Explore the full two-day session lineup, and stay tuned for additional exciting sessions and speakers.

Day 1 - June 07

11:45 AM PDT
2:45 PM EDT

DataComp: In search of the next generation of multimodal datasets

Ludwig Schmidt

Ludwig Schmidt

Assistant Professor in Computer Science
University of Washington

Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a benchmark where the training code is fixed and researchers innovate by proposing new training sets. We provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing on 38 downstream test sets. Our benchmark consists of multiple scales, which facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources.

Our baseline experiments show that the DataComp workflow is a promising way of improving multimodal datasets. We introduce a new dataset DataComp-1B and show that CLIP models trained on this dataset outperform OpenAI’s CLIP model by 3.7 percentage points on ImageNet while using the same compute budget. Compared to LAION-5B, our data improvement corresponds to a 9x improvement in compute cost.


Watch on demand

Watch all of the live sessions on-demand and discover the latest developments in data-centric AI.
Watch on demand