Better Visual Representations through Language Supervision


CLIP [1] demonstrates the power of simple contrastive learning on a large-scale dataset of image and caption pairs collected directly from the internet, without expensive manual labeling. In our project we seek to improve the data efficiency of CLIP and performance when trained on uncurated datasets, as well as explore additional capabilities beyond classification.

Data Efficiency

CLIP achieves impressive results when trained on WIT400M, a dataset of 400M image and caption pairs collected by searching for a curated list of common terms (including every ImageNet class name). When trained on an uncurated 14M pair dataset, performance suffers dramatically. This suggests a reliance on massive scale and manual curation. First, the interdependence between vision and language gives rise to a negative feedback loop: suboptimal textual features result in suboptimal visual features, and vice versa. Our current experiments show that bootstrapping the image encoder with self-supervised pretraining significantly improves zero-shot classification accuracy by +12%. Inspired by ideas from unsupervised machine translation, we also seek to combine paired and unpaired data to further improve performance.

It is also known that ViT models require much larger datasets or additional regularization to generalize well. However, heavy data augmentations appear to harm the relatively weaker training signal from captions. Thus, we are also exploring new ViT variants and smoother optimization algorithms.

Natural Concepts

Visual concepts in the real world exhibit a long-tailed distribution. In contrast, the WIT400M dataset is constructed from a curated list of search terms and then artificially class-balanced. Such curation is not realistic for the open-world setting in which the operator does not know a priori which search terms are even relevant. In our work, we focus on the natural data distribution of internet content. We will explore methods from long-tailed classification including reweighting loss functions, resampling data, and decoupling learning to tackle this setting head-on.


One hope of grounding vision models in language is that the resulting representations will be less prone to spurious correlations and be more robust and generalizable. CLIP hints at better robustness against distribution shift, but this evaluation merits further examination as training set may include the distribution shifts (e.g. sketches, artistic renditions) that are used to evaluate robustness. We will conduct a careful evaluation by constructing new benchmarks leveraging concept drift and compositionality to collect novel image distributions unlikely to be seen during training.

[1]: Radford, A., Kim, J.W., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision


Norman Mu (

Saining Xie (

Alexander Kirillov (

David Wagner (