Overview
CLIP [1] demonstrates the power of simple contrastive learning on a large-scale dataset of image and caption pairs collected directly from the internet, without expensive manual labeling. In our project we seek to improve the data efficiency of CLIP and performance when trained on uncurated datasets, as well as explore additional capabilities beyond classification.
Data Efficiency
CLIP achieves impressive results when trained on WIT400M, a dataset of 400M image and caption pairs collected by searching for a curated list of...