Data Curation for Web-Scale Datasets


Data curation is a promising direction for improving the efficiency and performance of large-scale models. Current efforts towards curation are ad-hoc and disconnected. We propose to develop new principled approaches for data curation inspired by Sorscher et al., 2022. Data is central to all large-scale learning tasks and this research has potential for wide-scale impact.

 Why is data curation necessary?

Tremendous improvements in model performance are being driven by scaling self-supervised learning (SSL) on enormous datasets. These datasets contain billions of loosely curated data points. This scaling enables impressive performance, but dataset size along with model size have dramatically increased costs. We propose to develop methods that curate data, reducing the amount of data needed for training without negatively impacting model performance. 

The increase in data size, model size and compute time of state-of-the-art models have inspired empirical studies of model performance as these parameters grow. These studies consistently observe power law scaling (Kaplan et al., 2020; Hoffmann et al., 2022), which is unsustainable, especially at large data scales where ever-increasing amounts of data are required to achieve diminishing performance improvements. Recently, Sorscher et al., 2022 demonstrated both theoretically and empirically that power law scaling can be improved upon if data are appropriately selected. Intuitively, this is because the average marginal data point is more likely to be redundant if more data has been seen. By selecting data intelligently, one can ensure each data point is less redundant and thus more informative marginally. This work also demonstrated that the quality of the data curation approach is critical to beating the power law scaling, motivating research on curation metric design. Encouragingly, Sorscher et al., 2022 also found that the benefits of data pruning should grow as the data size increases, demonstrating the potential for these approaches at massive scale.

While significant attention is often paid to the architecture of the models themselves, data curation has been understudied in the literature–particularly for unsupervised learning. This presents an opportunity as even moderate reductions in data size can result in substantial efficiency gains and are likely to increase learning speed. This is critical because modern models take very long to train and never seem to converge (for example, see curves inOpenCLIP,OpenCLIP ViT-G/14, OPT). This learning speed increase means that data curation can actually lead to higher performance models, especially when misleading data is filtered.

List of Reseachers:

Justin Kang (UC Berkeley)

Nived Rajaraman (UC Berkeley)

Ari Morcos (Meta AI)

Prof. Kannan Ramchandran (UC Berkeley)

Prof. Anant Sahai (UC Berkeley)


Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., & Morcos, A. S. (2022). Beyond neural scaling laws: beating power law scaling via data pruning. In Advances in Neural Information Processing Systems

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., ... & Jitsev, J. (2022). Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143

Feldman, V., & Zhang, C. (2020). What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33, 2881-2891.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 586-595).

Li, A. C., Brown, E., Efros, A. A., & Pathak, D. (2023). Internet Explorer: Targeted Representation Learning on the Open Web. arXiv preprint arXiv:2302.14051