A Curriculum for Foundational AI Models Inspired by Human Cognition

Contributors

Eunice Yiu (ey242@berkeley.edu)

Shiry Ginosar (shiry@berkeley.edu)

Kate Saenko (saenko@meta.com)

Alison Gopnik (gopnik@berkeley.edu)

Abstract

Foundational AI models trained on multimodal data, such as GPT-4V, are becoming more powerful, yet there is no comprehensive way to compare their performance to that of humans. Researchers typically evaluate these models on a set of existing academic benchmarks covering a collection of tasks, such as zero-shot classification, text-to-image retrieval, phrase grounding, etc. These tests are quite arbitrary in coverage and focus more on recognition of basic concepts while leaving many gaps in visual, spatial and linguistic reasoning. Inspired by tests of human visual and language reasoning abilities used in developmental cognitive science, we propose to construct a benchmark for multimodal models to serve as a test of broad visuo-linguistic intelligence. Rather than learning with the sheer goal to achieve zero-shot classification or VQA, children’s vision and language learning is multifaceted and progressive. In our developing benchmark, we evaluate AI models and children on a level playing field their ability infer an underlying object characteristic, relationship, or transformation, and generalize it to novel, out-of-distribution instances at test. Our curriculum spans from spatial reasoning and transformations to more casual and functional abstractions. 

Examples of Current Proposed Skills

These tasks all take inspiration from studies in developmental psychology and artificial intelligence. 

1. Simple object transformations: 

detecting simple transformations in orientation, number, color, size, etc. [1,2]

2. Spatial relations:

identifying the spatial relationships between specific object entities in a scene [3,4]

3. Task-driven object detection & tool use:

recognizing functional objects that are suitable for solving a specified task in a scene [5,6]

References

[1] Goddu, M. K., Lombrozo, T., & Gopnik, A. (2020). Transformations and transfer: Preschool children understand abstract relations and reason analogically in a causal task. Child development, 91(6), 1898-1915.

[2] Moskvichev, A., Odouard, V. V., & Mitchell, M. (2023). The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain. arXiv preprint arXiv:2305.07141.

[3] Park, Y., & Casasola, M. (2017). The impact of object type on the spatial analogies in Korean preschoolers. Cognitive Psychology, 94, 53-66.

[4] Liu, F., Emerson, G., & Collier, N. (2023). Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11, 635-651.

[5] Yiu, E., & Gopnik, A. (2023). Discovering New Functions in Everyday Tools by Children, Adults and LLM’s. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 45, No. 45).

[6] Sawatzky, J., Souri, Y., Grund, C., & Gall, J. (2019). What object should i use?-task driven object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7605-7614).