Trevor Darrell

Modeling latent variable for self-supervised learning

Abstract: Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location (a). To address this, we follow LeCun et. al (b), that suggested the use of a latent variable to capture uncertainties.

...

Reliable Multimodal Models

Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA) or image captioning. These tasks are especially important for assisting people with visual impairments, such as assisting in daily routines or interacting with visual content on the web. To provide such utility, users must be able to trust the output of these tools as they may be basing decisions or actions on the output. While improving the accuracy of approaches may be an important factor for trusting models, models are imperfect and will...

Self-supervised Open-World Segmentation

Overview

Standard benchmarks in image segmentation assume a "closed-world" setting, in which a pre-determined set of non-overlapping object categories is exhaustively segmented and labeled in all training and evaluation images. This significantly increases the difficulty of data collection, requiring either complex quality control and post-processing schemes if using crowd-sourced labeling or...

Fate of Snow

Northstar:“Develop iterative, meaningful benchmarks for AI researchers that enable substantial progress on problems related to climate change as well as impactful AI methodology.”

Summary: Learning from Observational, Multimodal, Multiscale, Spatiotemporal (OMMS) data sources are critical for researchers and practitioners working on problems related to climate change. AI methods for handling these types of data – and the many associated problems – remain largely undeveloped, and...

Towards Human-like Attention

Overview

Convolutional Neural Networks (CNNs) can already reach human performance on clean images, but is not as robust. Recently proposed self-attention seems to help robustness, but still fails at different cases. However, previous studies do show a close relationship between attention and robustness in human visual system. We hypothesize that attention is the key to robustness, only self-attention is not the right formulation for it. We propose to study the neuronal foundation of human visual attention, and propose a human-like attention mechanism to reach higher robustness....

Self Supervised Semantic Segmentation in the Wild

Overview

Self-supervised learning (SSL) enables the learning of effective task-agnostic representations that generalize to a wide range of downstream applications. Recent advances in SSL have adopted strong augmentation pipelines combined with pretext tasks to achieve results competitive with supervised learning while using a fraction of the labels. The goal of this project is to transfer the success of SSL to real-world applications. SSL in the wild specifically aims to tackle semantic segmentation where pixel-level annotation can be hard and time-consuming for humans....

Modeling Interpersonal Multimodal Signals in Social Conversation

With the integration of VR/AR and robotics in society, the need for socially intelligent AI systems has become more compelling, as people seek to build systems that are more responsive to human interactions, or strive to recreate more embodied telepresence. Furthermore, current advances in 3D human pose estimation have reached levels of accuracy that allow us to tap into in-the-wild datasets to extract poses and study human behavior, which previously was only performed on constrained mocap datasets. Coupled with the demand for social AI, the time is ripe for investigating social...

Vision and Language Testbed

More information on this project coming soon, please check back.

Interactive Learning from Vision and Touch

The virtuoso plays the piano with passion, poetry and extraordinary technical ability. As Liszt said (a virtuoso) “must call up scent and blossom, and breathe the breath of life. ” Despite recent advances, how to enable a robot to accurately, naturally and poetically play the piano remains an open and largely unexplored question....

Self-supervised Learning for Generic Visual Representation

The objective of this proposed collaboration is to explore self-supervised learning beyond the current paradigm of exploiting instance discrimination and contrastive learning, as current exploitation-driven research may be spiraling around a local optimum, while larger-scale algorithmic changes are needed.

Researchers

Tete Xiao, UC Berkeley, http://tetexiao.com/

Piotr Dollár, Facebook AI Research,...