Active

Data Curation for Web-Scale Datasets

Abstract

Data curation is a promising direction for improving the efficiency and performance of large-scale models. Current efforts towards curation are ad-hoc and disconnected. We propose to develop new principled approaches for data curation inspired by Sorscher et al...

Dynamic Compression Techniques for Efficient Transformers

Abstract

Transformers are a class of deep neural networks that have achieved state-of-the-art results across a wide range of domains, including natural language processing, computer vision, and computational biology. The widespread success of these models has been attributed to the attention mechanism, which identifies complex dependencies between elements of each input sequence. While the attention mechanism is incredibly...

Learning Large Touch-Vision-Language Models Using Self-Supervised Robot Learning

Abstract

Humans depend on the integration of multiple sensory inputs, including but not limited to vision, language, audio, and tactile, to successfully carry out daily tasks. Giving robots an analogous ability to perceive and process information from different sensory modalities enables a richer understanding of the physical...

Writing with Speech — Using LLMs for Gist-Level Manipulation of Spoken Text

Dictation provides a more efficient text input method for mobile devices. However, using speech for writing can lead to verbose, incoherent, and inconsistent text, necessitating substantial editing. Our project creates Rambler, an LLM-integrated user interface designed for conceptual-level editing of dictated content through two main sets of functions: gist extraction and macro revision. Gist extraction...

Automated State and Action Space Design for Multi-Objective Reinforcement Learning

Introduction

While Reinforcement Learning (RL) has shown impressive performance in games (e.g., Go and chess [14,13], DoTA2 [2], Starcraft II [18,17], etc.), it remains a challenging problem to use RL in real-world scenarios due to the gigantic sample complexity and limited ability to generalize to unknown environments. To train
an RL agent with traditional online methods, millions of interactions with the environments are often needed, which is definitely...

Self-supervised Open-World Segmentation

Overview

Standard benchmarks in image segmentation assume a "closed-world" setting, in which a pre-determined set of non-overlapping object categories is exhaustively segmented and labeled in all training and evaluation images. This significantly increases the difficulty of data collection, requiring either complex quality control and post-processing schemes if using crowd-sourced labeling or...

Multiscale Modeling for Control

Abstract

A long-standing goal of AI research is to build/learn representations that lead to generalization in real-world sequential decision settings. Object-level representation that enables abstract reasoning in visual environments is a good candidate for this goal. Indeed, RL agents equipped with this inductive bias are able to generalize to tasks that involve a different...

Towards a Unified Understanding of Privacy and Generalization for Better Algorithm Design

Abstract

Machine learning and deep learning have emerged as important technologies, which enable a wide range of applications including computer vision, natural language processing, healthcare, and recommendation. However, in order to responsibly deploy these machine learning algorithms in society, it is critical to design them to conform to ethical values such as privacy, safety, fairness, etc. For instance, researchers have found that information about training data can be extracted from a released machine learning model which raises important privacy concerns, and adversarial attacks or...

GuBERT: Grounded units for Self-Supervised Pre-training of Speech

Self-Supervised Learning (SSL) techniques have proved to be quite effective for representation learning in multiple modalities like text, image, and more recently speech. In the speech domain, SSL approaches for pretraining have resulted in state-of-the art demonstrations in several downstream applications like speech recognition (WAV2VEC, WAV2VEC2.0), spoken language modeling (GSLM), speech resynthesis (HuBERT) etc. As such this approach requires massive amounts of speech data (thousands of hours of speech) and computational resources to train such large models. Also, while...