GuBERT: Grounded units for Self-Supervised Pre-training of Speech

Self-Supervised Learning (SSL) techniques have proved to be quite effective for representation learning in multiple modalities like text, image, and more recently speech. In the speech domain, SSL approaches for pretraining have resulted in state-of-the art demonstrations in several downstream applications like speech recognition (WAV2VEC, WAV2VEC2.0),  spoken language modeling (GSLM), speech resynthesis (HuBERT) etc.  As such this approach requires massive amounts of speech data (thousands of hours of speech) and computational resources to train such large models. Also, while pretrained models based on SSL are objectively for downstream tasks in terms of objective measures, the learned representations themselves are not “explainable” and not amenable to “probing” in systematic ways. 

While SSL is itself inspired by skill acquisition in children, there are several fundamental differences between current SSL and human learning strategies. Recent studies in human neurophysiology have revealed new organizational schemes of speech production, and insights into the nature of the fundamental units of speech production that are commonly shared across all humans. Specifically, we now know that the high dimensional speech acoustics can be characterized by a low dimensional manifold of speech articulation, and that this manifold is characterized by an inventory of articulatory movement patterns that explain all variance in the acoustic signals. These insights are pertinent to speech representation learning that is the ultimate goal of self-supervised learning from data.  Our goal in this work is to bridge the gap between these disparate representation learning schemes. We propose to incorporate human-inspired inductive biases into self supervised learning for speech representation learning. We call this GuBERT (Grounded Units for Bidirectional Encoder Representations from Transformers) referring to the idea that the units are grounded in speech production theories. Specifically, we first aim to investigate how well current SSL models are grounded by analyzing representations from HuBERT using articulation data collected in two modality (EMA, rtMRI). Based on our analysis, we will design a new articulation-based inductive bias to improve sample efficiency and robustness of current techniques.


  • Cheol Jun Cho, UC Berkeley
  • Gopala K. Anumanchipalli, UC Berkeley
  • Abdelrahman Mohamed, Meta AI