Gopala Anumanchipalli

Grounded and Structured Self-Supervised Pre-training of Speech for Spoken Language Model

Self-supervised learning (SSL) techniques have been successful in learning rich representations for high-dimensional natural data. In the speech domain, SSL approaches for pretraining have resulted in state-of-the-art demonstrations in several downstream applications including automatic speech recognition, spoken language modeling, speech resynthesis, etc. SSL approaches employ self-driven targets as a reference to train the models, which allows the use of large-scale data without labels. However, current speech SSL methods often suffer from the arbitrariness...

GuBERT: Grounded units for Self-Supervised Pre-training of Speech

Self-Supervised Learning (SSL) techniques have proved to be quite effective for representation learning in multiple modalities like text, image, and more recently speech. In the speech domain, SSL approaches for pretraining have resulted in state-of-the art demonstrations in several downstream applications like speech recognition (WAV2VEC, WAV2VEC2.0), spoken language modeling (GSLM), speech resynthesis (HuBERT) etc. As such this approach requires massive amounts of speech data (thousands of hours of speech) and computational resources to train such large models. Also, while...