Grounded and Structured Self-Supervised Pre-training of Speech for Spoken Language Model

Self-supervised learning (SSL) techniques have been successful in learning rich representations for high-dimensional natural data. In the speech domain, SSL approaches for pretraining have resulted in state-of-the-art demonstrations in several downstream applications including automatic speech recognition, spoken language modeling, speech resynthesis, etc. SSL approaches employ self-driven targets as a reference to train the models, which allows the use of large-scale data without labels. However, current speech SSL methods often suffer from the arbitrariness of designing target space, which leaves the models in black boxes. Given that the speech is composed of different levels of linguistic content, the lack of structure has been a major challenge to understand and control the behaviors of the models.

Besides interpretability, having a better structure in representation is crucial to expanding the role of speech in a larger speech system, which has been largely limited to automatic speech recognition (ASR) or detection.However, speech conveys rich content beyond the text, which is not fully leveraged by the current practice of speech interface (e.g., Whisper interface in ChatGPT). For example, prosody is a significant cue for inferring the context and intention, but absent in textual representation. Thus, it is crucial for the speech model to extract contents enriched by non-textual information. Due to the unique nature of speech signals (information lies in multiple time scales) there is a lack of good targets to properly encode and fully leverage underlying information. we are proposing better targets/representations to build performant speech LLM, which is a general foundation model taking audio signal as input and is versatile in solving downstream tasks.

Ourgoal is to develop a new SSL method for speech to learn a structured and grounded representation of speech. Here, we refer to "structured" as the learned representations demonstrate a linguistic hierarchy of speech and "grounded" as they are entangled with specific linguistic entities. To fully extract speech representation itself, we adopt a "text-less" approach as an overarching principle of the project. By building a model without relying on text, we envision finding a rich representational space of speech that is potentially decimated in the text representation. 

The speech hierarchy is largely divided into three stages by the types of information: acoustics, phonetics, and semantics. For the acoustics level, most of the traditional ways of speech feature engineering fall into this category, for example, mel-spectrogram. Recent studies have demonstrated that higher-level representations can be learned through SSL. Our paper published in ICASSP 2023 demonstrates that speech SSL models are significantly correlated with the actual human vocal tract articulation, suggesting that SSL representations encode articulatory phonetic information [3]. Note this is a paper from our last year Meta-BAIR commons projects and our ongoing collaboration has revealed that SSL is a universal articulatory learner, agnostic to gender, speakers, dialects and even languages (submitted to ICASSP 2024, preprint). 

While the current SSL models can effectively represent phonetic information, the higher-level linguistic contents are not yet fully leveraged. The speech SSL models, like Wav2Vec2 [1] and HuBERT [2], are shown to encode some lexical semantics. However, the representations are highly entangled with phonetics, and extracting the higher-level linguistic contents has been challenging. Here, we discovered that by self-distilling a pretrained SSL model on sentence-level representation, the syllabic organization naturally emerges from speech without any supervision. Our proposed model called SD-HuBERT can effectively learn syllabic units and definite boundaries within the speech, which outperforms previous methods in unsupervised syllable discovery. Furthermore, this emergent property provides valuable insight into understanding the transition between phonetics and semantics in speech recognition, which has not yet been fully understood in linguistics literature. We submitted this result to ICASSP 2024 (preprint).

So far, we have revealed the current SSL models are articulatory phonetic models and developed a new model for syllable-level models. Our next goal is to leverage these findings to build a better-spoken language model. By quantizing each representation, we can attain sets of phonetic units and syllabic units, separately. Combined with the acoustic units from Encodec [4], we will build a hierarchical language model encompassing all these levels of units. Compared to the previous approach of using speech units to build a language model (so-called "text-less NLP") [5,6], we have additional unit space as syllable space from SD-HuBERT. We hypothesize this addition will enable better training and performance of the model. For example, high-sampling rate of phonetic units has been a bottleneck of long-term consistency in the GPT style auto-regressive generation. Here, by leveraging syllabic units that span longer durations with lower temporal frequency, learning a long-term relation will be more tractable and stable. In the end, we will combine this model into LLMs to reduce the gap between the two modalities.

Meta Collaborator: Shang-Wen (Daniel) Li

BAIR PI: Gopala K. Anumanchipalli

BAIR Student Researchers: Cheol Jun Cho, Akshat Gupta, Nicholas Lee, Jiachen Lian

Other Collaborators: Abdelrahman Mohamed (Rambrand), Allan W Black (CMU)


[1] Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems33, 12449-12460.

[2] Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing29, 3451-3460.

[3] Cho, C. J., Wu, P., Mohamed, A., & Anumanchipalli, G. K. (2023, June). Evidence of Vocal Tract Articulation in Self-Supervised Learning of Speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.

[4] Défossez, A., Copet, J., Synnaeve, G., & Adi, Y. (2022). High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.

[5] Lakhotia, K., Kharitonov, E., Hsu, W. N., Adi, Y., Polyak, A., Bolte, B., ... & Dupoux, E. (2021). On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics9, 1336-1354.

[6] Borsos, Z., Marinier, R., Vincent, D., Kharitonov, E., Pietquin, O., Sharifi, M., ... & Zeghidour, N. (2023). Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing