Dynamic Compression Techniques for Efficient Transformers

Abstract

Transformers are a class of deep neural networks that have achieved state-of-the-art results across a wide range of domains, including natural language processing, computer vision, and computational biology. The widespread success of these models has been attributed to the attention mechanism, which identifies complex dependencies between elements of each input sequence. While the attention mechanism is incredibly effective at processing sequential data, it scales quadratically with respect to the length of the input sequence, which has a number of consequences. In recent years, several techniques have been proposed to create more efficient Transformer models that can handle long input sequences without incurring high computational costs

In our research, we aim to make Transformer-based models more computationally efficient, as well as make them viable for applications involving long sequences of data. In particular, we are motivated by code completion and code translation as an application, which not only requires longer sequences than natural language, but also puts a constraint on inference time.

Researchers

  • Karna Mendonca, UC Berkeley
  • Matteo Guarrera, UC Berkeley
  • Mostafa Elhoushi, Meta AI
  • Chunxing Yin, Meta AI
  • Syed Shakib Sarwar, Meta AI
  • Kannan Ramchandran, UC Berkeley
  • Alberto Sangiovanni-Vincentelli, UC Berkeley

Acknowledgements

This project is in part based upon work sponsored by Meta AI.

Updates

[to be added] Closing Report.