We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear “reservoir” layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance, on various machine translation.
As larger models on larger datasets have often shown better performance [Li2020], the parameter size of NLP models is growing over time. For instance, RoBERTa [Liu2019, Strubell2019] was trained using a thousand GPUs, and it cost nearly five million dollars to train the final GPT-3 model [Brown2020]. Toward this end, In this project, we revisit an old idea in deep learning: echo state networks [Jaeger2001], where we add fixed layers with random features to Transformers. These layers increase the depth of the model and consequently improve its representational power, but are much more computationally efficient.
Our approach is based on a very simple idea. Neural networks are trained via backpropagation, which involves consecutive steps of matrix addition and multiplication, i.e.,
for some objective J, parameterization θ and learning rate η, with the gradient computed via the chain rule, where Li is the i-th layer of the neural network and x is the input. Let L = Transformer(X) be a single layer in a Transformer network, i.e.,
Now, during every “backward pass”, we compute the Jacobian for parameters θL at layer L, which are used to update the parameters of θL, as well as to compute the next layer’s Jacobian, thus back-propagating the gradients. In this project, however, for some of the layers, we still backpropagate through them to compute gradients for earlier layers, but we never update their parameters. As a result, these layers stay fixed at their random initialization, saving computational resources.
In the early trial of this project, we leverage the echo state transformer with various types of echo state layers on IWSLT translation tasks. We run this experiment on one V100 machine and set all the hyper-parameters the same as in fairseq
Here is the table presenting the wall-clock time (averaged over multiple runs) saved for IWSLT for different model types and encoder depths. We set the different number of layers for the encoder, and keep decoder depth fixed at 2. The ratio is computed compared to the comparable number of layers in the normal case. It shows the convergence time it took to achieve the maximum validation BLEU score and how that relates to the regular transformer, demonstrating that reservoir transformers consistently converge faster in terms of wall-clock time, up to 22% as much with the same number of updateable layers. As a point of reference, a half-hour gain on IWSLT translates to a gain of several days in the training of bigger transformer models like GPT-3 [Brown2020].
We are planning to extend the idea with a synthetic gradient predictor that allows for skipping the backward pass, and conduct detailed probing analysis towards its performance.
[Liu2019] Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
[Brown2020] Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
[Li2020] Li, Zhuohan, et al. "Train large, then compress: Rethinking model size for efficient training and inference of transformers." in ICML (2020).
[Wieting2019] Wieting, John, and Douwe Kiela. "No training required: Exploring random encoders for sentence classification." in ICLR (2019).
[Jaeger2001] Jaeger, Herbert. "The “echo state” approach to analysing and training recurrent neural networks-with an erratum note." in GMD Technical Report (2001).
[Strubell2019] Emma Strubell, Ananya Ganesh and Andrew McCallum. “Energy and Policy Considerations for Deep Learning in NLP.” in ACL (2019).