Overview
Large neural network models have improved accuracy and generalization in various domains. However, the trend cannot continue indefinitely due to limited hardware memory. As a result, researchers have devised a number of memory saving algorithms to alleviate the memory bottleneck, such as checkpointing, quantization, and swapping.
In this project, we first conduct a case study using the Bert model to see how effective such memory saving solutions are.
Surprisingly, we find that although these strategies indeed lower peak memory usage, the associated overhead (e.g., recomputation, communication between CPU and GPU) with these strategies is too high to actually benefit training.
As shown in the figure above, the throughput of all training instances increases with batch size, although to varying degrees. All optimization approaches appear to reach a point where throughput improves slowly, whereas the original training stops before reaching the plateau due to the constraint of device memory. Furthermore, when compared to the original training, all memory saving approaches increase the maximum batch size. That appears to be optimistic because most memory saving approaches evaluate their method by comparing the maximum batch size. Unfortunately, when considering the purpose of efficient model training as mentioned above, the throughput should be the primary criterion and all memory saving methods fail to really speed up training.
To reason why this is the case, we devise an intuitive performance model to quantitatively explain the memory and training time trade-off. We then show how our performance model can determine when to apply the various memory optimization strategies in training different models. Evaluation of our performance models on Bert, Swin Transformer, and GPT-3 demonstrates our model's ability to accurately evaluate and estimate the effectiveness of multiple memory saving strategies.
Researchers
- Prof. Alvin Cheung, UC Berkeley
- Lily Liu, UC Berkeley
- Xiaoyong Liu, Alibaba