Addressing Challenges in Large-scale Distributed AI Systems

Training Neural Network models is becoming increasingly more expensive, requiring scaling to thousands of processes. This problem is becoming more challenging, as the training data is growing exponentially as well, especially in light of recent unsupervised learning methods. This has made it difficult to apply NN models to large scale problems. Importantly, the problem is largely not due to the lack of computing power in the data centers, but the lack of scalable algorithms that enable large scale training without accuracy loss.
Addressing this challenge requires scalable frameworks that can optimally exploit the computational resources, as well as algorithmic innovations with minimal hyper-parameter tuning. This project pursues a multipronged approach for efficient training. This includes scalable frameworks and tools for efficient distributed-memory training optimal communication-computation trade-offs, efficient and accurate inference with systematic pruning and quantization, and scalable Neural Architecture Search.

Update

August 28, 2021

Researchers

Overview

Deep Neural Networks (DNNs) have proven to be very effective in diverse applications ranging from semantic segmentation and detection in computer vision to scientific applications such as astronomy, climate science, and medical image analysis. In these and many other applications of machine learning (ML) and artificial intelligence (AI), finding the right DNN architecture for a particular application and then training a high-quality model require extensive hyper-parameter tuning and architecture search, often on very large datasets. The delays associated with training DNNs are often the main bottleneck in the design process, and this bottleneck limits the usefulness of DNNs in many applications.

This prohibitive training time for state-of-the-art AI/ML models has impeded research, especially on larger datasets. So far, there has been successful work on optimizing the kernels in a DNN for single node execution. However, there is limited memory available on a single device, and in most cases, the hardware limits have been reached. The next milestone for accelerating the training is through distributed computing, and this can open new horizons. 

Designing scalable methods to tackle this challenge has been a major focus of our research group over the past years. We were the first group to publish results on scaling ImageNet training to 128 GPUs of Titan Supercomputer with our work on FireCaffe [12]. Since then we have published many papers on this topic [1, 2, 3, 4, 5] tackling different challenges with scaling, and importantly the loss of accuracy. Notable works include LARS/LAMB [6, 7] which have become industry standard and helped reduce ImageNet/BERT training time down to seconds/minutes.

Figure 1

Figure 1: Illustration of parallel data training in our in-house FireCaffe framework. FireCaffe was the first framework that scaled ImageNet to 128 GPUs of Titan Supercomputer [12].

Figure 2

Figure 2: Illustration of model and data parallelism for training NN models. In our prior work on integrated parallelism we showed how an optimal partitioning could be found with optimal communication time for training a model on distributed-memory processes [4].

However, there are still major challenges left. First, the size of the recent NN models is exponentially increasing, especially for Natural Language Processing and Recommendation Systems. As such, one has to use both model and data parallelism to efficiently utilize a distributed set of processes. The model parallelism includes both partitioning a layer as well as pipeline parallelism. However, a major challenge with these approaches is the added communication cost of alltoall and allreduce which limits the scalability of the framework. A potential solution is to use reduced-precision communication to address this, as well as variants of asynchronous training. However, a naive implementation of these approaches leads to suboptimal accuracy as compared to synchronous SGD with full precision communication. We are designing a systematic framework to obtain optimal trade-offs between training time and accuracy through the use of second-order based low precision communication. Moreover, we are working on designing a framework that supports the optimal execution of training on heterogeneous systems by building upon our prior work on Integrated Model and Data Parallelism [4]. This requires both determining the optimal partitioning of the model, as well as systematic prefetching to reduce communication. 

 

Figure 3: Illustration of gradient and Hessian operator for a NN model. Most existing optimization frameworks only use first-order information obtained from the gradient. However, using the Hessian information could enable more robust training with less hyper-parameter tuning. Our work on AdaHessian shows how such second-order information could be efficiently utilized to obtain higher accuracy with less hyper-parameter tuning [14].

Second, the proposed methods for large scale training (including our prior work on LARS/LAMB) are sensitive to the choice of the hyper-parameters including the learning rate. Using a slightly non-optimal learning rate could easily lead to divergence of the training. This has been a minor problem for hardware vendors since they want to show the best possible training time on a new platform. However, this poses major issues for a practitioner as massive hyper-parameter tuning is just not feasible. In large part, this situation is due to the first-order Stochastic Gradient Descent (SGD) methods that are widely used for training DNNs. Despite SGD’s well-known benefits, vanilla SGD tends to perform poorly, and thus one introduces many (essentially ad-hoc) knobs and hyper-parameters to make it work. These hyper-parameters are significantly more sensitive to tuning when training at a large scale with SGD, and this has impeded the effective use of supercomputing systems. A very promising direction is the class of stochastic Newton’s methods, which are known to have superior properties, as compared to SGD. Our goal is to extend recent results from our group  [2, 3, 8, 14] on second-order robust optimization methods and develop a scalable framework for large scale training. This will impact a wide range of applications, especially autonomous driving, where training time is an important bottleneck.  The target problems we will consider include large scale image classification, object detection, segmentation, and transformer based models.

 

Figure 4: Illustration of our Hessian AWare Quantization (HAWQ) which is an advanced framework for NN quantization. HAWQ finds an efficient low-precision configuration by finding the optimal trade-off between a paramet’s second-order sensitivity and the speed up gains achieved through quantization [9, 10, 11, 15].

Third, we need methods to enable efficient deployment of large AI models with real-time latency. The deployment of state-of-the-art NN models is often not possible due to application specific thresholds for latency and power consumption, and the prohibitive memory footprint for some of the new NN models, specially for Natural Language Processing. Making inference efficient is important since more than 90% of AI workload is spent on inference [13] Several approaches have been proposed to enable efficient inference of NN models including quantization, pruning, and model distillation. However, these methods often include ad-hoc approaches with brute-force parameter tuning, which may work for a subset of problems, but fail for slightly different NN models/tasks. The heuristic nature of these approaches, and the fact that the performance of existing solutions varies significantly, has created a frustrating barrier for practitioners. We are working to address this and develop a comprehensive framework for efficient hardware-aware NN design and deployment. This will build upon our earlier work on quantization where we performed a systematic study of quantization, where we showed an important theoretical correlation between the Hessian of the loss landscape and the quantizability of NN models.

References

[1] Jin, Peter H and Yuan, Qiaochu and Iandola, Forrest and Keutzer, Kurt. How to scale distributed deep learning? ArXiv: 1611.04581

[2] Yao, Zhewei and Gholami, Amir and Lei, Qi and Keutzer, Kurt and Mahoney, Michael W. Hessian-based analysis of large batch training and robustness to adversaries. NeurIPS’18

[3] Yao, Zhewei and Gholami, Amir and Keutzer, Kurt and Mahoney, Michael. Large Batch Size Training of Neural Networks with Adversarial Training and Second-Order Information. ArXiv: 1810.01021

[4] Gholami, Amir and Azad, Ariful and Jin, Peter and Keutzer, Kurt and Buluc, Aydin. Integrated model, batch, and domain parallelism in training neural networks. SPAA’18

[5] Jain, Paras and Jain, Ajay and Nrusimha, Aniruddha and Gholami, Amir and Abbeel, Pieter and Keutzer, Kurt and Stoica, Ion and Gonzalez, Joseph E. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization. MLSys’20

[6] You, Yang and Gitman, Igor and Ginsburg, Boris. Large batch training of convolutional networks. SC’19

[7] You, Yang and Li, Jing and Reddi, Sashank and Hseu, Jonathan and Kumar, Sanjiv and Bhojanapalli, Srinadh and Song, Xiaodan and Demmel, James and Hsieh, Cho-Jui. Large batch optimization for deep learning: Training bert in 76 minutes. ICLR’20

[8] Yao, Zhewei and Gholami, Amir and Keutzer, Kurt and Mahoney, Michael. PyHessian: Neural Networks Through the Lens of the Hessian.arXiv:1912.07145

[9] Dong, Zhen and Yao, Zhewei and Cai, Yaohui and Arfeen, Daiyaan and Gholami, Amir and Mahoney, Michael W and Keutzer, Kurt. HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks. NeurIPS’19 Workshop

[10] Shen, Sheng and Dong, Zhen and Ye, Jiayu and Ma, Linjian and Yao, Zhewei and Gholami, Amir and Mahoney, Michael W and Keutzer, Kurt. Q-bert: Hessian based ultra low precision quantization of bert. AAAI’20

[11] Dong, Zhen and Yao, Zhewei and Gholami, Amir and Mahoney, Michael W and Keutzer, Kurt. Hawq: Hessian aware quantization of neural networks with mixed-precision. ICCV’19

[12] Iandola, Forrest N and Moskewicz, Matthew W and Ashraf, Khalid and Keutzer, Kurt. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. CVPR’16

[13] Huang, R., 2020: Accelerating the pace of aws inferentia chip development, from concept to end customers use. ESWEEK, HENP Workshop.

[14] Yao Z, Gholami A, Shen S, Keutzer K, Mahoney MW. ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning. arXiv preprint arXiv:2006.00719. 2020.

[15] Yao Z, Dong Z, Zheng Z, Gholami A, Yu J, Tan E, Wang L, Huang Q, Wang Y, Mahoney MW, Keutzer K. HAWQV3: Dyadic Neural Network Quantization. arXiv preprint arXiv:2011.10680. 2020.