Training Sparse High Capacity Models with Implicit Neural Networks and Frank-Wolfe

Reducing parameter footprint and inference latency of machine learning models is being driven by diverse applications like mobile vision and on-device intelligence [Choudary 20], and it is increasingly important, as models become increasingly large. In this work, we propose to develop an alternative to the current train/compress paradigm, and instead we will train sparse high-capacity models from scratch, simultaneously achieving low training cost and high sparsity.


  • Geoffrey Négiar, UC Berkeley
  • Michael Mahoney, UC Berkeley
  • Laurent El Ghaoui, UC Berkeley
  • Brandon Amos, FAIR
  • Aaron Defazio, FAIR


To achieve sparse model training, we combine two threads of research: Implicit Neural Networks [Négiar 17, Askari 18, Gu 18, El Ghaoui 19]; and Frank-Wolfe (FW) algorithms for constrained optimization [Pedregosa 18, Négiar 20]. Implicit models are a novel class of machine learning models which bridge deep learning and control theory, using implicit optimization [Amos 17, Agrawal 19] to encompass both. Implicit Neural Networks are a superset of deep learning models, in which we allow the computation graph between “neurons” to have loops. To be well-defined (and avoid exploding gradients), they are naturally defined using constraints, reminiscent of control theory stability conditions. They reach state of the art accuracy on deep learning benchmark problems, and they have become a topic of interest in the optimization and machine learning communities [Bai 19]. The FW algorithm allows for controlling model complexity by initializing with all-zero weights, and adding sparse or low rank weights [Jaggi 13] at each iteration. It suits well the constrained nature of the Implicit Models.

We combine the strengths of our Implicit model family and our expertise of the Frank-Wolfe method to obtain accurate and light models ready to go at the end of their training.



  • Jaggi, M.. (2013). Revisiting Frank-Wolfe: Projection-Free Sparse Convex OptimizationICMLNégiar, G., Askari A., Pedregosa P. & El Ghaoui, L. (2017) Lifted Neural Networks for Weight InitializationOPT-MLWorkshop at NeurIPS 2017.
  • Amos, B., & Kolter, J.Z. (2017). OptNet: Differentiable Optimization as a Layer in Neural NetworksICML 2017.Askari, A.*, Négiar, G.*, Sambharya, R., & El Ghaoui, L. (2018). Lifted Neural NetworksArXiv, abs/1805.01532.Gu, F., Askari, A., & El Ghaoui, L. (2018). Fenchel Lifted Networks: A Lagrange Relaxation of Neural NetworkTrainingAISTATS 2020
  • Pedregosa, F., Négiar, G., Askari, A., & Jaggi, M. (2018). Linearly Convergent Frank-Wolfe with BacktrackingLine-SearchAISTATS 2020
  • El Ghaoui, L., Gu, F., Travacca, B., & Askari, A. (2019). Implicit Deep LearningArXiv, abs/1908.06315Dong, Z., Yao, Z., Cai, Y., Arfeen, D., Gholami, A., Mahoney, M.W., & Keutzer, K. (2019). HAWQ-V2: Hessian Awaretrace-Weighted Quantization of Neural NetworksArXiv, abs/1911.03852
  • Bai, S., Kolter, J.Z., & Koltun, V. (2019). Deep Equilibrium ModelsNeurIPS 2019
  • Agrawal, A., Amos, B., Barratt, S.T., Boyd, S., Diamond, S., & Kolter, J.Z. (2019). Differentiable Convex OptimizationLayersNeurIPS 2019
  • Shen, S., Dong, Z., Ye, J., Ma, L., Yao, Z., Gholami, A., Mahoney, M.W., & Keutzer, K. (2019). Q-BERT: Hessian BasedUltra Low Precision Quantization of BERTAAAI 2020
  • Négiar, G., Dresdner, G., Tsai, A., El Ghaoui, L., Locatello, F., Freund, R.M. & Pedregosa, F. (2020). StochasticFrank-Wolfe for Constrained Finite-Sum MinimizationICML 2020
  • Choudhary, T., Mishra, V., Goswami, A. & Sarangapani, J. (2020) A comprehensive survey on model compressionand accelerationArtificial Intelligence Review 2020