Towards a Unified Understanding of Privacy and Generalization for Better Algorithm Design

Abstract

Machine learning and deep learning have emerged as important technologies, which enable a wide range of applications including computer vision, natural language processing, healthcare, and recommendation. However, in order to responsibly deploy these machine learning algorithms in society, it is critical to design them to conform to ethical values such as privacy, safety, fairness, etc. For instance, researchers have found that information about training data can be extracted from a released machine learning model which raises important privacy concerns, and adversarial attacks or even simple distribution shifts (such as subpopulation shifts) can lead to large drops in machine learning model performance. In this project, we aim to deepen the community's understanding of the fundamental connections between privacy, robustness, and generalization, and leverage them to develop new techniques.

(Figure from [CYS2021])

Researchers

  • BAIR Faculty Member: Yi Ma, UC Berkeley
  • BAIR PhD student/postdoc: Yaodong Yu, UC Berkeley
  • Meta-AI researcher: Chuan Guo, Fundamental AI Research (FAIR) team at Meta

Overview

Prior works mainly focused on the topics of privacy, robustness, and generalization in isolation, but the three topics are in fact closely related. For example, [TMJ+2022] found membership inference vulnerability increases with the number of parameters for linear models whereas [LTL+2022] observed that larger private NLP models can achieve SOTA performance; [KYY+2022] identified interesting connections between differential privacy and distributional generalization (with application to subpopulation shifts). 

Specifically, we are interested in investigating the following directions:

1. Understand the effect of overparameterization/model size, model architecture, and training protocols on empirical privacy by conducting well-designed experiments, which will provide us concrete empirical understanding of the role of various factors for empirical privacy. 

2. Develop a theoretical framework to characterize the tradeoff between privacy and generalization and propose new algorithms to improve privacy guarantees in the overparameterized regime. Specifically, we will investigate the overparameterized linear model with eNTK representations on real-world datasets.

3. Inspired by empirical/theoretical privacy measurements and algorithms, propose new algorithms to predict out-of-distribution generalization performance and improve distributional generalization. Specifically, we will focus on real-world distributional shifts [KSM+2021].

Overall, we will start with empirical measurements and simple linear models, and then extend these to more complex scenarios.

References

[TMJ+2022] Jasper Tan, Blake Mason, Hamid Javadi, Richard G. Baraniuk. Parameters or Privacy: A Provable Tradeoff Between Overparameterization and Membership Inference. https://arxiv.org/abs/2202.01243

[LTL+2022] Xuechen Li, Florian Tramèr, Percy Liang, Tatsunori Hashimoto. Large Language Models Can Be Strong Differentially Private Learners. ICLR 2022.

[KYY+2022] Bogdan Kulynych, Yao-Yuan Yang, Yaodong Yu, Jarosław Błasiok, Preetum Nakkiran. What You See is What You Get: Distributional Generalization for Algorithm Design in Deep Learning. https://arxiv.org/abs/2204.03230

[CYS2021] Rishav Chourasia, Jiayuan Ye, Reza Shokri. Differential Privacy Dynamics of Langevin Diffusion and Noisy Gradient Descent. NeurIPS 2021.