Distributed Learning: Privacy and Data Summarization

Machine learning is increasingly being used in applications involving sensitive data, such as healthcare and finance. This necessitates approaches that incorporate secure and private use of data. Differential privacy is the main framework for addressing these needs. However its adoption has been rife with barriers especially for distributed data. One reason for this is that theoretical guarantees often consider extreme cases where the data is fully distributed across agents (one data point per agent). This has led to impractical privacy guarantees, e.g., some methods require sample sizes that are larger than the domain size and companies have used privacy parameters that are meaningless. We believe that this is partly due to the fact that these settings fail to take into account that in many modern distributed applications of machine learning, most agents own reasonably large data sets. This includes hospitals and educational institutes who are tasked with protecting the privacy of their datasets under HIPAA and FERPA, respectively. Our project focuses on these important settings and will contribute distributed private learning algorithms that work well by exploring connections between distributed privacy and data summarization. 

In more detail, this project will explore a wide range of multi-agent learning tasks under the lens of privacy. We will consider distributed learning (e.g., where the objective is to learn a model that performs well over the agents) and collaborative learning (e.g., where the objective is to learn a model that performs well for each agent). Our goal is to design algorithms that preserve the privacy of the data with guarantees that significantly outperform those that agents can achieve on their ownWe plan to address these challenges by bridging between privacy and data summarization. We expect privacy and data summarization to become closely related when agents have sufficiently large data sets. That is, preserving the privacy of the data will approximately translate to creating a synthetic data set that summarizes the data effectively. By exploring these connections further, our project will build synergies between two well established areas and will contribute to their further progress by providing a unified perspective for the use of ML on important and sensitive data sets.