Graph Data Augmentation for Computer Systems

Graphs are the most common state representation for structured input problems including molecule property prediction, code representation learning and computer systems. Learning algorithms embed graph structures using graph neural networks (GNNs). However, many domains lack large training datasets due to the expense of acquiring samples; work by Mirhoseini et al. trained chip placement policies from a dataset of only 20 examples due to the complexity of designing new chips. In data-scarce settings, augmentation is widely used to improve generalization. Simple transformations like cropping can improve test performance in computer vision. However, it is challenging to apply data augmentation in graph domains as there are few simple analogous transformations. Our key research question: how to augment graph data to improve generalization for machine learning for data-scarce systems applications?

Researchers

Paras Jain, UC Berkeley, https://parasjain.com(link is external)
Matthew Wright, UC Berkeley, https://www.linkedin.com/in/mattawright/(link is external)
Joseph E. Gonzalez, UC Berkeley, https://people.eecs.berkeley.edu/~jegonzal/(link is external)
Azalia Mirhoseini, UC Berkeley, http://azaliamirhoseini.com/(link is external)

Overview

Significant recent literature applies ML to problems in systems. Approaches increasingly leverage GNNs to represent graph structures in data. However, these methods require large datasets or synthetic data. Prior work explores methods to augment graphs; recent work highlights edge addition and deletion as potential graph augmentations. In contrast, we wish to build upon the rich body of work in generative models for graphs. Critically, prior work is mostly concerned with graph structure but not semantics; systems problems rely upon the properties of nodes, which impacts how augmentation should be performed.

Earlier work by some authors studied the effect of data augmentation and self-supervised learning on the structured domain of program inputs. The ContraCode algorithm(link is external) leverages a compiler to automatically generate semantically equivalent but textually divergent samples. This work demonstrated that BERT-style pre-training was not effective for program inputs, but compiler-derived augmentations outperformed supervised learning. Moreover, ContraCode representations significantly improve zero-shot generalization on a code clone detection benchmark.

Figure 1: ContraCode approach to self-supervised code representation learning

We are exploring ways to apply similar insights from ContraCode to new domains in the Open Graph Benchmark(link is external). Some of these tasks consider large graphs where data augmentation is challenging. We are also considering similar benchmark tasks with the ContraCode work, including computer-aided programming tasks like code summarization.

Updates

Final project update (August 24, 2021)(link is external)

Links

Please contact paras_jain@berkeley.edu(link sends e-mail) for more resources or questions