Creating effective visualization is an important part of data analytics. While there exist many libraries for creating visualization, writing such code remains difficult given the myriad of parameters that users need to provide. In this project, we propose the new task of synthesizing visualization programs from a combination of natural language utterances and code context. To tackle the learning problem, we introduce PlotCoder, a new hierarchical encoder-decoder architecture that models both the code context and the input utterance. We use PlotCoder to first determine the template of the visualization code, followed by predicting the data to be plotted. We use Jupyter notebooks containing visualization programs crawled from GitHub to train PlotCoder, and our experiments show that PlotCoder can correctly predict about 35% of samples in hard data splits, and performs 3-4.5% better when compared to our baselines.
Update
Researchers
- Xinyun Chen, UC Berkeley, https://jungyhuk.github.io/
- Dawn Song, UC Berkeley, https://people.eecs.berkeley.edu/~dawnsong/
- Rishabh Singh, Google Brain, https://rishabhmit.bitbucket.io/
Current work also involves the collaboration with Linyuan Gong (UC Berkeley) and Prof. Alvin Cheung (UC Berkeley).
Overview
In this project, we study neural program synthesis from a combination of program specifications of different formats. Currently, we are working on synthesizing visualization code in Python Jupyter notebooks, using a combination of natural language utterances and the programmatic context that the visualization program will reside (i.e., the preceding code cells in the notebook), focusing on programs that create static visualizations (e.g., line charts, scatter plots, etc). In the figure below, we provide an example to illustrate our setup. While there has been prior work on synthesizing code from natural language, sometimes with addition information such as database schemas or input-output examples, synthesizing general-purpose code from natural language remains highly difficult due to the ambiguity in the natural language input and complexity of the target.
To improve the visualization code synthesis performance with complex and ambiguous specifications, we design a hierarchical deep neural network code generation model called PlotCoder that decomposes synthesis into two subtasks: generating the plotting command, then the parameters to pass in given the command. Specifically, PlotCoder employs a pointer network architecture, which allows the model to directly select code tokens in the previous code cells in the same notebook as the data used for plotting. On the other hand, inspired by the schema linking techniques proposed for semantic parsing with structured inputs, e.g., text to SQL tasks, we design an encoder architecture in PlotCoder that connects the embedding of the natural language descriptions with their corresponding code fragments in previous code cells within each notebook. Although the constructed links can be noisy because the code context is less structured than the database tables in text-to-SQL problems, we observe that our approach results in substantial performance gain.
We evaluate PlotCoder's ability to synthesize visualization programs using Jupyter notebooks of homework assignments or exam solutions. On the gold test set where the notebooks are official solutions, our best model precisely predicts both the plot types and the plotted data for above 50% of the samples. On the more noisy test splits with notebooks written by students, which may include work-in-progress code, our model still achieves around 35% accuracy for generating the entire code, showing how PlotCoder's design decisions improve prediction accuracy.