Executable Semantic Parsing
For the constrained domain of plot generation, Chen et al. (ACL 2021), proposed a model that combines LSTM encoding and attention to predict a single plot command and the relevant arguments. Their focus was on choosing the parts of a dataframe to plot and the type of plot. Interestingly, BERT did not help, probably because the language in their setting was different - the text associated with code notebooks. Complete program accuracy is around 56%, though another ~7% of cases are semantically correct, and another ~28% are difficult to answer due to missing content in the language or code context (based on manual analysis, no human results are given for the task).
For the style of plots, Shao and Nakashole (ACL 2020), showed that the generation problem can be formulated as slot-extraction based task-oriented dialogue. Each slot corresponds to a property of the figure. They also released a demonstration that includes some personalisation by remembering previous slot values and continuous learning by asking users which of a k-best set of options is right (Wang et al., NAACL 2021).
Most work in NLP treats this task like translation, mapping a string to another string, sometimes with grammatical constraints. Xi et al. (TACL 2020) draw on the program synthesis literatures to consider how examples can inform the generation process. They have a NLP-style model (either grammar based or neural), but rather than generating code directly, it generates a sketch of the code. That sketch then constrains a search in the space of programs for one that satisfies the sketch and a set of example input-output pairs (in their case, regex). To train without annotations of sketches, they search for the best sketch that agrees with the target program. This almost solves KB13 and Turk (96.3% and 98.9% accuracy) when given 20 examples (10 positive, 10 negative), but the Stackoverflow data remains challenging.
Gupta et al. (EMNLP 2021) proposed a new learning objective to encourage consistency in program outputs (in their case for neural module networks). The idea is to improve performance by encouraging semantically similar spans of text to be converted into the same code. We can identify suitable pairs in several ways, with the best being either templates to augment training data or identifying semantically similar spans in existing questions.
One approach for incorporating human effort is to show the user the generated program or a description of it and ask for feedback. If the feedback is in natural language, we then need a model to interpret it and make a suitable update to the query. Elgohary et al. (ACL 2020) developed a dataset based on Spider for text-to-SQL experiments with this kind of updating. Using existing models, the accuracy of queries produced is higher than without feedback, but far below human accuracy. Their follow up paper Elgohary et al. (NAACL 2021), improved results by modeling edits as a series of smaller transofmrations, and also using synthetic data at the start of training. The value of interaction is smaller for better text-to-SQL systems (both relative and absolute), though still significant (e.g. improving RAT-SQL by 4.3 points). Tandon et al. (Workshop at AAAI, 2022) constructed a dataset for script learning (a series of natural langauge steps to achieve a goal) with corrections described by people and the script before and after correction.
Another approach is to dsign UI components that allow users to identify and correct aspects of the model’s interpretation. Narechania et al. (IUI 2021) built a system in which the user is shown the mapping from words in their query to table names and actions, with menus of alternatives that they can select. To aid the process, the UI also shows users a small sample of the dataset, selected to illustrate the impact of components of the query (e.g., either case for a WHERE condition), with a breakdown of the impacts of each step of the query. A user study showed positive opinions, but they did not measure how frequently it solves model outputs (as Elgohary’s work above does).
For prompt-based models, Austin et al. (arXiv 2021) showed that interaction can work, with clear short clarifications from users leading the model to make suitable updates.
Program synthesis is a closely related field with a long history. The key difference between that work and the work in NLP is that the input is not in natural language. So far, most work in NLP has not connected to that literature, but there is great scope for using it in systems and data collection.
In programming by example (PBE), the desired program is expressed by providing examples of what the user wants. Systems generate programs that satisfy the examples, rank them by some criteria, and interact with users to work out the correct one. Zhang et al. (UIST 2020) proposed a few ways to help this refinement process: (1) allow users to label parts of their examples (e.g. for RegEx generation, say a character is part of a class of possibilities like digits), (2) generate new examples that are close to the provided ones, but capture important variations that can distinguish between potential programs. For (2), they also use clustering to group the generated examples for rapid reading. However, those ideas still leave the actual synthesis process as a black box. Their next paper, Zhang et al. (CHI 2021), visualises the synthesis process in three ways. Most interestingly, they present the search tree over regular expressions and allow users to mark particular paths to be avoided or tried first. This format compactly captures the search, providing a picture of what the system is doing and a natural way to influence it.
For visualisations, Wang et al. (CHI 2021) built a system where users specify examples of how data should be used to make points in a plot and then synthesis methods infer the general transformation needed.
For spreadsheets, FlashFill is the classic example, generating string manipulation code in Excel based on a single user example. Chen et al. (ICML 2021) developed a neural approach (transformer encoder, LSTM decoder) for generation that considers more context and generates formulas in Google Sheets that cover operations beyond string manipulation.
For program generation, splitting the data into train and test sets can be tricky depending on what we are aiming to measure. A random split of the set of (utterance, program) examples could lead to a situation where the same programs appear in the train and test sets, just with different variables (Finegan-Dollak et al., ACL 2018). Splitting radnomly based on programs could create the situation where some symbols are not seen in training. Bogin et al. (arXiv 2022) showed that models have most difficulty with unseen structures. If a structure has been seen, just with a different symbol, then models can learn the substitution. Their measurement of program similarity correlates with system performance across a range of models (and does so better than their prior work on Partial COmponent Match (Hazoom et al. NLP4Prog 2021).
Most work in NLP focuses on accuracy of models for these tasks. Once those models are integrated into user interfaces, we also need to evaluate the overall effectiveness of the system.
Text to SQL
|Name||Task||Source||Train / Dev / Test|
|Academic, Advising, ATIS, Geography, Restaurants, Scholar, Spider (train), IMDB, Yelp, WikiSQL|
|Spider-DK||SQL for query||Spider with queries selected or modified to require domain knowledge||- / 535 / -|
|Splash||Fix SQL given a query and a request||Spider with queries generated from a model and correction requests written by people||7,481 / 871 / 962|
|SEDE||SQL for query||Stack Exchange Data Explorer user questions and T-SQL queries||10,309 / 1,714|
|MIMICSQL||SQL for query||Questions are auto-generated, filtered, and rephrased||8,000 / 1,000 / 1,000|
- Stackoverflow, collect from posts, including example inputs, from Xi et al. (TACL 2020)
- Turk, auto-generate with a synchronous grammar, then paraphrase, from Locascio et al. (EMNLP 2016)
- KB13, crowd workers write descriptions of a set of lines, others write regex for the descriptions, from Kushman and Barzilay, (NAACL 2013)
General Purpose Languages
|Name||Task||Source||Train / Dev / Test|
|CONCODE||Generate a Java class method given a description of the method and the rest of the class definition||Github repositories||100,000 / 2,000 / 2,000|
|JuICe||Given the text above a cell and earlier context, predict the contents of the cell||Python notebooks||1,518,049 / 1,744 / 1,981|
|Okapi||Generate an API call based on a request. Benchmarks focus on handling longer or unseen structures.||Auto-generated APIs, people wrote sentences for them, people paraphrased||22,628|
|CodeContests||Solve programming competition problems||Various||13,328 / 117 / 165|
|HumanEval||Generate Python code given a function name, arguments, and docstring||GitHub||164|
|MBPP||Generate Python code given a task description and three test cases||Crowdsourced||974|
|MathQA-Python||Generate Python code to solve a maths problem phrased as a paragraph||19,209 / 2,822 / 1,883|
|Name||Task||Source||Train / Dev / Test|
|PlotCoder||Derived from JuICe|
|ChartDialog||One worker describes a target plot and another sets parameters to achieve it||3,200|