Executable Semantic Parsing

Generating code that represents the meaning of text

Models

General Code

Large language models have been applied to code generation, using the prompt basead approach (e.g., Codex and Austin et al. (arXiv 2021)). Fine-tuning improves performance, even with a tiny number of samples, and sampling many outputs then filtering is critical to success (keeping outputs that behave correctly on sample input-output pairs). AlphaCode introduced a refinement to this inference process, training a separate model to generate sample inputs and then clustering generated programs based on how they behave with the generated inputs. Solutions are sampled from across these clusters, ensuring diversity in behaviour. For web code (HTML, CSS, Javascript), Jiang et al. (UIST 2021) used prompt based methods to generate small snippets of code.

It is possible to train strong small models if the data is high quality. For example, (Gunasekar et al., arXiv 2023) used (a) data filtered by a model trained on 100,000 GPT-4 judgements, and (b) generated data with GPT-3.5. That led to performance comparable to GPT-4 but with a much smaller model.

SQL

Sequence-to-sequence models for SQL generation were either specialised to a single database (e.g., our work), or take the schema as an input (e.g., WikiSQL’s Seq2SQL model). The latter approach worked in simple domains, like the single tables of WikiSQL, but when the database is complex, Li et al. (AAAI 2023) showed that it helps to filter the fields for relevance to the query and organise the tables to have the most relevant appear first.

Visualisation

For the constrained domain of plot generation, Chen et al. (ACL 2021), proposed a model that combines LSTM encoding and attention to predict a single plot command and the relevant arguments. Their focus was on choosing the parts of a dataframe to plot and the type of plot. Interestingly, BERT did not help, probably because the language in their setting was different - the text associated with code notebooks. Complete program accuracy is around 56%, though another ~7% of cases are semantically correct, and another ~28% are difficult to answer due to missing content in the language or code context (based on manual analysis, no human results are given for the task).

For the style of plots, Shao and Nakashole (ACL 2020), showed that the generation problem can be formulated as slot-extraction based task-oriented dialogue. Each slot corresponds to a property of the figure. They also released a demonstration that includes some personalisation by remembering previous slot values and continuous learning by asking users which of a k-best set of options is right (Wang et al., NAACL 2021).

Regular Expressions

Most work in NLP treats this task like translation, mapping a string to another string, sometimes with grammatical constraints. Xi et al. (TACL 2020) draw on the program synthesis literatures to consider how examples can inform the generation process. They have a NLP-style model (either grammar based or neural), but rather than generating code directly, it generates a sketch of the code. That sketch then constrains a search in the space of programs for one that satisfies the sketch and a set of example input-output pairs (in their case, regex). To train without annotations of sketches, they search for the best sketch that agrees with the target program. This almost solves KB13 and Turk (96.3% and 98.9% accuracy) when given 20 examples (10 positive, 10 negative), but the Stackoverflow data remains challenging.

Training

Gupta et al. (EMNLP 2021) proposed a new learning objective to encourage consistency in program outputs (in their case for neural module networks). The idea is to improve performance by encouraging semantically similar spans of text to be converted into the same code. We can identify suitable pairs in several ways, with the best being either templates to augment training data or identifying semantically similar spans in existing questions.

Improving speed

Human-in-the-Loop

SQL

One approach for incorporating human effort is to show the user the generated program or a description of it and ask for feedback. If the feedback is in natural language, we then need a model to interpret it and make a suitable update to the query. Elgohary et al. (ACL 2020) developed a dataset based on Spider for text-to-SQL experiments with this kind of updating. Using existing models, the accuracy of queries produced is higher than without feedback, but far below human accuracy. Their follow up paper Elgohary et al. (NAACL 2021), improved results by modeling edits as a series of smaller transofmrations, and also using synthetic data at the start of training. The value of interaction is smaller for better text-to-SQL systems (both relative and absolute), though still significant (e.g. improving RAT-SQL by 4.3 points). Tandon et al. (Workshop at AAAI, 2022) constructed a dataset for script learning (a series of natural langauge steps to achieve a goal) with corrections described by people and the script before and after correction.

Another approach is to dsign UI components that allow users to identify and correct aspects of the model’s interpretation. Narechania et al. (IUI 2021) built a system in which the user is shown the mapping from words in their query to table names and actions, with menus of alternatives that they can select. To aid the process, the UI also shows users a small sample of the dataset, selected to illustrate the impact of components of the query (e.g., either case for a WHERE condition), with a breakdown of the impacts of each step of the query. A user study showed positive opinions, but they did not measure how frequently it solves model outputs (as Elgohary’s work above does).

Other

For prompt-based models, Austin et al. (arXiv 2021) showed that interaction can work, with clear short clarifications from users leading the model to make suitable updates.

Without Language

Program synthesis is a closely related field with a long history. The key difference between that work and the work in NLP is that the input is not in natural language. So far, most work in NLP has not connected to that literature, but there is great scope for using it in systems and data collection.

In programming by example (PBE), the desired program is expressed by providing examples of what the user wants. Systems generate programs that satisfy the examples, rank them by some criteria, and interact with users to work out the correct one. Zhang et al. (UIST 2020) proposed a few ways to help this refinement process: (1) allow users to label parts of their examples (e.g. for RegEx generation, say a character is part of a class of possibilities like digits), (2) generate new examples that are close to the provided ones, but capture important variations that can distinguish between potential programs. For (2), they also use clustering to group the generated examples for rapid reading. However, those ideas still leave the actual synthesis process as a black box. Their next paper, Zhang et al. (CHI 2021), visualises the synthesis process in three ways. Most interestingly, they present the search tree over regular expressions and allow users to mark particular paths to be avoided or tried first. This format compactly captures the search, providing a picture of what the system is doing and a natural way to influence it.

For visualisations, Wang et al. (CHI 2021) built a system where users specify examples of how data should be used to make points in a plot and then synthesis methods infer the general transformation needed.

For spreadsheets, FlashFill is the classic example, generating string manipulation code in Excel based on a single user example. Chen et al. (ICML 2021) developed a neural approach (transformer encoder, LSTM decoder) for generation that considers more context and generates formulas in Google Sheets that cover operations beyond string manipulation.

Evaluation

For program generation, splitting the data into train and test sets can be tricky depending on what we are aiming to measure. A random split of the set of (utterance, program) examples could lead to a situation where the same programs appear in the train and test sets, just with different variables (Finegan-Dollak et al., ACL 2018). Splitting radnomly based on programs could create the situation where some symbols are not seen in training. Bogin et al. (arXiv 2022) showed that models have most difficulty with unseen structures. If a structure has been seen, just with a different symbol, then models can learn the substitution. Their measurement of program similarity correlates with system performance across a range of models (and does so better than their prior work on Partial COmponent Match (Hazoom et al. NLP4Prog 2021).

Most work in NLP focuses on accuracy of models for these tasks. Once those models are integrated into user interfaces, we also need to evaluate the overall effectiveness of the system.

Data

Text to SQL

Name	Task	Source	Train / Dev / Test
Academic, Advising, ATIS, Geography, Restaurants, Scholar, Spider (train), IMDB, Yelp, WikiSQL
Spider
Spider-DK	SQL for query	Spider with queries selected or modified to require domain knowledge	- / 535 / -
Splash	Fix SQL given a query and a request	Spider with queries generated from a model and correction requests written by people	7,481 / 871 / 962
CoSQL
SParC
WikiSQL
StaQC
SEDE	SQL for query	Stack Exchange Data Explorer user questions and T-SQL queries	10,309 / 1,714
MIMICSQL	SQL for query	Questions are auto-generated, filtered, and rephrased	8,000 / 1,000 / 1,000

Regular Expressions

https://github.com/xiye17/SketchRegex/tree/master/DeepSketch/datasets contains:

Stackoverflow, collect from posts, including example inputs, from Xi et al. (TACL 2020)
Turk, auto-generate with a synchronous grammar, then paraphrase, from Locascio et al. (EMNLP 2016)
KB13, crowd workers write descriptions of a set of lines, others write regex for the descriptions, from Kushman and Barzilay, (NAACL 2013)

General Purpose Languages

Name	Task	Source	Train / Dev / Test
CoNaLa
Pseudogen
StaQC
code-docstring-corpus
CodeNN
EMSE-DeepCom
FunCom
CONCODE	Generate a Java class method given a description of the method and the rest of the class definition	Github repositories	100,000 / 2,000 / 2,000
JuICe	Given the text above a cell and earlier context, predict the contents of the cell	Python notebooks	1,518,049 / 1,744 / 1,981
Okapi	Generate an API call based on a request. Benchmarks focus on handling longer or unseen structures.	Auto-generated APIs, people wrote sentences for them, people paraphrased	22,628
CodeContests	Solve programming competition problems	Various	13,328 / 117 / 165
HumanEval	Generate Python code given a function name, arguments, and docstring	GitHub	164
MBPP	Generate Python code given a task description and three test cases	Crowdsourced	974
MathQA-Python	Generate Python code to solve a maths problem phrased as a paragraph	19,209 / 2,822 / 1,883

Visulisation

Name	Task	Source	Train / Dev / Test
PlotCoder		Derived from JuICe
ChartDialog		One worker describes a target plot and another sets parameters to achieve it	3,200