Notes on data sources and history

Summary Table:

Dataset	Main Paper	Data Source
academic	Li and Jagadish, 2014	Contacted authors
advising	Finegan-Dollak et al., 2018	Here!
atis	Iyer et al., 2017	UW
geography	Iyer et al., 2017	UW
imdb	Yaghmazadeh et al., 2017	UT
restaurants	Popescu et al., 2003	Trento
scholar	Iyer et al., 2017	UW
spider	Yu et al., 2018)	Yale
yelp	Yaghmazadeh et al., 2017	UT
wikisql	Zhong et al., 2017	Salesforce

academic

Created for NaLIR by enumerating all of the different queries possible with the Microsoft Academic Search interface, then writing questions for each query.

advising

Collected questions from Facebook and undergraduates (past CLAIR lab students), then wrote further questions of a similar style.
Four people wrote SQL queries for all of the questions (one per question).
Six people scored the queries for helpfulness and accuracy (two people per query).
Collected paraphrases on Mechanical Turk, then one person checked them all, correcting/filtering for major grammatical or correctness issues and adding paraphrases to stay above a minimum of 10 per query.

The default student is in EECS (needed for assumed content of queries). In the database they are represented by student record ID 1.

atis

Originally collected for the “The ATIS spoken language systems pilot corpus”
Modified by Iyer et al. to reduce nesting.

geoquery

Originally a dataset created at UT Austin with sentences and logical forms.
Prolog converted to SQL at UW in the early 2000s.
Further queries converted and SQL improved at the University of Trento.
2017 UW paper uses the earlier UW work with additions to cover the remaining queries.

We have corrected some minor issues in the data:

References to population density of cities, which is not in the database
Inconsistent handling of rivers
Inconsistent use of either sorting or a subquery for questions that ask for the max of something
Use of ‘US’ in various ways that are inconsistent

restaurants

Originally a dataset created at UT Austin with sentences and logical forms.
Converted to SQL by Popescu et al. (UW)
Improved by Giordani and Moschitti (Trento)

scholar

Constructed at UW in 2017

spider

Combination of data from this repository (1,659 queries) and new data (8,034 queries) across a large set of tabe=les.
SQL canonicalised and variables detected automatically by us

yelp and imdb

Constructed at UT Austin in 2017

Note - in the imdb dataset there are some cases where multiple SQL queries are provided because of ambiguity in the question. For example:

"What is the nationality of Ben Affleck?"
"select director_0.nationality from director as director_0 where director_0.name = \" Ben Affleck \" "
"select actor_0.nationality from actor as actor_0 where actor_0.name = \" Ben Affleck \" "

wikiSQL

Data collected by Salesforce
SQL converted to our format and duplicate queries detected by us