Notes on data sources and history
Summary Table:
Dataset | Main Paper | Data Source |
---|---|---|
academic | Li and Jagadish, 2014 | Contacted authors |
advising | Finegan-Dollak et al., 2018 | Here! |
atis | Iyer et al., 2017 | UW |
geography | Iyer et al., 2017 | UW |
imdb | Yaghmazadeh et al., 2017 | UT |
restaurants | Popescu et al., 2003 | Trento |
scholar | Iyer et al., 2017 | UW |
spider | Yu et al., 2018) | Yale |
yelp | Yaghmazadeh et al., 2017 | UT |
wikisql | Zhong et al., 2017 | Salesforce |
academic
Created for NaLIR by enumerating all of the different queries possible with the Microsoft Academic Search interface, then writing questions for each query.
advising
- Collected questions from Facebook and undergraduates (past CLAIR lab students), then wrote further questions of a similar style.
- Four people wrote SQL queries for all of the questions (one per question).
- Six people scored the queries for helpfulness and accuracy (two people per query).
- Collected paraphrases on Mechanical Turk, then one person checked them all, correcting/filtering for major grammatical or correctness issues and adding paraphrases to stay above a minimum of 10 per query.
The default student is in EECS (needed for assumed content of queries). In the database they are represented by student record ID 1.
atis
- Originally collected for the “The ATIS spoken language systems pilot corpus”
- Modified by Iyer et al. to reduce nesting.
geoquery
- Originally a dataset created at UT Austin with sentences and logical forms.
- Prolog converted to SQL at UW in the early 2000s.
- Further queries converted and SQL improved at the University of Trento.
- 2017 UW paper uses the earlier UW work with additions to cover the remaining queries.
We have corrected some minor issues in the data:
- References to population density of cities, which is not in the database
- Inconsistent handling of rivers
- Inconsistent use of either sorting or a subquery for questions that ask for the max of something
- Use of ‘US’ in various ways that are inconsistent
restaurants
- Originally a dataset created at UT Austin with sentences and logical forms.
- Converted to SQL by Popescu et al. (UW)
- Improved by Giordani and Moschitti (Trento)
scholar
Constructed at UW in 2017
spider
- Combination of data from this repository (1,659 queries) and new data (8,034 queries) across a large set of tabe=les.
- SQL canonicalised and variables detected automatically by us
yelp and imdb
Constructed at UT Austin in 2017
Note - in the imdb dataset there are some cases where multiple SQL queries are provided because of ambiguity in the question. For example:
"What is the nationality of Ben Affleck?"
"select director_0.nationality from director as director_0 where director_0.name = \" Ben Affleck \" "
"select actor_0.nationality from actor as actor_0 where actor_0.name = \" Ben Affleck \" "
wikiSQL
- Data collected by Salesforce
- SQL converted to our format and duplicate queries detected by us