Skip to the content.

Notes on data sources and history

Summary Table:

Dataset Main Paper Data Source
academic Li and Jagadish, 2014 Contacted authors
advising Finegan-Dollak et al., 2018 Here!
atis Iyer et al., 2017 UW
geography Iyer et al., 2017 UW
imdb Yaghmazadeh et al., 2017 UT
restaurants Popescu et al., 2003 Trento
scholar Iyer et al., 2017 UW
spider Yu et al., 2018) Yale
yelp Yaghmazadeh et al., 2017 UT
wikisql Zhong et al., 2017 Salesforce

academic

Created for NaLIR by enumerating all of the different queries possible with the Microsoft Academic Search interface, then writing questions for each query.

advising

  1. Collected questions from Facebook and undergraduates (past CLAIR lab students), then wrote further questions of a similar style.
  2. Four people wrote SQL queries for all of the questions (one per question).
  3. Six people scored the queries for helpfulness and accuracy (two people per query).
  4. Collected paraphrases on Mechanical Turk, then one person checked them all, correcting/filtering for major grammatical or correctness issues and adding paraphrases to stay above a minimum of 10 per query.

The default student is in EECS (needed for assumed content of queries). In the database they are represented by student record ID 1.

atis

  1. Originally collected for the “The ATIS spoken language systems pilot corpus”
  2. Modified by Iyer et al. to reduce nesting.

geoquery

  1. Originally a dataset created at UT Austin with sentences and logical forms.
  2. Prolog converted to SQL at UW in the early 2000s.
  3. Further queries converted and SQL improved at the University of Trento.
  4. 2017 UW paper uses the earlier UW work with additions to cover the remaining queries.

We have corrected some minor issues in the data:

restaurants

  1. Originally a dataset created at UT Austin with sentences and logical forms.
  2. Converted to SQL by Popescu et al. (UW)
  3. Improved by Giordani and Moschitti (Trento)

scholar

Constructed at UW in 2017

spider

  1. Combination of data from this repository (1,659 queries) and new data (8,034 queries) across a large set of tabe=les.
  2. SQL canonicalised and variables detected automatically by us

yelp and imdb

Constructed at UT Austin in 2017

Note - in the imdb dataset there are some cases where multiple SQL queries are provided because of ambiguity in the question. For example:

"What is the nationality of Ben Affleck?"
"select director_0.nationality from director as director_0 where director_0.name = \" Ben Affleck \" "
"select actor_0.nationality from actor as actor_0 where actor_0.name = \" Ben Affleck \" "

wikiSQL

  1. Data collected by Salesforce
  2. SQL converted to our format and duplicate queries detected by us