Skip to the content.

These are useful tools for processing the SQL data.

canonicaliser.py

This is the code we wrote to modify SQL to have a consistent style, specifically:

Tests were developed in the process of developing the code and are also included. If you do use this we would suggest proceeding with care - if your SQL contains phenomena we had not considered then the results could be unexpected.

corpus_stats.py

Collects a few simple statistics about a dataset:

json_to_flat.py

A convenient tool to convert from our json format to three files (train, dev, test) conaining one example per line: sentence | query with variables filled in.

reformat_text2sql_data.py

A utility script to write json formatted datasets split by question/query splits and also divided by train/dev/test or cross validation splits. This helps read in data independently and simplifies the data loading process.