Datasets
IE/NER from Cybercriminal Forums
Forum posts with annotations of products.
https://evidencebasedsecurity.org/forums/#dataDSTC 7 track 1: Next Utterance Selection
Data from Noetic End-to-End Response Selection Challenge. Dialogue from Ubuntu tech support and Michigan course advising.
https://ibm.github.io/dstc-noesis/public/index.htmlDSTC 8 track 2: Next Utterance Selection
Data from NOESIS II: Predicting Responses, Identifying Success, and Managing Complexity in Task-Oriented Dialogue. Dialogue from Ubuntu tech support and Michigan course advising.
https://github.com/dstc8-track2/NOESIS-II/IRC Disentanglement
Annotation of IRC messages with reply-to structure, which disentangles simultaneous conversations. The largest such annotated resource.
https://jkk.name/irc-disentanglement/Crowdsourced Paraphrases
Paraphrases collected while conducting experiments on factors influencing crowd performance.
https://aclanthology.org/anthology/attachments/P/P17/P17-2017.Datasets.zipSpine and Arc version of the Penn Treebank
Code to convert the standard Penn Treebank into a version where each word is assigned a spine of non-terminals, and arcs to indicate attachments from one spine to another.
https://jkk.name/1ec-graph-parser/format-conversionText to SQL datasets
A collection of datasets containing questions in English paired with SQL queries for a provided database. Our version homogenises the style of the SQL and corrects errors in previous versions of the data.
https://jkk.name/text2sql-data