Datasets

IE/NER from Cybercriminal Forums

Forum posts with annotations of products.

https://evidencebasedsecurity.org/forums/#data

DSTC 7 track 1: Next Utterance Selection

Data from Noetic End-to-End Response Selection Challenge. Dialogue from Ubuntu tech support and Michigan course advising.

https://ibm.github.io/dstc-noesis/public/index.html

DSTC 8 track 2: Next Utterance Selection

Data from NOESIS II: Predicting Responses, Identifying Success, and Managing Complexity in Task-Oriented Dialogue. Dialogue from Ubuntu tech support and Michigan course advising.

https://github.com/dstc8-track2/NOESIS-II/

IRC Disentanglement

Annotation of IRC messages with reply-to structure, which disentangles simultaneous conversations. The largest such annotated resource.

https://jkk.name/irc-disentanglement/
Example IRC conversation

Crowdsourced Paraphrases

Paraphrases collected while conducting experiments on factors influencing crowd performance.

https://aclanthology.org/anthology/attachments/P/P17/P17-2017.Datasets.zip

Spine and Arc version of the Penn Treebank

Code to convert the standard Penn Treebank into a version where each word is assigned a spine of non-terminals, and arcs to indicate attachments from one spine to another.

https://jkk.name/1ec-graph-parser/format-conversion
Example parse in my split-head format

Text to SQL datasets

A collection of datasets containing questions in English paired with SQL queries for a provided database. Our version homogenises the style of the SQL and corrects errors in previous versions of the data.

https://jkk.name/text2sql-data