Ordinal Common-sense Inference (Zhang et al., 2017)

A new task and dataset of 39k examples for common sense reasoning, with a sentence generated for each prompt and a manual label indicating their relation, from very likely to impossible.

When people read a sentence they form an entire world around it, making inferences about unwritten properties based on their prior knowledge. If we want NLP systems to do the same, we need data to train and test this common sense aspect of language understanding.

This paper is about a new dataset of automatically generated sentence pairs with human ratings. The ratings indicate that given the first sentence, the second sentence is either very likely, likely, plausible, technically possible, or impossible. These ratings are crowdsourced, using the median of three ratings per example. The pay rates are fairly low, at $3.45 / hour (1.99c / example and 20.71 seconds / example), though it’s possible that the time is being skewed by outliers, and it’s unclear exactly how pay was determined (does this include Amazon’s cut? Why is it an average cost per example, rather than just the cost?).

The main contribution is the novel way of generating the sentences. For each prompt sentence, an argument is chosen, and then a hypothesis is generated in one of three ways (all trained with Gigaword). (1) A sequence-to-sequence model takes the full sentence as input and generates a sentence. (2) The same as (1), but with only the argument provided. (3) A sentence is sampled from templates generated by abstraction of sentences in the training data. Together these produce a diverse set of examples that get a range of ratings, with only ’likely’ being somewhat rarer. They also labeled some pairs from SNLI and COPA, to enable analysis of how this task compares.

They also provide a set of baselines for the new task. Using the baselines, they show that the generated sentences are somewhat more difficult than the pairs from existing datasets. The standard metrics proposed are MSE and Spearman’s Rho (both necessary because otherwise always guessing the middle would get an MSE better than any of the proposed baselines). Interestingly, regression does quite a bit better than a set of one-vs-all SVMs on MSE, and also slightly better on rho (I’m surprised because while there is an ordinal scale, it doesn’t feel like it should have a strong continuous interpretation).



author = {Zhang, Sheng  and Rudinger, Rachel  and Duh, Kevin  and Van Durme, Benjamin },
title = {Ordinal Common-sense Inference},
title: {Ordinal Common-sense Inference},
journal = {Transactions of the Association for Computational Linguistics},
volume = {5},
year = {2017},
keywords = {},
issn = {2307-387X},
url = {https://www.transacl.org/ojs/index.php/tacl/article/view/1082},
pages = {379--395}