There is a new effort to create a set of consistent datasets across multiple domains and languages: Universal Anaphora.
General approaches to annotation:
- Mentions are provided, annotators go through one by one.
- They can link to either earlier spans or to an entity in a list. Aralikatte and Søgaard, (LREC 2020) showed that these two are the same speed, though linking to a list may be more accurate (they provided minimal annotation guidelines, which may mean the flexibility of the first method led to more variation). Sachan et al. (IJCAI 2015) found that pairwaise judgements were faster, though few details are provided for the experiment, which was not the focus of the paper.
- When linking to earlier spans, they can be asked to link to (a) any span, (b) the first span, (c) the most recent span. It’s unclear what the speed and accuracy tradeoffs of these are.
- Mention pairs are presented and annotators confirm whether they are coreferent.
Active learning has had mixed success for coreference:
- Simulated studies: The earliest work found no benefit for a biomedical domain Gasperin (NAACL Workshop 2009). Sachan et al. (IJCAI 2015) did see a benefit, particularly when combining several selection methods in an ensemble. Yuan et al. (ACL 2022) explored a range of strategies for choosing samples, finding that a measure of uncertainty for mention detection was most effective when trying to minimise switching between documents (which is important given the cost in time spent reading). However, the gain over random sampling was small.
- Human studies: Li et al. (ACL 2020) proposed asking “is this mention pair coreferent?” and if not “what should this mention link to?”. Structuring the task that way provides more information at little time cost to the annotator. They compared several ways of choosing pairs, all doing well (query-by-committee was slightly worse)
Elazar et al. (2021) used MTurk with workers shown a document with an NP highlighted and the entities so far, asked if the NP is (a) new, (b) the same as an entity so far, shown as a list, (c) a time expression, or (d) idiomatic. Agreement was 76.8-82.1 (CoNLL 2012 score). Workers were paid $1.50 for ~160 word documents, which works out to about 12 minutes a document (the aim was for US minimum wage). Training data was single-annotated, test data was double-annotated and adjudicated by an expert.
- CoReFi, uses exhaustive sequential linking (go through a set of provided mentions one at a time and build clusters by linking them). Allows for modification of spans. Only supports non-overlapping spans. Keyboard commands for actions. Includes an adjudication mode with an algorithm to map different annotated clusters. Speed was ~400 mentions / hour.
Concepts: Cattan et al. (AKBC 2021) developed a dataset and model focused on concepts in scientific articles. As well as linking concepts that are the same, they also constructed a hierarchy between the concepts. The process is guided by automatically identifying concept spans using resources such as the kbowledge base of concepts in Papers with Code.
Events: Pratapa et al. (CoNLL 2021) developed a dataset for event coreference based on WikiNews with crowdsourced annotations. Prior resources (e.g. ECB+) focused on certain event types, while this dataset includes all events detected by two open-domain systems, with expert checking. It also considers quasi-identity based on time, location, and participants.
Most work in the cross-document task evaluates with gold mentions. Cattan et al. (ACL Findings 2021) build a model based on the e2e-coref model that predicts mentions as well, which makes the problem significantly harder (as suggested by their paper at *SEM 2021, which also pointed out that clustering documents in the ECB+ dataset gives an unrealistic advantage due to the way the dataset was constructed).
- ECB+, News
- SciCo, Concepts in scientific papers
- CD2CR, Entities in scientific papers and newspaper articles
Text-based NP Enrichment (Elazar et al., 2021) includes coreference as a subtask, but then involves defining relationships between noun phrases using preprositions (not between entities because in some cases a relation is valid for some NPs in the entity, but not others).
The Winograd Schema Challenge aims to probe commonsense knowledge by testing a model’s ability to correctly resolve within-sentence coreference examples. WinoGrande scaled up the idea, crowdsourcing a huge dataset and using a model-based process to identify systematic bias that could artificially inflate results. The leaderboard shows dramatic improvements in performance, up to accuracies over 80, but recent work has shown that the structure of the task is leading to inflated results (Elazar et al., EMNLP 2021), specifically that performance above random can be achieved even without seeing key chunks of the sentence. Also, in a zero-shot setting, where models need to predict the verb (rather than coreference), language models have much lower results.