Improving Human-Labeled Data through Dynamic Automatic Conflict Resolution (Sun, et al., CoLing 2020)
The standard approach in crowdsourcing is to have a fixed number of workers annotate each instance and then aggregate annotations in some way (possibly with experts resolving disagreements). This paper proposes a way to dynamically allocate workers.
The process is as follows:
- Get two workers to annotate an example. If they agree, assign the label.
- For disagreements, ask additional annotators to label it until a simple majority annotation is reached or a limit is reached.
- For cases where the limit is reached, use some aggregation approach / experts.
I really like this idea - it’s simple to apply and the intuition for why it should work is clear. Unfortunately, the experiments in the paper do not do the comparison I am most interested in: real data, with multiple annotation strategies applied. The simulated study supports the effectiveness, but that means buying a range of assumptions about annotator behaviour (e.g. that all errors are equally likely and all workers have the same pattern of behaviour). There is a large-scale experiment with real data in which the approach collects 3.74 labels per instance on average (with a mimimum of 3) and only 5% of cases not reaching a consensus. That seems very good!
Citation
@inproceedings{sun-etal-2020-improving,
title = "Improving Human-Labeled Data through Dynamic Automatic Conflict Resolution",
title: "Improving Human-Labeled Data through Dynamic Automatic Conflict Resolution",
author = "Sun, David Q. and
Kotek, Hadas and
Klein, Christopher and
Gupta, Mayank and
Li, William and
Williams, Jason D.",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = "dec",
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/2020.coling-main.316",
pages = "3547--3557",
abstract = "This paper develops and implements a scalable methodology for (a) estimating the noisiness of labels produced by a typical crowdsourcing semantic annotation task, and (b) reducing the resulting error of the labeling process by as much as 20-30{\%} in comparison to other common labeling strategies. Importantly, this new approach to the labeling process, which we name Dynamic Automatic Conflict Resolution (DACR), does not require a ground truth dataset and is instead based on inter-project annotation inconsistencies. This makes DACR not only more accurate but also available to a broad range of labeling tasks. In what follows we present results from a text classification task performed at scale for a commercial personal assistant, and evaluate the inherent ambiguity uncovered by this annotation strategy as compared to other common labeling strategies.",
}