Crowdsourcing and Data Annotation

Including non-experts in data creation and system functions

Crowdsourcing, collecting annotations of data from a distributed group of people online, is a major source of data for AI research. The original idea involved people doing it as volunteers (e.g. Folding@home) or as a byproduct of some other goal (e.g. reCAPTCHA), but most of the data collected in AI today is from paid workers.


Combining human and AI effort:

  • If some errors are acceptable then you can train a model and have it take over once accuracy is high enough (using data annotated so far to train and evaluate). A variation on this approach from Kobayashi, et. al. (HComp 2021) is to treat all models as clustering methods and check the accuracy on individual clusters (where each cluster gets one label). The benefit of that approach is that the models can contribute before they are accurate everywhere.

Encouraging diversity:

  • Tell workers they can’t use certain words in a response. This was used in my own work on dialogue (Larson et al., EMNLP 2020) and Parrish et al. (EMNLP Findings 2021)’s work on NLI. Interestingly, both studies find that it leads to more challenging data, but Parrish’s work shows no impact on out-of-domain performance of models trained on the data.

Communicating with workers:

  • It is generally considered good practise to be responsive to email and provide workers a way to provide feedback.
  • When running multiple rounds of annotation, provide feedback to workers and/or edit the instructions based on analysis of the data collected.
  • Parrish et al. (EMNLP Findings 2021) carefully compared conditions including one where workers and linguists were in a Slack team, finding that using Slack did not improve results or worker satisfaction more than providing feedback/changes between rounds and it was a significant time cost.

Detecting bias:

Improving accuracy:

  • Post-processing methods: majority vote, various versions of EM (e.g. MACE, NAACL 2013).
  • Asking workers to write a justification for their judgement and consider an argument for another option can improve accuracy Drapeau et al. (HComp 2016). In particular, it can help with difficult examples where otherwise a majority of annotators may get the wrong answer. This works even better if workers are filtered to be the ones who write better arguments (measured with the Flesch-Kincaid readability test). Note that this is not the same as providing rationales that are intended to explain the answer to a machine (e.g., for MT in Zaidan et al. (NAACL 2007)). The synchronous version of the idea (Cicero, CHI 2019), adds direct discussion between workers, leading to further improvements. This benefit from peer-communication has also been observed for other tasks (Tang et al., WWW 2019)
  • Framing questions differetly: Bayesian Truth Serum asks people to give what they think the distribution of answers from the population would be, this provides quite a different signal.
  • Account for cases where there is no single right answer (Bowman and Dahl, NAACL 2021). This complicates annotation and evaluation, but is particularly critical in certain tasks.


  • Bonus for sticking with a task. For example, Parrish et al. (EMNLP Findings 2021) increased pay 5c in each round of collection, gave a $20 bonus for workers at the end, and gave 10% bonuses after completing 10, 50, and 100 HITs in a single round.
  • Bonus for accuracy. For example, Parrish et al. (EMNLP Findings 2021) gave $5 to workers with 25 HITs in a round and a 95% validation rate in the task. They did note that many workers say they would have done more if the pay was higher for bonuses (e.g. 15+% of base pay).

Data sources:

  • Where possible, collect real text data rather than asking workers / participants to write examples. Many recent datasets have not effectively measured performance because the data has spurious patterns that models learn to exploit (Bowman and Dahl, NAACL 2021). This is true of both expert and non-expert written examples.
  • Adversarial generation and filtering (creating examples that models do poorly on) can make metrics more difficult, but risks moving away from a representative measurement of performance on the task of interest (Bowman and Dahl, NAACL 2021).
  • Whatever the source, there is a risk of bias in the data because it reflects the authors. Gaps can be addressed by asking experts to contribute (see Section 4.1 of Bowman and Dahl, (NAACL 2021)).

User Interfaces

Hettiachchi et al. (CHI 2020) showed that voice assistants are another viable method for some tasks. Participants completed tasks entirely via audio interaction. Accuracy was lower than a web UI in lab conditions, but similar in a small field study.

Examples in NLP

Games with a Purpose


Companies fall into a few categories:

  • Crowd providers, which directly connect with workers.
  • Crowd enhancers, which provide a layer on top of the providers that adds features (e.g. active learning, nice templates, sophisticated workflows).
  • Annotation tools, which are designed to integrate with crowd providers (or your own internal workers).
  • Interfaces, which make it easier to use one of the crowd providers.

I decided not to break the first two categories apart because it was sometimes unclear whether a service was using their own crowd or providing a layer over another, but I have roughly sorted them. Where possible I have included pricing, though some services did not make it easy to find, and of course it is likely to change over time. Take note of the description in each case because the data collected varies substantially. Also note that many tasks can be structured as a classification task (e.g. “Is this coreference link correct?”), making many of these services more flexible than the ’text classification’ label below may seem (though structuring your task so costs don’t explode may require some thought).

  • Mechanical Turk, a small set of templates and the option to define a web UI that does whatever you want. Cost is a 20% fee on top of whatever you choose to pay workers (though note it jumps to 40% if you have more than 10 assignments for a HIT!).
  • Hybrid, seems to be any task you can define in text (including with links?). 40% fee, though there is a discount of some type for academic and non-profit institutions.
  • Prolific, seems to be that you just provide a link to a site for annotations (originally intended for survey research). 30% fee. Last year they had a research grant program.
  • Gorilla, designed for social science research, but could be used for any classification or free text task. Costs $1.19 / response, though note that you construct a questionnaire with a series of questions. There are also discounts available when collecting thousands of responses.
  • Scale, classification tasks for 8c / annotation. There is an academic program, but details are not available online (mentioned here).
  • Amazon SageMaker Ground Truth, text classification for 8c / label, decreasing after 50,000 annotations + a workflow fee of 1.2c / label.
  • iMerit, NER, classification, and sentiment tasks. When used on the Amazon Marketplace they are 5 dollars / hour (India based workers) or 25 (US based workers).
  • Appen
  • Hive
  • Samasource
  • Labelbox
  • Cloudfactory
  • DataLoop
  • SuperAnnotate
  • Datasaur
  • Allegion
  • Tasq
  • Superb AI
  • Quansight
  • HumanFirst
  • 1715Labs
  • LXT

Mechanical Turk Integration Interfaces

These are interfaces for Mechanical Turk that provide an easier way to set up HITs without having to mess with Amazon’s APIs yourself. Both are free, but have slightly different features:

  • MTurk Manager, self-hosted, includes features for custom views of responses from workers.
  • Mephisto
  • LegionTools, self-hosted or not, includes key features for real-time systems. No longer maintained.
  • Crowd-Kit

Annotation User Interfaces

There are many annotation tools for NLP (e.g. my own, SLATE!). These annotation tools are designed to integrate with providers above to collect annotations:

  • Prodigy, span classification (e.g. NER), multiple choice questions (which can be used to do a wide range of tasks), and relations (see examples). Cost is whatever you pay a crowd provider + $390 for a lifetime license, or $490 for a company. One distinctive property is that you download and run it yourself, providing complete control over your data.
  • LightTAG, span classification and links. Cost is 1c / annotation + the cost from a crowd provider, but there is an academic license that makes it free.

Misc Notes

In 2019, Hal Daumé III mentioned on Twitter that Figure Eight, a paid crowdsourcing service, had removed their free licenses for academics, and asked for alternatives (Note, Figure Eight has since been acquired by Appen). A bunch of people had suggestions which I wanted to record for my own future reference. I wrote a blog post on my old site, the content of which is now here.