Controlled Crowdsourcing for High-Quality QA-SRL Annotation (Roit, et al., ACL 2020)

Semantic Role Labeling captures the content of a sentence by labeling the word sense of the verbs and identifying their arguments. Over the last few years, Luke Zettlemoyer’s Group has been exploring using question-answer pairs to represent this structure. This approach has the big advantage that it is easier to explain than the sense inventory and role types of more traditional SRL resources like PropBank. However, even with that advantage, crowdsourcing this annotation is difficult, as this paper shows.

I got three main things out of this paper:

1. It shifted my approach to crowdsourcing to consider workers more like traditional expert annotators.
2. It reinforced the idea that small shifts in crowd workflows can have a major impact on annotation quality.
3. QA-SRL can capture roles not covered by PropBank.

The work also provides a new dataset that will be useful for future work on this problem, and useful benchmarks of systems and measurements of data quality. Expanding on the three points above:

Crowd workers: The paper argues in favour of putting more time into training workers. Most of the work I’ve seen in NLP for crowdsourcing (including my own) focuses on modifying task design or using ML post-processing to improve results. Here, they run a large-scale qualification task and filter workers based on their performance, then train those workers by paying them to read a set of instructions (23 text-dense slides) and do two small annotation rounds with feedback after each one. This increases the upfront cost, but reduces the cost of annotation by reducing the need for multiple annotations of each item. The paper doesn’t provide quite enough detail to quantify the cost. We do know that to get to 11 workers they needed to train 30 workers at a cost of 2 hours each plus 30 minutes of researcher time each. If we assume 60 workers did the preliminary round, each taking 5 minutes, and that workers cost $12 / hour ($10 to the workers, $2 to Amazon), that’s almost$800 plus 15 hours of researcher time. For a large annotation effort, the savings during annotation will make that worth it (or, as in this case, it will lead to higher quality data). I am curious which aspect was more important though - filtering the pool of workers, or training workers.

Workflow impact: In previous QA-SRL work, one worker wrote a question and its answers and two workers checked the question and independently added answers. Here, two workers independently write a question+answer and a third work consolidates the annotations into a final annotation. The cost for a label is about the same (54c / predicate vs. 51c / predicate), but coverage is considerably higher. The design space for crowd workflows is huge and this is another example of how important it is to explore. It’s also possible that the changes in recruitment and training were more critical than the workflow shift, but the study didn’t include evaluation with only one or the other.

QA-SRL vs. PropBank: This may be less surprising to someone who works more on SRL, but they found their approach captured many implicit roles that PropBank does not. Specifically, of 100 annotated arguments that were not in PropBank, 68 were valid implicit arguments. I’m curious about what those implicit arguments are capturing. Maybe targeted re-annotation could be used to add them to PropBank (identifying relevant sentences by trace parsing).

Citation

@inproceedings{roit-etal-2020-controlled,
title = "Controlled Crowdsourcing for High-Quality {QA}-{SRL} Annotation",
author = "Roit, Paul  and
Klein, Ayal  and
Stepanov, Daniela  and
Mamou, Jonathan  and
Michael, Julian  and
Stanovsky, Gabriel  and
Zettlemoyer, Luke  and
Dagan, Ido",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = "jul",
year = "2020",