Most work in NLP uses datasets with a diverse set of speakers. In practise, everyone speaks / writes slightly differently and our models would be better if they accounted for that. This has been the motivation for a line of work by [Charlie Welch](http://cfwelch.com/) that I've been a collaborator on (in [CICLing 2019](https://www.jkk.name/publication/cicling19personal), [IEEE Intelligent Systems 2019](https://www.jkk.name/publication/ieee19personal/), [CoLing 2020](https://www.jkk.name/publication/coling20personal/), and this paper).
My [previous post](https://www.jkk.name/post/2020-09-25_crowdqasrl/) discussed work on crowdsourcing QA-SRL, a way of capturing semantic roles in text by asking workers to answer questions. This post covers a paper I contributed to that also considers crowdsourcing SRL, but collects the more traditional form of annotation used in resources like Propbank.
This paper explores two questions. First, what is the impact of a few key design decisions for word embeddings in language models? Second, based on the first answer, how can we improve results in the situation where we have 50 million+ words of text, but only 1 GPU for training?
Semantic Role Labeling captures the content of a sentence by labeling the word sense of the verbs and identifying their arguments. Over the last few years, [Luke Zettlemoyer's Group](https://www.cs.washington.edu/people/faculty/lsz/) has been exploring using question-answer pairs to represent this structure. This approach has the big advantage that it is easier to explain than the sense inventory and role types of more traditional SRL resources like PropBank. However, even with that advantage, crowdsourcing this annotation is difficult, as this paper shows.
Training models requires massive amounts of labeled data. We usually sample data iid from the target domain (e.g. newspapers), but it seems intuitive that this means we wast effort labeling samples that are obvious or easy and so not informative during training. Active Learning follows that intuition, labeling data incrementally, selecting the next example(s) to label based on what a model considers uncertain. Lots of work has shown this can be effective for that model, but if the labeled dataset is then used to train another model will it also do well?
Natural language interfaces to computer systems are an exciting area with new workshops ([WNLI](https://aclanthology.org/volumes/2020.nli-1/) at ACL and [IntEx-SemPar](https://intex-sempar.github.io/) at EMNLP), a range of datasets (including my own work on [text-to-SQL](/publication/acl18sql/)), and many papers. Most work focuses on either (1) commands for simple APIs, (2) generating a database query, or (3) generating general purpose code. This paper considers an interesting application: interaction with data visualisation tools.
It is difficult to predict how well a model will work in the real world. Carefully curated test sets provide some signal, but only if they are large, representative, and have not been overfit to. This paper builds on two ideas for this problem: constructing challenge datasets and breaking performance down into subcategories. Together, these become a process of designing specific tests that measure how well a model handles certain types of variation in data.
Games have been a focus of AI research for decades, from Samuel's checkers program in the 1950s, to Deep Blue playing Chess in the 1990s, and AlphaGo playing Go in the 2010s. All of those are two-player...
This post is about my own paper to appear at ACL later this month. What is interesting about this paper will depend on your research interests, so that’s how I’ve broken down this blog post.
A few key points first:
Data and code are available on Github. The paper is also available.
Surprisingly, word2vec (negative skipgram sampling) produces vectors that point in a consistent direction, a pattern not seen in GloVe (but also one that doesn't seem to cause a problem for downstream tasks).
It seems intuitive that a coreference system could benefit from information about what nouns a verb selects for, but experiments on explicitly adding a representation of it to a neural system does not lead to gains, implying it is already learning them or they are not useful.
To explain structured outputs in terms of which inputs have most impact, treat it as identifying components in a bipartite graph where weights are determined by perturbing the input and observing the impact on outputs.
Neural abstractive summarisation can be dramatically improved with a beam search that favours output that matches the source document, and further improved with attention based on PageRank, with a modification to avoid attending to the same sentence more than once.
By introducing a new loss that encourages sparsity, an auto-encoder can be used to go from existing word vectors to new ones that are sparser and more interpretable, though the impact on downstream tasks is mixed.
Parsing performance on the semantic structures of UCCA can be boosted by using a transition system that combines ideas from discontinuous and constituent transition systems, covering the full space of structures.
Vectors for words and entities can be learned by trying to model the text written about the entities. This leads to word vectors that score well on similarity tasks and entity vectors that produce excellent results on entity linking and question answering.
Using in-order traversal for transition based parsing (put the non-terminal on the stack after its first child but before the rest) is consistently better than pre-order / top-down or post-order / bottom-up traversal.
During task-oriented dialogue generation, to take into consideration a table of information about entities, represent it as a graph, run message passing to get vector representations of each entity, and use attention.
Stack-LSTM models for dependency parsing can be adapted to constituency parsing by considering spinal version of the parse and adding a single 'create-node' operation to the transition-based parsing scheme, giving an elegant algorithm and competitive results.
With some tweaks (domain-specific heuristics), coreference systems can be used to identify the set of characters in a novel, which in turn can be used to do large scale tests of hypotheses from literary analysis.