Compositional Demographic Word Embeddings (Welch et al., EMNLP 2020)

Most work in NLP uses datasets with a diverse set of speakers. In practise, everyone speaks / writes slightly differently and our models would be better if they accounted for that. This has been the motivation for a line of work by [Charlie Welch]( that I've been a collaborator on (in [CICLing 2019](, [IEEE Intelligent Systems 2019](, [CoLing 2020](, and this paper).

A Novel Workflow for Accurately and Efficiently Crowdsourcing Predicate Senses and Argument Labels (Jiang, et al., Findings of EMNLP 2020)

My [previous post]( discussed work on crowdsourcing QA-SRL, a way of capturing semantic roles in text by asking workers to answer questions. This post covers a paper I contributed to that also considers crowdsourcing SRL, but collects the more traditional form of annotation used in resources like Propbank.

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation (Welch, Mihalcea, and Kummerfeld, EMNLP 2020)

This paper explores two questions. First, what is the impact of a few key design decisions for word embeddings in language models? Second, based on the first answer, how can we improve results in the situation where we have 50 million+ words of text, but only 1 GPU for training?

Controlled Crowdsourcing for High-Quality QA-SRL Annotation (Roit, et al., ACL 2020)

Semantic Role Labeling captures the content of a sentence by labeling the word sense of the verbs and identifying their arguments. Over the last few years, [Luke Zettlemoyer's Group]( has been exploring using question-answer pairs to represent this structure. This approach has the big advantage that it is easier to explain than the sense inventory and role types of more traditional SRL resources like PropBank. However, even with that advantage, crowdsourcing this annotation is difficult, as this paper shows.

Practical Obstacles to Deploying Active Learning (Lowell, et al., EMNLP 2019)

Training models requires massive amounts of labeled data. We usually sample data iid from the target domain (e.g. newspapers), but it seems intuitive that this means we wast effort labeling samples that are obvious or easy and so not informative during training. Active Learning follows that intuition, labeling data incrementally, selecting the next example(s) to label based on what a model considers uncertain. Lots of work has shown this can be effective for that model, but if the labeled dataset is then used to train another model will it also do well?

ChartDialogs: Plotting from Natural Language Instructions (Shao and Nakashole, ACL 2020)

Natural language interfaces to computer systems are an exciting area with new workshops ([WNLI]( at ACL and [IntEx-SemPar]( at EMNLP), a range of datasets (including my own work on [text-to-SQL](/publication/acl18sql/)), and many papers. Most work focuses on either (1) commands for simple APIs, (2) generating a database query, or (3) generating general purpose code. This paper considers an interesting application: interaction with data visualisation tools.

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList (Ribeiro, et al., ACL 2020 Best Paper)

It is difficult to predict how well a model will work in the real world. Carefully curated test sets provide some signal, but only if they are large, representative, and have not been overfit to. This paper builds on two ideas for this problem: constructing challenge datasets and breaking performance down into subcategories. Together, these become a process of designing specific tests that measure how well a model handles certain types of variation in data.

No-Press Diplomacy: Modeling Multi-Agent Gameplay (Paquette et al., 2019)

Games have been a focus of AI research for decades, from Samuel's checkers program in the 1950s, to Deep Blue playing Chess in the 1990s, and AlphaGo playing Go in the 2010s. All of those are two-player...

A Large-Scale Corpus for Conversation Disentanglement (Kummerfeld et al., 2019)

This post is about my own paper to appear at ACL later this month. What is interesting about this paper will depend on your research interests, so that’s how I’ve broken down this blog post. A few key points first: Data and code are available on Github. The paper is also available.

PreCo: A Large-scale Dataset in Preschool Vocabulary for Coreference Resolution (Chen et al., 2018)

The OntoNotes dataset, which is the focus of almost all coreference resolution research, had several compromises in its development (as is the case for any dataset). Some of these are discussed in...

Evaluating the Utility of Hand-crafted Features in Sequence Labelling (Minghao Wu et al., 2018)

A common argument in favour of neural networks is that they do not require 'feature engineering', manually defining functions that produce useful representations of the input data (e.g. a function...

Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (Vidur Joshi et al., 2018)

Virtually all systems trained using data have trouble when applied to datasets that differ even slightly - even switching from Wall Street...

The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models (Weber et al., 2018)

We know that training a neural network involves optimising over a non-convex space, but using standard evaluation methods we see that our models...

An Analysis of Neural Language Modeling at Multiple Scales (Merity et al., 2018)

Assigning a probability distribution over the next word or character in a sequence (language modeling) is a useful component of many systems...

Provenance for Natural Language Queries (Deutch et al., 2017)

Being able to query a database in natural language could help make data accessible ...

Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning (Tsvetkov et al., 2016)

Reordering training sentences for word vectors may impact their usefulness for downstream tasks.

Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations (Wieting et al., 2017)

With enough training data, the best vector representation of a sentence is to concatenate an average over word vectors and an average over character trigram vectors.

Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time (Huang et al., 2018)

For a more flexible dialogue system, use the crowd to propose and vote on responses, then introduce agents and a model for voting, gradually learning to replace the crowd.

A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings (Yang et al., 2017)

To leverage out-of-domain data, learn multiple sets of word vectors but with a loss term that encourages them to be similar.

The strange geometry of skip-gram with negative sampling (Mimno et al., 2017)

Surprisingly, word2vec (negative skipgram sampling) produces vectors that point in a consistent direction, a pattern not seen in GloVe (but also one that doesn't seem to cause a problem for downstream tasks).

Sequence Effects in Crowdsourced Annotations (Mathur et al., 2017)

Annotator sequence bias, where the label for one item affects the label for the next, occurs across a range of datasets. Avoid it by separately randomise the order of items for each annotator.

High-risk learning: acquiring new word vectors from tiny data (Herbelot et al., 2017)

The simplest way to learn word vectors for rare words is to average their context. Tweaking word2vec to make greater use of the context may do slightly better, but it's unclear.

Revisiting Selectional Preferences for Coreference Resolution (Heinzerling et al., 2017)

It seems intuitive that a coreference system could benefit from information about what nouns a verb selects for, but experiments on explicitly adding a representation of it to a neural system does not lead to gains, implying it is already learning them or they are not useful.

Neural Semantic Parsing over Multiple Knowledge-bases (Herzig et al., 2017)

Training a single parser on multiple domains can improve performance, and sharing more parameters (encoder and decoder as opposed to just one) seems to help more.

A causal framework for explaining the predictions of black-box sequence-to-sequence models (Alvarez-Melis et al., 2017)

To explain structured outputs in terms of which inputs have most impact, treat it as identifying components in a bipartite graph where weights are determined by perturbing the input and observing the impact on outputs.

A Local Detection Approach for Named Entity Recognition and Mention Detection (Xu et al., 2017)

Effective NER can be achieved without sequence prediction using a feedforward network that labels every span with a fixed attention mechanism for getting contextual information.

Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme (Zheng et al., 2017)

By encoding the relation type and role of each word in tags, an LSTM can be applied to relation extraction with great success.

Abstractive Document Summarization with a Graph-Based Attentional Neural Model (Tan et al., 2017)

Neural abstractive summarisation can be dramatically improved with a beam search that favours output that matches the source document, and further improved with attention based on PageRank, with a modification to avoid attending to the same sentence more than once.

SPINE: SParse Interpretable Neural Embeddings (Subramanian et al., 2017)

By introducing a new loss that encourages sparsity, an auto-encoder can be used to go from existing word vectors to new ones that are sparser and more interpretable, though the impact on downstream tasks is mixed.

Ordinal Common-sense Inference (Zhang et al., 2017)

A new task and dataset of 39k examples for common sense reasoning, with a sentence generated for each prompt and a manual label indicating their relation, from very likely to impossible.

Error-repair Dependency Parsing for Ungrammatical Texts (Sakaguchi et al., 2017)

Grammatical error correction can be improved by jointly parsing the sentence being corrected.

Attention Strategies for Multi-Source Sequence-to-Sequence Learning (Libovicky et al., 2017)

To apply attention across multiple input sources, it is best to apply attention independently and then have a second phase of attention over the summary vectors for each source.

Robust Incremental Neural Semantic Graph Parsing (Buys et al., 2017)

A neural transition based parser with actions to create non-local links can perform well on Minimal Recursion Semantics parsing.

A Two-Stage Parsing Method for Text-Level Discourse Analysis (Wang et al., 2017)

Breaking discourse parsing into separate relation identification and labeling tasks can boost performance (by dealing with limited training data).

A Transition-Based Directed Acyclic Graph Parser for UCCA (Hershcovich et al., 2017)

Parsing performance on the semantic structures of UCCA can be boosted by using a transition system that combines ideas from discontinuous and constituent transition systems, covering the full space of structures.

Learning Distributed Representations of Texts and Entities from Knowledge Base (Yamada et al., 2017)

Vectors for words and entities can be learned by trying to model the text written about the entities. This leads to word vectors that score well on similarity tasks and entity vectors that produce excellent results on entity linking and question answering.

In-Order Transition-based Constituent Parsing (Liu et al., 2017)

Using in-order traversal for transition based parsing (put the non-terminal on the stack after its first child but before the rest) is consistently better than pre-order / top-down or post-order / bottom-up traversal.

Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog (Kottur et al., 2017)

Constraining the language of a dialogue agent can improve performance by encouraging the use of more compositional language.

Leveraging Knowledge Bases in LSTMs for Improving Machine Reading (Yang et al., 2017)

Incorporating vector representations of entities from structured resources like NELL and WordNet into the output of an LSTM can improve entity and event extraction.

Frames: a corpus for adding memory to goal-oriented dialogue systems (El Asri et al., 2017)

A new dialogue dataset that has annotations of multiple plans (frames) and dialogue acts that indicate modifications to them.

Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings (He et al., 2017)

During task-oriented dialogue generation, to take into consideration a table of information about entities, represent it as a graph, run message passing to get vector representations of each entity, and use attention.

Arc-Standard Spinal Parsing with Stack-LSTMs (Ballesteros et al., 2017)

Stack-LSTM models for dependency parsing can be adapted to constituency parsing by considering spinal version of the parse and adding a single 'create-node' operation to the transition-based parsing scheme, giving an elegant algorithm and competitive results.

Mr. Bennet, his coachman, and the Archbishop walk into a bar but only one of them gets recognized: On The Difficulty of Detecting Characters in Literary Texts (Vala et al., 2015)

With some tweaks (domain-specific heuristics), coreference systems can be used to identify the set of characters in a novel, which in turn can be used to do large scale tests of hypotheses from literary analysis.

Joint Modeling of Content and Discourse Relations in Dialogues (Qin et al., 2017)

Identifying the key phrases in a dialogue at the same time as identifying the type of relations between pairs of utterances leads to substantial improvements on both tasks.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017)

Neural networks for language can be scaled up by using a form of selective computation, where a noisy single-layer model chooses among feed-forward networks (experts) that sit between LSTM layers.

Real-time Captioning by Groups of Non-experts (Lasecki et al., 2012)

By dividing a task up among multiple annotators carefully we can achieve high-quality real-time annotation of data, in this case transcription of audio.

Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)

Language model perplexity can be reduced by maintaining a separate model that is updated during application of the model, allowing adaptation to short-term patterns in the text.

Searching for Activation Functions (Ramachandran et al., 2017)

Switching from the ReLU non-linearity, $\text{max}(0, x)$, to Swish, $x \cdot \text{sigmoid}(x)$, consistently improves performance in neural networks across both vision and machine translation tasks.

Multimodal Word Distributions (Athiwaratkun and Wilson, 2017)

By switching from representing words as points in a vector space to multiple gaussian regions we can get a better model, scoring higher on multiple word similarity metrics than a range of techniques.

Shift-Reduce Constituency Parsing with Dynamic Programming and POS Tag Lattice (Mi and Huang, 2015)

An implementation of the transition-parsing as a dynamic program idea, leading to fast parsing and strong performance.