Semantic Parsing with Semi-Supervised Sequential Autoencoders (Kocisky et al., EMNLP 2016)

By training a parser and language generation system together, we can use semantic parses without associated sentences for training (the sentence becomes a latent representation that is being learnt).

A causal framework for explaining the predictions of black-box sequence-to-sequence models (Alvarez-Melis et al., 2017)

To explain structured outputs in terms of which inputs have most impact, treat it as identifying components in a bipartite graph where weights are determined by perturbing the input and observing the impact on outputs.

A Factored Neural Network Model for Characterizing Online Discussions in Vector Space (Cheng et al., EMNLP 2017)

A proposal for how to improve vector representations of sentences by using attention over (1) fixed vectors, and (2) a context sentence.

A Local Detection Approach for Named Entity Recognition and Mention Detection (Xu et al., 2017)

Effective NER can be achieved without sequence prediction using a feedforward network that labels every span with a fixed attention mechanism for getting contextual information.

A Novel Workflow for Accurately and Efficiently Crowdsourcing Predicate Senses and Argument Labels (Jiang, et al., Findings of EMNLP 2020)

My previous post discussed work on crowdsourcing QA-SRL, a way of capturing semantic roles in text by asking workers to answer questions. This post covers a paper I contributed to that also considers crowdsourcing SRL, but collects the more traditional form of annotation used in resources like Propbank.

A Simple Regularization-based Algorithm for Learning Cross-Domain Word Embeddings (Yang et al., 2017)

To leverage out-of-domain data, learn multiple sets of word vectors but with a loss term that encourages them to be similar.

A Transition-Based Directed Acyclic Graph Parser for UCCA (Hershcovich et al., 2017)

Parsing performance on the semantic structures of UCCA can be boosted by using a transition system that combines ideas from discontinuous and constituent transition systems, covering the full space of structures.

A Two-Stage Parsing Method for Text-Level Discourse Analysis (Wang et al., 2017)

Breaking discourse parsing into separate relation identification and labeling tasks can boost performance (by dealing with limited training data).

Abstractive Document Summarization with a Graph-Based Attentional Neural Model (Tan et al., 2017)

Neural abstractive summarisation can be dramatically improved with a beam search that favours output that matches the source document, and further improved with attention based on PageRank, with a modification to avoid attending to the same sentence more than once.

Addressing the Data Sparsity Issue in Neural AMR Parsing (Peng et al., EACL 2017)

Another paper looking at the issue of output symbol sparsity in AMR parsing, though here the solution is to group the consistent but rare symbols (rather than graph fragments like the paper last week). This drastically increases neural model performance, but does not reach the level of hybrid systems.

An Analysis of Neural Language Modeling at Multiple Scales (Merity et al., 2018)

Assigning a probability distribution over the next word or character in a sequence (language modeling) is a useful component of many systems…

Approaching Conferences

Am I getting the most our of time at conferences? This post was a way for me to think through that question and come up with strategies.

Arc-Standard Spinal Parsing with Stack-LSTMs (Ballesteros et al., 2017)

Stack-LSTM models for dependency parsing can be adapted to constituency parsing by considering spinal version of the parse and adding a single ‘create-node’ operation to the transition-based parsing scheme, giving an elegant algorithm and competitive results.

Attention Is All You Need (Vaswani et al., ArXiv 2017)

To get context-dependence without recurrence we can use a network that applies attention multiple times over both input and output (as it is generated).

Attention Strategies for Multi-Source Sequence-to-Sequence Learning (Libovicky et al., 2017)

To apply attention across multiple input sources, it is best to apply attention independently and then have a second phase of attention over the summary vectors for each source.

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList (Ribeiro, et al., ACL 2020 Best Paper)

It is difficult to predict how well a model will work in the real world. Carefully curated test sets provide some signal, but only if they are large, representative, and have not been overfit to. This paper builds on two ideas for this problem: constructing challenge datasets and breaking performance down into subcategories. Together, these become a process of designing specific tests that measure how well a model handles certain types of variation in data.

ChartDialogs: Plotting from Natural Language Instructions (Shao and Nakashole, ACL 2020)

Natural language interfaces to computer systems are an exciting area with new workshops (WNLI at ACL and IntEx-SemPar at EMNLP), a range of datasets (including my own work on text-to-SQL), and many papers. Most work focuses on either (1) commands for simple APIs, (2) generating a database query, or (3) generating general purpose code. This paper considers an interesting application: interaction with data visualisation tools.

Compositional Demographic Word Embeddings (Welch et al., EMNLP 2020)

Most work in NLP uses datasets with a diverse set of speakers. In practise, everyone speaks / writes slightly differently and our models would be better if they accounted for that. This has been the motivation for a line of work by Charlie Welch that I’ve been a collaborator on (in CICLing 2019, IEEE Intelligent Systems 2019, CoLing 2020, and this paper).

Controlled Crowdsourcing for High-Quality QA-SRL Annotation (Roit, et al., ACL 2020)

Semantic Role Labeling captures the content of a sentence by labeling the word sense of the verbs and identifying their arguments. Over the last few years, Luke Zettlemoyer’s Group has been exploring using question-answer pairs to represent this structure. This approach has the big advantage that it is easier to explain than the sense inventory and role types of more traditional SRL resources like PropBank. However, even with that advantage, crowdsourcing this annotation is difficult, as this paper shows.

DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission (Hill et al., MICRO 2017)

GPU processing can be sped up ~2x by removing low impact rows from weight matrices, and switching to a specialised floating point representation.

Detecting annotation noise in automatically labelled data (Rehbein and Ruppenhofer, ACL 2017)

When labeling a dataset automatically there are going to be errors, but we can use a generative model and active learning to guide effort to checking the examples most likely to be incorrect.

Dynamic Evaluation of Neural Sequence Models (Krause et al., 2017)

Language model perplexity can be reduced by maintaining a separate model that is updated during application of the model, allowing adaptation to short-term patterns in the text.

Dynamic Programming Algorithms for Transition-Based Dependency Parsers (Kuhlmann et al., ACL 2011)

Transition based algorithms can be transformed into dynamic programs by defining sequences of actions that correspond to the same overall transformation.

Error-repair Dependency Parsing for Ungrammatical Texts (Sakaguchi et al., 2017)

Grammatical error correction can be improved by jointly parsing the sentence being corrected.

Evaluating the Utility of Hand-crafted Features in Sequence Labelling (Minghao Wu et al., 2018)

A common argument in favour of neural networks is that they do not require ‘feature engineering’, manually defining functions that produce useful representations of the input data (e.g. a function…

Evorus: A Crowd-powered Conversational Assistant Built to Automate Itself Over Time (Huang et al., 2018)

For a more flexible dialogue system, use the crowd to propose and vote on responses, then introduce agents and a model for voting, gradually learning to replace the crowd.

Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (Vidur Joshi et al., 2018)

Virtually all systems trained using data have trouble when applied to datasets that differ even slightly - even switching from Wall Street…

Filling the Blanks (hint: plural noun) for Mad Libs Humor (Hossain et al., EMNLP 2017)

A new task and associated evaluation method plus system for Mad Libs - filling in missing words in a story in a funny way. While the system does poorly, using it as a first pass with human rerankers produces funnier stories than people alone.

Frames: a corpus for adding memory to goal-oriented dialogue systems (El Asri et al., 2017)

A new dialogue dataset that has annotations of multiple plans (frames) and dialogue acts that indicate modifications to them.

Getting the Most out of AMR Parsing (Wang and Xue, EMNLP 2017)

Two ideas for improving AMR parsing: (1) take graph distance into consideration when generating alignments, (2) during parsing, for concept generation, generate individual concepts in some cases and frequently occurring subgraphs in other cases.

Google's Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation (Johnson et al., TACL 2017)

A translation model trained on sentence pairs from a mixture of languages can do very well across all of the languages, and even generalise somewhat to new pairs of the languages. That’s useful as one model can do the work of $O(n^2)$ models, and with a fraction of the parameters.

High-risk learning: acquiring new word vectors from tiny data (Herbelot et al., 2017)

The simplest way to learn word vectors for rare words is to average their context. Tweaking word2vec to make greater use of the context may do slightly better, but it’s unclear.

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation (Welch, Mihalcea, and Kummerfeld, EMNLP 2020)

This paper explores two questions. First, what is the impact of a few key design decisions for word embeddings in language models? Second, based on the first answer, how can we improve results in the situation where we have 50 million+ words of text, but only 1 GPU for training?

In-Order Transition-based Constituent Parsing (Liu et al., 2017)

Using in-order traversal for transition based parsing (put the non-terminal on the stack after its first child but before the rest) is consistently better than pre-order / top-down or post-order / bottom-up traversal.

Iterative Feature Mining for Constraint-Based Data Collection to Increase Data Diversity and Model Robustness (Larson, et al., EMNLP 2020)

When we crowdsource data for tasks like SRL and sentiment analysis we only care about accuracy. For tasks where workers write new content, such as paraphrasing and creating questions, we also care about data diversity. If our data is not diverse then models trained on it will not be robust in the real world. The core idea of this paper is to encourage creativity by constraining workers.

Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme (Zheng et al., 2017)

By encoding the relation type and role of each word in tags, an LSTM can be applied to relation extraction with great success.

Joint Modeling of Content and Discourse Relations in Dialogues (Qin et al., 2017)

Identifying the key phrases in a dialogue at the same time as identifying the type of relations between pairs of utterances leads to substantial improvements on both tasks.

Learning Distributed Representations of Texts and Entities from Knowledge Base (Yamada et al., 2017)

Vectors for words and entities can be learned by trying to model the text written about the entities. This leads to word vectors that score well on similarity tasks and entity vectors that produce excellent results on entity linking and question answering.

Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings (He et al., 2017)

During task-oriented dialogue generation, to take into consideration a table of information about entities, represent it as a graph, run message passing to get vector representations of each entity, and use attention.

Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning (Tsvetkov et al., 2016)

Reordering training sentences for word vectors may impact their usefulness for downstream tasks.

Learning Whom to Trust with MACE (Hovy et al., NAACL 2013)

By using a generative model to explain worker annotations, we can more effectively predict the correct label, and which workers are spamming.

Leveraging Knowledge Bases in LSTMs for Improving Machine Reading (Yang et al., 2017)

Incorporating vector representations of entities from structured resources like NELL and WordNet into the output of an LSTM can improve entity and event extraction.

Mastering the game of Go without human knowledge (Silver et al., Nature 2017)

By using a single core model to build a game state representation, which then gives input to both state evaluation and move choice, DeepMind are able to apply reinforcement learning with self-play with no supervision and achieve state-of-the-art performance.

Mr. Bennet, his coachman, and the Archbishop walk into a bar but only one of them gets recognized: On The Difficulty of Detecting Characters in Literary Texts (Vala et al., 2015)

With some tweaks (domain-specific heuristics), coreference systems can be used to identify the set of characters in a novel, which in turn can be used to do large scale tests of hypotheses from literary analysis.

Multimodal Word Distributions (Athiwaratkun and Wilson, 2017)

By switching from representing words as points in a vector space to multiple gaussian regions we can get a better model, scoring higher on multiple word similarity metrics than a range of techniques.

Named Entity Disambiguation for Noisy Text (Eshel et al., CoNLL 2017)

The WikiLinks dataset of text mentions that are hyperlinked to wikipedia articles provides a nice testing space for named entity disambiguation, and a neural network using attention over local context does reasonably well.

Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog (Kottur et al., 2017)

Constraining the language of a dialogue agent can improve performance by encouraging the use of more compositional language.

Neural Semantic Parsing over Multiple Knowledge-bases (Herzig et al., 2017)

Training a single parser on multiple domains can improve performance, and sharing more parameters (encoder and decoder as opposed to just one) seems to help more.

No-Press Diplomacy: Modeling Multi-Agent Gameplay (Paquette et al., 2019)

Games have been a focus of AI research for decades, from Samuel’s checkers program in the 1950s, to Deep Blue playing Chess in the 1990s, and AlphaGo playing Go in the 2010s. All of those are two-player…

Ordinal Common-sense Inference (Zhang et al., 2017)

A new task and dataset of 39k examples for common sense reasoning, with a sentence generated for each prompt and a manual label indicating their relation, from very likely to impossible.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017)

Neural networks for language can be scaled up by using a form of selective computation, where a noisy single-layer model chooses among feed-forward networks (experts) that sit between LSTM layers.

Practical Obstacles to Deploying Active Learning (Lowell, et al., EMNLP 2019)

Training models requires massive amounts of labeled data. We usually sample data iid from the target domain (e.g. newspapers), but it seems intuitive that this means we wast effort labeling samples that are obvious or easy and so not informative during training. Active Learning follows that intuition, labeling data incrementally, selecting the next example(s) to label based on what a model considers uncertain. Lots of work has shown this can be effective for that model, but if the labeled dataset is then used to train another model will it also do well?

PreCo: A Large-scale Dataset in Preschool Vocabulary for Coreference Resolution (Chen et al., 2018)

The OntoNotes dataset, which is the focus of almost all coreference resolution research, had several compromises in its development (as is the case for any dataset). Some of these are discussed in…

Provenance for Natural Language Queries (Deutch et al., 2017)

Being able to query a database in natural language could help make data accessible …

Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations (Wieting et al., 2017)

With enough training data, the best vector representation of a sentence is to concatenate an average over word vectors and an average over character trigram vectors.

Real-time Captioning by Groups of Non-experts (Lasecki et al., 2012)

By dividing a task up among multiple annotators carefully we can achieve high-quality real-time annotation of data, in this case transcription of audio.

Revisiting Selectional Preferences for Coreference Resolution (Heinzerling et al., 2017)

It seems intuitive that a coreference system could benefit from information about what nouns a verb selects for, but experiments on explicitly adding a representation of it to a neural system does not lead to gains, implying it is already learning them or they are not useful.

Robust Incremental Neural Semantic Graph Parsing (Buys et al., 2017)

A neural transition based parser with actions to create non-local links can perform well on Minimal Recursion Semantics parsing.

Search-based Neural Structured Learning for Sequential Question Answering (Iyyer et al., ACL 2017)

A new dataset containing multi-turn questions about a table, and a model that generates a kind of logical form, but scores actions based on the content of the table.

Searching for Activation Functions (Ramachandran et al., 2017)

Switching from the ReLU non-linearity, $\text{max}(0, x)$, to Swish, $x \cdot \text{sigmoid}(x)$, consistently improves performance in neural networks across both vision and machine translation tasks.

Sequence Effects in Crowdsourced Annotations (Mathur et al., 2017)

Annotator sequence bias, where the label for one item affects the label for the next, occurs across a range of datasets. Avoid it by separately randomise the order of items for each annotator.

Shift-Reduce Constituency Parsing with Dynamic Programming and POS Tag Lattice (Mi and Huang, 2015)

An implementation of the transition-parsing as a dynamic program idea, leading to fast parsing and strong performance.

SPINE: SParse Interpretable Neural Embeddings (Subramanian et al., 2017)

By introducing a new loss that encourages sparsity, an auto-encoder can be used to go from existing word vectors to new ones that are sparser and more interpretable, though the impact on downstream tasks is mixed.

The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models (Weber et al., 2018)

We know that training a neural network involves optimising over a non-convex space, but using standard evaluation methods we see that our models…

The strange geometry of skip-gram with negative sampling (Mimno et al., 2017)

Surprisingly, word2vec (negative skipgram sampling) produces vectors that point in a consistent direction, a pattern not seen in GloVe (but also one that doesn’t seem to cause a problem for downstream tasks).