By training a parser and language generation system together, we can use semantic parses without associated sentences for training (the sentence becomes a latent representation that is being learnt).
To explain structured outputs in terms of which inputs have most impact, treat it as identifying components in a bipartite graph where weights are determined by perturbing the input and observing the impact on outputs.
A proposal for how to improve vector representations of sentences by using attention over (1) fixed vectors, and (2) a context sentence.
Effective NER can be achieved without sequence prediction using a feedforward network that labels every span with a fixed attention mechanism for getting contextual information.
My previous post discussed work on crowdsourcing QA-SRL, a way of capturing semantic roles in text by asking workers to answer questions. This post covers a paper I contributed to that also considers crowdsourcing SRL, but collects the more traditional form of annotation used in resources like Propbank.
To leverage out-of-domain data, learn multiple sets of word vectors but with a loss term that encourages them to be similar.
Parsing performance on the semantic structures of UCCA can be boosted by using a transition system that combines ideas from discontinuous and constituent transition systems, covering the full space of structures.
Breaking discourse parsing into separate relation identification and labeling tasks can boost performance (by dealing with limited training data).
Neural abstractive summarisation can be dramatically improved with a beam search that favours output that matches the source document, and further improved with attention based on PageRank, with a modification to avoid attending to the same sentence more than once.
Another paper looking at the issue of output symbol sparsity in AMR parsing, though here the solution is to group the consistent but rare symbols (rather than graph fragments like the paper last week). This drastically increases neural model performance, but does not reach the level of hybrid systems.
Assigning a probability distribution over the next word or character in a sequence (language modeling) is a useful component of many systems…
Am I getting the most our of time at conferences? This post was a way for me to think through that question and come up with strategies.
Stack-LSTM models for dependency parsing can be adapted to constituency parsing by considering spinal version of the parse and adding a single ‘create-node’ operation to the transition-based parsing scheme, giving an elegant algorithm and competitive results.
To get context-dependence without recurrence we can use a network that applies attention multiple times over both input and output (as it is generated).
To apply attention across multiple input sources, it is best to apply attention independently and then have a second phase of attention over the summary vectors for each source.
It is difficult to predict how well a model will work in the real world. Carefully curated test sets provide some signal, but only if they are large, representative, and have not been overfit to. This paper builds on two ideas for this problem: constructing challenge datasets and breaking performance down into subcategories. Together, these become a process of designing specific tests that measure how well a model handles certain types of variation in data.
Natural language interfaces to computer systems are an exciting area with new workshops (WNLI at ACL and IntEx-SemPar at EMNLP), a range of datasets (including my own work on text-to-SQL), and many papers. Most work focuses on either (1) commands for simple APIs, (2) generating a database query, or (3) generating general purpose code. This paper considers an interesting application: interaction with data visualisation tools.
Most work in NLP uses datasets with a diverse set of speakers. In practise, everyone speaks / writes slightly differently and our models would be better if they accounted for that. This has been the motivation for a line of work by Charlie Welch that I’ve been a collaborator on (in CICLing 2019, IEEE Intelligent Systems 2019, CoLing 2020, and this paper).
Semantic Role Labeling captures the content of a sentence by labeling the word sense of the verbs and identifying their arguments. Over the last few years, Luke Zettlemoyer’s Group has been exploring using question-answer pairs to represent this structure. This approach has the big advantage that it is easier to explain than the sense inventory and role types of more traditional SRL resources like PropBank. However, even with that advantage, crowdsourcing this annotation is difficult, as this paper shows.
GPU processing can be sped up ~2x by removing low impact rows from weight matrices, and switching to a specialised floating point representation.
When labeling a dataset automatically there are going to be errors, but we can use a generative model and active learning to guide effort to checking the examples most likely to be incorrect.
Language model perplexity can be reduced by maintaining a separate model that is updated during application of the model, allowing adaptation to short-term patterns in the text.
Transition based algorithms can be transformed into dynamic programs by defining sequences of actions that correspond to the same overall transformation.
Grammatical error correction can be improved by jointly parsing the sentence being corrected.
A common argument in favour of neural networks is that they do not require ‘feature engineering’, manually defining functions that produce useful representations of the input data (e.g. a function…
For a more flexible dialogue system, use the crowd to propose and vote on responses, then introduce agents and a model for voting, gradually learning to replace the crowd.
Virtually all systems trained using data have trouble when applied to datasets that differ even slightly - even switching from Wall Street…
A new task and associated evaluation method plus system for Mad Libs - filling in missing words in a story in a funny way. While the system does poorly, using it as a first pass with human rerankers produces funnier stories than people alone.
A new dialogue dataset that has annotations of multiple plans (frames) and dialogue acts that indicate modifications to them.
Two ideas for improving AMR parsing: (1) take graph distance into consideration when generating alignments, (2) during parsing, for concept generation, generate individual concepts in some cases and frequently occurring subgraphs in other cases.
A translation model trained on sentence pairs from a mixture of languages can do very well across all of the languages, and even generalise somewhat to new pairs of the languages. That’s useful as one model can do the work of $O(n^2)$ models, and with a fraction of the parameters.
The simplest way to learn word vectors for rare words is to average their context. Tweaking word2vec to make greater use of the context may do slightly better, but it’s unclear.
This paper explores two questions. First, what is the impact of a few key design decisions for word embeddings in language models? Second, based on the first answer, how can we improve results in the situation where we have 50 million+ words of text, but only 1 GPU for training?
Using in-order traversal for transition based parsing (put the non-terminal on the stack after its first child but before the rest) is consistently better than pre-order / top-down or post-order / bottom-up traversal.
When we crowdsource data for tasks like SRL and sentiment analysis we only care about accuracy. For tasks where workers write new content, such as paraphrasing and creating questions, we also care about data diversity. If our data is not diverse then models trained on it will not be robust in the real world. The core idea of this paper is to encourage creativity by constraining workers.
By encoding the relation type and role of each word in tags, an LSTM can be applied to relation extraction with great success.
Identifying the key phrases in a dialogue at the same time as identifying the type of relations between pairs of utterances leads to substantial improvements on both tasks.
Vectors for words and entities can be learned by trying to model the text written about the entities. This leads to word vectors that score well on similarity tasks and entity vectors that produce excellent results on entity linking and question answering.
During task-oriented dialogue generation, to take into consideration a table of information about entities, represent it as a graph, run message passing to get vector representations of each entity, and use attention.
Reordering training sentences for word vectors may impact their usefulness for downstream tasks.
By using a generative model to explain worker annotations, we can more effectively predict the correct label, and which workers are spamming.
Incorporating vector representations of entities from structured resources like NELL and WordNet into the output of an LSTM can improve entity and event extraction.
By using a single core model to build a game state representation, which then gives input to both state evaluation and move choice, DeepMind are able to apply reinforcement learning with self-play with no supervision and achieve state-of-the-art performance.
With some tweaks (domain-specific heuristics), coreference systems can be used to identify the set of characters in a novel, which in turn can be used to do large scale tests of hypotheses from literary analysis.
By switching from representing words as points in a vector space to multiple gaussian regions we can get a better model, scoring higher on multiple word similarity metrics than a range of techniques.
The WikiLinks dataset of text mentions that are hyperlinked to wikipedia articles provides a nice testing space for named entity disambiguation, and a neural network using attention over local context does reasonably well.
Constraining the language of a dialogue agent can improve performance by encouraging the use of more compositional language.
Training a single parser on multiple domains can improve performance, and sharing more parameters (encoder and decoder as opposed to just one) seems to help more.
Games have been a focus of AI research for decades, from Samuel’s checkers program in the 1950s, to Deep Blue playing Chess in the 1990s, and AlphaGo playing Go in the 2010s. All of those are two-player…
A new task and dataset of 39k examples for common sense reasoning, with a sentence generated for each prompt and a manual label indicating their relation, from very likely to impossible.
Neural networks for language can be scaled up by using a form of selective computation, where a noisy single-layer model chooses among feed-forward networks (experts) that sit between LSTM layers.
Training models requires massive amounts of labeled data. We usually sample data iid from the target domain (e.g. newspapers), but it seems intuitive that this means we wast effort labeling samples that are obvious or easy and so not informative during training. Active Learning follows that intuition, labeling data incrementally, selecting the next example(s) to label based on what a model considers uncertain. Lots of work has shown this can be effective for that model, but if the labeled dataset is then used to train another model will it also do well?
The OntoNotes dataset, which is the focus of almost all coreference resolution research, had several compromises in its development (as is the case for any dataset). Some of these are discussed in…
Being able to query a database in natural language could help make data accessible …
With enough training data, the best vector representation of a sentence is to concatenate an average over word vectors and an average over character trigram vectors.
By dividing a task up among multiple annotators carefully we can achieve high-quality real-time annotation of data, in this case transcription of audio.
It seems intuitive that a coreference system could benefit from information about what nouns a verb selects for, but experiments on explicitly adding a representation of it to a neural system does not lead to gains, implying it is already learning them or they are not useful.
A neural transition based parser with actions to create non-local links can perform well on Minimal Recursion Semantics parsing.
A new dataset containing multi-turn questions about a table, and a model that generates a kind of logical form, but scores actions based on the content of the table.
Switching from the ReLU non-linearity, $\text{max}(0, x)$, to Swish, $x \cdot \text{sigmoid}(x)$, consistently improves performance in neural networks across both vision and machine translation tasks.
Annotator sequence bias, where the label for one item affects the label for the next, occurs across a range of datasets. Avoid it by separately randomise the order of items for each annotator.
An implementation of the transition-parsing as a dynamic program idea, leading to fast parsing and strong performance.
By introducing a new loss that encourages sparsity, an auto-encoder can be used to go from existing word vectors to new ones that are sparser and more interpretable, though the impact on downstream tasks is mixed.
We know that training a neural network involves optimising over a non-convex space, but using standard evaluation methods we see that our models…
Surprisingly, word2vec (negative skipgram sampling) produces vectors that point in a consistent direction, a pattern not seen in GloVe (but also one that doesn’t seem to cause a problem for downstream tasks).