Slot-filling models in task-driven dialog systems rely on carefully annotated training data. However, annotations by crowd workers are often inconsistent or contain errors. Simple solutions like manually checking annotations or having multiple workers label each sample are expensive and waste effort on samples that are correct. If we can identify inconsistencies, we can focus effort where it is needed. Toward this end, we define six inconsistency types in slot-filling annotations. Using three new noisy crowd-annotated datasets, we show that a wide range of inconsistencies occur and can impact system performance if not addressed. We then introduce automatic methods of identifying inconsistencies. Experiments on our new datasets show that these methods effectively reveal inconsistencies in data, though there is further scope for improvement.
In this paper, we introduce personalized word embeddings, and examine their value for language modeling. We compare the performance of our proposed prediction model when using personalized versus generic word representations, and study how these representations can be leveraged for improved performance. We provide insight into what types of words can be more accurately predicted when building personalized models. Our results show that a subset of words belonging to specific psycholinguistic categories tend to vary more in their representations across users and that combining generic and personalized word embeddings yields the best performance, with a 4.7{%} relative reduction in perplexity. Additionally, we show that a language model using personalized word embeddings can be effectively used for authorship attribution.
Diverse data is crucial for training robust models, but crowdsourced text often lacks diversity as workers tend to write simple variations from prompts. We propose a general approach for guiding workers to write more diverse text by iteratively constraining their writing. We show how prior workflows are special cases of our approach, and present a way to apply the approach to dialog tasks such as intent classification and slot-filling. Using our method, we create more challenging versions of test sets from prior dialog datasets and find dramatic performance drops for standard models. Finally, we show that our approach is complementary to recent work on improving data diversity, and training on data collected with our approach leads to more robust models.
Blog Post Abstract Code DOI Supplementary Material ArXiv
Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.
Word embeddings are usually derived from corpora containing text from many individuals, thus leading to general purpose representations rather than individually personalized representations. While personalized embeddings can be useful to improve language model performance and other language processing tasks, they can only be computed for people with a large amount of longitudinal data, which is not the case for new users. We propose a new form of personalized word embeddings that use demographic-specific word representations derived compositionally from full or partial demographic information for a user (i.e., gender, age, location, religion). We show that the resulting demographic-aware word representations outperform generic word representations on two tasks for English: language modeling and word associations. We further explore the trade-off between the number of available attributes and their relative effectiveness and discuss the ethical implications of using them.
Resources for Semantic Role Labeling (SRL) are typically annotated by experts at great expense. Prior attempts to develop crowdsourcing methods have either had low accuracy or required substantial expert annotation. We propose a new multi-stage crowd workflow that substantially reduces expert involvement without sacrificing accuracy. In particular, we introduce a unique filter stage based on the key observation that crowd workers are able to almost perfectly filter out incorrect options for labels. Our three-stage workflow produces annotations with 95% accuracy for predicate labels and 93% for argument labels, which is comparable to expert agreement. Compared to prior work on crowdsourcing for SRL, we decrease expert effort by 4x, from 56% to 14% of cases. Our approach enables more scalable annotation of SRL, and could enable annotation of NLP tasks that have previously been considered too complex to effectively crowdsource.
Extensive work has argued in favour of paying crowd workers a wage that is at least equivalent to the U.S.~federal minimum wage. Meanwhile, research on collecting high quality annotations (e.g.~for Natural Language Processing) suggests using qualifications such as a minimum number of previously completed tasks. If most requesters who pay fairly use this kind of minimum qualification, then workers may be forced to complete a substantial amount of poorly paid work for other requesters before they can earn a fair wage. This paper (1) explores current conventions for the threshold, (2) discusses possible alternatives, and (3) presents a study of correlation between approved work and work quality.
Abstract Dataset DOI Citations (14)
This paper provides detailed information about the seventh Dialog System Technology Challenge (DSTC7) and its three tracks aimed to explore the problem of building robust and accurate end-to-end dialog systems. In more detail, DSTC7 focuses on developing and exploring end-to-end technologies for the following three pragmatic challenges: (1) sentence selection for multiple domains, (2) generation of informational responses grounded in external knowledge, and (3) audio visual scene-aware dialog to allow conversations with users about objects and events around them. This paper summarizes the overall setup and results of DSTC7, including detailed descriptions of the different tracks, provided datasets and annotations, overview of the submitted systems and their final results. For Track 1, LSTM-based models performed best across both datasets, allowing teams to effectively handle task variants where no correct answer was present or when multiple paraphrases were included. For Track 2, RNN-based architectures augmented to incorporate facts by using two types of encoders: a dialog encoder and a fact encoder plus using attention mechanisms and a pointer-generator approach provided the best results. Finally, for Track 3, the best model used Hierarchical Attention mechanisms to combine the text and vision information obtaining a 22% better result than the baseline LSTM system for the human rating score. More than 220 participants were registered and about 40 teams participated in the final challenge. 32 scientific papers reporting the systems submitted to DSTC7, and 3 general technical papers for dialog technologies, were presented during the one-day wrap-up workshop at AAAI-19. During the workshop, we reviewed the state-of-the-art systems, shared novel approaches to the DSTC7 tasks, and discussed the future directions for the challenge (DSTC8).
RAP-Net: Recurrent Attention Pooling Networks for Dialogue Response Selection
Chao-Wei Huang, Ting-Rui Chiang, Shang-Yu Su, Yun-Nung Chen, CSL, 2020
Learning Multi-Level Information for Dialogue Response Selection by Highway Recurrent Transformer
Ting-Rui Chiang, Chao-Wei Huang, Shang-Yu Su, Yun-Nung Chen, CSL, 2020
Knowledge-Grounded Response Generation with Deep Attentional Latent-Variable Model
Hao-Tong Ye, Kai-Lin Lo, Shang-Yu Su, Yun-Nung Chen, CSL, 2020
Cluster-based beam search for pointer-generator chatbot grounded by knowledge
Yik-Cheung Tam, CSL, 2020
Investigating Topics, Audio Representations and Attention for Multimodal Scene-Aware Dialog
Shachi H Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman, CSL, 2020
Treating Dialogue Quality Evaluation as an Anomaly Detection Problem
Rostislav Nedelchev, Ricardo Usbeck, Jens Lehmann, LREC, 2020
Hierarchical multimodal attention for end-to-end audio-visual scene-aware dialogue response generation
Hung T. Le, Doyen Sahoo, Nancy F. Chen, Steven C. H. Hoi, CSL, 2020
Counterfactual Augmentation for Training Next Response Selection,
Seungtaek Choi, Myeongho Jeong, Jinyoung Yeo, Seung-won Hwang, EMNLP Workshop on Simple and Efficient Natural Language Processing, 2020
Conditional Response Augmentation for Dialogue using Knowledge Distillation
Myeongho Jeong, Seungtaek Choi, Hojae Han, Kyungho Kim, Seung-won Hwang, Interspeech, 2020
Investigating topics, audio representations and attention for multimodal scene-aware dialog
Shachi H.Kumar, Eda Okur, Saurav Sahay, Jonathan Huang, Lama Nachman, CSL, 2020
A Quantum-Like multimodal network framework for modeling interaction dynamics in multiparty conversational sentiment analysis
Yazhou Zhang, Dawei Song, Xiang Li, Peng Zhang, Panpan Wang, Lu Rong, Guangliang Yu, Bo Wang, Information Fusion, 2020
Language Model Transformers as Evaluators for Open-domain Dialogues
Rostislav Nedelchev, Jens Lehmann, Ricardo Usbeck, CoLing, 2020
Spoken Medical Prescription Acquisition Through a Dialogue System on Smartphone: Perspective of a Healthcare Software Company
Ali Can Kocabiyikoglu, François Portet, Jean-Marc Babouchkine, Hervé Blanchon, LREC, 2020
MEISD: A Multimodal Multi-Label Emotion, Intensity and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations
Mauajama Firdaus, Hardik Chauhan, Asif Ekbal, Pushpak Bhattacharyya, CoLing, 2020
Abstract Dataset Citations (1)
Real-world conversation often involves more than two participants and complex conversation structures, but most datasets for dialogue research simplify the task to make it more tractable. This shared task built on prior tasks for goal-oriented dialogue, moving towards more realistic settings. Seventeen teams participated in the primary task, predicting the next utterance in a multi-party conversation, and several teams participated in supplementary tasks. All of the datasets have been publicly released, providing a standard benchmark for future work in this space.
Online Conversation Disentanglement with Pointer Networks
Tao Yu, Shafiq Joty, EMNLP, 2020
Detecting rhetoric that manipulates readers’ emotions requires distinguishing intrinsically emotional content (IEC; e.g., a parent losing a child) from emotionally manipulative language (EML; e.g., using fear-inducing language to spread anti-vaccine propaganda). However, this remains an open classifcation challenge for both automatic and crowdsourcing approaches. Machine Learning approaches only work in narrow domains where labeled training data is available, and non-expert annotators tend to confate IEC with EML. We introduce an approach, anchor comparison, that leverages workers’ ability to identify and remove instances of EML in text to create a paraphrased ‘anchor text’, which is then used as a comparison point to classify EML in the original content. We evaluate our approach with a dataset of news-style text snippets and show that precision and recall can be tuned for system builders’ needs. Our contribution is a crowdsourcing approach that enables non-expert disentanglement of social references from content.
Word embeddings are powerful representations that form the foundation of many natural language processing architectures and tasks, both in English and in other languages. To gain further insight into word embeddings in multiple languages, we explore their stability, defined as the overlap between the nearest neighbors of a word in different embedding spaces. We discuss linguistic properties that are related to stability, drawing out insights about how morphological and other features relate to stability. This has implications for the usage of embeddings, particularly in research that uses embeddings to study language trends.
NEMO: Frequentist Inference Approach to Constrained Linguistic Typology Feature Prediction in SIGTYP 2020 Shared Task
Alexander Gutkin, Richard Sproat, Workshop on Computational Research in Linguistic Typology, 2020
Abstract Dataset ArXiv Citations (7)
This paper introduces the Eighth Dialog System Technology Challenge. In line with recent challenges, the eighth edition focuses on applying end-to-end dialog technologies in a pragmatic way for multi-domain task-completion, noetic response selection, audio visual scene-aware dialog, and schema-guided dialog state tracking tasks. This paper describes the task definition, provided datasets, and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks.
Multi-step Joint-Modality Attention Network for Scene-Aware Dialogue System
Yun-Wei Chu, Kuan-Yen Lin, Chao-Chun Hsu, Lun-Wei Ku, DSTC, 2020
Is Your Goal-Oriented Dialog Model Performing Really Well? Empirical Analysis of System-wise Evaluation
Ryuichi Takanobu, Qi Zhu, Jinchao Li, Baolin Peng, Jianfeng Gao, Minlie Huang, SigDial, 2020
End-to-End Neural Pipeline for Goal-Oriented Dialogue System using GPT-2
Donghoon Ham, Jeong-Gwan Lee, Youngsoo Jang, Kee-Eung Kim, ACL, 2019
Adaptability as a skill for goal-oriented dialog systems
Oralie Cattan, Traitement Automatique des Langues Naturelles, 2020
Multi-turn Response Selection using Dialogue Dependency Relations
Qi Jia, Yizhu Liu, Siyu Ren, Kenny Q. Zhu, Haifeng Tang, EMNLP, 2020
Black-Box Testing of Financial Virtual Assistants
Iosif Itkin, Elena Treshcheva, Luba Konnova, Pavel Braslavski, Rostislav Yavorskiy, Conference on Software Quality, Reliability and Security Companion, 2020
Conversation Graph: Data Augmentation, Training and Evaluation for Non-Deterministic Dialogue Management
Milan Gritta, Gerasimos Lampouras, Ignacio Iacobacci, TACL, 2020
Blog Post Abstract Supplementary Material ArXiv Citations (4)
Diplomacy is a seven-player non-stochastic, non-cooperative game, where agents acquire resources through a mix of teamwork and betrayal. Reliance on trust and coordination makes Diplomacy the first non-cooperative multi-agent benchmark for complex sequential social dilemmas in a rich environment. In this work, we focus on training an agent that learns to play the No Press version of Diplomacy where there is no dedicated communication channel between players. We present DipNet, a neural-network-based policy model for No Press Diplomacy. The model was trained on a new dataset of more than 150,000 human games. Our model is trained by supervised learning (SL) from expert trajectories, which is then used to initialize a reinforcement learning (RL) agent trained through self-play. Both the SL and RL agents demonstrate state-of-the-art No Press performance by beating popular rule-based bots.
It Takes Two to Lie: One to Lie, and One to Listen
Denis Peskov, Benny Cheng, Ahmed Elgohary, Joe Barrow, Cristian Danescu-Niculescu-Mizil, Jordan Boyd-Graber, ACL, 2020
Learning to Play No-Press Diplomacy with Best Response Policy Iteration,
T. C. Anthony, Tom Eccles, Andrea Tacchetti, János Kramár, Ian Gemp, Thomas C. Hudson, Nicolas Porcel, Marc Lanctot, Julien Pérolat, Richard L. Everett, Satinder Singh, Thore Graepel, Yoram Bachrach, arXiv, 2020
Learning to Resolve Alliance Dilemmas in Many-Player Zero-Sum Games
Edward Hughes, Thomas W. Anthony, Tom Eccles, Joel Z. Leibo, David Balduzzi, Yoram Bachrach, AAMAS, 2020
Learning to Play: Reinforcement Learning and Games
Aske Plaat, Springer, 2020
Abstract Dataset DOI ArXiv Citations (14)
Task-oriented dialog systems need to know when a query falls outside their range of supported intents, but current text classification corpora only define label sets that cover every example. We introduce a new dataset that includes queries that are out-of-scope—i.e., queries that do not fall into any of the system’s supported intents. This poses a new challenge because models cannot assume that every query at inference time belongs to a system-supported intent class. Our dataset also covers 150 intent classes over 10 domains, capturing the breadth that a production task-oriented agent must handle. We evaluate a range of benchmark classifiers on our dataset along with several different out-of-scope identification schemes. We find that while the classifiers perform well on in-scope intent classification, they struggle to identify out-of-scope queries. Our dataset and evaluation fill an important gap in the field, offering a way of more rigorously and realistically benchmarking text classification in task-driven dialog systems.
User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportunities
Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Fabio Casati, Moshe Chai Barukh, Shayan Zamanirad, IEEE Internet Computing, 2020
Data Query Language and Corpus Tools for Slot-Filling and Intent Classification Data
Stefan Larson, Eric Guldan, Kevin Leach, LREC, 2020
Out-of-Domain Detection for Natural Language Understanding in Dialog Systems
Yinhe Zheng, Guanyi Chen, Minlie Huang, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020
"None of the Above": Measure Uncertainty in Dialog Response Retrieval,
Yulan Feng, Shikib Mehri, Maxine Eskenazi, Tiancheng Zhao, ACL, 2020
Revisiting One-vs-All Classifiers for Predictive Uncertainty and Out-of-Distribution Detection in Neural Networks
Shreyas Padhy, Zachary Nado, Jie Ren, Jeremiah Liu, Jasper Snoek, Balaji Lakshminarayanan, ICML Workshop: Uncertainty and Robustness in Deep Learning, 2020
KLOOS: KL Divergence-based Out-of-Scope Intent Detection in Human-to-Machine Conversations
Eyup Halit Yilmaz, Cagri Toraman, SIGIR, 2020
Using Optimal Embeddings to Learn New Intents with Few Examples: An Application in the Insurance Domain
Shailesh Acharya, Glenn Fung, KDD: Workshop on Conversational Systems Towards Mainstream Adoption, 2020
HINT3: Raising the bar for Intent Detection in the Wild
Gaurav Arora, Chirag Jain, Manas Chaturvedi, Krupal Modi, EMNLP Insights Workshop, 2020
Intent Detection-Based Lithuanian Chatbot Created via Automatic DNN HyperParameter Optimization
Jurgita Kapociute-Dzikiene, Human Language Technologies - The Baltic Perspective, 2020
Probing Task-Oriented Dialogue Representation from Language Models
Chien-Sheng Wu, Caiming Xiong, EMNLP, 2020
Discriminative Nearest Neighbor Few-Shot Intent Detection by Transferring Natural Language Inference
Jian-Guo Zhang, Kazuma Hashimoto, Wenhao Liu, Chien-Sheng Wu, Yao Wan, Philip S. Yu, Richard Socher, Caiming Xiong, EMNLP, 2020
Improving Out-of-Scope Detection in Intent Classification by Using Embeddings of the Word Graph Space of the Classes,
Paulo Cavalin, Victor Henrique Alves Ribeiro, Ana Appel, Claudio inhanez, EMNLP, 2020
Few-shot Pseudo-Labeling for Intent Detection
Thomas Dopierre, C. Gravier, Julien Subercaze, Wilfried Logerais, CoLing, 2020
A Deep Generative Distance-Based Classifier for Out-of-Domain Detection with Mahalanobis Space
H. Xu, Keqing He, Yuanmeng Yan, Si-hong Liu, Z. Liu, Weiran Xu, CoLing, 2020
Sequential Neural Networks for Noetic End-to-End Response Selection
Qian Chen, Wen Wang, CSL, 2020
Dialog Modelling Experiments with Finnish One-to-One Chat Data
Janne Kauttonen, Lili Aunimo, Conference on Artificial Intelligence and Natural Language, 2020
Multi-turn Response Selection using Dialogue Dependency Relations
Qi Jia, Yizhu Liu, Siyu Ren, Kenny Q. Zhu, Haifeng Tang, EMNLP, 2020
Abstract Code Poster DOI Citations (3)
Many annotation tools have been developed, covering a wide variety of tasks and providing features like user management, pre-processing, and automatic labeling. However, all of these tools use a Graphical User Interface, and often require substantial effort for installation and configuration. This paper presents a new annotation tool that is designed to fill the niche of a lightweight interface for users with a terminal-based workflow. Slate supports annotation at different scales (spans of characters, tokens, and lines, or a document) and of different types (free text, labels, and links), with easily customisable keybindings, and unicode support. In a user study comparing with other tools it was consistently the easiest to install and use. Slate fills a need not met by existing systems, and has already been used to annotate two corpora, one of which involved over 250 hours of annotation effort.
An extensive review of tools for manual annotation of documents
Mariana Neves, Jurica Ševa, Briefings in Bioinformatics, 2019
Eras: Improving the quality control in the annotation process for Natural Language Processing tasks
Jonatas S. Grosman, Pedro H.T. Furtado, Ariane M.B. Rodrigues, Guilherme G. Schardong, Simone D.J. Barbosa, Helio C.V. Lopes, Information Systems, 2020
Annobot: Platform for Annotating and Creating Datasets through Conversation with a Chatbot
Rafal Poswiata, Michal Perelkiewicz, CoLing, 2020
Blog Post Abstract Code Dataset Poster DOI Supplementary Material ArXiv Citations (30)
Disentangling conversations mixed together in a single stream of messages is a difficult task, made harder by the lack of large manually annotated datasets. We created a new dataset of 77,563 messages manually annotated with reply-structure graphs that both disentangle conversations and define internal conversation structure. Our data is 16 times larger than all previously released datasets combined, the first to include adjudication of annotation disagreements, and the first to include context. We use our data to re-examine prior work, in particular, finding that 89% of conversations in a widely used dialogue corpus are either missing messages or contain extra messages. Our manually-annotated data presents an opportunity to develop robust data-driven methods for conversation disentanglement, which will help advance dialogue research.
Query-Focused Scenario Construction,
Su Wang, Greg Durrett, Katrin Erk, EMNLP, 2019
Constructing Interpretive Spatio-Temporal Features for Multi-Turn Responses Selection,
Junyu Lu, Chenbin Zhang, Zeying Xie, Guang Ling, Tom Chao Zhou, Zenglin Xu, ACL, 2019
Who did They Respond to? Conversation Structure Modeling using Masked Hierarchical Transformer
Henghui Zhu, Feng Nan, Zhiguo Wang, Ramesh Nallapati, Bing Xiang, AAAI, 2020
Evaluation Benchmarks and Learning Criteria for Discourse-Aware Sentence Representations,
Mingda Chen, Zewei Chu, Kevin Gimpel, EMNLP, 2019
Sequential Neural Networks for Noetic End-to-End Response Selection
Qian Chen, Wen Wang, CSL, 2020
Noetic end-to-end response selection with supervised neural network based classifiers and unsupervised similarity models
Pawel Skorzewski, Weronika Sieinska, Marek Kubis, CSL, 2020
End-to-End Response Selection Based on Multi-Level Context Response Matching
Basma El Amel Boussaha, Nicolas Hernandez, Christine Jacquin, Emmanuel Morin, CSL, 2020
Software-related Slack Chats with Disentangled Conversations
Preetha Chatterjee, Kostadin Damevski, Nicholas A. Kraft, Lori Pollock, International Conference on Mining Software Repositories: Data Showcase Track, 2020
RAP-Net: Recurrent Attention Pooling Networks for Dialogue Response Selection
Chao-Wei Huang, Ting-Rui Chiang, Shang-Yu Su, Yun-Nung Chen, DSTC, 2019
RAP-Net: Recurrent Attention Pooling Networks for Dialogue Response Selection
Chao-Wei Huang, Ting-Rui Chiang, Shang-Yu Su, Yun-Nung Chen, CSL, 2020
Learning Multi-Level Information for Dialogue Response Selection by Highway Recurrent Transformer
Ting-Rui Chiang, Chao-Wei Huang, Shang-Yu Su, Yun-Nung Chen, CSL, 2020
Time to Take Emoji Seriously: They Vastly Improve Casual Conversational Models (short paper)
Pieter Delobelle, Bettina Berendt, Proceedings of the Reference AI & ML Conference for Belgium, Netherlands & Luxemburg, 2019
Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, Jason Weston, ICLR, 2020
Noetic end-to-end response selection with supervised neural network based classifiers and unsupervised similarity models
Pawel Skorzewski, Weronika Sieinska, Marek Kubis, DSTC, 2019
Multi-level Context Response Matching in Retrieval-Based Dialog Systems
Basma El Amel Boussaha, Nicolas Hernandez, Christine Jacquin, Emmanuel Morin, DSTC, 2019
Enhanced Sequential Representation Augmented with Utterance-level Attention for Response Selection
Taesun Whang, Dongyub Lee, Chanhee Lee, Heuiseok Lim, DSTC, 2019
Sequential Attention-based Network for Noetic End-to-End Response Selection
Qian Chen, Wen Wang, DSTC, 2019
Spatio-Temporal Matching Network for Multi-Turn Responses Selection in Retrieval-Based Chatbots
Junyu Lu, Zeying Xie, Guang Ling, Chao Zhou, Zenglin Xu, DSTC, 2019
Convolutional Neural Encoder for the 7th Dialogue System Technology Challenge
Mandy Korpusik, James Glass, DSTC, 2019
Learning Multi-Level Information for Dialogue Response Selection by Highway Recurrent Transformer
Ting-Rui Chiang, Chao-Wei Huang, Shang-Yu Su, Yun-Nung Chen, DSTC, 2019
Knowledge-incorporating ESIM models for Response Selection in Retrieval-based Dialog Systems
Jatin Ganhotra, Siva Sankalp Patel, Kshitij Fadnis, DSTC, 2019
End-to-End Question Answering Models for Goal-Oriented Dialog Learning
Jamin Shin1, Andrea Madotto, Minjoon Seo, Pascale Fung, DSTC, 2019
Building Sequential Inference Models for End-to-End Response Selection
Jia-Chen Gu, Zhen-Hua Ling, Yu-Ping Ruan, Quan Liu1, DSTC, 2019
Pre-Trained and Attention-Based Neural Networks for Building Noetic Task-Oriented Dialogue Systems
Jia-Chen Gu, Tianda Li, Quan Liu, Xiaodan Zhu, Zhen-Hua Ling, Yu-Ping Ruan, DSTC, 2020
End-to-End Transition-Based Online Dialogue Disentanglement
Hui Liu, Zhan Shi, Jia-Chen Gu, Quan Liu, Si Wei, Xiaodan Zhu, IJCAI, 2020
Detection of hidden feature requests from massive chat messages via deep siamese network
Lin Shi, Mingzhe Xing, Mingyang Li, Yawen Wang, Shoubin Li, Qing Wang, Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 2020
Multi-turn Response Selection using Dialogue Dependency Relations
Qi Jia, Yizhu Liu, Siyu Ren, Kenny Q. Zhu, Haifeng Tang, EMNLP, 2020
Response Selection for Multi-Party Conversations with Dynamic Topic Tracking
Weishi Wang, Shafiq Joty, Steven C.H. Hoi, EMNLP, 2020
Vertext: An End-to-end AI Powered Conversation Management System for Multi-party Chat Platforms
Omer Anjum, Chak Ho Chan, Tanitpong Lawphongpanich, Yucheng Liang, Tianyi Tang, Shuchen Zhang, Wen-mei Hwu, Jinjun Xiong, Sanjay Patel, CSCW: Demonstrations, 2020
Online Conversation Disentanglement with Pointer Networks
Tao Yu, Shafiq Joty, EMNLP, 2020
Abstract Dataset DOI ArXiv Citations (4)
In a corpus of data, outliers are either errors: mistakes in the data that are counterproductive, or are unique: informative samples that improve model robustness. Identifying outliers can lead to better datasets by (1) removing noise in datasets and (2) guiding collection of additional data to fill gaps. However, the problem of detecting both outlier types has received relatively little attention in NLP, particularly for dialog systems. We introduce a simple and effective technique for detecting both erroneous and unique samples in a corpus of short texts using neural sentence embeddings combined with distance-based outlier detection. We also present a novel data collection pipeline built atop our detection technique to automatically and iteratively mine unique data samples while discarding erroneous samples. Experiments show that our outlier detection technique is effective at finding errors while our data collection pipeline yields highly diverse corpora that in turn produce more robust intent classification and slot-filling models.
User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportunities
Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Fabio Casati, Moshe Chai Barukh, Shayan Zamanirad, IEEE Internet Computing, 2020
Treating Dialogue Quality Evaluation as an Anomaly Detection Problem
Rostislav Nedelchev, Ricardo Usbeck, Jens Lehmann, LREC, 2020
More Diverse Dialogue Datasets via Diversity-Informed Data Collection
Katherine Stasaski, Grace Hui Yang, Marti A. Hearst, ACL, 2020
Elimination of multidimensional outliers for a compression chiller using a support vector data description
Jae Min Kim, Cheol Soo Park, Science and Technology for the Built Environment, 2020
We examine a large dialog corpus obtained from the conversation history of a single individual with 104 conversation partners. The corpus consists of half a million instant messages, across several messaging platforms. We focus our analyses on seven speaker attributes, each of which partitions the set of speakers, namely: gender; relative age; family member; romantic partner; classmate; co-worker; and native to the same country. In addition to the content of the messages, we examine conversational aspects such as the time messages are sent, messaging frequency, psycholinguistic word categories, linguistic mirroring, and graph-based features reflecting how people in the corpus mention each other. We present two sets of experiments predicting each attribute using (1) short context windows; and (2) a larger set of messages. We find that using all features leads to gains of 9-14% over using message text only.
Extracting Personal Information from Conversations
Anna Tigunova, WWW, 2020
{CHARM}: Inferring Personal Attributes from Conversations,
Anna Tigunova, Andrew Yates, Paramita Mirza, Gerhard Weikum, EMNLP, 2020
Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network
Jeng-Lin Li, Chi-Chun Lee, Interspeech, 2020
Learning interaction dynamics with an interactive LSTM for conversational sentiment analysis
Yazhou Zhang, Prayag Tiwari, Dawei Song, Xiaoliu Mao, Panpan Wang, Xiang Li, Hari Mohan Pandey, Neural networks : the official journal of the International Neural Network Society, 2020
We explore the use of longitudinal dialog data for two dialog prediction tasks: next message prediction and response time prediction. We show that a neural model using personal data that leverages a combination of message content, style matching, time features, and speaker attributes leads to the best results for both tasks, with error rate reductions of up to 15% compared to a classifier that relies exclusively on message content and to a classifier that does not use personal data.
The Four Dimensions of Social Network Analysis: An Overview of Research Methods, Applications, and Software Tools
David Camacho, Angel Panizo-LLedot, Gema Bello-Orgaz, Antonio Gonzalez-Pardo, Erik Cambria, Information Fusion, 2020
A Quantum-Like multimodal network framework for modeling interaction dynamics in multiparty conversational sentiment analysis
Yazhou Zhang, Dawei Song, Xiang Li, Peng Zhang, Panpan Wang, Lu Rong, Guangliang Yu, Bo Wang, Information Fusion, 2020
Intent Classification for Dialogue Utterances
Jetze Schuurmans, Flavius Frasincar, IEEE Intelligent Systems, 2019
Learning interaction dynamics with an interactive LSTM for conversational sentiment analysis
Yazhou Zhang, Prayag Tiwari, Dawei Song, Xiaoliu Mao, Panpan Wang, Xiang Li, Hari Mohan Pandey, Neural Networks, 2020
Abstract Dataset Citations (13)
Goal-oriented dialogue in complex domains is an extremely challenging problem and there are relatively few datasets. This task provided two new resources that presented different challenges: one was focused but small, while the other was large but diverse. We also considered several new variations on the next utterance selection problem: (1) increasing the number of candidates, (2) including paraphrases, and (3) not including a correct option in the candidate set. Twenty teams participated, developing a range of neural network models, including some that successfully incorporated external data to boost performance. Both datasets have been publicly released, enabling future work to build on these results, working towards robust goal-oriented dialogue systems.
Extracting Dialog Structure and Latent Beliefs from Dialog Corpus
Aishwarya Chhabra, Pratik Saini, Chandrasekhar Anantaram, LaCATODA-BtG Workshop, 2019
Sequential Neural Networks for Noetic End-to-End Response Selection
Qian Chen, Wen Wang, CSL, 2020
End-to-End Response Selection Based on Multi-Level Context Response Matching
Basma El Amel Boussaha, Nicolas Hernandez, Christine Jacquin, Emmanuel Morin, Computer Speech & Language, 2020
Learning Multi-Level Information for Dialogue Response Selection by Highway Recurrent Transformer
Ting-Rui Chiang, Chao-Wei Huang, Shang-Yu Su, Yun-Nung Chen, Computer Speech and Language, 2020
Evaluating Dialogue Generation Systems via Response Selection
Shiki Sato, Reina Akama, Hiroki Ouchi, Jun Suzuki, Kentaro Inui, ACL, 2020
"None of the Above": Measure Uncertainty in Dialog Response Retrieval,
Yulan Feng, Shikib Mehri, Maxine Eskenazi, Tiancheng Zhao, ACL, 2020
Experiences and Insights for Collaborative Industry–Academic Research in Artificial Intelligence
Lisa Amini, Ching-Hua Chen, David Cox, Aude Oliva, Antonio Torralba, AI Magazine, 2020
Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, Jason Weston, ICLR, 2020
Challenges in the Evaluation of Conversational Search Systems
Gustavo Penha, Claudia Hauff, KDD Workshop on Conversational Systems Towards Mainstream Adoption, 2020
Distilling Knowledge for Fast Retrieval-based Chat-bots
Amir Vakili Tahami, Kamyar Ghajar, Azadeh Shakery, SIGIR, 2020
Automatic Evaluation of Non-task Oriented Dialog Systems by Using Sentence Embeddings Projections and Their Dynamics
Mario Rodríguez-Cantelar, Luis Fernando D’Haro, Fernando Matia, Conversational Dialogue Systems for the Next Decade, 2020
A Repository of Conversational Datasets
Matthew Henderson, Paweł Budzianowski, Iñigo Casanueva, Sam Coope, Daniela Gerz, Girish Kumar, Nikola Mrkšić, Georgios Spithourakis, Pei-Hao Su, Ivan Vulić, Tsung-Hsien Wen, Workshop on NLP for Conversational AI, 2019
Response Selection for Multi-Party Conversations with Dynamic Topic Tracking
Weishi Wang, Shafiq Joty, Steven C.H. Hoi, EMNLP, 2020
Abstract Dataset ArXiv Citations (23)
This paper introduces the Seventh Dialog System Technology Challenges (DSTC), which use shared datasets to explore the problem of building dialog systems. Recently, end-to-end dialog modeling approaches have been applied to various dialog tasks. The seventh DSTC (DSTC7) focuses on developing technologies related to end-to-end dialog systems for (1) sentence selection, (2) sentence generation and (3) audio visual scene aware dialog. This paper summarizes the overall setup and results of DSTC7, including detailed descriptions of the different tracks and provided datasets. We also describe overall trends in the submitted systems and the key results. Each track introduced new datasets and participants achieved impressive results using state-of-the-art end-to-end technologies.
Multi-level Context Response Matching in Retrieval-Based Dialog Systems,
Basma El Amel Boussaha, Nicolas Hernandez, Christine Jacquin, Emmanuel Morin, DSTC, 2019
RAP-Net: Recurrent Attention Pooling Networks for Dialogue Response Selection,
Chao-Wei Huang, Ting-Rui Chiang, Shang-Yu Su, Yun-Nung Chen, DSTC, 2019
Enhanced Sequential Representation Augmented with Utterance-level Attention for Response Selection,
Taesun Whang, Dongyub Lee, Chanhee Lee, Heuiseok Lim, DSTC, 2019
Sequential Attention-based Network for Noetic End-to-End Response Selection,
Qian Chen, Wen Wang, DSTC, 2019
Spatio-Temporal Matching Network for Multi-Turn Responses Selection in Retrieval-Based Chatbots,
Junyu Lu, Zeying Xie, Guang Ling, Chao Zhou, Zenglin Xu, DSTC, 2019
Noetic End-to-End Response Selection with Supervised Neural Network Based Classifiers and Unsupervised Similarity Models,
Pawel Skorzewski, Weronika Sieinska, Marek Kubis, DSTC, 2019
Convolutional Neural Encoder for the 7th Dialogue System Technology Challenge,
Mandy Korpusik, James Glass, DSTC, 2019
Learning Multi-Level Information for Dialogue Response Selection by Highway Recurrent Transformer,
Ting-Rui Chiang, Chao-Wei Huang, Shang-Yu Su, Yun-Nung Chen, DSTC, 2019
End-to-end Gated Self-attentive Memory Network for Dialog Response Selection,
Shuo Sun, Yik-Cheung Tam, Jie Cao, Canxiang Yan, Zuohui Fu, Cheng Niu, Jie Zhou, DSTC, 2019
Entropy-Enhanced Multimodal Attention Model for Scene-Aware Dialogue Generation,
Kuan-Yen Lin, Chao-Chun Hsu, Yun-Nung Chen, Lun-Wei Ku, DSTC, 2019
Comparison of Transfer-Learning Approaches for Response Selection in Multi-Turn Conversations,
Jesse Vig, Kalai Ramea, DSTC, 2019
Top-K Attention Mechanism for Complex Dialogue System,
Chang-Uk Shina, Jeong-Won Chab, DSTC, 2019
Knowledge-Grounded Response Generation with Deep Attentional Latent-Variable Model,
Hao-Tong Ye, Kai-Ling Lo, Shang-Yu Su, Yun-Nung Chen, DSTC, 2019
The OneConn-MemNN System for Knowledge-Grounded Conversation Modeling,
Junyuan Zheng, Surya Kasturi, Mason Lin, Xin Chen, Onkar Salvi, Harry Jiannan Wang, DSTC, 2019
An Ensemble Dialogue System for Facts-Based Sentence Generation,
Ryota Tanaka, Akihide Ozeki, Shugo Kato, Akinobu Lee, DSTC, 2019
Conversing by Reading: Contentful Neural Conversation with On-demand Machine Reading,
Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong Liu, Xiang Gao, Bill Dolan, Yejin Choi, Jianfeng Gao, ACL, 2019
WCIS 2019: 1st Workshop on Conversational Interaction Systems
Abhinav Rastogi, Alexandros Papangelis, Rahul Goel, Chandra Khatri, Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019
Context and Knowledge Aware Conversational Model and System Combination for Grounded Response Generation
Ryota Tanaka, Akihide Ozeki, Shugo Kato, Akinobu Lee, Computer Speech & Language, 2020
Sequential Neural Networks for Noetic End-to-End Response Selection
Qian Chen, Wen Wang, CSL, 2020
End-to-End Response Selection Based on Multi-Level Context Response Matching
Basma El Amel Boussaha, Nicolas Hernandez, Christine Jacquin, Emmanuel Morin, CSL, 2020
Distilling Knowledge for Fast Retrieval-based Chat-bots
Amir Vakili Tahami, Kamyar Ghajar, Azadeh Shakery, SIGIR, 2020
Is this Dialogue Coherent? Learning from Dialogue Acts and Entities
Alessandra Cervone, Giuseppe Riccardi, SigDial, 2020
Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring
Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, Jason Weston, ICLR, 2020
Abstract Code Dataset Poster DOI ArXiv Citations (56)
To be informative, an evaluation must measure how well systems generalize to realistic unseen data. We identify limitations of and propose improvements to current evaluations of text-to-SQL systems. First, we compare human-generated and automatically generated questions, characterizing properties of queries necessary for real-world applications. To facilitate evaluation on multiple datasets, we release standardized and improved versions of seven existing datasets and one new text-to-SQL dataset. Second, we show that the current division of data into training and test sets measures robustness to variations in the way questions are asked, but only partially tests how well systems generalize to new queries; therefore, we propose a complementary dataset split for evaluation of future work. Finally, we demonstrate how the common practice of anonymizing variables during evaluation removes an important challenge of the task. Our observations highlight key difficulties, and our methodology enables effective measurement of future development.
Dependency-based Hybrid Trees for Semantic Parsing,
Zhanming Jie, Wei Lu, EMNLP, 2018
Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation,
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, Dongmei Zhang, ACL, 2019
Bootstrapping an End-to-End Natural Language Interface for Databases
Nathaniel Weir, Prasetya Utama, SIGMOD, 2019
Neural Semantic Parsing with Anonymization for Command Understanding in General-Purpose Service Robots
Nick Walker, Yu-Tang Peng, Maya Cakmak, RoboCup 2019: Robot World Cup XXIII, 2019
A cross-domain natural language interface to databases using adversarial text method
Wenlu Wang, The VLDB PhD Workshop, 2019
Don't paraphrase, detect! Rapid and Effective Data Collection for Semantic Parsing,
Jonathan Herzig, Jonathan Berant, EMNLP, 2019
SpatialNLI: A Spatial Domain Natural Language Interface to Databases Using Spatial Comprehension
Jingjing Li, Wenlu Wang, Wei-Shinn Ku, Yingtao Tian, Haixun Wang, ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2019
Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions
Rui Zhang, Tao Yu, He Yang Er, Sungrok Shim, Eric Xue, Xi Victoria Lin, Tianze Shi, Caiming Xiong, Richard Socher, Dragomir Radev, EMNLP, 2019
Model-based Interactive Semantic Parsing: A Unified Formulation and A Text-to-SQL Case Study
Ziyu Yao, Yu Su, Huan Sun, Wen-tau Yih, EMNLP, 2019
Leveraging Adjective-Noun Phrasing Knowledge for Comparison Relation Prediction in Text-to-SQL,
Haoyan Liu, Lei Fang, Qian Liu, Bei Chen, Jian-Guang LOU, Zhoujun Li, EMNLP, 2019
Graph Enhanced Cross-Domain Text-to-SQL Generation,
Siyu Huo, Tengfei Ma, Jie Chen, Maria Chang, Lingfei Wu, Michael Witbrock, TextGraphs Workshop, 2019
A Comprehensive Exploration on Spider with Fuzzy Decision Text-to-SQL Model
Q. Li, L. Li, Q. Li, J. Hong, IEEE Transactions on Industrial Informatics, 2019
Domain Adaptation for Low-Resource Neural Semantic Parsing
Alvin Kennardi, Gabriela Ferraro, Qing Wang, ALTA, 2019
Measuring Compositional Generalization: a Comprehensive Method on Realistic Data,
Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, Olivier Bousquet, ICLR, 2020
Text-to-SQL Generation for Question Answering on Electronic Medical Records
Ping Wang, Tian Shi, Chandan K. Reddy, WWW, 2020
Syntactic Question Abstraction and Retrieval for Data-Scarce Semantic Parsing
Wonseok Hwang, Jinyeong Yim, Seunghyun Park, Minjoon Seo, AKBC, 2020
Recent Advances in SQL Query Generation: A Survey
Jovan Kalajdjieski, Martina Toshevska, Frosina Stojanovska, 17th International Conference on Informatics and Information Technologies, 2020
DBPal: A Fully Pluggable NL2SQL Training Pipeline
Nathaniel Weir, Prasetya Utama, Alex Galakatos, Andrew Crotty, Amir Ilkhechi, Shekar Ramaswamy, Rohin Bhushan, Nadja Geisler, Benjamin H{"a}ttasch, Steffen Eger, Ugur Çetintemel, Carsten Binnig, Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020
Efficient Deployment of Conversational Natural Language Interfaces over Databases
Anthony Colas, Trung Bui, Franck Dernoncourt, Moumita Sinha, Doo Soon Kim, ACL Workshop on NLI, 2020
Photon: A Robust Cross-Domain Text-to-SQL System,
Jichuan Zeng, Xi Victoria Lin, Steven C.H. Hoi, Richard Socher, Caiming Xiong, Michael Lyu, Irwin King, ACL: Demonstrations, 2020
SParC: Cross-Domain Semantic Parsing in Context,
Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit, David Proctor, Sungrok Shim, Jonathan Kraft, Vincent Zhang, Caiming Xiong, Richard Socher, Dragomir Radev, ACL, 2019
NL2pSQL: Generating Pseudo-SQL Queries from Under-Specified Natural Language Questions,
Fuxiang Chen, Seung-won Hwang, Jaegul Choo, Jung-Woo Ha, Sunghun Kim, EMNLP, 2019
SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task,
Tao Yu, Michihiro Yasunaga, Kai Yang, Rui Zhang, Dongxu Wang, Zifan Li, Dragomir Radev, EMNLP, 2018
Clause-Wise and Recursive Decoding for Complex and Cross-Domain Text-to-SQL Generation,
Dongjun Lee, EMNLP, 2019
Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task,
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, Dragomir Radev, EMNLP, 2018
A Pilot Study for {C}hinese {SQL} Semantic Parsing,
Qingkai Min, Yuefeng Shi, Yue Zhang, EMNLP, 2019
CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases,
Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga, Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vincent Zhang, Caiming Xiong, Richard Socher, Walter Lasecki, Dragomir Radev, EMNLP, 2019
Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing,
Ben Bogin, Jonathan Berant, Matt Gardner, ACL, 2019
Good-Enough Compositional Data Augmentation,
Jacob Andreas, ACL, 2020
Bootstrapping a Natural Language Interface to a Cyber Security Event Collection System using a Hybrid Translation Approach,
Johann Roturier, Brian Schlatter, David Silva Schlatter, Proceedings of Machine Translation Summit XVII Volume 2: Translator, Project and User Tracks, 2019
Learning Semantic Parsers from Denotations with Latent Structured Alignments and Abstract Programs,
Bailin Wang, Ivan Titov, Mirella Lapata, EMNLP, 2019
Learning Programmatic Idioms for Scalable Semantic Parsing,
Srinivasan Iyer, Alvin Cheung, Luke Zettlemoyer, EMNLP, 2019
A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation,
Jan Deriu, Katsiaryna Mlynchyk, Philippe Schl{"a}pfer, Alvaro Rodrigo, Dirk von Gr{"u}nigen, Nicolas Kaiser, Kurt Stockinger, Eneko Agirre, Mark Cieliebak, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers,
Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, Matthew Richardson, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
Scalable Neural Methods for Reasoning With a Symbolic Knowledge Base
William W. Cohen, Haitian Sun, R. Alex Hofer, Matthew Siegler, ICLR, 2020
Zero-shot Text-to-SQL Learning with Auxiliary Task
Shuaichen Chang, Pengfei Liu, Yun Tang, Jing Huang, Xiaodong He, Bowen Zhou, AAAI, 2020
STNS-CSG: Syntax Tree Networks with Self-Attention for Complex SQL Generation
Miaomiao Hong, Bin Wu, Bai Wang, Pengpeng Zhou, IEEE Fourth International Conference on Data Science in Cyberspace, 2019
Generating SQL Statements from Natural Language Queries: A Multitask Learning Approach (S)
Chunqi Chen, Yunxiang Xiong, Beijun Shen, Yuting Chen, The 31st International Conference on Software Engineering & Knowledge Engineering, 2019
Natural language to SQL: Where are we today?
Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han∗ Hongrae Lee, VLDB, 2020
Exploring Unexplored Generalization Challenges for Cross-Database Semantic Parsing
Alane Suhr, Ming-Wei Chang, Peter Shaw, Kenton Lee less, ACL, 2020
RECPARSER: A Recursive Semantic Parsing Framework for Text-to-SQL Task
Yu Zeng, Yan Gao, Jiaqi Guo, Bei Chen, Qian Liu, Jian-Guang Lou, Fei Teng, Dongmei Zhang, IJCAI, 2020
Automatic Extraction of Legal Norms: Evaluation of Natural Language Processing Tools
Gabriela Ferraro, Ho-Pun Lam, Silvano Colombo Tosatto, Francesco Olivieri, Mohammad Badiul Islam, Nick van Beest, Guido Governatori, New Frontiers in Artificial Intelligence, 2019
Semantic Evaluation for Text-to-SQL with Distilled Test Suites
Ruiqi Zhong, Tao Yu, Dan Klein, EMNLP, 2020
ColloQL: Robust Cross-Domain Text-to-SQL Over Search Queries
Karthik Radhakrishnan, Arvind Srikantan, Xi Victoria Lin, EMNLP Workshop on Interactive and Executable Semantic Parsing, 2020
On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries
Tianze Shi, Chen Zhao, Jordan Boyd-Graber, Hal Daumé III, Lillian Lee, Findings of EMNLP, 2020
Benchmarking Meaning Representations in Neural Semantic Parsing.
Jiaqi Guo, Qian Liu, Jian-Guang Lou, Zhenwen Li, Xueqing Liu, Tao Xie, Ting Liu., EMNLP, 2020
Character-level Representations Improve {DRS}-based Semantic Parsing {E}ven in the Age of {BERT},
Rik van Noord, Antonio Toral, Johan Bos, EMNLP, 2020
Re-examining the Role of Schema Linking in Text-to-{SQL},
Wenqiang Lei, Weixin Wang, Zhixin Ma, Tian Gan, Wei Lu, Min-Yen Kan, Tat-Seng Chua, EMNLP, 2020
{D}u{SQL}: A Large-Scale and Pragmatic {C}hinese Text-to-{SQL} Dataset,
Lijie Wang, Ao Zhang, Kun Wu, Ke Sun, Zhenghua Li, Hua Wu, Min Zhang, Haifeng Wang, EMNLP, 2020
{``}What Do You Mean by That?{''} A Parser-Independent Interactive Approach for Enhancing Text-to-{SQL},
Yuntao Li, Bei Chen, Qian Liu, Yan Gao, Jian-Guang Lou, Yan Zhang, Dongmei Zhang, EMNLP, 2020
Sequence-Level Mixed Sample Data Augmentation,
Demi Guo, Yoon Kim, Alexander Rush, EMNLP, 2020
Benchmarking Meaning Representations in Neural Semantic Parsing,
Jiaqi Guo, Qian Liu, Jian-Guang Lou, Zhenwen Li, Xueqing Liu, Tao Xie, Ting Liu, EMNLP, 2020
GRAPPA: Grammar-Augmented Pre-Training for Table Semantic Parsing
Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin Wang, Yi Chern Tan, Xinyi Yang, Dragomir Radev, Richard Socher, Caiming Xiong, EMNLP Workshop on Interactive and Executable Semantic Parsing, 2020
Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text-to-SQL
Yusen Zhang, Xiangyu Dong, Shuaichen Chang, Tao Yu, Peng Shi, Rui Zhang, EMNLP Workshop on Interactive and Executable Semantic Parsing, 2020
Neural Approaches for Natural Language Interfaces to Databases: A Survey
Radu Cristian Alexandru Iacob, F. Brad, Elena-Simona Apostol, Ciprian-Octavian Truică, Ionel Alexandru Hosu, Traian Rebedea less, CoLing, 2020
Tracking Interaction States for Multi-Turn Text-to-SQL Semantic Parsing
Run-Ze Wang, Zhen-Hua Ling, Jing-Bo Zhou, Yu Hu, AAAI, 2021
Abstract DOI ArXiv Citations (44)
Despite the recent popularity of word embedding methods, there is only a small body of work exploring the limitations of these representations. In this paper, we consider one aspect of embedding spaces, namely their stability. We show that even relatively high frequency words (100-200 occurrences) are often unstable. We provide empirical evidence for how various factors contribute to the stability of word embeddings, and we analyze the effects of stability on downstream tasks.
Subcharacter information in japanese embeddings: when is it worth it?
Marzena Karpinska, Bofang Li, Anna Rogers, Aleksandr Drozd, Proceedings of the Workshop on Relevance of Linguistic Structure in Neural Architectures for NLP (RELNLP), 2018
What's in Your Embedding, And How It Predicts Task Performance
Anna Rogers, Shashwath Hosur Ananthakrishna, Anna Rumshisky, CoLing, 2018
CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling,
Felipe Viegas, Sergio D. Canuto, Christian Gomes, Washington Luiz, Thierson Rosa, Sabir Ribas, Leonardo C. da Rocha, Marcos Andre Goncalves, WSDM, 2019
Investigating the Stability of Concrete Nouns in Word Embeddings,
Bénédicte Pierrejean, Ludovic Tanguy, Proceedings of the 13th International Conference on Computational Semantics - Short Papers, 2019
Modeling Word Emotion in Historical Language: Quantity Beats Supposed Stability in Seed Word Selection,
Johannes Hellrich, Sven Buechel, Udo Hahn, Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, 2019
Estimating Topic Modeling Performance with Sharma–Mittal Entropy
Sergei Koltcov, Vera Ignatenko, Olessia Koltsova, Entropy, 2019
A framework for anomaly detection using language modeling, and its applications to finance,
Armineh Nourbakhsh, Grace Bang, 2nd KDD Workshop on Anomaly Detection in Finance, 2019
Weighted posets: Learning surface order from dependency trees
William Dyer, Syntaxfest, 2019
Comparing the Performance of Feature Representations for the Categorization of the Easy-to-Read Variety vs Standard Language
Marina Santini, Benjamin Danielsson, Arne Jonsson, NoDaLiDa, 2019
A Metrological Framework for Evaluating Crowd-powered Instruments,
Chris Welty, Lora Aroyo, Praveen Paritosh, HComp, 2019
Ideological Drifts in the U.S. Constitution: Detecting Areas of Contention with Models of Semantic Change
Abdul Z. Abdulrahim, NeurIPS Joint Workshop on AI for Social Good, 2019
Learning Variable-Length Representation of Words
Debasis Ganguly, Pattern Recognition, 2020
Understanding the Downstream Instability of Word Embeddings
Megan Leszczynski, Avner May, Jian Zhang, Sen Wu, Christopher Aberger, Christopher Re, MLSys, 2020
Automated Event Identification from System Logs Using Natural Language Processing
Abhishek Dwaraki, Shachi Kumary, Tilman Wolf, International Conference on Computing, Networking and Communications, 2020
Revisiting the Context Window for Cross-lingual Word Embeddings
Ryokan Ri, Yoshimasa Tsuruoka, ACL, 2020
Towards Understanding the Instability of Network Embedding
Chenxu Wang, Wei Rao, Wenna Guo, Pinghui Wang, Jun Liu, Xiaohong Guan, Transactions on Knowledge and Data Engineering, 2020
Stolen Probability: A Structural Weakness of Neural Language Models
David Demeter, Gregory Kimmel, Doug Downey, ACL, 2020
On the Influence of Coreference Resolution on Word Embeddings in Lexical-semantic Evaluation Tasks
Alexander Henlein, Alexander Mehler, LREC, 2020
SAMPO: Unsupervised Knowledge Base Construction for Opinions and Implications
Nikita Bhutani, Aaron Traylor, Chen Chen, Xiaolan Wang, Behzad Golshan, Wang-Chiew Tan, AKBC, 2020
Simple, Interpretable and Stable Method for Detecting Words with Usage Change across Corpora,
Hila Gonen, Ganesh Jawahar, Djamé Seddah, Yoav Goldberg, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020
The Influence of Down-Sampling Strategies on SVD Word Embedding Stability,
Johannes Hellrich, Bernd Kampe, Udo Hahn, RepEval, 2019
Tkol, Httt, and r/radiohead: High Affinity Terms in Reddit Communities,
Abhinav Bhandari, Caitrin Armstrong, W-NUT, 2019
Writing habits and telltale neighbors: analyzing clinical concept usage patterns with sublanguage embeddings,
Denis Newman-Griffis, Eric Fosler-Lussier, Workshop on Health Text Mining and Information Analysis, 2019
Density Matching for Bilingual Word Embedding,
Chunting Zhou, Xuezhe Ma, Di Wang, Graham Neubig, NAACL, 2019
Transparent, Efficient, and Robust Word Embedding Access with WOMBAT,
Mark-Christoph M{"u}ller, Michael Strube, CoLing: Demonstration, 2018
Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings,
Yadollah Yaghoobzadeh, Katharina Kann, T. J. Hazen, Eneko Agirre, Hinrich Sch{"u}tze, ACL, 2019
Characterizing the Impact of Geometric Properties of Word Embeddings on Task Performance,
Brendan Whitaker, Denis Newman-Griffis, Aparajita Haldar, Hakan Ferhatosmanoglu, Eric Fosler-Lussier, RepEval, 2019
Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research
Pedro Rodriguez, Arthur Spirling, Conference of the Society for Political Methodology, 2019
Enhancing Domain-Specific Supervised Natural Language Intent Classification with a Top-Down Selective Ensemble Model
Gard Jenset, Barbara McGillivray, Machine Learning and Knowledge Extraction, 2019
Analyzing Hypersensitive AI: Instability in Corporate-Scale Machine Learning
Michaela Regneri, Malte Hoffmann, Jurij Kost, Niklas Pietsch, Timo Schulz, Sabine Stamm, IJCAI/ECAI Workshop on Explainable Artificial Intelligence, 2018
Can prediction-based distributional semantic models predict typicality?
Tom Heyman, Geert Heyman, Quarterly Journal of Experimental Psychology, 2019
Data Shift in Legal AI Systems
Venkata Nagaraju Buddarapu, Arunprasath Shankar, Workshop on Automated Semantic Analysis of Information in Legal Texts co-located with the International Conference on Artificial Intelligence and Law, 2019
Word Embeddings: Reliability & Semantic Change
Johannes Hellrich, , 2019
Medical Information Extraction in the Age of Deep Learning
Udo Hahn, Michel Oleynik, Yearbook of Medical Informatics, 2020
Understanding the stability of medical concept embeddings
Grace E. Lee, Aixin Sun, Association for Information Science and Technology, 2020
Embedding Structured Dictionary Entries,
Steven Wilson, Walid Magdy, Barbara McGillivray, Gareth Tyson, EMNLP Workshop on Insights from Negative Results in NLP, 2020
Diachronic Embeddings for People in the News,
Felix Hennig, Steven Wilson, EMNLP Workshop on Natural Language Processing and Computational Social Science, 2020
Is {W}ikipedia succeeding in reducing gender bias? Assessing changes in gender bias in {W}ikipedia using word embeddings,
Katja Geertruida Schmahl, Tom Julian Viering, Stavros Makrodimitris, Arman Naseri Jahfari, David Tax, Marco Loog, EMNLP Workshop on Natural Language Processing and Computational Social Science, 2020
Measuring the Semantic Stability of Word Embedding
Zhenhao Huang, Chenxu Wang, Natural Language Processing and Chinese Computing, 2020
Short-term Semantic Shifts and their Relation to Frequency Change
Anna Marakasova, Julia Neidhardt, Probability and Meaning Conference, 2020
Visualizing and Quantifying Vocabulary Learning During Search
Nilavra Bhattacharyaa, Jacek Gwizdkaa, CIKM Workshop on Investigating Learning During (Web) Search, 2020
Detecting Different Forms of Semantic Shift in Word Embeddings via Paradigmatic and Syntagmatic Association Changes
Anna Wegmann, Florian Lemmerich, Markus Strohmaier, The Semantic Web, 2020
An Empirical Study of the Downstream Reliability of Pre-Trained Word Embeddings
Anthony Rios, Brandon Lwowski, CoLing, 2020
Comparing the performance of various Swedish BERT models for classification
Daniel Holmer, Arne Jonsson, Swedish Language Technology Conference, 2020
Most summarization research focuses on summarizing the entire given text, but in practice readers are often interested in only one aspect of the document or conversation. We propose ‘targeted summarization’ as an umbrella category for summarization tasks that intentionally consider only parts of the input data. This covers query-based summarization, update summarization, and a new task we propose where the goal is to summarize a particular aspect of a document. However, collecting data for this new task is hard because directly asking annotators (e.g., crowd workers) to write summaries leads to data with low accuracy when there are a large number of facts to include. We introduce a novel crowdsourcing workflow, Pin-Refine, that allows us to collect highquality summaries for our task, a necessary step for the development of automatic systems.
A Crowdsourcing Approach to Evaluate the Quality of Query-based Extractive Text Summaries
N. Iskender, A. Gabryszak, T. Polzehl, L. Hennig, S. Möller, Conference on Quality of Multimedia Experience, 2019
{HEIDL}: Learning Linguistic Expressions with Deep Learning and Human-in-the-Loop,
Prithviraj Sen, Yunyao Li, Eser Kogan, Yiwei Yang, Walter Lasecki, ACL (demo), 2019
Efficient Elicitation Approaches to Estimate Collective Crowd Answers,
John Joon Young Chung, Jean Y. Song, Sindhu Kutty, Sungsoo (Ray) Hong, Juho Kim, Walter S. Lasecki, CSCW, 2019
Towards Hybrid Human-AI Workflowsfor Unknown Unknown Detection
Anthony Z. Liu, Santiago Guerra, Isaac Fung, Gabriel Matute, Ece Kamar, Walter S. Lasecki, WWW, 2020
C-Reference: Improving 2D to 3D Object Pose Estimation Accuracy via Crowdsourced Joint Object Estimation
Jean Y. Song, John Joon Young Chung, David F. Fouhey, Walter S. Lasecki, CSCW, 2020
Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts,
Luke Breitfeller, Emily Ahn, David Jurgens, Yulia Tsvetkov, EMNLP, 2019
How we write with crowds
Molly Q. Feldman, Brian McInnis, CSCW, 2020
Abstract Video DOI Citations (11)
Industrial dialogue systems such as Apple Siri and Google Assistant require large scale diverse training data to enable their sophisticated conversation capabilities. Crowdsourcing is a scalable and inexpensive data collection method, but collecting high quality data efficiently requires thoughtful orchestration of crowdsourcing jobs. Prior study of data collection process has focused on tasks with limited scope and performed intrinsic data analysis, which may not be indicative of impact on trained model performance. In this paper, we present a study of crowdsourcing methods for a user intent classification task in one of our deployed dialogue systems. Our task requires classification over 47 possible user intents and contains many intent pairs with subtle differences. We consider different crowdsourcing job types and job prompts, quantitatively analyzing the quality of collected data and downstream model performance on a test set of real user queries from production logs. Our observations provide insight into how design decisions impact crowdsourced data quality, with clear recommendations for future data collection for dialogue systems.
Enhancing Domain-Specific Supervised Natural Language Intent Classification with a Top-Down Selective Ensemble Model,
Gard B. Jenset, Barbara McGillivray, Machine Learning and Knowledge Extraction, 2019
A Study of Incorrect Paraphrases in Crowdsourced User Utterances,
Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Moshe Chai Barukh, Shayan Zamanirad, NAACL, 2019
Data Collection Methods for Building a Free Response Training Simulation
Vaibhav Sharma, Beni Shpringer, Sung Min Yang, Martin Bolger, Sodiq Adewole, Dr. D. Brown, Erfaneh Gharavi, Systems and Information Engineering Design Symposium, 2019
Personalizing crowdsourced human-robot interaction through curiosity-driven learning
Phoebe Liu, Malcolm Doering, Dylan F. Glas, Takayuki Kanda, Dana Kulic, Hiroshi Ishiguro, Personalization in Long-Term Human-Robot Interaction, 2019
MA-DST: Multi-Attention-Based Scalable Dialog State Tracking
Adarsh Kumar, Peter Ku, Anuj Kumar Goyal, Angeliki Metallinou, Dilek Hakkani-tur, The 3rd NeurIPS workshop on Conversational AI: Today's Practice and Tomorrow's Potential, 2019
User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportunities
Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Fabio Casati, Moshe Chai Barukh, Shayan Zamanirad, IEEE Internet Computing, 2020
Dynamic word recommendation to obtain diverse crowdsourced paraphrases of user utterances
Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Fabio Casati, Moshe Chai Barukh, Shayan Zamanirad, IUI, 2020
Data Query Language and Corpus Tools for Slot-Filling and Intent Classification Data
Stefan Larson, Eric Guldan, Kevin Leach, LREC, 2020
More Diverse Dialogue Datasets via Diversity-Informed Data Collection
Katherine Stasaski, Grace Hui Yang, Marti A. Hearst, ACL, 2020
Optimizing the Design and Cost for Crowdsourced Conversational Utterances
Phoebe Liu, Joan Xiao, Tong Liu, Dylan F. Glas, KDD WOrkshop: Data Collection, Curation, and Labeling (DCCL) for Mining and Learning, 2019
Dialogue Act Classification for Virtual Agents for Software Engineers during Debugging
Andrew Wood, Zachary Eberhart, Collin McMillan, International Conference on Software Engineering Workshops, 2020
In this paper we explore the role played by world knowledge in semantic parsing. We look at the types of errors that currently exist in a state-of-the-art Abstract Meaning Representation (AMR) parser, and explore the problem of how to integrate world knowledge to reduce these errors. We look at three types of knowledge from (1) WordNet hypernyms and super senses, (2) Wikipedia entity links, and (3) retraining a named entity recognizer to identify concepts in AMR. The retrained entity recognizer is not perfect and cannot recognize all concepts in AMR and we examine the limitations of the named entity features using a set of oracles. The oracles show how performance increases if it can recognize different subsets of AMR concepts. These results show improvement on multiple fine-grained metrics, including a 6% increase in named entity F-score, and provide insight into the potential of world knowledge for future work in Abstract Meaning Representation parsing.
Towards Turkish Abstract Meaning Representation,
Zahra Azin, Gulcsen Eryigit, ACL: SRW, 2019
Abstract Code DOI Supplementary Material ArXiv Citations (10)
One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data. In this work, we study the task of identifying products being bought and sold in online cybercrime forums, which exhibits particularly challenging cross-domain effects. We formulate a task that represents a hybrid of slot-filling information extraction and named entity recognition and annotate data from four different forums. Each of these forums constitutes its own ‘fine-grained domain’ in that the forums cover different market sectors with different properties, even though all forums are in the broad domain of cybercrime. We characterize these domain differences in the context of a learning-based system: supervised models see decreased accuracy when applied to new forums, and standard techniques for semi-supervised learning and domain adaptation have limited effectiveness on this data, which suggests the need to improve these techniques. We release a dataset of 1,938 annotated posts from across the four forums.
You Are Your Photographs: Detecting Multiple Identities of Vendors in the Darknet Marketplaces
Xiangwen Wang, Peng Peng, Chun Wang, Gang Wang, ASIA CCS, 2018
Reading Thieves' Cant: Automatically Identifying and Understanding Dark Jargons from Cybercrime Marketplaces,
Kan Yuan, Haoran Lu, Xiaojing Liao, XiaoFeng Wang, USENIX, 2018
Automatically identifying the function and intent of posts in underground forums,
Andrew Caines, Sergio Pastrana, Alice Hutchings, Paula J. Buttery, Crime Science, 2018
Understanding and Predicting Private Interactions in Underground Forums
Zhibo Sun, Carlos E. Rubio-Medrano, Ziming Zhao, Tiffany Bao, Adam Doupe, Gail-Joon Ahn, Proceedings of the Ninth ACM Conference on Data and Application Security and Privacy, 2019
Casino Royale: A Deep Exploration of Illegal Online Gambling
Hao Yang, Kun Du, Yubao Zhang, Shuang Hao, Zhou Li, Mingxuan Liu, Haining Wang, Haixin Duan, Yazhou Shi, Xiaodong Su, Guang Liu, Zhifeng Geng, Jianping Wu, Proceedings of the 35th Annual Computer Security Applications Conference, 2019
On (The Lack Of) Location Privacy in Crowdsourcing Applications,
Spyros Boukoros, Mathias Humbert, Stefan Katzenbeisser, Carmela Troncoso, 28th {USENIX} Security Symposium ({USENIX} Security 19), 2019
Mapping the Underground: Supervised Discovery of Cybercrime Supply Chains
Rasika Bhalerao, Maxwell Aliapoulios, Ilia Shumailov, Sadia Afroz, Damon McCoy, IEEE APWG Symposium on Electronic Crime Research, 2019
Measuring eWhoring
Sergio Pastrana, Alice Hutchings, Daniel R. Thomas, Juan E. E. Tapiador, IMC, 2019
Proactively Identifying Emerging Hacker Threats from the Dark Web: A Diachronic Graph Embedding Framework (D-GEF)
Sagar Samtani, Hongyi Zhu, Hsinchun Chen, ACM Transactions on Privacy and Security (TOPS), 2020
Proactively Identifying Emerging Hacker Threats from the Dark Web
Sagar Samtani, Hongyi Zhu, Hsinchun Chen, ACM Transactions on Privacy and Security (TOPS), 2020
Abstract Dataset Video DOI PDF Slides ArXiv Citations (18)
Linguistically diverse datasets are critical for training and evaluating robust machine learning systems, but data collection is a costly process that often requires experts. Crowdsourcing the process of paraphrase generation is an effective means of expanding natural language datasets, but there has been limited analysis of the trade-offs that arise when designing tasks. In this paper, we present the first systematic study of the key factors in crowdsourcing paraphrase collection. We consider variations in instructions, incentives, data domains, and workflows. We manually analyzed paraphrases for correctness, grammaticality, and linguistic diversity. Our observations provide new insight into the trade-offs between accuracy and diversity in crowd responses that arise as a result of task design, providing guidance for future paraphrase generation procedures.
Effective Crowdsourced Generation of Training Data for Chatbots Natural Language Understanding,
R. Bapat, P. Kucherbaev, A. Bozzon, ICWE, 2018
SPADE: Evaluation Dataset for Monolingual Phrase Alignment,
Yuki Arase1, Junichi Tsujii?, LREC, 2018
Crowdsourcing for Reminiscence Chatbot Design,
Svetlana Nikitina, Florian Daniel, Marcos Baez, Fabio Casati, HCOMP, 2018
Towards More Robust Speech Interactions for Deaf and Hard of Hearing Users,
Raymond Fok, Harmanpreet Kaur, Skanda Palani, Martez E. Mott, Walter S. Lasecki, ASSETS, 2018
A Study of Incorrect Paraphrases in Crowdsourced User Utterances,
Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Moshe Chai Barukh, Shayan Zamanirad, NAACL, 2019
Optimizing the Design and Cost for Crowdsourced Conversational Utterances
Phoebe Liu, Joan Xiao, Tong Liu, Dylan F. Glas, Workshop on Data Collection, Curation, and Labeling (DCCL) for Mining and Learning, 2019
Personalizing crowdsourced human-robot interaction through curiosity-driven learning
Phoebe Liu, Malcolm Doering, Dylan F. Glas, Takayuki Kanda, Dana Kulic, Hiroshi Ishiguro, Personalization in Long-Term Human-Robot Interaction, 2019
PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese
Bowei Zhang, Weiwei Sun, Xiaojun Wan, Zongming Guo, Natural Language Processing and Chinese Computing, 2019
Efficient Elicitation Approaches to Estimate Collective Crowd Answers
John Joon Young Chung, Jean Y. Song, Sindhu Kutty, Sungsoo (Ray) Hong, Juho Kim, Walter S. Lasecki, CSCW, 2019
Optimizing for Happiness and Productivity: Modeling Opportune Moments for Transitions and Breaks at Work
Harmanpreet Kaur, Alex C. Williams, Daniel McDuff, Mary Czerwinski, Jaime Teevan, Shamsi Iqbal, CHI, 2020
Dynamic word recommendation to obtain diverse crowdsourced paraphrases of user utterances
Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Fabio Casati, Moshe Chai Barukh, Shayan Zamanirad, IUI, 2020
The Influence of Input Data Complexity on Crowdsourcing Quality
Christopher Tauchmann, Johannes Daxenberger, Margot Mieskes, IUI, 2020
User Utterance Acquisition for Training Task-Oriented Bots: A Review of Challenges, Techniques and Opportunities
Mohammad-Ali Yaghoub-Zadeh-Fard, Boualem Benatallah, Fabio Casati, Moshe Chai Barukh, Shayan Zamanirad, IEEE Internet Computing, 2020
Emotional Speech Corpus for Persuasive Dialogue System
Sara Asai, Koichiro Yoshino, Seitaro Shinagawa, Sakriani Sakti, Satoshi Nakamura, LREC, 2020
C-Reference: Improving 2D to 3D Object Pose Estimation Accuracy via Crowdsourced Joint Object Estimation
Jean Y. Song, John Joon Young Chung, David F. Fouhey, Walter S. Lasecki, CSCW, 2020
{P}ara{NMT}-50{M}: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations,
John Wieting, Kevin Gimpel, ACL, 2018
Multilingual Whispers: Generating Paraphrases with Translation,
Christian Federmann, Oussama Elachqar, Chris Quirk, W-NUT, 2019
A chatbot response generation system
Jasper Feine, Stefan Morana, Alexander D. Maedche, Conference on Mensch und Computer, 2020
Underground forums are widely used by criminals to buy and sell a host of stolen items, datasets, resources, and criminal services. These forums contain important resources for understanding cybercrime. However, the number of forums, their size, and the domain expertise required to understand the markets makes manual exploration of these forums unscalable. In this work, we propose an automated, top-down approach for analyzing underground forums. Our approach uses natural language processing and machine learning to automatically generate high-level information about underground forums, first identifying posts related to transactions, and then extracting products and prices. We also demonstrate, via a pair of case studies, how an analyst can use these automated approaches to investigate other categories of products and transactions. We use eight distinct forums to assess our tools: Antichat, Blackhat World, Carders, Darkode, Hack Forums, Hell, L33tCrew and Nulled. Our automated approach is fast and accurate, achieving over 80% accuracy in detecting post category, product, and prices.
Ethical issues of research using datasets of illicit origin,
Daniel R. Thomas, Sergio Pastrana, Alice Hutchings, Richard Clayton, Alastair R. Beresford, IMC, 2017
CrimeBB: Enabling Cybercrime Research on Underground Forums at Scale,
Sergio Pastrana, Daniel R. Thomas, Alice Hutchings, Richard Clayton, WWW, 2018
At-risk system identification via analysis of discussions on the darkweb,
Eric Nunes, Paulo Shakarian, Gerardo I. Simari, APWG Symposium on Electronic Crime Research (eCrime), 2018
Systematically Understanding the Cyber Attack Business: A Survey,
Keman Huang, Michael Siegel, Stuart Madnick, ACM Computing Surveys, 2018
Characterizing Eve: Analysing Cybercrime Actors in a Large Underground Forum
Sergio Pastrana, Alice Hutchings, Andrew Caines, Paula Buttery, International Symposium on Research in Attacks, Intrusions and Defenses, 2018
Automatically identifying the function and intent of posts in underground forums,
Andrew Caines, Sergio Pastrana, Alice Hutchings, Paula J. Buttery, Crime Science, 2018
Analyzing and Identifying Data Breaches in Underground Forums,
Yong Fang, Yusong Guo, Cheng Huang, Liang Liu, IEEE Access, 2019
Multistream Classification for Cyber Threat Data with Heterogeneous Feature Space,
Yi-Fan Li, Yang Gao, Gbadebo Ayoade, Hemeng Tao, Latifur Khan, Bhavani Thuraisingham, WWW, 2019
CARONTE: Crawling Adversarial Resources Over Non-Trusted, High-Profile Environments
M. Campobasso, P. Burda, L. Allodi, 2019 IEEE European Symposium on Security and Privacy Workshops, 2019
Chapter 3 - Challenges of using machine learning algorithms for cybersecurity: a study of threat-classification models applied to social media communication data
Andrei Queiroz Lima, Brian Keegan, , 2020
The Art and Craft of Fraudulent App Promotion in Google Play
Mizanur Rahman, Nestor Hernandez, Ruben Recabarren, Syed Ishtiaque Ahmed, Bogdan Carbunar, CCS, 2019
Casino Royale: A Deep Exploration of Illegal Online Gambling
Hao Yang, Kun Du, Yubao Zhang, Shuang Hao, Zhou Li, Mingxuan Liu, Haining Wang, Haixin Duan, Yazhou Shi, Xiaodong Su, Guang Liu, Zhifeng Geng, Jianping Wu, Proceedings of the 35th Annual Computer Security Applications Conference, 2019
Challenges Within the Industry 4.0 Setup
Akshi Kumar, Divya Gupta, , 2020
An Empirical Study of Malicious Threads in Security Forums,
Joobin Gharibshah, Zhabiz Gharibshah, Evangelos E. Papalexakis, Michalis Faloutsos, WWW, 2019
Mapping the Underground: Supervised Discovery of Cybercrime Supply Chains
Rasika Bhalerao, Maxwell Aliapoulios, Ilia Shumailov, Sadia Afroz, Damon McCoy, IEEE APWG Symposium on Electronic Crime Research, 2019
The Not Yet Exploited Goldmine of OSINT: Opportunities, Open Challenges and Future Trends
Javier Pastor-Galindo, Pantaleone Nespoli, Felix Gomez Marmol, Gregorio Martinez Perez, IEEE Access, 2020
REST: A thread embedding approach for identifying and classifying user-specified information in security forums
Joobin Gharibshah, Evangelos E. Papalexakis, Michalis Faloutsos, ICWSM, 2020
Cybercrimes: Critical Issues in a Global Context
Anita Lavorgna, Macmillan International Higher Education, 2020
A tight scrape: methodological approaches to cybercrime research data collection in adversarial environments
Kieron Turk, Sergio Pastrana, Ben Collier, Workshop on Attackers and Cyber-Crime Operations: IEEE European Symposium on Security and Privacy, 2020
Mining actionable information from security forums: the case of malicious IP addresses
Joobin Gharibshah, Tai Ching Li, Andre Castro, Konstantinos Pelechrinis, Evangelos E. Papalexakis, Michalis Faloutsos, Conference on Advances in Social Networks Analysis and Mining, 2019
iDetector: Automate Underground Forum Analysis Based on Heterogeneous Information Network
Yiming Zhang, Yujie Fan, Shifu Hou, Jian Liu, Yanfang Ye, Thirimachos Bourlai, IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2018
Extracting actionable information from Security Forums
Joobin Gharibshah, Michalis Faloutsos, WWW, 2019
RIPEx: Extracting malicious IP addresses from security forums using cross-forum learning
Joobin Gharibshah, Evangelos E. Papalexakis, Michalis Faloutsos, Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2018
You Are Your Photographs: Detecting Multiple Identities of Vendors in the Darknet Marketplaces
Xiangwen Wang, Peng Peng, Chun Wang, Gang Wang less, Asia Conference on Computer and Communications Security, 2018
Economic Factors of Vulnerability Trade and Exploitation
Luca Allodi, ACM SIGSAC Conference on Computer and Communications Security, 2017
SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub
Md Omar Faruk Rokon, Risul Islam, Ahmad Darki, Evangelos E. Papalexakis, Michalis Faloutsos, International Symposium on Research in Attacks, Intrusions and Defenses, 2020
A Framework for Analysis Attackers’ Accounts
Hossein Siadati, Jay Koven, Christian Felix da Silva, Markus Jakobsson, Enrico Bertini, David Maimon, Nasir Memon, Security, Privacy and User Interaction, 2020
Turning Up the Dial: the Evolution of a Cybercrime Market Through Set-up, Stable, and Covid-19 Eras
Anh V. Vu, Jack Hughes, Ildiko Pete, Ben Collier, Yi Ting Chua, Ilia Shumailov, Alice Hutchings, Internet Measurement Conference, 2020
HackerScope: The Dynamics of a Massive Hacker
Online Ecosystem, Risul Islam, Md Omar Faruk Rokon, Ahmad Darki, Michalis Faloutsos, Conference on Advances in Social Networks Analysis and Mining, 2020
TenFor: A Tensor-Based Tool to Extract Interesting Events from Security Forums
Risul Islam, Md Omar Faruk Rokon, Evangelos E. Papalexakis, Michalis Faloutsos, Conference on Advances in Social Networks Analysis and Mining, 2020
Abstract Code Video DOI Interview ArXiv Citations (10)
General treebank analyses are graph structured, but parsers are typically restricted to tree structures for efficiency and modeling reasons. We propose a new representation and algorithm for a class of graph structures that is flexible enough to cover almost all treebank structures, while still admitting efficient learning and inference. In particular, we consider directed, acyclic, one-endpoint-crossing graph structures, which cover most long-distance dislocation, shared argumentation, and similar tree-violating linguistic phenomena. We describe how to convert phrase structure parses, including traces, to our new representation in a reversible manner. Our dynamic program uniquely decomposes structures, is sound and complete, and covers 97.3% of the Penn English Treebank. We also implement a proofof-concept parser that recovers a range of null elements and trace types.
Exploiting Structure in Parsing to 1-Endpoint-Crossing Graphs,
Robin Kurtz, Marco Kuhlmann, IWPT, 2017
The Interplay Between Loss Functions and Structural Restrictions in Semantic Dependency Parsing
Robin Kurtz, Marco Kuhlmann, Proceedings of the Seventh Swedish Language Technology Conference (SLTC), 2018
An Analysis of Plane Task Text Ellipticity and the Possibility of Ellipses Reconstructing Based on Cognitive Modeling Geometric Objects and Actions,
Xenia Naidenova, Sergei Kurbatov, Vjacheslav Ganapol'skii, Proceedings of Computational Models in Language and Speech Workshop, 2018
AGRR-2019: A Corpus for Gapping Resolution in Russian,
Maria Ponomareva, Kira Droganova, Ivan Smurov, Tatiana Shavrina, The 7th Workshop on Balto-Slavic Natural Language Processing, 2019
PTB Graph Parsing with Tree Approximation,
Yoshihide Kato, Shigeki Matsubara, ACL, 2019
Mind the Gap: Data Enrichment in Dependency Parsing of Elliptical Constructions,
Kira Droganova, Filip Ginter, Jenna Kanerva, Daniel Zeman, Workshop on Universal Dependencies, 2018
Generalized chart constraints for efficient PCFG and TAG parsing,
Stefan Grünewald, Sophie Henning, Alexander Koller, ACL, 2018
Sentences with Gapping: Parsing and Reconstructing Elided Predicates,
Sebastian Schuster, Joakim Nivre, Christopher D. Manning, NAACL, 2018
Global Transition-based Non-projective Dependency Parsing,
Carlos Gómez-Rodríguez, Shi, Tianze Shi, Lillian Lee, ACL, 2018
Semantic Role Labeling as Syntactic Dependency Parsing
Tianze Shi, Igor Malioutov, Ozan İrsoy, EMNLP, 2020
Representation of syntactic structure is a core area of research in Computational Linguistics, disambiguating distinctions in meaning that are crucial for correct interpretation of language. Development of algorithms and statistical models over the past three decades has led to systems that are accurate enough to be deployed in industry, playing a key role in products such as Google Search and Apple Siri. However, syntactic parsers today are usually constrained to tree representations of language, and performance is interpreted through a single metric that conveys no linguistic information regarding remaining errors.
In this dissertation, we present new algorithms for error analysis and parsing. The heart of our approach to error analysis is the use of structural transformations to identify more meaningful classes of errors, and to enable comparisons across formalisms. For parsing, we combine a novel dynamic program with careful choices in syntactic representation to create an efficient parser that produces graph structured output. Together, these developments allowed us to evaluate the outstanding challenges in parsing and to address a key weakness in current work.
First, we present a search algorithm that, given two structures, finds a sequence of modifications leading from one structure to the other. We applied this algorithm to syntactic error analysis, where one structure is the output of a parser, the other is the correct parse, and each modification corresponds to fixing one error. We constructed a tool based on the algorithm and analyzed variations in behavior between parsers, types of text, and languages. Our observations shine light on several assumptions about syntactic errors, showing some to be true and others to be false. For example, prepositional phrase attachment errors are indeed a major issue, while coordination scope errors do not hurt performance as much as expected.
Next, we describe an algorithm that builds a parse in one syntactic representation to match a parse in another representation. Specifically, we build phrase structure parses from Combinatory Categorial Grammar derivations. Our approach follows the philosophy of CCG, defining specific phrase structures for each lexical category and generic rules for combinatory steps. The new parse is built by following the CCG derivation bottom-up, gradually building the corresponding phrase structure parse. This produced significantly more accurate parses than past work, and enabled us to compare performance of several parsers across formalisms.
Finally, we address a weakness we observed in phrase structure parsers: the exclusion of syntactic trace structures for computational convenience. We present an efficient dynamic programming algorithm that constructs the graph structure that has the highest score under an edge-factored scoring function. We define a parse representation compatible with the algorithm, and show how certain linguistic distinctions dramatically impact coverage. We also show various ways to modify the algorithm to improve performance by exploiting properties of observed linguistic structure. This approach to syntactic parsing is the first to cover virtually all structure encoded in the Penn Treebank.
Abstract Code Poster DOI Citations (10)
Despite the convexity of structured max-margin objectives (Taskar et al., 2004; Tsochantaridis et al., 2004), the many ways to optimize them are not equally effective in practice. We compare a range of online optimization methods over a variety of structured NLP tasks (coreference, summarization, parsing, etc) and find several broad trends. First, margin methods do tend to outperform both likelihood and the perceptron. Second, for max-margin objectives, primal optimization methods are often more robust and progress faster than dual methods. This advantage is most pronounced for tasks with dense or continuous-valued features. Overall, we argue for a particularly simple online primal subgradient descent method that, despite being rarely mentioned in the literature, is surprisingly effective in relation to its alternatives.
Using accelerometers to remotely and automatically characterize behavior in small animals
Talisin T. Hammond, Dwight Springthorpe, Rachel E. Walsh, Taylor Berg-Kirkpatrick, Journal of Experimental Biology, 2016
Joint Models for Extracting Adverse Drug Events from Biomedical Text
Fei Li, Yue Zhang, Meishan Zhang, Donghong Ji, IJCAI, 2016
A Practical Perspective on Latent Structured Prediction for Coreference Resolution,
Iryna Haponchyk, Alessandro Moschitti, EACL, 2017
Fine-Grained Entity Typing with High-Multiplicity Assignments,
Maxim Rabinovich, Dan Klein, ACL, 2017
Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints,
Greg Durrett, Taylor Berg-Kirkpatrick, Dan Klein, ACL, 2016
Multi-Task Structured Prediction for Entity Analysis: Search-Based Learning Algorithms,
Chao Ma, Janardhan Rao Doppa, Prasad Tadepalli, Hamed Shahbazi, Xiaoli Fern, ACML, 2017
Post-Specialisation: Retrofitting Vectors of Words Unseen in Lexical Resources,
Ivan Vulić, Goran Glavaš, Nikola Mrkšić, Anna Korhonen, NAACL, 2018
DGeoSegmenter: A dictionary-based Chinese word segmenter for the geoscience domain,
Qinjun Qiu, Zhong Xie, Liang Wu, Wenjia Li, Computers and Geosciences, 2018
Neural Word Segmentation Learning for {C}hinese,
Deng Cai, Hai Zhao, ACL, 2016
Policy Shaping and Generalized Update Equations for Semantic Parsing from Denotations,
Dipendra Misra, Ming-Wei Chang, Xiaodong He, Yih Wen-tau, EMNLP, 2018
Abstract Code Slides PDF Slides Citations (34)
Coreference resolution metrics quantify errors but do not analyze them. Here, we consider an automated method of categorizing errors in the output of a coreference system into intuitive underlying error types. Using this tool, we first compare the error distributions across a large set of systems, then analyze common errors across the top ten systems, empirically characterizing the major unsolved challenges of the coreference resolution task.
Visualization, Search, and Error Analysis for Coreference Annotations
Markus Gartner, Anders Bjorkelund, Gregor Thiele, Wolfgang Seeker, Jonas Kuhn, ACL, 2014
Limited memory incremental coreference resolution
Kellie Webster, James Curran, CoLing, 2014
Linking people in videos with \"their\" names using coreference resolution
Vignesh Ramanathan, Armand Joulin, Percy Liang, Li Fei-Fei, ECCV, 2014
Solving Hard Coreference Problems
Haoruo Peng, Daniel Khashabi, Dan Roth, NAACL, 2015
Analyzing and Visualizing Coreference Resolution Errors
Sebastian Martschat, Thierry G"{o}ckel, Michael Strube, NAACL, 2015
Modeling the Lifespan of Discourse Entities with Application to Coreference Resolution
Marie-Catherine de Marneffe, Marta Recasens, Christopher Potts, JAIR, 2015
Distributional Semantics for Resolving Bridging Mentions
Tim Feuerbach, Martin Riedl, Chris Biemann, RANLP, 2015
Using Lexical and Encyclopedic Knowledge
Yannick Versley, Massimo Poesio, Simone Ponzetto, , 2015
Error analysis for anaphora resolution in Russian: new challenging issues for anaphora resolution task in a morphologically rich language
Svetlana Toldova, Ilya Azerkovich, Anna Roytberg, Alina Ladygina, Maria Vasilyeva, Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016), 2016
Visual Development & Analysis of Coreference Resolution Systems with CORVIDAE
Nico Moller, Gunther Heidemann, Visualization and Interaction for Ontologies and Linked Data, 2016
CORVIDAE: Coreference Resolution Visual Development & Analysis Environment
Nico Moller, Gunther Heidemann, International Conference on Semantic Systems, 2016
Richer Event Description: Integrating event coreference with temporal, causal and bridging annotation
Tim O'Gorman, Kristin Wright-Bettner, Martha Palmer, Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016), 2016
Multilingual coreference resolution
Sandra Kubler, Desislava Zhekova, Language and Linguistics Compass, 2016
A scaffolding approach to coreference resolution integrating statistical and rule-based models
Heeyoung Lee, Mihai Surdeanu, Dan Jurafsky, NLE, 2017
A method for in-depth comparative evaluation: How (dis)similar are outputs of pos taggers, dependency parsers and coreference resolvers really?,
Don Tuggener, EACL, 2017
An Active Learning Approach to Coreference Resolution,
Mrinmaya Sachan, Eduard Hovy, Eric P. Xing, IJCAI, 2015
Incorporating Structural Information for Better Coreference Resolution,
Kong Fang, Fu Jian, IJCAI, 2019
Automated Generation of Test Suites for Error Analysis of Concept Recognition Systems,
Tudor Groza, Karin Verspoor, ALTA, 2014
Graph-Based Lexicon Regularization for PCFG With Latent Annotations
Xiaodong Zeng, Derek F. Wong, Lidia S. Chao, Isabel Trancoso, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015
Evaluation Campaigns
Marta Recasens, Sameer Pradhan, Anaphora Resolution, 2016
The More Antecedents, the Merrier: Resolving Multi-Antecedent Anaphors,
Hardik Vala, Andrew Piper, Derek Ruths, ACL, 2016
Singleton Detection using Word Embeddings and Neural Networks,
Hessel Haagsma, ACL Workshop: SRW, 2016
Learning Global Features for Coreference Resolution,
Sam Wiseman, Alexander M. Rush, Stuart M. Shieber, NAACL, 2016
Learning Anaphoricity and Antecedent Ranking Features for Coreference Resolution,
Sam Wiseman, Alexander M. Rush, Stuart Shieber, Jason Weston, ACL, 2015
Enriching Basque Coreference Resolution System using Semantic Knowledge sources,
Ander Soraluze, Olatz Arregi, Xabier Arregi, Arantza Díaz de Ilarraza, Workshop on Coreference Resolution Beyond {O}nto{N}otes, 2017
Examining the Impact of Coreference Resolution on Quote Attribution,
Tim O'Keefe, Kellie Webster, James R. Curran, Irena Koprinska, ALTA, 2013
Latent Structures for Coreference Resolution,
Sebastian Martschat, Michael Strube, TACL, 2015
Recall Error Analysis for Coreference Resolution,
Sebastian Martschat, Michael Strube, EMNLP, 2014
A Dutch coreference resolution system with an evaluation on literary fiction
Andreas van Cranenburgh, Computational Linguistics in the Netherlands Journal, 2019
Improving mention detection for Basque based on a deep error analysis
Ander Soraluze), Olatz Arregi, Xabier Arregi, Arantza Diaz de Ilarraza, Natural Language Engineering, 2016
Different German and English Coreference Resolution Models for Multi-domain Content Curation Scenarios
Ankit Srivastava, Sabine Weber, Peter Bourgonje, Georg Rehm, International Conference of the German Society for Computational Linguistics and Language Technology, 2018
Conundrums in Entity Coreference Resolution: Making Sense of the State of the Art,
Jing Lu, Vincent Ng, EMNLP, 2020
Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks
Shubham Toshniwal, Sam Wiseman, Allyson Ettinger, Karen Livescu, Kevin Gimpel, EMNLP, 2020
A Benchmark of Rule-Based and Neural Coreference Resolution in Dutch Novels and News
Corben Poot, Andreas van Cranenburgh, Workshop on Computational Models of Reference, Anaphora and Coreference, 2020
Abstract Code Slides PDF Slides Citations (15)
Aspects of Chinese syntax result in a distinctive mix of parsing challenges. However, the contribution of individual sources of error to overall difficulty is not well understood. We conduct a comprehensive automatic analysis of error types made by Chinese parsers, covering a broad range of error types for large sets of sentences, enabling the first empirical ranking of Chinese error types by their performance impact. We also investigate which error types are resolved by using gold part-of-speech tags, showing that improving Chinese tagging only addresses certain error types, leaving substantial outstanding challenges.
Chinese syntactic parsing based on linguistic entity-relationship model,
Dechun Yin, Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, 2013
A Hebrew verb–complement dictionary,
Hanna Fadida, Alon Itai, Shuly Wintner, LREC, 2013
Two Knives Cut Better Than One: Chinese Word Segmentation with Dual Decomposition,
Mengqiu Wang, Rob Voigt, Christopher D. Manning, ACL, 2014
Joint POS Tagging and Transition-based Constituent Parsing in Chinese with Non-local Features,
Zhiguo Wang, Nianwen Xue, ACL, 2014
Improved Parsing with Taxonomy of Conjunctions,
Dongchen Li, Xiantao Zhang, Xihong Wu, IEEE China Summit and International Conference on Signal and Information Processing, 2014
Parsing Chinese with a Generalized Categorial Grammar,
Manjuan Duan, William Schuler, Proceedings of the Grammar Engineering Across Frameworks (GEAF) 2015 Workshop, 2015
Research on Chinese Parsing Based on the Improved Compositional Vector Grammar,
Jingyi Li, Lingling Mu, Hongying Zan, Kunli Zhang, 16th Workshop, CLSW 2015, Revised Selected Papers, 2015
Does String-Based Neural MT Learn Source Syntax?,
Xing Shi, Unkit Padhi, Kevin Knight, EMNLP, 2016
A Semantic-Oriented Grammar for Chinese Treebanking,
Meishan Zhang, Yue Zhang, Wanxiang Che, Ting Liu, CICLing, 2016
Resolving Coordinate Structures for Chinese Constituent Parsing,
Yichu Zhou, Shujian Huang, Xinyu Dai, Jiajun Chen, Natural Language Processing and Chinese Computing, 2015
Non-Deterministic Segmentation for Chinese Lattice Parsing,
Hai Hu, Daniel Dakota, Sandra Kubler, Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, 2017
Test Sets for Chinese Nonlocal Dependency Parsing
Manjuan Duan, William Schuler, LREC, 2018
An Empirical Investigation of Error Types in {V}ietnamese Parsing,
Quy Nguyen, Yusuke Miyao, Hiroshi Noji, Nhung Nguyen, CoLing, 2018
Towards Replicability in Parsing,
Daniel Dakota, Sandra K{"u}bler, RANLP, 2017
A note on constituent parsing for Korean
Mija Kim, Jungyeul Park, Natural Language Engineering, 2020
We present a catalogue of high-velocity clouds (HVCs) from the Galactic All Sky Survey (GASS) of southern-sky neutral hydrogen, which has 57 mK sensitivity and 1 km/s velocity resolution and was obtained with the Parkes Telescope. Our catalogue has been derived from the stray-radiation corrected second release of GASS. We describe the data and our method of identifying HVCs and analyse the overall properties of the GASS population. We catalogue a total of 1693 HVCs at declinations < 0 deg, including 1111 positive velocity HVCs and 582 negative velocity HVCs. Our catalogue also includes 295 anomalous velocity clouds (AVCs). The cloud line-widths of our HVC population have a median FWHM of ~19 km/s, which is lower than found in previous surveys. The completeness of our catalogue is above 95% based on comparison with the HIPASS catalogue of HVCs, upon which we improve with an order of magnitude in spectral resolution. We find 758 new HVCs and AVCs with no HIPASS counterpart. The GASS catalogue will shed an unprecedented light on the distribution and kinematic structure of southern-sky HVCs, as well as delve further into the cloud populations that make up the anomalous velocity gas of the Milky Way.
HI4PI: a full-sky H i survey based on EBHIS and GASS,
N. Ben Bekhti, L. Flöer, R. Keller, J. Kerp, D. Lenz, B. Winkel, J. Bailin, M. R. Calabretta, L. Dedes, H. A. Ford, B. K. Gibson, U. Haud, S. Janowiecki, P. M. W. Kalberla, F. J. Lockman, N. M. McClure-Griffiths, T. Murphy, H. Nakanishi, D. J. Pisano, L. Staveley-Smith, Astronomy and Astrophysics, 2016
Theoretical model of hydrodynamic jet formation from accretion disks with turbulent viscosity
E. Arshilava, M. Gogilashvili, V. Loladze, I. Jokhadze, B. Modrekiladze, N.L. Shatashvili, A.G. Tevzadze, Journal of High Energy Astrophysics, 2019
How runaway stars boost galactic outflows
Eric P. Andersson, Oscar Agertz, Florent Renaud, Monthly Notices of the Royal Astronomical Society, 2020
Abstract Code Slides PDF Slides Citations (2)
We propose an improved, bottom-up method for converting CCG derivations into PTB-style phrase structure trees. In contrast with past work (Clark and Curran, 2009), which used simple transductions on category pairs, our approach uses richer transductions attached to single categories. Our conversion preserves more sentences under round-trip conversion (51.1% vs. 39.6%) and is more robust. In particular, unlike past methods, ours does not require ad-hoc rules over non-local features, and so can be easily integrated into a parser.
A Machine Learning Approach to Convert CCGbank to Penn Treebank,
Xiaotian Zhang, Hai Zhao, Cong Hui, CoLing, 2012
Automatic Generation of High Quality {CCG}banks for Parser Domain Adaptation,
Masashi Yoshikawa, Hiroshi Noji, Koji Mineshima, Daisuke Bekki, ACL, 2019
Abstract Code Slides PDF Slides Citations (57)
Constituency parser performance is primarily interpreted through a single metric, F-score on WSJ section 23, that conveys no linguistic information regarding the remaining errors. We classify errors within a set of linguistically meaningful types using tree transformations that repair groups of errors together. We use this analysis to answer a range of questions about parser behaviour, including what linguistic constructions are difficult for state-of-the-art parsers, what types of errors are being resolved by rerankers, and what types are introduced when parsing out-of-domain text.
Joint Apposition Extraction with Syntactic and Semantic Constraints,
Will Radford, James R. Curran, ACL, 2013
Parsing with Compositional Vector Grammars,
Richard Socher, John Bauer, Christopher D. Manning, Andrew Y. Ng, ACL, 2013
A Hebrew verb-complement dictionary,
Hanna Fadida, Alon Itai, Shuly Wintner, LREC, 2013
On the Elements of an Accurate Tree-to-String Machine Translation System,
Graham Neubig, Kevin Duh, ACL, 2014
Parser Evaluation Using Derivation Trees: A Complement to evalb,
Seth Kulick, Ann Bies, Justin Mott, Anthony Kroch, Mark Liberman, Beatrice Santorini, ACL, 2014
Joint RNN-Based Greedy Parsing and Word Composition,
Joel Legrand, Ronan Collobert, ICLR, 2015
Exploring Compositional Architectures and Word Vector Representations for Prepositional Phrase Attachment,
Yonatan Belinkov, Tao Lei, Regina Barzilay, Amir Globerson, TACL, 2014
Identifying Cascading Errors using Constraints in Dependency Parsing,
Dominick Ng, James R. Curran, ACL, 2015
It Depends: Dependency Parser Comparison Using A Web-based Evaluation Tool,
Jinho D. Choi, Joel Tetreault, Amanda Stent, ACL, 2015
Transition-based Neural Constituent Parsing,
Taro Watanabe, Eiichiro Sumita, ACL, 2015
What is hard in Universal Dependency Parsing?,
Angelika Kirilina, Yannick Versleya, SPMRL, 2015
A Protocol for Annotating Parser Differences,
James V. Bruno, Aoife Cahill, Binod Gyawali, ETS Research Reports, 2016
Predicting the Performance of Parsing with Referential Translation Machines,
Ergun Bicici, The Prague Bulletin of Mathematical Linguistics, 2016
An Evaluation of Parser Robustness for Ungrammatical Sentences,
Home B. Hashemi, Rebecca Hwa, EMNLP, 2016
Fine-grained parallelism in probabilistic parsing with Habanero Java,
Matthew Francis-Landau, Bing Xue, Jason Eisner, Vivek Sarkar, Workshop on Irregular Applications: Architectures and Algorithms, 2016
Old School vs. New School: Comparing Transition-Based Parsers with and without Neural Network Enhancement,
Miryam de Lhoneux, Sara Stymne, Joakim Nivre, International Workshop on Treebanks and Linguistic Theories, 2017
PP Attachment: Where do We Stand?,
Dani"{e}l de Kok, Jianqiang Ma, Corina Dima, Erhard Hinrichs, EACL, 2017
Deep Semantic Role Labeling: What Works and What's Next,
Luheng He, Kenton Lee, Mike Lewis, Luke Zettlemoyer, ACL, 2017
Breaking {NLP}: {Using} {Morphosyntax}, {Semantics}, {Pragmatics} and {World} {Knowledge} to {Fool} {Sentiment} {Analysis} {Systems},
Taylor Mahler, Willy Cheung, Micha Elsner, David King, Marie-Catherine de Marneffe, Cory Shain, Symon Stevens-Guille, Michael White, {EMNLP} 2017 {Workshop} on {Building} {Linguistically} {Generalizable} {NLP} {Systems}, 2017
A Simple Method for Clarifying Sentences with Coordination Ambiguities,
Michael White, Manjuan Duan, David L. King, INLG, 2017
Does String-Based Neural MT Learn Source Syntax?,
Xing Shi, Unkit Padhi, Kevin Knight, EMNLP, 2016
Prepositional Phrase Attachment over Word Embedding Products,
Pranava Swaroop Madhyastha, Xavier Carreras, Ariadna Quattoni, IWPT, 2017
Improving Sequence-to-Sequence Constituency Parsing
Lemao Liu, Muhua Zhu, Shuming Shi, AAAI, 2018
Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples
Vidur Joshi, Matthew Peters, Mark Hopkins, ACL, 2018
Madly Ambiguous: A Game for Learning about Structural Ambiguity and Why It's Hard for Computers,
Ajda Gokcen, Ethan Hill, Michael White, NAACL (Demonstration), 2018
Parsing Speech: A Neural Approach to Integrating Lexical and Acoustic-Prosodic Information,
Trang Tran, Shubham Toshniwal, Mohit Bansal, Kevin Gimpel, Karen Livescu, Mari Ostendorf, NAACL, 2018
Automated Extraction of Semantic Legal Metadata Using Natural Language Processing,
Amin Sleimi, Nicolas Sannier, Mehrdad Sabetzadeh, Lionel Briand, John Dann, The 26th IEEE International Requirements Engineering Conference, 2018
An Empirical Investigation of Error Types in Vietnamese Parsing,
Quy T. Nguyen, Yusuke Miyao, Hiroshi Noji, Nhung T.H. Nguyen, CoLing, 2018
Sprucing up the trees - Error detection in treebanks,
Ines Rehbein, Josef Ruppenhofer, CoLing, 2018
Natural Language Parsing: Progress and Challenges,
Carlos Gomez-Rodriguez, Boletin de Estadistica e Investigacion Operativa, 2018
Constituent Parsing as Sequence Labeling,
Carlos Gomez-Rodriguez, David Vilares, EMNLP, 2018
The status of function words in dependency grammar: A critique of Universal Dependencies (UD),
Timothy Osborne, Kim Gerdes, Glossa, 2019
Visual Disambiguation of Prepositional Phrase Attachments: Multimodal Machine Learning for Syntactic Analysis Correction
Sebastien Delecraz, Leonor Becerra-Bonache, Alexis Nasr, Frederic Bechet, Benoit Favre, Advances in Computational Intelligence, 2019
On the Role of Style in Parsing Speech with Neural Models
Trang Tran, Jiahong Yuan, Yang Liu, Mari Ostendorf, Interspeech, 2019
Integration of a Multilingual Preordering Component into a Commercial SMT Platform
Anita Ramm, Riccardo Superbo, Dimitar Shterionov, Tony O'Dowd, Alexander Fraser, The Prague Bulletin of Mathematical Linguistics, 2017
Using Prosody to Improve Dependency Parsing
Hussein Ghaly, Michael Mandel, International Conference on Speech Prosody, 2020
Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers,
Graham Neubig, ACL, 2013
Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness,
Vicente Ivan Sanchez Carmona, Jeff Mitchell, Sebastian Riedel, NAACL, 2018
Better, Faster, Stronger Sequence Tagging Constituent Parsers,
David Vilares, Mostafa Abdou, Anders Sogaard, NAACL, 2019
Semi-supervised Relation Extraction from Monolingual Dictionary for Russian WordNet
Daniil Alexeyevsky, CICLing, 2017
A New Tool for Benchmarking and Assessing Arabic Syntactic Parsers
Younes JaafarEmail authorKarim Bouzoubaa, ICALP: Arabic Language Processing: From Theory to Practice, 2017
Improving Shift‐Reduce Phrase‐Structure Parsing with Constituent Boundary Information
Wenliang Chen, Muhua Zhu, Min Zhang, Yue Zhang, Jingbo Zhu, Computational Intelligence, 2016
Unlexicalized Transition-based Discontinuous Constituency Parsing,
Maximin Coavoux, Benoît Crabbé, Shay B. Cohen, TACL, 2019
Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT2015
Graham Neubig, Makoto Morishita, Satoshi Nakamura, WAT, 2015
Modeling Selectional Preferences of Verbs and Nouns in String-to-Tree Machine Translation,
Maria Nădejde, Alexandra Birch, Philipp Koehn, CMT, 2016
Towards Replicability in Parsing,
Daniel Dakota, Sandra K{"u}bler, RANLP, 2017
Automated Generation of Test Suites for Error Analysis of Concept Recognition Systems,
Tudor Groza, Karin Verspoor, ALTA, 2014
Graph-Based Lexicon Regularization for PCFG With Latent Annotations
Xiaodong Zeng, Derek F. Wong, Lidia S. Chao, Isabel Trancoso, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015
Embedding Syntax and Semantics of Prepositions via Tensor Decomposition,
Hongyu Gong, Suma Bhat, Pramod Viswanath, NAACL, 2018
Valency-Augmented Dependency Parsing,
Tianze Shi, Lillian Lee, EMNLP, 2018
Distributional regularities of verbs and verbal adjectives: Treebank evidence and broader implications,
Daniël de Kok, Patricia Fischer, Corina Dima, Erhard Hinrichs, IWTLT, 2017
Une note sur l'analyse du constituant pour le francais
Jungyeul Park, TALN, 2018
An Empirical Exploration of Local Ordering Pre-training for Structured Prediction,
Zhisong Zhang, Xiang Kong, Lori Levin, Eduard Hovy, Findings of EMNLP, 2020
Unsupervised Parsing with S-DIORA: Single Tree Encoding for Deep Inside-Outside Recursive Autoencoders
Andrew Drozdov, Subendhu Rongal, Yi-Pei Chen, Tim O'Gorman, Mohit Iyyer, Andrew McCallum, EMNLP, 2020
Strongly Incremental Constituency Parsing with Graph Neural Networks
Kaiyu Yang, Jia Deng, NeurIPS, 2020
A note on constituent parsing for Korean
Mija Kim, Jungyeul Park, Natural Language Engineering, 2020
Parsers Know Best: German PP Attachment Revisited
Bich-Ngoc Do, Ines Rehbein, CoLing, 2020
Abstract Poster Citations (17)
Our submission was a reduced version of the system described in Haghighi and Klein (2010), with extensions to improve mention detection to suit the OntoNotes annotation scheme. Including exact matching mention detection in this shared task added a new and challenging dimension to the problem, particularly for our system, which previously used a very permissive detection method. We improved this aspect of the system by adding filters based on the annotation scheme for OntoNotes and analysis of system behavior on the development set. These changes led to improvements in coreference F-score of 10.06, 5.71, 6.78, 6.63 and 3.09 on the MUC, B3 , Ceaf-e, Ceaf-m and Blanc, metrics, respectively, and a final task score of 47.10.
Improving mention detection for Basque based on a deep error analysis,
Ander Soraluze, Olatz Arregi, Xabier Arregi, Arantza Diaz de Ilarraza, Natural Language Engineering, 2016
Mention detection: First steps in the development of a Basque coreference resolution system,
Ander Soraluze, Olatz Arregi, Xabier Arregi, Klara Ceberio, Arantza Diaz de Ilarraza, Proceedings of KONVENS, 2012
Detecting Apposition for Text Simplification in Basque,
Itziar Gonzalez-Dios, Maria Jesus Aranzabe, Arantza Diaz de Ilarraza, Ander Soraluze, , 2013
Co-reference Resolution in Farsi Corpora,
Maryam Nazaridoust, Behrouz Minaie Bidgoli, Siavash Nazaridoust, , 2014
Removing the Training Wheels: A Coreference Dataset that Entertains Humans and Challenges Computers,
Anupam Guha, Mohit Iyyer, Danny Bouman, Jordan Boyd-Graber, NAACL, 2015
ARRAU: Linguistically-Motivated Annotation of Anaphoric Descriptions,
Olga Uryupina, Ron Artstein, Antonella Bristot, Federica Cavicchio, Kepa J. Rodriguez, Massimo Poesio, LREC, 2016
A scaffolding approach to coreference resolution integrating statistical and rule-based models,
Heeyoung Lee, Mihai Surdeanu, Dan Jurafsky, NLE, 2017
Apports des analyses syntaxiques pour la d'etection automatique de mentions dans un corpus de francais oral,
Loic Grobol, Isabelle Tellier, Eric De La Clergerie, Marco Dinarelli, Frederic Landragin, Actes de la 24e Conference sur le Traitement Automatique des Langues Naturelles (TALN), 2017
Testing TileAttack with Three Key Audiences
Chris Madge ,Massimo Poesio ,Udo Kruschwitz, Jon Chamberlain, LREC, 2018
Mention Detection Using Pointer Networks for Coreference Resolution
Cheoneum Park, Changki Lee, Soojong Lim, ETRI Journal, 2017
Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus
Olga Uryupina, Ron Artstein, Antonella Bristot, Federica Cavicchio, Francesca Delogu, Kepa J. Rodriguez, Massimo Poesio, Natural Language Engineering, 2020
Detecting Non-reference and Non-anaphoricity
Olga Uryupina, Mijail Kabadjov, Massimo Poesio, Anaphora Resolution, 2016
Multilingual Mention Detection for Coreference Resolution,
Olga Uryupina, Alessandro Moschitti, IJCNLP, 2013
Mining the Biomedical Literature
Claudiu Mihaila, et al., Healthcare Data Analytics, 2015
Crowdsourcing and Aggregating Nested Markable Annotations,
Chris Madge, Juntao Yu, Jon Chamberlain, Udo Kruschwitz, Silviu Paun, Massimo Poesio, ACL, 2019
Evaluation Campaigns
Marta Recasens, Sameer Pradhan, Anaphora Resolution, 2016
Systems Architecture and Algorithm for Co-reference Resolution in Texts
N. Mostafavi, M. H. Sadredini, S. Rahmani, S. M. Fakhrahmad, International Journal of Electronics Communication and Computer Engineering, 2014
Abstract DOI ArXiv Citations (44)
We identify the pattern of microscopic dynamical relaxation for a two-dimensional glass-forming liquid. On short time scales, bursts of irreversible particle motion, called cage jumps, aggregate into clusters. On larger time scales, clusters aggregate both spatially and temporally into avalanches. This propagation of mobility takes place along the soft regions of the systems, which have been identified by computing isoconfigurational Debye-Waller maps. Our results characterize the way in which dynamical heterogeneity evolves in moderately supercooled liquids and reveal that it is astonishingly similar to the one found for dense glassy granular media.
Structural phases in non-additive soft-disk mixtures: Glasses, substitutional order, and random tilings
A. Widmer-Cooper, P. Harrowell, Journal of Chemical Physics, 2011
Transient slowing down relaxation dynamics of the supercooled dusty plasma liquid after quenching
Yen-Shuo Su, Chong-Wai Io, Lin I, Physical Review E, 2012
Finite Size Scaling of the Dynamical Free-Energy in a Kinetically Constrained Model
Thierry Bodineau, Vivien Lecomte, Cristina Toninelli, Journal of Statistical Physics, 2012
Manifestations of dynamical facilitation in glassy materials
Yael S. Elmatad, Aaron S. Keys, Physcial Review E, 2012
Trajectory entanglement in dense granular materials
James G Puckett, Frederic Lechenault, Karen E Daniels, Jean-Luc Thiffeault, Journal of Statistical Mechanics: Theory and Experiment, 2012
Excitations Are Localized and Relaxation Is Hierarchical in Glass-Forming Liquids
Aaron S. Keys, Lester O. Hedges, Juan P. Garrahan, Sharon C. Glotzer, David Chandler, Physical Review, 2011
Perspective: Supercooled liquids and glasses
M. D. Ediger, Peter Harrowell, Journal of Chemical Physics, 2012
Gel formation and aging in weakly attractive nanocolloid suspensions at intermediate concentrations
Hongyu Guo, S. Ramakrishnan, James L. Harden, Robert L. Leheny, Journal of Chemical Physics, 2011
Mean-field cage theory for the random close packed state of a metastable hard-sphere glass
Xian-Zhi Wang, Physica A, 2012
Theoretical perspective on the glass transition and amorphous materials
Ludovic Berthier, Giulio Biroli, Reviews of Modern Physics, 2011
Dynamical facilitation decreases when approaching the granular glass transition
R. Candelier, O. Dauchot, G. Biroli, Europhysics Letters, 2010
Jammed Particle Configurations and Dynamics in High-Density Lennard-Jones Binary Mixtures in Two Dimensions
Hayato Shiba, Akira Onuki, Progress of Theoretical Physics Supplement, 2010
Dynamic heterogeneities, boson peak, and activation volume in glass-forming liquids
L. Hong, V. N. Novikov, A. P. Sokolov, Physical Review E, 2011
From Coupled Elementary Units to the Complexity of the Glass Transition
Christian Rehwald, Oliver Rubner, Andreas Heuer, Physical Review Letters, 2010
Dynamics of thermal vibrational motions and stringlike jump motions in three-dimensional glass-forming liquids
Takeshi Kawasaki, Akira Onuki, AIP: The Journal of Chemical Physics, 2013
Local elastic response measured near the colloidal glass transition
D. Anderson, D. Schaar, H. G. E. Hentschel, J. Hay, Piotr Habdas, Eric R. Weeks, AIP: The Journal of Chemical Physics, 2013
Dynamical Heterogeneities in Glasses, Colloids, and Granular Media
Ludovic Berthier, Giulio Biroli, Jean-Philippe Bouchaud, Luca Cipelletti, Wim van Saarloos, , 2011
Microscopic Picture of Cooperative Processes in Restructuring Gel Networks
Jader Colombo, Asaph Widmer-Cooper, Emanuela Del Gado, Physical Review Letters, 2013
Distribution of local relaxation events in an aging three-dimensional glass: Spatiotemporal correlation and dynamical heterogeneity
Anton Smessaert, Jorg Rottler, Physical Review E, 2013
Avalanches mediate crystallization in a hard-sphere glass
Eduardo Sanz, Chantal Valeriani, Emanuela Zaccarelli, Wilson C. K. Poon, Michael E. Cates, Peter N. Pusey, Proceedings of the National Academy of Sciences, 2013
Distributions of single-molecule properties as tools for the study of dynamical heterogeneities in nanoconfined water
G B Suffritti, P Demontis, J Gulin-Gonzalez, M Masia, Journal of Physics: Condensed Matter, 2014
Stress-induced microcracking and cooperative motion of cold dusty plasma liquids
Chi Yang, Lin I, Physical Review E, 2014
Dynamics in a tetrahedral network glassformer: Vibrations, network rearrangements, and diffusion
Takeshi Kawasaki, Kang Kim, Akira Onuki, The Journal of Chemical Physics, 2014
Order parameter for structural heterogeneity in disordered solids
Hua Tong, Ning Xu, Physical Review E, 2014
Relaxation pathway confinement in glassy dynamics
J. A. Rodriguez Fris, M. A. Frechero, G. A. Appignanesi, The Journal of Chemical Physics, 2014
Dynamical Heterogeneity in the Supercooled Liquid State of the Phase Change Material GeTe
Gabriele Cesare Sosso, Jader Colombo, Joerg Behler, Emanuela Del Gado, Marco Bernasconi, The Journal of Physical Chemistry B, 2014
Flexible confinement leads to multiple relaxation regimes in glassy colloidal liquids
Ian Williams, Erdal C. Oguz, Paul Bartlett, Hartmut Lowen, C. Patrick Royall, The Journal of Chemical Physics, 2015
Mutual information reveals multiple structural relaxation mechanisms in a model glass former
Andrew J. Dunleavy, Karoline Wiesner, Ryoichi Yamamoto, C. Patrick Royall, Nature Communications, 2015
The role of local structure in dynamical arrest
C. Patrick Royall, Stephen R. Williams, Physics Reports, 2015
Cooling the two-dimensional short spherocylinder liquid to the tetratic phase: Heterogeneous dynamics with one-way coupling between rotational and translational hopping
Yen-Shuo Su, Lin I, Physical Review E, 2015
Computer simulation studies of the influence of side alkyl chain on glass transition behavior of carbazole trimer
Chunyang Yu, Li Ma, Wei Huang, Yongfeng Zhou, Jingui Qin, Deyue Yan, Science China Chemistry, 2017
Structure-property relationships from universal signatures of plasticity in disordered solids
E. D. Cubuk, R. J. S. Ivancic, S. S. Schoenholz, D. J. Strickl, A. Basu, Z. S. Davidson, J. Fontaine, J. L. Hor, Y.-R. Huang, Y. Jiang, N. C. Keim, K. D. Koshigan, J. A. Lefever, T. Liu, X.-G. Ma, D. J. Magagnosc, E. Morrow, C. P. Ortiz, J. M. Rieser, A. Shavit, T. Still, Y. Xu, Y. Zhang, K. N. Nordstrom, P. E. Arratia, R. W. Carpick, D. J. Durian, Z. Fakhraai, D. J. Jerolmack, Daeyeon Lee, Ju Li, R. Riggleman, K. T. Turner, A. G. Yodh, D. S. Gianola, rea J. Liu, Science, 2017
Continuous-time random-walk approach to supercooled liquids: Self-part of the van Hove function and related quantities
J. Helfferich, J. Brisch, H. Meyer, O. Benzerara, F. Ziebert, J. Farago, J. Baschnagel, The European Physical Journal E, 2018
Heterogeneous Activation, Local Structure, and Softness in Supercooled Colloidal Liquids
Xiaoguang Ma, Zoey S. Davidson, Tim Still, Robert J. S. Ivancic, S. S. Schoenholz, A. J. Liu, A. G. Yodh, Physical Review Letters, 2019
The Glassy Dynamics Predicted by the Mutual Role of the Free and Activation Volumes
Wycliffe Kiprop Kipnusu, Mohamed Elsayed, Ciprian Iacob, Sebastian Pawlus, Reinhard Krause Rehberg, Marian Paluch, Soft Matter, 2019
Multiscale Coherent Excitations in Microscopic Acoustic Wave Turbulence of Cold Dusty Plasma Liquids
Hu, Hao-Wei, Wang, Wen, I, Lin, Physical Review Letters, 2019
Active particles sense micromechanical properties of glasses
Celia Lozano, Juan Ruben Gomez-Solano, Clemens Bechinger, Nature Materials, 2019
Attractive versus truncated repulsive supercooled liquids: The dynamics is encoded in the pair correlation function
Francois P. Landes, Giulio Biroli, Olivier Dauchot, Andrea Liu, David Reichman, Physical Review E, 2020
Elucidation of the Nature of Structural Relaxation in Glassy D-Sorbitol
Marcin Krynski, Felix C. Mocanu, Stephen R. Elliott, The Journal of Physical Chemistry B, 2020
Diffusion of Anisotropic Particles in Random Energy Landscapes—An Experimental Study
Juan Pablo Segovia-Gutiérrez, Manuel A. Escobedo-Sánchez, Erick Sarmiento-Gómez, Stefan U. Egelhaaf, Frontiers in Physics, 2020
Application of machine learning approach in disordered materials
JiaQi Wu, YiTao Sun, WeiHua Wang, MaoZhi Li, SCIENTIA SINICA Physica, Mechanica & Astronomica, 2020
Collective diffusion within the superionic regime of Bi2O3
Chris E. Mohn, Marcin Krynski, Phys. Rev. B, 2020
Unveiling the predictive power of static structure in glassy systems
V. Bapst, T. Keck, A. Grabska-Barwinska, C. Donner, E. D. Cubuk, S. S. Schoenholz, A. Obika, A. W. R. Nelson, T. Back, D. Hassabis, P. Kohli, Nature Physics, 2020
Local Dynamics of Excitations in Glassy Liquids
H. T. Lee, J. Landy, J. U. Kim, Y. S. Jho, Journal of the Korean Physical Society, 2020
Because English is a low morphology language, current statistical parsers tend to ignore morphology and accept some level of redundancy. This paper investigates how costly such redundancy is for a lexicalised grammar such as CCG.\n\nWe use morphological analysis to split verb inflectional suffixes into separate tokens, so that they can receive their own lexical categories. We find that this improves accuracy when the splits are based on correct POS tags, but that errors in gold standard or automatically assigned POS tags are costly for the system. This shows that the parser can benefit from morphological analysis, so long as the analysis is correct.
Morpho-syntactic Lexical Generalization for CCG Semantic Parsing,
Adrienne Wang, Tom Kwiatkowski, Luke Zettlemoyer, EMNLP, 2014
Wide-Coverage Parsing, Semantics, and Morphology,
Ruket Cakici, Mark Steedman, Cem Bozsahin, , 2018
CCG Supertagging Using Morphological and Dependency Syntax Information,
Ngoc Luyen Le, Yannis Haralambous, International Conference on Computational Linguistics and Intelligent Text Processing, 2019
Abstract Code PDF Slides Citations (13)
We propose a novel self-training method for a parser which uses a lexicalised grammar and supertagger, focusing on increasing the speed of the parser rather than its accuracy. The idea is to train the supertagger on large amounts of parser output, so that the supertagger can learn to supply the supertags that the parser will eventually choose as part of the highest scoring derivation. Since the supertagger supplies fewer supertags overall, the parsing speed is increased. We demonstrate the effectiveness of the method using a CCG supertagger and parser, obtaining significant speed increases on newspaper text with no loss in accuracy. We also show that the method can be used to adapt the CCG parser to new domains, obtaining accuracy and speed improvements for Wikipedia and biomedical text.
Chart pruning for fast lexicalised-grammar parsing,
Yue Zhanga, Byung-Gyu Ahn, Stephen Clark, Curt Van Wyk, James R. Curran, Laura Rimell, CoLing, 2010
A Comparison of Loopy Belief Propagation and Dual Decomposition for Integrated CCG Supertagging and Parsing,
Michael Auli, Adam Lopez, ACL, 2011
Efficient CCG Parsing: A* versus Adaptive Supertagging,
Michael Auli, Adam Lopez, ACL, 2011
Exciting and interesting: issues in the generation of binomials
Ann Copestake, Aurelie Herbelot, Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop, 2011
Automatic recognition of conceptualization zones in scientific articles and two life science applications,
Maria Liakata, Shyamasree Saha, Simon Dobnik, Colin Batchelor, Dietrich Rebholz-Schuhmann, Bioinformatics, 2012
Frontier Pruning for Shift-Reduce CCG Parsing,
Stephen Merity, James Curran, ALTA, 2011
Ubertagging: Joint Segmentation and Supertagging for English,
Rebecca Dridan, EMNLP, 2013
A* CCG Parsing with a Supertag-factored Model,
Mike Lewis, Mark Steedman, EMNLP, 2014
CCG Supertagging with a Recurrent Neural Network,
Wenduan Xu, Michael Auli, Stephen Clark, ACL, 2015
Imitation Learning of Agenda-based Semantic Parsers,
Jonathan Berant, Percy Liang, TACL, 2015
Shift-Reduce Constituent Parsing with Neural Lookahead Features,
Jiangming Liu, Yue Zhang, TACL, 2017
Syntax-aware Neural Semantic Role Labeling with Supertags,
Jungo Kasai, Dan Friedman, Robert Frank, Dragomir Radev, Owen Rambow, NAACL, 2019
Supertagging Combinatory Categorial Grammar with Attentive Graph Convolutional Networks
Yuanhe Tian, Yan Song, Fei Xia, EMNLP, 2020
Parsers are often the bottleneck for data acquisition, processing text too slowly to be widely applied. One way to improve the efficiency of parsers is to construct more confident statistical models. More training data would enable the use of more sophisticated features and also provide more evidence for current features, but gold standard annotated data is limited and expensive to produce.\n\nWe demonstrate faster methods for training a supertagger using hundreds of millions of automatically annotated words, constructing statistical models that further constrain the number of derivations the parser must consider. By introducing new features and using an automatically annotated corpus we are able to double parsing speed on Wikipedia and the Wall Street Journal, and gain accuracy slightly when parsing Section 00 of the Wall Street Journal.
Scalable syntactic processing will underpin the sophisticated language technology needed for next generation information access. Companies are already using nlp tools to create web-scale question answering and ‘semantic search’ engines. Massive amounts of parsed web data will also allow the automatic creation of semantic knowledge resources on an unprecedented scale. The web is a challenging arena for syntactic parsing, because of its scale and variety of styles, genres, and domains.\n\nThe goals of our workshop were to scale and adapt an existing wide-coverage parser to Wikipedia text; improve the efficiency of the parser through various methods of chart pruning; use self-training to improve the efficiency and accuracy of the parser; use the parsed wiki data for an innovative form of bootstrapping to make the parser both more efficient and more accurate; and finally use the parsed web data for improved disambiguation of coordination structures, using a variety of syntactic and semantic knowledge sources.\n\nThe focus of the research was the C&C parser (Clark and Curran, 2007c), a state-of-the-art statistical parser based on Combinatory Categorial Grammar (ccg). The parser has been evaluated on a number of standard test sets achieving state-of-the-art accuracies. It has also recently been adapted successfully to the biomedical domain (Rimell and Clark, 2009). The parser is surprisingly efficient, given its detailed output, processing tens of sentences per second. For web-scale text processing, we aimed to make the parser an order of magnitude faster still. The C&C parser is one of only very few parsers currently available which has the potential to produce detailed, accurate analyses at the scale we were considering.
Introducing More Features to Improve Chinese Shift-Reduce Parsing
Hongxian Wang, Qiang Zhou, Liou Chen, Asia Pacific Signal and Information Processing Association, Annual Summit and Conference, 2011
Probabilistic models of similarity in syntactic context
Diarmuid Seaghdha, Anna Korhonen, EMNLP, 2011
Evaluation Reportof the third Chinese Parsing Evaluation: CIPSSIGHAN-ParsEval-2012
Qiang Zhou, Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2012
Evaluation Reportof the third Chinese Parsing Evaluation: CIPSSIGHAN-ParsEval-2014
Qiang Zhou, Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2014
Interpreting compound nouns with kernel methods
Diarmuid Seaghdha, Ann Copestake, Natural Language Engineering, 2013
CYK-based Decision and Forest Parsing Algorithm for Combinatory Categorial Grammar
Qingjiang Wang, Zhengping Wang, Lin Zhang, Journal of Information and Computational Science, 2014
A Streaming Dataflow Implementation of Parallel Cocke–Younger–Kasami Parser
D. Bojic, M. Bojovic, , 2016
Leveraging a Semantically Annotated Corpus to Disambiguate Prepositional Phrase Attachment
Guy Emerson, Ann Copestake, Proceedings of the 11th International Conference on Computational Semantics, 2015
How important is syntactic parsing accuracy? An empirical evaluation on rule-based sentiment analysis,
Carlos Gomez-Rodriguez, Iago Alonso-Alonso, David Vilares, Artificial Intelligence Review, 2017
Statistical parsers are crucial for tackling the grand challenges of Natural Language Processing. The most effective approaches to these tasks are data driven, but parsers are too slow to be effectively used on large data sets. State-of-the-art parsers generally cannot process more than one sentence a second, and the fastest cannot process more than fifty sentences a second. The situation is even worse when they are applied outside of the domain of their training data. The fastest systems have two components, a parser, which has time complexity O(n3) and a supertagger, which has linear time complexity. By shifting work from the parser to the supertagger we dramatically improve speed.\n\nThis work demonstrates several major novel ideas that improve parsing efficiency. The core idea is that the tags chosen by the parser are gold standard data for its supertagger. This leads to the second surprising conceptual development, that decreasing tagging accuracy can improve parsing performance. To demonstrate these ideas required extensive development of the C&C supertagger, including imple- mentation of more efficient estimation algorithms and parallelisation of the training process. This was particularly challenging as the C&C supertagger is a state-of-the-art high performance system designed with a focus on speed rather than flexibility.\n\nI was able to significantly improve performance on the standard evaluation corpus by using the parser to generate extremely large new resources for supertagger training. I have also shown that these methods provide significant benefits on another domain, Wikipedia text, without the cost of generating human annotated data sets. These parsing performance gains occur while supertagging accuracy decreases.\n\nDespite extensive use of supertaggers to improve parsing efficiency there has been no comprehensive study of the interaction between a supertagger and a parser. I present the first systematic exploration of the relationship, show the potential benefits of understanding it, and demonstrate a novel algorithm for optimising the parameters that define it.\n\nI have constructed models that process newspaper text 86% faster than previously, and Wikipedia text 30% faster, without any loss in accuracy and without the aid of extra gold standard resources in either domain. This work will lead directly to improvements in a range of Natural Language Processing tasks by enabling the use of far more parsed data.
Manually maintaining comprehensive databases of multi-word expressions, for example Verb-Particle Constructions (VPCs), is infeasible. We describe a new classifier for potential VPCs, which uses information in the Google Web1T corpus to perform a simple linguistic constituency test. Specifically, we consider the fronting test, comparing the frequencies of the two possible orderings of the given verb and particle. Using only a small set of queries for each verb-particle pair, the system was able to achieve an F-score of 78.4% in our evaluation while processing thousands of queries a second.
New Tools for Web-Scale N-grams,
Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani, Sushant Narsale, LREC, 2010
Corpus-based Extraction of Japanese Compound Verbs,
James Breen, Timothy Baldwin, ALTA, 2009
Predicting the Semantic Compositionality of Prefix Verbs,
Shane Bergsma, Aditya Bhargava, Hua He, Grzegorz Kondrak, EMNLP, 2010
Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding,
Ching-Yun Chang, Stephen Clark, EMNLP, 2010
POS Tagging of English Particles for Machine Translation,
Jianjun Ma, Degen Huang, Haixia Liu, Wenfeng Sheng, MT Summit of the International Association for Machine Translation, 2011
Automatic Classification of German an Particle Verbs,
Sylvia Springorum, Sabine Schulte im Walde, Antje Roßdeutscher, LREC, 2012
Practical Linguistic Steganography using Contextual Synonym Substitution and a Novel Vertex Coding Method,
Ching-Yun Chang, Stephen Clark, CL, 2014
The Secret's in the Word Order: Text-to-Text Generation for Linguistic Steganography,
Ching-Yun Chang, Stephen Clark, CoLing, 2012
This paper considers the homogeneous packing of binary hard spheres in an equimolar stoichiometry, and postulates the densest packing at each sphere size ratio. Monte Carlo simulated annealing optimizations are seeded with all known atomic inorganic crystal structures, and the search is performed within the degrees of freedom associated with each homogeneous AB structure type. Structures isopointal to the FeB structure type are found to have the highest packing fraction at all sphere size ratios. The optimized structures match or improve on the best previously demonstrated packings of this type, and show that compound structures can pack more densely than segregated close-packed structures at all radius ratios less than 0.62.
Phase Diagram and Structural Diversity of the Densest Binary Sphere Packings
Adam B. Hopkins, Yang Jiao, Frank H. Stillinger, Salvatore Torquato, Physical Review Letters, 2011
Densest binary sphere packings
Adam B. Hopkins, Frank H. Stillinger, Salvatore Torquato, Physical Review E, 2012
Geometrical Frustration and Static Correlations in a Simple Glass Former
Benoit Charbonneau, Patrick Charbonneau, Gilles Tarjus, Physical Review Letters, 2012
Structural phases in non-additive soft-disk mixtures: Glasses, substitutional order, and random tilings
Asaph Widmer-Cooper, Peter Harrowell, Journal of Chemical Physics, 2011
Phase diagram of hard snowman-shaped particles
Matthew Dennison, Kristina Milinkovi ́c, Marjolein Dijkstra, Journal of Chemical Physics, 2012
Dense Sphere Packing in the NaZn13 Structure Type
Toby S. Hudson, The Journal of Physical Chemistry C, 2010
Electrophoretic deposition of binary energetic composites
Kyle Thomas Sullivan, Marcus Andre Worsley, Joshua David Kuntz, Alex Eydmann Gash, Combustion and Flame, 2012
Multicomponent periodic nanoparticle superlattices
Paul Podsiadlo, Galyna Krylova, Arnaud Demortiere, Elena Shevchenko, Journal of Nanoparticle Research, 2011
Structural searches using isopointal sets as generators: densest packings for binary hard sphere mixtures
Toby S. Hudson, Peter Harrowell, Journal of Physics: Condensed Matter, 2011
Prediction of binary hard-sphere crystal structures
Laura Filion, Marjolein Dijkstra, Physical Review E, 2009
Efficient Method for Predicting Crystal Structures at Finite Temperature: Variable Box Shape Simulations
Laura Filion, Matthieu Marechal, Bas van Oorschot, Daniel Pelt, Frank Smallenburg, Marjolein Dijkstra, Physical Review Letters, 2009
Dense Packings of Hard Spheres of Different Sizes Based on Filling Interstices in Uniform Three-Dimensional Tilings
Toby S. Hudson, Peter Harrowell, The Journal of Physical Chemistry B, 2008
Crystal nucleation in binary hard-sphere mixtures: the effect of order parameter on the cluster composition
Ran Ni, Frank Smallenburg, Laura Filion, Marjolein Dijkstra, Molecular Physics, 2011
New High Density Packings of Similarly-Sized Binary Spheres
Patrick I. O'Toole, Toby S. Hudson, The Journal of Physical Chemistry C, 2011
Structural search for dense packing of concave and convex shapes in two dimensions
Nabiha T Elias, Toby S Hudson, Journal of Physics: Conference Series, 2012
On the Phase Behavior of Binary Mixtures of Nanoparticles
Avi Ben-Simon, Hagai Eshet, Eran Rabani, ACS Nano, 2013
Favoured Local Structures in Liquids and Solids: a 3D Lattice Model
Pierre Ronceray, Peter Harrowell, The Self Journal of Science, 2015
Packing concave molecules in crystals and amorphous solids: on the connection between shape and local structure
Cerridwen Jennings, Malcolm Ramsay, Toby Hudson, Peter Harrowell, Molecular Physics, 2015
Binary nanoparticle superlattices of soft-particle systems
Alex Travesset, Proceedings of the National Academy of Sciences, 2015
A Geometric-Structure Theory for Maximally Random Jammed Packings
Jianxiang Tian, Yaopengxiao Xu, Yang Jiao, Salvatore Torquato, Scientific Reports, 2015
Perspective: Basic understanding of condensed phases of matter via packing models,
Salvatore Torquato, The Journal of Chemical Physics, 2018
Using symmetry to elucidate the importance of stoichiometry in colloidal crystal assembly,
Nathan A. Mahynski, Evan Pretti, Vincent K. Shen, Jeetain Mittal, Nature Communicationsvolume, 2019
The Influence of Softness on the Stability of Binary Colloidal Crystals
R. Allen LaCour, Carl Simon Adorf, Julia Dshemuchadse, Sharon C Glotzer, ACS Nano, 2019
Effect of surface texture, size ratio and large particle volume fraction on packing density of binary spherical mixtures
Chamod Hettiarachchi, W. K. Mampearachchi, Granular Matter, 2019
Observation of 9-Fold Coordinated Amorphous TiO2 at High Pressure
Yu Shu, Yoshio Kono, Itaru Ohira, Quanjun Li, Rostislav Hrubiak, Changyong Park, Curtis Kenney-Benson, Yaanbin Wang, Guoyin Shen, The Journal of Physical Chemistry Letters, 2019
Phase Diagram and Structure Map of Binary Nanoparticle Superlattices from a Lennard-Jones Model
Shang Ren, Yang Sun*, Feng Zhang*, Alex Travesset, Cai-Zhuang Wang, Kai-Ming Ho, ACS Nano, 2020