Most work in NLP uses datasets with a diverse set of speakers. In practise, everyone speaks / writes slightly differently and our models would be better if they accounted for that. This has been the motivation for a line of work by [Charlie Welch](http://cfwelch.com/) that I've been a collaborator on (in [CICLing 2019](https://www.jkk.name/publication/cicling19personal), [IEEE Intelligent Systems 2019](https://www.jkk.name/publication/ieee19personal/), [CoLing 2020](https://www.jkk.name/publication/coling20personal/), and this paper).
This paper explores two questions. First, what is the impact of a few key design decisions for word embeddings in language models? Second, based on the first answer, how can we improve results in the situation where we have 50 million+ words of text, but only 1 GPU for training?
The simplest way to learn word vectors for rare words is to average their context. Tweaking word2vec to make greater use of the context may do slightly better, but it's unclear.