An Analysis of Neural Language Modeling at Multiple Scales (Merity et al., 2018)

Assigning a probability distribution over the next word or character in a sequence (language modeling) is a useful component of many systems, such as speech recognition and translation. Recently neural networks have come to dominate in performance, with a range of clever innovations in network structure. This paper is not about new models, but rather explores the current evaluation and how well carefully tuned baseline models can do.

The key observations for me were:

  • There are issues with the PTB dataset for character-level evaluation - it removes all punctuation, makes numbers ‘N’, and removes rare words (i.e. it is a character-level version of the token-level task). Given that the original Penn Treebank exists, I would have been interested to see a comparison with the PTB without any simplification. The other dataset, enwik8, makes sense as a testing ground for compression algorithms, but is a little odd for modeling language, since it is the first 100 million bytes of a Wikipedia XML dump. The paper does have another dataset, WikiText, which sounds good, but then there is no character-level evaluation!
  • The LSTM is able to achieve ~SotA results for character-level modeling. The key seems to be careful design of the softmax that produces the final probability distribution: (1) rare words are clustered and represented by a single value in the distribution calculation, and (2) word vectors are shared between input and output.
  • Dropout matters more than the network design, and multiple forms of dropout should be tuned jointly. This comes from analysis of a set of models trained with random variation in hyperparameters.



   author = {{Merity}, S. and {Shirish Keskar}, N. and {Socher}, R.},
    title = {An Analysis of Neural Language Modeling at Multiple Scales},
  journal = {ArXiv e-prints},
     year = {2018},
      url = {},


comments powered by Disqus