Skip to the content.

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

This repository contains code for language modeling experiments, as described in:

It is based on the original version of the AWD-LSTM language model (later versions of the code had slightly different performance, so we used the original to match the original paper). Most of this readme file is taken from that code.

Data preparation

This repository contains the code we used to pre-process the data. There are files for extraction:

And files for tokenising with Stanza and converting numbers:

We also include a script that reads the LDC PTB tgz file and produces our version of the PTB:

For example, to prepare the data in the same way we did, run these two commands (where treebank_3_LDC99T42.tgz must be downloaded from the LDC).

./data-preprocessing/make-non-unk-ptb.py --prefix ptb.std. treebank_3_LDC99T42.tgz
./data-preprocessing/make-non-unk-ptb.py --prefix ptb.rare. --no-unks treebank_3_LDC99T42.tgz

Changes to AWD-LSTM

The model code has been modified to support the experiments described in the paper. Specifically, we added:

These are controlled via command line options:

  --emsize EMSIZE       size of word embeddings
  --nout NOUT           size of output embedding. Must match emsize if tying
  --untied              Do not tie the input and output weights
  --random-in           Use random init for the input embeddings
  --random-out          Use random init for the output embeddings
  --freeze-in           Freeze the input embeddings
  --freeze-out          Freeze the output embeddings but not the bias vector
  --freeze-out-withbias
                        Freeze the output embeddings and the bias vector
  --embed EMBED         File with word embeddings

AWD-LSTM Language Model

Averaged Stochastic Gradient Descent with Weight Dropped LSTM

This repository contains the code used for Salesforce Research’s Regularizing and Optimizing LSTM Language Models paper, originally forked from the PyTorch word level language modeling example. The model comes with instructions to train a word level language model over the Penn Treebank (PTB) and WikiText-2 (WT2) datasets, though the model is likely extensible to many other datasets.

If you use this code or our results in your research, please cite:

@article{merityRegOpt,
  title=,
  author={Merity, Stephen and Keskar, Nitish Shirish and Socher, Richard},
  journal={arXiv preprint arXiv:1708.02182},
  year={2017}
}

Software Requirements

This codebase requires Python 3 and PyTorch 0.1.12_2. If you are using Anaconda, this can be achieved via: conda install pytorch=0.1.12 -c soumith.

Note the older version of PyTorch - upgrading to later versions would require minor updates and would prevent the exact reproductions of the results below. Pull requests which update to later PyTorch versions are welcome, especially if they have baseline numbers to report too :)

Experiments

The codebase was modified during the writing of the paper, preventing exact reproduction due to minor differences in random seeds or similar. The guide below produces results largely similar to the numbers reported.

For data setup, run ./getdata.sh. This script collects the Mikolov pre-processed Penn Treebank and the WikiText-2 datasets and places them in the data directory.

Important: If you’re going to continue experimentation beyond reproduction, comment out the test code and use the validation metrics until reporting your final results. This is proper experimental practice and is especially important when tuning hyperparameters, such as those used by the pointer.

Penn Treebank (PTB)

The instruction below trains a PTB model that without finetuning achieves perplexities of 61.2 / 58.9 (validation / testing), with finetuning achieves perplexities of 58.8 / 56.6, and with the continuous cache pointer augmentation achieves perplexities of 53.5 / 53.0.

First, train the model:

python main.py --batch_size 20 --data data/penn --dropouti 0.4 --seed 28 --epoch 300 --save PTB.pt

The first epoch should result in a validation perplexity of 308.03.

To then fine-tune that model:

python finetune.py --batch_size 20 --data data/penn --dropouti 0.4 --seed 28 --epoch 300 --save PTB.pt

The validation perplexity after the first epoch should be 60.85.

Note: Fine-tuning modifies the original saved model in PTB.pt - if you wish to keep the original weights you must copy the file.

Finally, to run the pointer:

python pointer.py --data data/penn --save PTB.pt --lambdasm 0.1 --theta 1.0 --window 500 --bptt 5000

Note that the model in the paper was trained for 500 epochs and the batch size was 40, in comparison to 300 and 20 for the model above. The window size for this pointer is chosen to be 500 instead of 2000 as in the paper.

Note: BPTT just changes the length of the sequence pushed onto the GPU but won’t impact the final result.

WikiText-2 (WT2)

The instruction below train a WT2 model that without finetuning achieves perplexities of 69.1 / 66.1 (validation / testing), with finetuning achieves perplexities of 68.7 / 65.8, and with the continuous cache pointer augmentation achieves perplexities of 53.6 / 52.0 (51.95 specifically).

python main.py --seed 20923 --epochs 750 --data data/wikitext-2 --save WT2.pt

The first epoch should result in a validation perplexity of 629.93.

python -u finetune.py --seed 1111 --epochs 750 --data data/wikitext-2 --save WT2.pt

The validation perplexity after the first epoch should be 69.14.

Note: Fine-tuning modifies the original saved model in PTB.pt - if you wish to keep the original weights you must copy the file.

Finally, run the pointer:

python pointer.py --save WT2.pt --lambdasm 0.1279 --theta 0.662 --window 3785 --bptt 2000 --data data/wikitext-2

Note: BPTT just changes the length of the sequence pushed onto the GPU but won’t impact the final result.

Speed

All the augmentations to the LSTM, including our variant of DropConnect (Wan et al. 2013) termed weight dropping which adds recurrent dropout, allow for the use of NVIDIA’s cuDNN LSTM implementation. PyTorch will automatically use the cuDNN backend if run on CUDA with cuDNN installed. This ensures the model is fast to train even when convergence may take many hundreds of epochs.

The default speeds for the model during training on an NVIDIA Quadro GP100:

Speeds are approximately three times slower on a K80. On a K80 or other memory cards with less memory you may wish to enable the cap on the maximum sampled sequence length to prevent out-of-memory (OOM) errors, especially for WikiText-2.

If speed is a major issue, SGD converges more quickly than our non-monotonically triggered variant of ASGD though achieves a worse overall perplexity.