Skip to the content.

This repository contains software to do several things related to syntax - parsing, format conversion, and evaluation. For a full definition of the parsing algorithm, proofs of its properties, and results on the standard metrics, see:

If you want to understand the algorithm, my thesis provides a better explanation (the TACL paper focuses on proving properties):

If you use this code in your own work, please cite the TACL paper:

    title = "Parsing with Traces: An ${O}(n^4)$ Algorithm and a Structural Representation",
    author = "Kummerfeld, Jonathan K.  and
      Klein, Dan",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "5",
    year = "2017",
    url = "",
    doi = "10.1162/tacl_a_00072",
    pages = "441--454",
    software  = "",

What is this work about?

In the Penn Treebank, parses have (1) a core projective tree structure and (2) traces that represent control structures, wh-movement and more. However, most parsers and the standard evaluation metric (evalb) ignore the traces and all null elements, focusing entirely on the tree structure. These aspects of syntax are excluded not because of disagreements regarding theory, but rather because of the computational challenge of including them.

This work is about a new inference algorithm that efficiently finds the maximum scoring graph parse for a sentence when using a first-order model. Our approach involves (1) the algorithm, which builds upon the non-projective tree parsing algorithm of Pitler et al. (2013), and (2) a syntactic representation, similar to several prior approaches, particularly Carreras et al. (2008). For more details, see the paper and dissertation above.

This repository contains:

All of the scripts are in python and can be run without downloading or installing any additional resources (note, all should work with python 2.7, only some work with python 3). The parser is in scala, but can be downloaded as a jar that runs with java. For pruning we use a tagger written in C++ that depends on several libraries (we provide a makefile and information about what else needs to be installed).

For detailed information on each component, see the file in each directory.

Reproducing the results in the paper

You will need:

Assuming you download the contents of this repo and have all of those file and put them in the same directory as this file, you will need to:

  1. Get just the text from the gold parses
  2. Add IDs to the sentences
  3. [Only if running the tagger] Simplify the text
  4. [Only if running the tagger] Run the spine tagger
  5. Run the parser
  6. Fix parses where we failed to produce output
  7. Convert to standard PTB
  8. Adjust the format of the gold data
  9. Evaluate traces and nulls
  10. Evaluse nulls only

Each of these steps is one command below:

python2 format-conversion/ -i p -o w < wsj23.mrg > test.tok
python3 parser/nn-tagger/ 200001 < test.tok > test-with-ids.tok
python3 parser/nn-tagger/ < test-with-ids.tok > test-with-ids.tok.simple
parser/nn-tagger/spine-tagger -word-dict dict.words -tag-dict dict.tags -model model.params -test test-with-ids.tok.simple -prefix wsj23.tagged
./ test.parser test-with-ids.tok wsj23.pos.stanford
python3 evaluation/ < test.parser.auto_all.localStageFinal > test.parser.shp
python2 format-conversion/ -i h -o o -e he < test.parser.shp > test.parser.ptb
python2 format-conversion/ -i p -o st -e fh < wsj23.mrg >
python2 evaluation/ test.parser.ptb | tail -n 1
python2 evaluation/ test.parser.ptb --null_only | tail -n 1

The final two commands give performance on (1) traces and other null items, (2) nulls only:

all 2621 3893 3524 74.3757094211 67.3259696892 70.6754752595
all 3153 3864 3522 89.5229982964 81.599378882 85.3777416734

These numbers are (count of matching, gold total, test total, precision, recall, f-score).

Note, the first line of results is slightly better than in the paper. Prof. Yoshihide Kato identified a bug in my conversion script that meant correlates and remnants had the “=” and “-“ indices reversed. This bug has now been fixed. The original values produced were:

all 2619 3893 3524 74.3189557321 67.2745954277 70.6215450991


If you find bugs or have questions about any of this software please either contact me ( or create an issue. Thank you!

Misc Notes

The dynamic program definitions in the TACL paper and my thesis are slightly different: