This paper proposes two techniques for speeding up neural network execution on GPUs:

- Reduce computation when doing matrix-multiply by removing rows.
- Reduce communication on the GPU by halving the number of bits used to represent numbers.

Either of these gives a speed up of ~1.5x and together they give ~2x, across a range of different computer vision tasks+models.

## Core ideas in detail

The first idea, reducing work by eliminating parts of the computation, has been considered before. In the past, however, the focus was on saving memory in models, and so the most common strategy was to move to a sparse matrix where weights close to zero are dropped. Here the focus is on speed and they show that while the sparse approach saves memory it can end up being slower because of hardware behaviour. Instead, they eliminate entire rows of the matrix, which means there is less computation, but it remains dense (and therefore fast). Rows are identified by measuring correlation between outputs and greedily eliminating rows that correlate highly with the rest of the output.

The natural question to ask is whether this hurts performance. First, they do two things to avoid problems, (1) a scale factor is used to make sure the outputs are of the same range that they would have been with the full matrix, and (2) they restart training to fine-tune the network once pruning is set up. With high enough pruning accuracy does fall, but speed ups can be gained before that is a problem (the exact point depends on the task).

The second idea relates to numerical representation, and is motivated by measurements of where the bottlenecks are in communication. Many AI researchers have tried switching to 16 bit representations to save space and time, but here they develop a different floating point encoding that gives more bits to the exponent, and fewer to the mantissa.

## Thoughts

- It would be interesting to see the interaction of this work with the investigation of networks without non-linear functions that can still learn non-linear behaviour because of numerical approximations.
- In the context of language, the weight reduction approach would be interesting to analyse. Specifically, what do we lose in our word vectors depending on the task?
- I’ve always had some interest in making things faster. It would be interesting to know where the remaining bottlenecks are (after applying these changes).

## Citation

```
@InProceedings{Hill:MICRO:2017,
author = {Parker Hill, Animesh Jain, Mason Hill1, Babak Zamirai, Chang-Hong Hsu, Michael A. Laurenzano, Scott Mahlke, Lingjia Tang and Jason Mars},
title = {DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission},
booktitle = {The 50th Annual IEEE/ACM International Symposium on Microarchitecture},
year = {2017},
}
```