Mastering the game of Go without human knowledge (Silver et al., Nature 2017)
This paper is an extension of the original AlphaGo work on using reinforcement learning to build a Go-player. Interestingly, the changes have simplified the overall model, as well as enabling it to do even better than the previous model, but now without any supervised training.
One key change is that there is a single core neural network learning to represent the game state. On top of that there are either a set of layers that produce an evaluation of the quality of a position, or there are a set of layers that place a distribution over moves. This ties in nicely to a lot of work happening at the moment on multi-task learning in NLP and elsewhere.
Getting into the details, they use monte-carlo tree search to choose actions during training, then update the model to better match the outcomes observed. Starting from a completely random initialisation, the argument for why this works is that at every point in self-play the MCTS informed outcomes are just slightly better than the current model. That edge is enough to provide a useful signal, without being such a drastic shift because in self-play the two sides are closely matched. Interestingly, while the unsupervised model is worse at predicting what expert human players will do in a game, it is still better at predicting which player will win.
Citation
@Article{AlphaGoZero,
author = {Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and Chen, Yutian and Lillicrap, Timothy and Hui, Fan and Sifre, Laurent and van den Driessche, George and Graepel, Thore and Hassabis, Demis},
title = {Mastering the game of Go without human knowledge},
title: {Mastering the game of Go without human knowledge},
journal = {Nature},
year = {2017},
volume = {550},
issue = {7676},
pages = {354-359},
publisher = {Macmillan Publishers Limited, part of Springer Nature},
doi = {10.1038/nature24270},
url = {http://www.nature.com/nature/journal/v550/n7676/abs/nature24270.html#supplementary-information},
}