Diplomacy
No-Press / No communication
The first neural model was DipNet (Paquette et al. 2019), a project I contributed to. We used supervised learning based on online games, using an encoder-decoder network. The system plays as well as strong human player and was integrated into WebDiplomacy. We tried applying reinforcement learning, without success.
Immediate follow-up work refined the architecture, filtered the training data, and found ways to make RL effective. Specifically, Anthony et al. (2020) changed the encoder and decoder, and used policy iteration in which a set of actions are sampled and the best option is used to approximate the best response action. Gray et al. (2021) improved the model further, and proposed another training approach. They sample a set of possible orders for all players, then treat that as the space of moves, estimate regret by treating the space of moves as a game to be optimised, with rollouts of 2-3 turns and reward based on the base model’s value network. The final system plays based on the final regret matching iteration’s policy, doing better than prior work.
Building off the RL ideas in the work above, Bakhtin et al. (2021 explored training without using any supervised learning from human games. To get it to work, they further modified the architecture, and applied an AlphaZero style training approach. Interestingly, the models produced in this way win in a 6v1 setting (6 self-trained vs 1 prior model) but do worse than expected in a 1v6 setting, suggesting that it has converged to a different strategy, and the same is observed for multiple training runs with different seeds (ie, it converges to different strategies). The paper does not provide detailed analysis of the actual games, but one possibility is that the agents are converging to a very coordinated strategy (e.g. Russia and Turkey working together).
RL trained models diverge from human styles of play, as observed for DipNet and the Bakhtin et al. (2021 bot. To address this, Jacob et al. (2021) modify the policy to be regularised towards a human-style policy. In other words, there is a penalty for having a distribution over moves that differs from a policy trained with supervised learning. This leads to improved performance in both playing the game and predicting what an expert player would do (in Chess and Go as well as Diplomacy).
Press / With communication
Cicero Bakhtin et al. (2022) Bakhtin et al. (2022)
Structured Communication
DARPA recently announced the SHADE Program, which will explore bots that can communicate, though with a constrained communication language, rather than full natural language.
Language
Two studies have looked at the language used in human games, to see if there are markers of deception. Peskov et al. (2020 introduced a dataset where players indicated whether they were lying while playing the game and recipients indicated whether they thought they were being lied to. This is nice because it is more reliable than post-hoc analysis. Humans and machines are not that great at detecting lies (Lie F1 of at most 27), though the errors are quite different and overall performance is fairly similar. However, they only use a single feature of the game state, and only train on the 9 labelled games (rather than pre-training or similar on other resources).