The first major achievement of artificial intelligence was chess. The game has a dizzying number of possible combinations, but it was relatively manageable because it was structured by a set of clear rules. An algorithm can always have a perfect knowledge of the state of the game and know all the possible moves that it and its opponent can make. The state of the game can be assessed just by looking at the board.
But many other games are not so simple. If you take something like Pac Man, so figuring out the ideal move would involve considering the shape of the maze, the location of the ghosts, the location of any additional areas to clean, the availability of power-ups, etc., and the best plan can end in disaster if Blinky or Clyde does an unexpected move. We developed AIs that can handle these games too, but they had to take a very different approach than the one that conquered chess and Go.
At least so far Today, however, Google’s DeepMind division has published an article describing the structure of an AI that can handle Atari chess and classics.
Reinforcement trees
The algorithms that worked in games like chess and Go do their planning using a tree-based approach, in which they simply look forward to all the branches that result from different actions in the present. This approach is computationally expensive, and the algorithms depend on knowledge of the rules of the game, which allows them to project the current status of the game to possible future states of the game.
Other games require algorithms that don’t really care about the state of the game. Instead, the algorithms simply evaluate what they “see” – usually, something like the position of the pixels on the screen of a arcade game – and choose an action based on that. There is no internal model of the state of the game, and the training process largely involves finding out what response is appropriate given that information. There have been some attempts to model a game state based on inputs such as pixel information, but it has not worked as well as the successful algorithms that only respond to what is on the screen.
The new system, which DeepMind is calling MuZero, is based in part on DeepMind’s work with AlphaZero AI, which has taught itself to master rule-based games like chess and Go. But MuZero also adds a new twist that makes it substantially more flexible.
This twist is called “model-based reinforcement learning”. In a system that uses this approach, the software uses what it can see from a game to build an internal model of the state of the game. Critically, this state is not pre-structured based on any understanding of the game – AI is able to have a lot of flexibility regarding the information that is or is not included in it. The reinforcement learning part refers to the training process, which allows AI to learn to recognize when the model it is using is accurate and contains the information it needs to make decisions.
Forecasts
The model he creates is used to make various predictions. This includes the best possible movement according to the current state and the state of the game as a result of the movement. Critically, his prediction is based on his internal game state model – not on the actual visual representation of the game, such as the location of the chess pieces. The forecast itself is based on past experience, which is also subject to training.
Finally, the value of the move is evaluated using the algorithm predictions of any immediate rewards won with that move (the point value of a piece taken in chess, for example) and the final state of the game, as the result of winning or losing of chess. This may involve the same searches on potential game state trees done by previous chess algorithms, but in this case, the trees consist of AI’s own internal game models.
If this is confusing, you can also think of it this way: MuZero performs three evaluations in parallel. One (the policy process) chooses the next move according to the current state of the game model. A second predicts the resulting new state and any immediate rewards for the difference. And a third considers the previous experience to inform the political decision. Each of them is the product of training, which focuses on minimizing errors between these predictions and what actually happens in the game.
Top it!
Obviously, DeepMind folks wouldn’t have an article in Nature if that didn’t work. MuZero made just under a million matches against its predecessor AlphaZero to achieve a similar level of performance in chess or shogi. For Go, he overtook AlphaZero after only half a million games. In all three cases, MuZero can be considered to be far superior to any human player.
But MuZero also excelled at an Atari gaming panel, something that previously required a completely different AI approach. Compared to the previous best algorithm, which does not use an internal model, MuZero had a higher average and median score in 42 of the 57 games tested. So while there are still some circumstances in which he is left behind, he has now made AI model-based competitive in these games, while maintaining his ability to handle rule-based games like chess and Go.
Overall, this is an impressive achievement and an indication of how the AIs are growing in sophistication. A few years ago, training AIs in just one task, like recognizing a cat in photos, was an achievement. But now, we are able to train several aspects of an AI at the same time – here, the algorithm that created the model, the one that chose the move and the one that predicted future rewards were all trained simultaneously.
In part, this is a product of the availability of greater processing power, which makes it possible to play millions of chess games. But in part, it is an acknowledgment that this is what we need to do if an AI is ever flexible enough to master multiple and distantly related tasks.
Nature, 2020. DOI: 10.1038 / s41586-020-03051-4 (About DOIs).
List image by Richard Heaven / Flickr