Master Atari, Go, chess and shogi planning with a learned model

  • 1

    Campbell, M., Hoane, AJ Jr & Hsu, F.-h. Deep blue. Artif. Intell. 134, 57-83 (2002).

    Google Scholar article

  • two

    Silver, D. et al. Master the Go game with deep neural networks and tree research. Nature 529, 484–489 (2016).

    Google Scholar CAS ADS Article

  • 3 –

    Bellemare, MG, Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: an assessment platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).

    Google Scholar article

  • 4 –

    Machado, M. et al. Revisiting the arcade learning environment: assessment protocols and open problems for agents in general. J. Artif. Intell. Res. 61, 523–562 (2018).

    MathSciNet Google Scholar Article

  • 5

    Silver, D. et al. A general reinforcement learning algorithm that dominates chess, shogi and automatic play. Science 362, 1140–1144 (2018).

    MathSciNet CAS Google Scholar ADS Article

  • 6

    Schaeffer, J. et al. A world championship caliber checkers program. Artif. Intell. 53, 273-289 (1992).

    Google Scholar article

  • 7

    Brown, N. & Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats the best professionals. Science 359, 418–424 (2018).

    MathSciNet CAS Google Scholar ADS Article

  • 8

    Moravčík, M. et al. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science 356, 508-513 (2017).

    MathSciNet Google Scholar ADS Article

  • 9

    Vlahavas, I. & Refanidis, I. Planning and Programming Technical Report (EETN, 2013).

  • 10

    Segler, MH, Preuss, M. & Waller, MP Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).

    Google Scholar CAS ADS Article

  • 11

    Sutton, RS and Barto, AG Reinforcement Learning: An Introduction 2nd ed. (MIT Press, 2018).

  • 12

    Deisenroth, M. & Rasmussen, C. PILCO: a model-based and data-efficient approach to policy research. Inside Proc. 28th International Conference on Machine Learning, ICML 2011 465–472 (Omnipress, 2011).

  • 13

    Heess, N. et al. Learning of continuous control policies by gradients of stochastic value. Inside NIPS’15: Proc. 28th International Conference on Neural Information Processing Systems Vol. 2 (eds Cortes, C. et al.) 2944–2952 (MIT Press, 2015).

  • 14

    Levine, S. & Abbeel, P. Learning neural network policies with guided policy research under unknown dynamics. Adv. Neural Inf. Process. Syst. 27, 1071–1079 (2014).

    Google Scholar

  • 15

    Hafner, D. et al. Learning latent dynamics for pixel planning. Prepress at https://arxiv.org/abs/1811.04551 (2018).

  • 16

    Kaiser, L. et al. Model-based reinforcement learning for atari. Prepress at https://arxiv.org/abs/1903.00374 (2019).

  • 17

    Buesing, L. et al. Learning and consultation of rapid generating models for reinforcement learning. Prepress at https://arxiv.org/abs/1802.03006 (2018).

  • 18

    Espeholt, L. et al. IMPALA: Scalable, distributed deep RL with weighted actor-student architectures. Inside Proc. International Machine Learning Conference, ICML Vol. 80 (eds Dy, J. & Krause, A.) 1407-1416 (2018).

  • 19

    Kapturowski, S., Ostrovski, G., Dabney, W., Quan, J. & Munos, R. Repetition of the recurrent experience in distributed reinforcement learning. Inside International Conference on Learning Representations (2019).

  • 20

    Horgan, D. et al. Repetition of distributed prioritized experience. Inside International Conference on Learning Representations (2018).

  • 21

    Puterman, ML Markov decision processes: Discrete Stochastic Dynamic Programming 1st ed. (John Wiley & Sons, 1994).

  • 22

    Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree research. Inside International Conference on Computers and Games 72–83 (Springer, 2006).

  • 23

    Wahlström, N., Schön, TB & Deisenroth, MP From pixels to torques: policy learning with deep dynamic models. Prepress at http://arxiv.org/abs/1502.02251 (2015).

  • 24

    Watter, M., Springenberg, JT, Boedecker, J. & Riedmiller, M. Embed to control: a locally linear latent dynamic model for control from raw images. Inside NIPS’15: Proc. 28th International Conference on Neural Information Processing Systems Vol. 2 (eds Cortes, C. et al.) 2746–2754 (MIT Press, 2015).

  • 25

    Ha, D. & Schmidhuber, J. Recurring world models facilitate policy evolution. Inside NIPS’18: Proc. 32nd International Conference on Neural Information Processing Systems (eds Bengio, S. et al.) 2455–2467 (Curran Associates, 2018).

  • 26

    Gelada, C., Kumar, S., Buckman, J., Nachum, O. & Bellemare, MG DeepMDP: learning continuous latent space models for learning representation. Proc. 36th International Conference on Machine Learning: Volume 97 of Proc. Machine learning research (eds Chaudhuri, K. & Salakhutdinov, R.) 2170–2179 (PMLR, 2019).

  • 27

    van Hasselt, H., Hessel, M. & Aslanides, J. When to use parametric models in reinforcement learning? Prepress at https://arxiv.org/abs/1906.05243 (2019).

  • 28

    Tamar, A., Wu, Y., Thomas, G., Levine, S. & Abbeel, P. Value iteration networks. Adv. Neural Inf. Process. Syst. 29, 2154-2162 (2016).

    Google Scholar

  • 29

    Silver, D. et al. The predictron: learning and planning from end to end. Inside Proc. 34th International Conference on Machine Learning Vol. 70 (eds Precup, D. & Teh, YW) 3191–3199 (JMLR, 2017).

  • 30

    Farahmand, AM, Barreto, A. & Nikovski, D. Value-aware loss function for model-based reinforcement learning. Inside Proc. 20th International Conference on Artificial Intelligence and Statistics: Volume 54 of Proc. Machine learning research (eds Singh, A. & Zhu, J) 1486–1494 (PMLR, 2017).

  • 31

    Farahmand, A. Iterative value-aware model learning. Adv. Neural Inf. Process. Syst. 31, 9090–9101 (2018).

    Google Scholar

  • 32

    Farquhar, G., Rocktaeschel, T., Igl, M. & Whiteson, S. TreeQN and ATreeC: differentiable tree planning for deep reinforcement learning. Inside International Conference on Learning Representations (2018).

  • 33

    Oh, J., Singh, S. & Lee, H. Value prediction network. Adv. Neural Inf. Process. Syst. 30, 6118–6128 (2017).

    Google Scholar

  • 34

    Krizhevsky, A., Sutskever, I. & Hinton, GE Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097-1105 (2012).

    Google Scholar

  • 35

    He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. Inside 14th European Conference on Computer Vision 630–645 (2016).

  • 36

    Hessel, M. et al. Rainbow: combining improvements in deep reinforcement learning. Inside Thirty-second AAAI Conference on Artificial Intelligence (2018).

  • 37

    Schmitt, S., Hessel, M. & Simonyan, K. Non-political actor-critic with shared replay experience. Prepress at https://arxiv.org/abs/1909.11583 (2019).

  • 38

    Azizzadenesheli, K. et al. Surprising negative results for the research of the generative adversary tree. Prepress at http://arxiv.org/abs/1806.05780 (2018).

  • 39

    Mnih, V. et al. Control at the human level through deep reinforcement learning. Nature 518, 529-533 (2015).

    Google Scholar CAS ADS Article

  • 40

    Open, AI OpenAI five. OpenAI https://blog.openai.com/openai-five/ (2018).

  • 41

    Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019).

    Google Scholar CAS ADS Article

  • 42

    Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. Prepress at https://arxiv.org/abs/1611.05397 (2016).

  • 43

    Silver, D. et al. Mastering the Go game without human knowledge. Nature 550, 354–359 (2017).

    Google Scholar CAS ADS Article

  • 44

    Kocsis, L. & Szepesvári, C. Monte-Carlo planning based on bandits. Inside European Machine Learning Conference 282–293 (Springer, 2006).

  • 45

    Rosin, CD Multi-armed bandits with context of the episode. Ann. Mathematics. Artif. Intell. 61, 203–230 (2011).

    MathSciNet Google Scholar Article

  • 46

    Schadd, MP, Winands, MH, Van Den Herik, HJ, Chaslot, GM-B. & Uiterwijk, JW Monte-Carlo tree research for one player. Inside International Conference on Computers and Games 1–12 (Springer, 2008).

  • 47

    Pohlen, T. et al. Observe and observe more: achieving consistent performance on Atari. Prepress at https://arxiv.org/abs/1805.11593 (2018).

  • 48

    Schaul, T., Quan, J., Antonoglou, I. & Silver, D. Prioritized experience replay. Inside International Conference on Learning Representations (2016).

  • 49.

    Cloud TPU. Google Cloud https://cloud.google.com/tpu/ (2019).

  • 50

    Coulom, R. Classification of the whole story: a Bayesian classification system for players of variable strength over time. Inside International Conference on Computers and Games 113–124 (2008).

  • 51

    Nair, A. et al. Massively parallel methods for deep reinforcement learning. Prepress at https://arxiv.org/abs/1507.04296 (2015).

  • 52

    Lanctot, M. et al. OpenSpiel: a framework for reinforcement learning in games. Prepress at http://arxiv.org/abs/1908.09453 (2019).

  • Source