Campbell, M., Hoane, AJ Jr & Hsu, F.-h. Deep blue. Artif. Intell. 134, 57-83 (2002).
Silver, D. et al. Master the Go game with deep neural networks and tree research. Nature 529, 484–489 (2016).
Bellemare, MG, Naddaf, Y., Veness, J. & Bowling, M. The arcade learning environment: an assessment platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013).
Machado, M. et al. Revisiting the arcade learning environment: assessment protocols and open problems for agents in general. J. Artif. Intell. Res. 61, 523–562 (2018).
Silver, D. et al. A general reinforcement learning algorithm that dominates chess, shogi and automatic play. Science 362, 1140–1144 (2018).
Schaeffer, J. et al. A world championship caliber checkers program. Artif. Intell. 53, 273-289 (1992).
Brown, N. & Sandholm, T. Superhuman AI for heads-up no-limit poker: Libratus beats the best professionals. Science 359, 418–424 (2018).
Moravčík, M. et al. Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science 356, 508-513 (2017).
Vlahavas, I. & Refanidis, I. Planning and Programming Technical Report (EETN, 2013).
Segler, MH, Preuss, M. & Waller, MP Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018).
Sutton, RS and Barto, AG Reinforcement Learning: An Introduction 2nd ed. (MIT Press, 2018).
Deisenroth, M. & Rasmussen, C. PILCO: a model-based and data-efficient approach to policy research. Inside Proc. 28th International Conference on Machine Learning, ICML 2011 465–472 (Omnipress, 2011).
Heess, N. et al. Learning of continuous control policies by gradients of stochastic value. Inside NIPS’15: Proc. 28th International Conference on Neural Information Processing Systems Vol. 2 (eds Cortes, C. et al.) 2944–2952 (MIT Press, 2015).
Levine, S. & Abbeel, P. Learning neural network policies with guided policy research under unknown dynamics. Adv. Neural Inf. Process. Syst. 27, 1071–1079 (2014).
Hafner, D. et al. Learning latent dynamics for pixel planning. Prepress at https://arxiv.org/abs/1811.04551 (2018).
Kaiser, L. et al. Model-based reinforcement learning for atari. Prepress at https://arxiv.org/abs/1903.00374 (2019).
Buesing, L. et al. Learning and consultation of rapid generating models for reinforcement learning. Prepress at https://arxiv.org/abs/1802.03006 (2018).
Espeholt, L. et al. IMPALA: Scalable, distributed deep RL with weighted actor-student architectures. Inside Proc. International Machine Learning Conference, ICML Vol. 80 (eds Dy, J. & Krause, A.) 1407-1416 (2018).
Kapturowski, S., Ostrovski, G., Dabney, W., Quan, J. & Munos, R. Repetition of the recurrent experience in distributed reinforcement learning. Inside International Conference on Learning Representations (2019).
Horgan, D. et al. Repetition of distributed prioritized experience. Inside International Conference on Learning Representations (2018).
Puterman, ML Markov decision processes: Discrete Stochastic Dynamic Programming 1st ed. (John Wiley & Sons, 1994).
Coulom, R. Efficient selectivity and backup operators in Monte-Carlo tree research. Inside International Conference on Computers and Games 72–83 (Springer, 2006).
Wahlström, N., Schön, TB & Deisenroth, MP From pixels to torques: policy learning with deep dynamic models. Prepress at http://arxiv.org/abs/1502.02251 (2015).
Watter, M., Springenberg, JT, Boedecker, J. & Riedmiller, M. Embed to control: a locally linear latent dynamic model for control from raw images. Inside NIPS’15: Proc. 28th International Conference on Neural Information Processing Systems Vol. 2 (eds Cortes, C. et al.) 2746–2754 (MIT Press, 2015).
Ha, D. & Schmidhuber, J. Recurring world models facilitate policy evolution. Inside NIPS’18: Proc. 32nd International Conference on Neural Information Processing Systems (eds Bengio, S. et al.) 2455–2467 (Curran Associates, 2018).
Gelada, C., Kumar, S., Buckman, J., Nachum, O. & Bellemare, MG DeepMDP: learning continuous latent space models for learning representation. Proc. 36th International Conference on Machine Learning: Volume 97 of Proc. Machine learning research (eds Chaudhuri, K. & Salakhutdinov, R.) 2170–2179 (PMLR, 2019).
van Hasselt, H., Hessel, M. & Aslanides, J. When to use parametric models in reinforcement learning? Prepress at https://arxiv.org/abs/1906.05243 (2019).
Tamar, A., Wu, Y., Thomas, G., Levine, S. & Abbeel, P. Value iteration networks. Adv. Neural Inf. Process. Syst. 29, 2154-2162 (2016).
Silver, D. et al. The predictron: learning and planning from end to end. Inside Proc. 34th International Conference on Machine Learning Vol. 70 (eds Precup, D. & Teh, YW) 3191–3199 (JMLR, 2017).
Farahmand, AM, Barreto, A. & Nikovski, D. Value-aware loss function for model-based reinforcement learning. Inside Proc. 20th International Conference on Artificial Intelligence and Statistics: Volume 54 of Proc. Machine learning research (eds Singh, A. & Zhu, J) 1486–1494 (PMLR, 2017).
Farahmand, A. Iterative value-aware model learning. Adv. Neural Inf. Process. Syst. 31, 9090–9101 (2018).
Farquhar, G., Rocktaeschel, T., Igl, M. & Whiteson, S. TreeQN and ATreeC: differentiable tree planning for deep reinforcement learning. Inside International Conference on Learning Representations (2018).
Oh, J., Singh, S. & Lee, H. Value prediction network. Adv. Neural Inf. Process. Syst. 30, 6118–6128 (2017).
Krizhevsky, A., Sutskever, I. & Hinton, GE Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097-1105 (2012).
He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep residual networks. Inside 14th European Conference on Computer Vision 630–645 (2016).
Hessel, M. et al. Rainbow: combining improvements in deep reinforcement learning. Inside Thirty-second AAAI Conference on Artificial Intelligence (2018).
Schmitt, S., Hessel, M. & Simonyan, K. Non-political actor-critic with shared replay experience. Prepress at https://arxiv.org/abs/1909.11583 (2019).
Azizzadenesheli, K. et al. Surprising negative results for the research of the generative adversary tree. Prepress at http://arxiv.org/abs/1806.05780 (2018).
Mnih, V. et al. Control at the human level through deep reinforcement learning. Nature 518, 529-533 (2015).
Open, AI OpenAI five. OpenAI https://blog.openai.com/openai-five/ (2018).
Vinyals, O. et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575, 350–354 (2019).
Jaderberg, M. et al. Reinforcement learning with unsupervised auxiliary tasks. Prepress at https://arxiv.org/abs/1611.05397 (2016).
Silver, D. et al. Mastering the Go game without human knowledge. Nature 550, 354–359 (2017).
Kocsis, L. & Szepesvári, C. Monte-Carlo planning based on bandits. Inside European Machine Learning Conference 282–293 (Springer, 2006).
Rosin, CD Multi-armed bandits with context of the episode. Ann. Mathematics. Artif. Intell. 61, 203–230 (2011).
Schadd, MP, Winands, MH, Van Den Herik, HJ, Chaslot, GM-B. & Uiterwijk, JW Monte-Carlo tree research for one player. Inside International Conference on Computers and Games 1–12 (Springer, 2008).
Pohlen, T. et al. Observe and observe more: achieving consistent performance on Atari. Prepress at https://arxiv.org/abs/1805.11593 (2018).
Schaul, T., Quan, J., Antonoglou, I. & Silver, D. Prioritized experience replay. Inside International Conference on Learning Representations (2016).
Cloud TPU. Google Cloud https://cloud.google.com/tpu/ (2019).
Coulom, R. Classification of the whole story: a Bayesian classification system for players of variable strength over time. Inside International Conference on Computers and Games 113–124 (2008).
Nair, A. et al. Massively parallel methods for deep reinforcement learning. Prepress at https://arxiv.org/abs/1507.04296 (2015).
Lanctot, M. et al. OpenSpiel: a framework for reinforcement learning in games. Prepress at http://arxiv.org/abs/1908.09453 (2019).