Since their breakthrough in computer Go, Monte Carlo tree search (MCTS) methods have initiated almost a revolution in game-playing agents: the artificial intelligence (AI) community has since developed an enormous amount of MCTS variants and enhancements that advanced the state of the art not only in games, but also in several other domains. Although MCTS methods merge the generality of random sampling with the precision of tree search, their convergence rate can be relatively low in practice, especially when not aided by additional enhancements. This is why practitioners often combine them with expert or prior knowledge, heuristics, and handcrafted strategies. Despite the outstanding results (like the AlphaGo engine, which defeated the best human Go players, prodigiously overcoming this grand challenge of AI), such task-specific enhancements decrease the generality of many applied MCTS algorithms. Improving the performance of core MCTS methods, while retaining their generality and scalability, has proven difficult and is a current research challenge. This thesis presents a new approach for general improvement of MCTS methods and, at the same time, advances the fundamental theory behind MCTS by taking inspiration from the older and well-established field of reinforcement learning (RL). The links between MCTS, which is regarded as a search and planning framework, and the RL theory have already been outlined in the past; however, they have neither been thoroughly studied yet, nor have the existing studies significantly influenced the larger game AI community. Motivated by this, we re-examine in depth the close relation between the two fields and detail not only the similarities, but identify and emphasize also the differences between them. We present a practical way of extending MCTS methods with RL dynamics: we develop the temporal difference tree search (TDTS) framework, a novel class of MCTS-like algorithms that learn via temporal-differences (TD) instead of Monte Carlo sampling. This can be understood both as a generalization of MCTS with TD learning, as well as an extension of traditional TD learning methods with tree search and novel MCTS concepts. Through TDTS we show that a straightforward adaptation of RL semantics within tree search can lead to a wealth of new algorithms, for which the traditional MCTS is only one of the variants. We experiment with several such algorithms, focusing on an extension of the upper confidence bounds for trees (UCT) algorithm with the Sarsa(\lambda) algorithm – an on-policy TD learning method. Our evaluations confirm that MCTS-like methods inspired by RL dynamics demonstrate superior results on several classic board games and arcade video games: the new algorithms preserve robustness and computational complexity, while consistently outperforming their basic MCTS counterparts. Our findings encourage cross-fertilization between the game AI and RL communities, hopefully narrowing the gap between them, and promote a unified view of search and learning methods, recommending it as a promising research direction.
|