Monte Carlo tree search (MCTS) has become well known with its success in
the game of Go. A computer has never before won a game against a human
master player before. There have been multiple variations of the algorithm
since. One of the best known versions is the Upper Confidence Bounds for
Trees (UCT) by Kocsis and Szepesv´ari. Many of the enhancements to the
basic MCTS algorithm include usage of domain specific heuristics, which
make the algorithm less general.
The goal of this thesis is to investigate how to improve the MCTS algorithm
without compromising its generality. A Reinforcement Learning (RL)
paradigm, called Temporal Difference (TD) learning, is a method that makes
use of two concepts, Dynamic Programming (DP) and the Monte Carlo (MC)
method. Our goal was to try to incorporate the advantages of the TD learning
paradigm into the MCTS algorithm. The main idea was to change how
rewards for each node are calculated, and when they are updated.
From the results of the experiments, one can conclude that the combination
of the MCTS algorithm and the TD learning paradigm is after all a
good idea. The newly developed Sarsa-TS(λ) shows a general improvement
on the performance. Since the games we have done our experiments on are
all very different, the effect the algorithm has on the performance varies.
|