A sparse sampling algorithm for near-optimal planning in large Markov decision processes
An issue that is critical for the application of Markov decision processes (MDPs) to realistic problems is how the complexity of planning scales with the size of the MDP. In stochas-tic environments with very large or even infi-nite state spaces, traditional planning and reinforcement learning algorithms are often in-applicable, since their running time typically scales linearly with the state space size. In this paper we present a new algorithm that, given only a generative model (simulator) for an arbitrary MDP, performs near-optimal planning with a running time that has no dependence onthe number of states. Although the running time is exponential in the horizon time (which depends only on the discount factor 7 and the desired degree of approximation to the optimal policy), our results establish for the first time that there are no theoretical barriers to computing near-optimal policies in arbitrarily large, unstructured MDPs.Our algorithm is based on the idea of sparse sampling . We prove that a randomly sampled look-ahead tree that covers only a vanishing fraction of the full look-ahead tree nevertheless suffices to compute near-optimal actions from any state of an MDP. Practical implementations of the algorithm are discussed, and we draw ties to our related recent results on finding a near best strategy from a given classof strategies in very large partially observable MDPs [KMN99].