We consider the problem of reliably choosing a near-best strategy from a restricted class of strategies › in a partially observable Markov decision process (POMDP). In particular, we are interested in what might be considered the sample complexity — that is, the amount of data or experience we must generate in the POMDP in order to choose a good strategy. We assume we are given the ability to simulate the behavior of the POMDP, and we provide methods for generating simulated experience sufficient to accurately approximate the expected return of any strategy in ›. We prove upper bounds on the amount of simulated experience our methods must generate in order to achieve such uniform approximation. These bounds have no dependence on the size or complexity of the underlying POMDP, depend only linearly on the complexity of the restricted strategy class ›, and depend exponentially on the horizon time. The main challenge in obtaining such bounds lies in generating trajectories that can be reused, in the sense that they simultaneously provide estimates of the return of many strategies in the class. Our methods can be viewed as generating a "small" set of trajectories that provide an accurate estimate of the value of any strategy in the class. They can thus be easily used with many standard approaches to search in strategy space, such as gradient descent and local search. For such algorithms, the exponential dependence on the horizon time can be replaced by a factor linear in the number of steps of search to be performed. Although we emphasize the use of entire (finite-horizon) trajectories for obtaining accurate value function estimates, we note that our methods can also be combined with standard TD(Ö) updates and variants. Our measure of strategy class complexity generalizes the classical notion of VC dimension, and our methods develop connections between problems of current interest in reinforcement learning and well-studied issues in the theory of supervised learning. We also discuss a number of practical planning algorithms for POMDPs that arise from our methods. *Contact author. Address: AT&T Labs, Room A235, 180 Park Avenue, Florham Park, New J erse y , 0 79 32. E-ma il: mkearns @ research.att.com.

**Authors:**Michael Kearns, Yishay Mansour, Andrew Y. Ng (2000)### Abstract

We consider the problem of reliably choosing a near-best strategy from a restricted class of strategies › in a partially observable Markov decision process (POMDP). In particular, we are interested in what might be considered the sample complexity — that is, the amount of data or experience we must generate in the POMDP in order to choose a good strategy. We assume we are given the ability to simulate the behavior of the POMDP, and we provide methods for generating simulated experience sufficient to accurately approximate the expected return of any strategy in ›. We prove upper bounds on the amount of simulated experience our methods must generate in order to achieve such uniform approximation. These bounds have no dependence on the size or complexity of the underlying POMDP, depend only linearly on the complexity of the restricted strategy class ›, and depend exponentially on the horizon time. The main challenge in obtaining such bounds lies in generating trajectories that can be reused, in the sense that they simultaneously provide estimates of the return of many strategies in the class. Our methods can be viewed as generating a "small" set of trajectories that provide an accurate estimate of the value of any strategy in the class. They can thus be easily used with many standard approaches to search in strategy space, such as gradient descent and local search. For such algorithms, the exponential dependence on the horizon time can be replaced by a factor linear in the number of steps of search to be performed. Although we emphasize the use of entire (finite-horizon) trajectories for obtaining accurate value function estimates, we note that our methods can also be combined with standard TD(Ö) updates and variants. Our measure of strategy class complexity generalizes the classical notion of VC dimension, and our methods develop connections between problems of current interest in reinforcement learning and well-studied issues in the theory of supervised learning. We also discuss a number of practical planning algorithms for POMDPs that arise from our methods. *Contact author. Address: AT&T Labs, Room A235, 180 Park Avenue, Florham Park, New J erse y , 0 79 32. E-ma il: mkearns @ research.att.com.Download PDF

## Leave a Reply

You must be logged in to post a comment