Policy invariance under reward transformations: Theory and application to reward shaping

Authors:Stuart Russell,Daishi Harada and Andrew Y. Ng. (2012)
Andrew Y. Ng
Daishi Harada
Stuart Russell


This paper investigates conditions under which modifications to the reward function of a Markov decision process preserve the optimal policy. It is shown that, besides the positive linear transformation familiar from utility theory, one can add a reward for transitions between states that is expressible as the difference in value of an arbitrary potential function applied to those states. Furthermore, this is shown to be a necessary condition for invariance, in the sense that any other transformation may yield suboptimal policies unless further assumptions are made about the underlying MDP. These results shed light on the practice of reward shaping, a method used in reinforcement learning whereby additional training rewards are used to guide the learning agent. In particular, some well-known "bugs" in reward shaping procedures are shown to arise from non-potential-based rewards, and methods are given for constructing shaping potentials corresponding to distance-based and subgoalbased heuristics. We show that such potentials can lad to substantial reductions in learning time.

Download PDF

Related Projects

Leave a Reply

You must be logged in to post a comment