Chapter 21 In the absence of feedback from a teacher, an agent can learn a transition model for its own moves and can perhaps learn to predict the opponent’s moves, but without some feedback about what is good and what is bad, the agent will have no grounds for deciding which move to make. This kind of feedback is called a reward, or reinforcement. Clearly, the passive learning task is similar to the policy evaluation task, part of the policy iteration algorithm described in Section 17.3. The main difference is that the passive learning agent does not know the transition model P(s | s, a), which specifies the probability of reaching state s from state s after doing action a; nor does it know the reward function R(s), which specifies the reward for each state. A simple method for direct utility estimation was invented in the late 1950s in the area of adaptive control theory by Widrow and Hoff (1960). The utility of each state equals its own reward plus the expected utili...