Week 7 - Reading
Chapter 20
- This kind of feedback is called a reward, or reinforcement. In games like chess, the reinforcement is received only at the end of the game. We call this a terminal state in the state history sequence.
- The agent can be a passive learner or an active learner. A passive learner simply watches the world going by, and tries to learn the utility of being in various states; an active learner must also act using the learned information, and can use its problem generator to suggest explorations of unknown portions of the environment.
- The agent learns an action-value function giving the expected utility of taking a given action in a given state. This is called Q-learning.
- We define the reward-to-go of a state as the sum of the rewards from that state until a terminal state is reached. Given this definition, it is easy to see that the expected utility of a state is the expected reward-to-go of that state.
- A simple method for updating utility estimates was invented in the late 1950s in the area of adaptive control theory by Widrow and Hoff
- The actual utility of a state is constrained to be the probability-weighted average of its successors' utilities, plus its own reward.
- The process of solving the equations is therefore identical to a single value determination phase in the policy iteration algorithm.
- We will use the term adaptive dynamic programming (or ADP) to denote any reinforcement learning method that works by solving the utility equations with a dynamic programming algorithm.
- The key is to use the observed transitions to adjust the values of the observed states so that they agree with the constraint equations.
- The prioritized-sweeping heuristic prefers to make adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates.
- Obviously, we need an approach somewhere between wackiness and greediness. The agent should be more wacky when it has little idea of the environment, and more greedy when it has a model that is close to being correct. Can we be a little more precise than this? Is there an optimal exploration policy? It turns out that this question has been studied in depth in the subfield of statistical decision theory that deals with so-called bandit problems.
- An action-value function assigns an expected utility to taking a given action in a given state; as mentioned earlier, such values are also called Q-values.
- The compression achieved by an implicit representation allows the learning agent to generalize from states it has visited to states it has not visited.
- The actions are usually discrete—jerk left or jerk right, the so-called bang-bang control regime.
Comments
Post a Comment