Week 9 - Reading

March 04, 2019

Chapter 21

In the absence of feedback from a teacher, an agent can learn a transition model for its own moves and can perhaps learn to predict the opponent’s moves, but without some feedback about what is good and what is bad, the agent will have no grounds for deciding which move to make.
This kind of feedback is called a reward, or reinforcement.
Clearly, the passive learning task is similar to the policy evaluation task, part of the policy iteration algorithm described in Section 17.3. The main difference is that the passive learning agent does not know the transition model P(s | s, a), which specifies the probability of reaching state s from state s after doing action a; nor does it know the reward function R(s), which specifies the reward for each state.
A simple method for direct utility estimation was invented in the late 1950s in the area of adaptive control theory by Widrow and Hoff (1960).
The utility of each state equals its own reward plus the expected utility of its successor states
An adaptive dynamic programming (or ADP) agent takes advantage of the constraints among the utilities of states by learning the transition model that connects them and solving the corresponding Markov decision process using a dynamic programming method.
The first approach, Bayesian reinforcement learning, assumes a prior probability P(h) for each hypothesis h about what the true model is; the posterior probability P(h | e) is obtained in the usual way by Bayes’ rule given the observations to date.
The second approach, derived from robust control theory, allows for a set of possible models H and defines an optimal robust policy as one that gives the best outcome in the worst case over H.
Here, α is the learning rate parameter. Because this update rule uses the difference in utilities between successive states, it is often called the temporal-difference, or TD, equation.
Notice that TD does not need a transition model to perform its updates. The environment supplies the connection between neighboring states in the form of observed transitions
The prioritized sweeping heuristic prefers to make adjustments to states whose likely successors have just undergone a large adjustment in their own utility estimates.
We call this agent the greedy agent. Repeated experiments show that the greedy agent very seldom converges to the optimal policy for this environment and sometimes converges to really horrendous policies.
The most obvious change from the passive case is that the agent is no longer equipped with a fixed policy, so, if it learns a utility function U, it will need to learn a model in order to be able to choose an action based on U via one-step look-ahead.
There is an alternative TD method, called Q-learning, which learns an action-utility representation instead of learning utilities.
One of the key historical characteristics of much of AI research is its (often unstated) adherence to the knowledge-based approach.
One way to handle such problems is to use function approximation, which simply means using any sort of representation for the Q-function other than a lookup table.
The compression achieved by a function approximator allows the learning agent to generalize from states it has visited to states it has not visited.
The final approach we will consider for reinforcement learning problems is called policy search.
This process is not the same as Q-learning! In Q-learning with function approximation, the algorithm finds a value of θ such that Qˆθ is “close” to Q∗, the optimal Q-function.
An obvious solution is to generate a certain number of hands in advance and have each program play the same set of hands. In this way, we eliminate the measurement error due to differences in the cards received.
The actions are usually discrete: jerk left or jerk right, the so-called bang-bang control regime.

Search This Blog

Muddiest Points - Questions

Week 9 - Reading

Chapter 21

Comments

Post a Comment

Popular posts from this blog

Week 7 - Reading

Week 5 - Reading

Week 6 - Muddiest Point