Chapter 20 This kind of feedback is called a reward, or reinforcement. In games like chess, the reinforcement is received only at the end of the game. We call this a terminal state in the state history sequence. The agent can be a passive learner or an active learner. A passive learner simply watches the world going by, and tries to learn the utility of being in various states; an active learner must also act using the learned information, and can use its problem generator to suggest explorations of unknown portions of the environment. The agent learns an action-value function giving the expected utility of taking a given action in a given state. This is called Q-learning. We define the reward-to-go of a state as the sum of the rewards from that state until a terminal state is reached. Given this definition, it is easy to see that the expected utility of a state is the expected reward-to-go of that state. A simple method for updating uti...
Chapter 14 Agents almost never have access to the whole truth about their environment. The right thing to do, the rational decision, therefore, depends on both the relative importance of various goals and the likelihood that, and degree to which, they will be achieved. Probability provides a way of summarizing the uncertainty that comes from our laziness and ignorance. Probability theory makes the same ontological commitment as logic, namely, that facts either do or do not hold in the world. Degree of truth, as opposed to degree of belief, is the subject of fuzzy logic. Prior or unconditional probability; after the evidence is obtained, we talk about posterior or conditional probability. An agent is rational if and only if it chooses the action that yields the highest expected utility, averaged over all the possible outcomes of the action If Agent 1 expresses a set of degrees of belief that violate the axioms of probability theory then there is a betting strategy for Agent 2...
Comments
Post a Comment