Week 6 - Reading
Chapter 18
18.1 - 18.6
- This chapter (and most of current machine learning research) covers inputs that form a factored representation—a vector of attribute values—and outputs that can be either a continuous numerical value or a discrete value.
- We say that learning a (possibly incorrect) general function or rule from specific input–output pairs is called inductive learning.
- In unsupervised learning the agent learns patterns in the input even though no explicit feedback is supplied.
- In reinforcement learning the agent learns from a series of reinforcements—rewards or punishments.
- In supervised learning the agent observes some example input–output pairs and learns a function that maps from input to output.
- In semi-supervised learning we are given a few labeled examples and must make what we can of a large collection of unlabeled examples.
- When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning problem is called classification.
- how do we choose from among multiple consistent hypotheses? One answer is to prefer the simplest hypothesis consistent with the data.
- In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.
- There is a tradeoff between the expressiveness of a hypothesis space and the complexity of finding a good hypothesis within that space.
- A decision tree represents a function that takes as input a vector of attribute values and returns a “decision”—a single output value.
- We can answer this question by using a statistical significance test.
- You can think of the task of finding the best hypothesis as two tasks: model selection defines the hypothesis space and then optimization finds the best hypothesis within that space.
- In machine learning it is traditional to express utilities by means of a loss function.
- Traditional methods in statistics and the early years of machine learning concentrated on small-scale learning, where the number of training examples ranged from dozens to the low thousands.
- Any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction. Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct.
- Any learning algorithm that returns hypotheses that are probably approximately correct is called a PAC learning algorithm.
- Many forms of learning involve adjusting weights to minimize a loss, so it helps to have a mental picture of what’s going on in weight space—the space defined by all possible settings of the weights.
- Thus,it is common to use regularization on multivariate linear functions to avoid overfitting.
18.8 - 18.9
- This approach is called instance-based learning or memory-based learning.
- A more complex metric known as the Mahalanobis distance takes into account the covariance between dimensions.
- A balanced binary tree over data with an arbitrary number of dimensions is called a k-d tree, for k-dimensional tree.
- The support vector machine or SVM framework is currently the most popular approach for “off-the-shelf” supervised learning: if you don’t have any specialized prior knowledge about a domain, then the SVM is an excellent method to try first.
- The margin is the width of the area bounded by dashed lines in the figure—twice the distance from the separator to the nearest example point.
- The data enter the expression only in the form of dot products of pairs of points.
- Optimal linear separators can be found efficiently in feature spaces with billions of (or, in some cases, infinitely many) dimensions.
Chapter 7
- The architecture we introduce is called a feed-forward network because the computation proceeds iteratively from one layer of units to the next.
- In later chapters we’ll introduce many other aspects of neural models, such as the recurrent neural network and the encoder-decoder model.
- At its heart, a neural unit is taking a weighted sum of its inputs, with one additional term in the sum called a bias term.
- The simplest activation function, and perhaps the most commonly used, is the rectified linear unit, also called the ReLU.
- This line acts as a decision boundary in two-dimensional space in which the output 0 is assigned to all inputs lying on one side of the line, and the output 1 to all input points lying on the other side of the line.
- Let’s now walk through a slightly more formal presentation of the simplest kind of neural network, the feed-forward network
- There is a convenient function for normalizing a vector of real values, by which we mean converting it to a vector that encodes a probability distribution.
- First, we’ll need a loss function that models the distance between the system output and the gold output, and it’s common to use the loss used for logistic regression, the cross-entropy loss.
- Second, to find the parameters that minimize this loss function, we’ll use the gradient descent optimization algorithm introduced in Chapter 5. There are some differences.
- Third, gradient descent requires knowing the gradient of the loss function, the vector that contains the partial derivative of the loss function with respect to each of the parameters.
- The cross entropy loss, that is used in neural networks is the same one we saw for cross entropy loss logistic regression.
- Various forms of regularization are used to prevent overfitting. One of the most dropout important is dropout: randomly some units and their connections from the network during training.
- The method of using another algorithm to learn the embedding representations we use for input pretraining words is called pretraining.
Comments
Post a Comment