Week 6 - Reading

Week 6 - Reading

February 10, 2019

Chapter 18

18.1 - 18.6

This chapter (and most of current machine learning research) covers inputs that form a factored representation—a vector of attribute values—and outputs that can be either a continuous numerical value or a discrete value.
We say that learning a (possibly incorrect) general function or rule from speciﬁc input–output pairs is called inductive learning.
In unsupervised learning the agent learns patterns in the input even though no explicit feedback is supplied.
In reinforcement learning the agent learns from a series of reinforcements—rewards or punishments.
In supervised learning the agent observes some example input–output pairs and learns a function that maps from input to output.
In semi-supervised learning we are given a few labeled examples and must make what we can of a large collection of unlabeled examples.
When the output y is one of a ﬁnite set of values (such as sunny, cloudy or rainy), the learning problem is called classiﬁcation.
how do we choose from among multiple consistent hypotheses? One answer is to prefer the simplest hypothesis consistent with the data.
In general, there is a tradeoff between complex hypotheses that ﬁt the training data well and simpler hypotheses that may generalize better.
There is a tradeoff between the expressiveness of a hypothesis space and the complexity of ﬁnding a good hypothesis within that space.
A decision tree represents a function that takes as input a vector of attribute values and returns a “decision”—a single output value.
We can answer this question by using a statistical signiﬁcance test.
You can think of the task of ﬁnding the best hypothesis as two tasks: model selection deﬁnes the hypothesis space and then optimization ﬁnds the best hypothesis within that space.
In machine learning it is traditional to express utilities by means of a loss function.
Traditional methods in statistics and the early years of machine learning concentrated on small-scale learning, where the number of training examples ranged from dozens to the low thousands.
Any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction. Thus, any hypothesis that is consistent with a sufﬁciently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct.
Any learning algorithm that returns hypotheses that are probably approximately correct is called a PAC learning algorithm.
Many forms of learning involve adjusting weights to minimize a loss, so it helps to have a mental picture of what’s going on in weight space—the space deﬁned by all possible settings of the weights.
Thus,it is common to use regularization on multivariate linear functions to avoid overﬁtting.

18.8 - 18.9

This approach is called instance-based learning or memory-based learning.
A more complex metric known as the Mahalanobis distance takes into account the covariance between dimensions.
A balanced binary tree over data with an arbitrary number of dimensions is called a k-d tree, for k-dimensional tree.
The support vector machine or SVM framework is currently the most popular approach for “off-the-shelf” supervised learning: if you don’t have any specialized prior knowledge about a domain, then the SVM is an excellent method to try ﬁrst.
The margin is the width of the area bounded by dashed lines in the ﬁgure—twice the distance from the separator to the nearest example point.
The data enter the expression only in the form of dot products of pairs of points.
Optimal linear separators can be found efﬁciently in feature spaces with billions of (or, in some cases, inﬁnitely many) dimensions.

Chapter 7

The architecture we introduce is called a feed-forward network because the computation proceeds iteratively from one layer of units to the next.
In later chapters we’ll introduce many other aspects of neural models, such as the recurrent neural network and the encoder-decoder model.
At its heart, a neural unit is taking a weighted sum of its inputs, with one additional term in the sum called a bias term.
The simplest activation function, and perhaps the most commonly used, is the rectified linear unit, also called the ReLU.
This line acts as a decision boundary in two-dimensional space in which the output 0 is assigned to all inputs lying on one side of the line, and the output 1 to all input points lying on the other side of the line.
Let’s now walk through a slightly more formal presentation of the simplest kind of neural network, the feed-forward network
There is a convenient function for normalizing a vector of real values, by which we mean converting it to a vector that encodes a probability distribution.
First, we’ll need a loss function that models the distance between the system output and the gold output, and it’s common to use the loss used for logistic regression, the cross-entropy loss.
Second, to find the parameters that minimize this loss function, we’ll use the gradient descent optimization algorithm introduced in Chapter 5. There are some differences.
Third, gradient descent requires knowing the gradient of the loss function, the vector that contains the partial derivative of the loss function with respect to each of the parameters.
The cross entropy loss, that is used in neural networks is the same one we saw for cross entropy loss logistic regression.
Various forms of regularization are used to prevent overfitting. One of the most dropout important is dropout: randomly some units and their connections from the network during training.
The method of using another algorithm to learn the embedding representations we use for input pretraining words is called pretraining.

Comments