Week 6 - Reading

Chapter 18

18.1 - 18.6

  • This chapter (and most of current machine learning research) covers inputs that form a factored representation—a vector of attribute values—and outputs that can be either a continuous numerical value or a discrete value. 
  • We say that learning a (possibly incorrect) general function or rule from specific input–output pairs is called inductive learning. 
  • In unsupervised learning the agent learns patterns in the input even though no explicit feedback is supplied. 
  • In reinforcement learning the agent learns from a series of reinforcements—rewards or punishments. 
  • In supervised learning the agent observes some example input–output pairs and learns a function that maps from input to output.
  • In semi-supervised learning we are given a few labeled examples and must make what we can of a large collection of unlabeled examples. 
  • When the output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning problem is called classification.
  • how do we choose from among multiple consistent hypotheses? One answer is to prefer the simplest hypothesis consistent with the data. 
  • In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.
  • There is a tradeoff between the expressiveness of a hypothesis space and the complexity of finding a good hypothesis within that space. 
  • A decision tree represents a function that takes as input a vector of attribute values and returns a “decision”—a single output value.
  • We can answer this question by using a statistical significance test.
  • You can think of the task of finding the best hypothesis as two tasks: model selection defines the hypothesis space and then optimization finds the best hypothesis within that space.
  • In machine learning it is traditional to express utilities by means of a loss function. 
  • Traditional methods in statistics and the early years of machine learning concentrated on small-scale learning, where the number of training examples ranged from dozens to the low thousands.
  • Any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction. Thus, any hypothesis that is consistent with a sufficiently large set of training examples is unlikely to be seriously wrong: that is, it must be probably approximately correct.
  • Any learning algorithm that returns hypotheses that are probably approximately correct is called a PAC learning algorithm.
  • Many forms of learning involve adjusting weights to minimize a loss, so it helps to have a mental picture of what’s going on in weight space—the space defined by all possible settings of the weights.
  • Thus,it is common to use regularization on multivariate linear functions to avoid overfitting.

18.8 - 18.9

  • This approach is called instance-based learning or memory-based learning.
  •  A more complex metric known as the Mahalanobis distance takes into account the covariance between dimensions.
  • A balanced binary tree over data with an arbitrary number of dimensions is called a k-d tree, for k-dimensional tree.
  • The support vector machine or SVM framework is currently the most popular approach for “off-the-shelf” supervised learning: if you don’t have any specialized prior knowledge about a domain, then the SVM is an excellent method to try first. 
  • The margin is the width of the area bounded by dashed lines in the figure—twice the distance from the separator to the nearest example point.
  • The data enter the expression only in the form of dot products of pairs of points.
  • Optimal linear separators can be found efficiently in feature spaces with billions of (or, in some cases, infinitely many) dimensions.

Chapter 7

  • The architecture we introduce is called a feed-forward network because the computation proceeds iteratively from one layer of units to the next.
  • In later chapters we’ll introduce many other aspects of neural models, such as the recurrent neural network and the encoder-decoder model.
  • At its heart, a neural unit is taking a weighted sum of its inputs, with one additional term in the sum called a bias term.
  • The simplest activation function, and perhaps the most commonly used, is the rectified linear unit, also called the ReLU.
  • This line acts as a decision boundary in two-dimensional space in which the output 0 is assigned to all inputs lying on one side of the line, and the output 1 to all input points lying on the other side of the line.
  • Let’s now walk through a slightly more formal presentation of the simplest kind of neural network, the feed-forward network
  • There is a convenient function for normalizing a vector of real values, by which we mean converting it to a vector that encodes a probability distribution.
  • First, we’ll need a loss function that models the distance between the system output and the gold output, and it’s common to use the loss used for logistic regression, the cross-entropy loss. 
  • Second, to find the parameters that minimize this loss function, we’ll use the gradient descent optimization algorithm introduced in Chapter 5. There are some differences.
  • Third, gradient descent requires knowing the gradient of the loss function, the vector that contains the partial derivative of the loss function with respect to each of the parameters.
  • The cross entropy loss, that is used in neural networks is the same one we saw for cross entropy loss logistic regression.
  • Various forms of regularization are used to prevent overfitting. One of the most dropout important is dropout: randomly some units and their connections from the network during training.
  • The method of using another algorithm to learn the embedding representations we use for input pretraining words is called pretraining.

Comments

Popular posts from this blog

Week 7 - Reading

Week 5 - Reading