Week 10 - Reading

Week 10 - Reading

Chapter 2

We’ll begin with the most important tool for describing text patterns: the regular expression.
We’ll then turn to a set of tasks collectively called text normalization, in which regular expressions play an important part.
One of the unsung successes in standardization in computer science has been the regular expression (RE), a language for specifying text search strings.
This language consists of strings with a b, followed by at least two a’s, followed by an exclamation point. The set of operators that allows us to say things like “some number of as” are based on the asterisk or *, commonly called the Kleene.
Anchors are special characters that anchor regular expressions to particular places in a string.
Since we can’t use the square brackets to search for “cat or dog” (why disjunction can’t we say /[catdog]/?), we need a new operator, the disjunction operator, also called the pipe symbol |. The pattern /cat|dog/ matches either the string cat or the string dog.
This idea that one operator may take precedence over another, requiring us to sometimes use parentheses to specify what we mean, is formalized by the operator precedence hierarchy for regular expressions.
The process we just went through was based on fixing two kinds of errors: false false positives positives, strings that we incorrectly matched like other or there, and false negatives, strings that we incorrectly missed, like The.
An important use of regular expressions is in substitutions. For example, the substitution operator s/regexp1/pattern/ used in Python and in Unix commands like vim or sed allows a string characterized by a regular expression to be replaced by another string.
Finally, there will be times when we need to predict the future: look ahead in the text to see if some pattern matches, but not advance the match cursor, so that we can then deal with the pattern if it occurs.
This utterance has two kinds of disfluencies. The broken-off word main- is fragment called a fragment. Words like uh and um are called fillers or filled pauses.
Case folding is another kind of normalization. For tasks like speech recognition and information retrieval, everything is mapped to lower case.
Lemmatization is the task of determining that two words have the same root, despite their surface differences.
Stemming or lemmatizing has another side-benefit. By treating two similar words identically, these normalization methods help deal with the problem of unknown words.
Sentence segmentation is another important step in text processing. The most use- Sentence segmentation cues for segmenting a text into sentences are punctuation, like periods, question marks, exclamation points.
Edit distance gives us a way to quantify both of these intuitions about string similarity. More formally, the minimum edit distance between two strings is defined minimum edit distance as the minimum number of editing operations (operations like insertion, deletion, substitution) needed to transform one string into another.

Chapter 4

We focus on one common text categorization task, sentiment analysis, the ex- sentiment analysis traction of sentiment, the positive or negative orientation that a writer expresses toward some object.
Spam detection is another important commercial application, the binary classification task of assigning an email to one of the two classes spam or not-spam. Many lexical and other features can be used to perform this classification.
Most cases of classification in language processing are instead done via supervised machine learning, and this will be the subject of the remainder of this chapter.
The multinomial naive Bayes classifier, so called be- naive Bayes classifier cause it is a Bayesian classifier that makes a simplifying (naive) assumption about how the features interact.
This idea of Bayesian inference has been known since the work of Bayes (1763), Bayesian inference and was first applied to text classification by Mosteller and Wallace (1964)
The unknown word solution for such unknown words is to ignore them—remove them from the test document and not include any probability for them at all.
This variant is called binary multinomial naive Bayes or binary NB.
We also need to know whether the email is actually spam or not, i.e. the human-defined labels for each document that we are trying to gold labels match. We will refer to these human labels as the gold labels.
To evaluate any system for detecting things, we start by building a contingency table
There are two kinds of multi-class classification tasks. In any-of or multi-label classification, each document or item can be assigned more than one label.
More common in language processing is one-of or multinomial classification, in which the classes are mutually exclusive and each document or item appears in exactly one class.
To do so we’d need to reject the null hypothesis that A isn’t really better than B and this difference δ(x) occurred purely by chance.
The regularization technique introduced in the previous section is feature selection Feature selection is a method of removing features that are unlikely to generalize well.
A very common metric is information gain. Information gain tells information gain us how many bits of information the presence of the word gives us for guessing the class, and can be computed.

Comments