CS 491 Natural Language Processing Module 2: Basic Text Processing
CS 491 Natural Language Processing Module 2: Basic Text Processing
Regular Expressions
Regular expressions
• A formal language for specifying text strings
• How can we search for any of these?
• woodchuck
• woodchucks
• Woodchuck
• Woodchucks
Regular Expressions: Disjunctions
• Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit
• Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my books
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
• Negations [^Ss]
• Caret means negation only when first in []
Pattern Matches
[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a caret b Look up a^b now
Regular Expressions: More Disjunction
• Woodchucks is another name for groundhog!
• The pipe | for disjunction
Pattern Matches
groundhog|woodchuck
yours|mine yours
mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
Regular Expressions: ? * + .
Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
Stephen C Kleene
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n Kleene *, Kleene +
Examples
• What do the following match?
• [ab]*
• [NLP]*
• [NLPNLP]
• NLP.*NLP
• [0-9][0-9]*
8
Regular Expressions: Anchors ^ $
Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Example
• Find me all instances of the word “the” in a text.
the
Example
• Find me all instances of the word “the” in a text.
the Misses capitalized examples
[tT]he
Example
• Find me all instances of the word “the” in a text.
the Misses capitalized examples
[tT]he Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
returns instances with no alphabetic letters on either side of “the”
Example
• Find me all instances of the word “the” in a text.
the Misses capitalized examples
[tT]he Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
returns instances with no alphabetic letters on either side of “the”
misses instances which starts with “The”
(^|[^a-zA-Z])[tT]he[^a-zA-Z]
Regular Expressions: Advanced Operators
• . = match a single character of any value, except end of
line.
• * (Asterisk) = match zero or more of the preceding character
or expression.
• {x,y} = match x to y occurrences of the preceding.
• {x} = match exactly x occurrences of the preceding.
• {x,} = match x or more occurrences of the preceding
14
Regular Expressions: Advanced Operators
• ^ (Caret) = match expression at the start of a line, as in ^A.
• $ (Dollar) = match expression at the end of a line, as in A$.
• \ (Back Slash) = turn off the special meaning of the next
character, as in \^.
• [ ] (Brackets) = match any one of the enclosed characters, as
in [aeiou]. Use Hyphen "-" for a range, as in [0-9].
• [^ ] = match any one character except those enclosed in
[ ], as in [^0-9].
15
Examples
• What do the following match?
• ^Language.$
• ^Language\.$
• Language\.$
• Language.
• [Language.]
16
Anchors \b and \B
• \b matches a word boundary
• Word: sequence of digits, underscores, letters
• \B matches a non-boundary
• \b99\b matches “was run out on 99” and “the
microwave costs $99” but not “was run out on
199”
17
Disjunction, Grouping, Precedence
• Disjunction operator “|”
• Search for cat or dog: cat | dog
• What would the following match?
• [catdog]
• guppy|ies
• gupp(y|ies)
18
Disjunction, Grouping, Precedence
• Kleene* applies to a single character not a sequence
• What do the following match?
• Column[0-9]*
• (Column[0-9])*
19
Disjunction, Grouping, Precedence
• RE operator precedence (highest to lowest)
• Parentheses ()
• Counters * + ? { }
• Sequences and anchors ^, $
• Disjunction |
20
Errors
• The process we just went through was based on fixing
two kinds of errors
• Matching strings that we should not have matched (there,
then, other)
• False positives (Type I)
• Not matching things that we should have matched (The)
• False negatives (Type II)
Precision and Recall
22
Errors cont.
• In NLP we are always dealing with these kinds of errors.
• Reducing the error rate for an application often
involves two antagonistic efforts:
• Increasing accuracy or precision (minimizing false positives)
• Increasing coverage or recall (minimizing false negatives).
Summary
• Regular expressions play a surprisingly large role
• Sophisticated sequences of regular expressions are often the first model
for any text processing text
• For many hard tasks, we use machine learning classifiers
• But regular expressions are used as features in the classifiers
• Can be very useful in capturing generalizations
24