Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
47 views

CS 491 Natural Language Processing Module 2: Basic Text Processing

Regular expressions are a formal language for specifying text strings. They allow searching for patterns using symbols like brackets, pipes, question marks and anchors. Key concepts covered include disjunctions, grouping, precedence, and reducing errors by increasing precision to minimize false positives and recall to minimize false negatives. Regular expressions are widely used in natural language processing as the first model for text processing or to provide features to machine learning classifiers.

Uploaded by

Sakshi Nijhawan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

CS 491 Natural Language Processing Module 2: Basic Text Processing

Regular expressions are a formal language for specifying text strings. They allow searching for patterns using symbols like brackets, pipes, question marks and anchors. Key concepts covered include disjunctions, grouping, precedence, and reducing errors by increasing precision to minimize false positives and recall to minimize false negatives. Regular expressions are widely used in natural language processing as the first model for text processing or to provide features to machine learning classifiers.

Uploaded by

Sakshi Nijhawan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

CS 491 Natural Language Processing

Module 2: Basic Text Processing


Basic Text
Processing

Regular Expressions
Regular expressions
• A formal language for specifying text strings
• How can we search for any of these?
• woodchuck
• woodchucks
• Woodchuck
• Woodchucks
Regular Expressions: Disjunctions
• Letters inside square brackets []
Pattern Matches
[wW]oodchuck Woodchuck, woodchuck
[1234567890] Any digit

• Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my books
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
• Negations [^Ss]
• Caret means negation only when first in []

Pattern Matches
[^A-Z] Not an upper case letter Oyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a caret b Look up a^b now
Regular Expressions: More Disjunction
• Woodchucks is another name for groundhog!
• The pipe | for disjunction
Pattern Matches
groundhog|woodchuck
yours|mine yours
mine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
Regular Expressions: ? * + .

Pattern Matches
colou?r Optional color colour
previous char
oo*h! 0 or more of oh! ooh! oooh! ooooh!
previous char
o+h! 1 or more of oh! ooh! oooh! ooooh!
previous char
Stephen C Kleene
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n Kleene *, Kleene +
Examples
• What do the following match?
• [ab]*
• [NLP]*
• [NLPNLP]
• NLP.*NLP
• [0-9][0-9]*

8
Regular Expressions: Anchors ^ $

Pattern Matches
^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
Example
• Find me all instances of the word “the” in a text.
the
Example
• Find me all instances of the word “the” in a text.
the Misses capitalized examples
[tT]he
Example
• Find me all instances of the word “the” in a text.
the Misses capitalized examples
[tT]he Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
returns instances with no alphabetic letters on either side of “the”
Example
• Find me all instances of the word “the” in a text.
the Misses capitalized examples
[tT]he Incorrectly returns other or theology
[^a-zA-Z][tT]he[^a-zA-Z]
returns instances with no alphabetic letters on either side of “the”
misses instances which starts with “The”
(^|[^a-zA-Z])[tT]he[^a-zA-Z]
Regular Expressions: Advanced Operators
• . = match a single character of any value, except end of
line.
• * (Asterisk) = match zero or more of the preceding character
or expression.
• {x,y} = match x to y occurrences of the preceding.
• {x} = match exactly x occurrences of the preceding.
• {x,} = match x or more occurrences of the preceding

14
Regular Expressions: Advanced Operators
• ^ (Caret) = match expression at the start of a line, as in ^A.
• $ (Dollar) = match expression at the end of a line, as in A$.
• \ (Back Slash) = turn off the special meaning of the next
character, as in \^.
• [ ] (Brackets) = match any one of the enclosed characters, as
in [aeiou]. Use Hyphen "-" for a range, as in [0-9].
• [^ ] = match any one character except those enclosed in
[ ], as in [^0-9].
15
Examples
• What do the following match?
• ^Language.$
• ^Language\.$
• Language\.$
• Language.
• [Language.]
16
Anchors \b and \B
• \b matches a word boundary
• Word: sequence of digits, underscores, letters
• \B matches a non-boundary
• \b99\b matches “was run out on 99” and “the
microwave costs $99” but not “was run out on
199”
17
Disjunction, Grouping, Precedence
• Disjunction operator “|”
• Search for cat or dog: cat | dog
• What would the following match?
• [catdog]
• guppy|ies
• gupp(y|ies)

18
Disjunction, Grouping, Precedence
• Kleene* applies to a single character not a sequence
• What do the following match?
• Column[0-9]*
• (Column[0-9])*

19
Disjunction, Grouping, Precedence
• RE operator precedence (highest to lowest)
• Parentheses ()
• Counters * + ? { }
• Sequences and anchors ^, $
• Disjunction |

20
Errors
• The process we just went through was based on fixing
two kinds of errors
• Matching strings that we should not have matched (there,
then, other)
• False positives (Type I)
• Not matching things that we should have matched (The)
• False negatives (Type II)
Precision and Recall

22
Errors cont.
• In NLP we are always dealing with these kinds of errors.
• Reducing the error rate for an application often
involves two antagonistic efforts:
• Increasing accuracy or precision (minimizing false positives)
• Increasing coverage or recall (minimizing false negatives).
Summary
• Regular expressions play a surprisingly large role
• Sophisticated sequences of regular expressions are often the first model
for any text processing text
• For many hard tasks, we use machine learning classifiers
• But regular expressions are used as features in the classifiers
• Can be very useful in capturing generalizations

24

You might also like