NLP Part1

Natural Language
Processing
IFT6758 - Data Science
Sources:
http://demo.clab.cs.cmu.edu/NLP/
http://u.cs.biu.ac.il/~89-680/
And many more …

Announcements
• HM#1 grades are accessible in Gradescope!
• All the mid-term evaluation scores are on the scoreboard!
• HM#2 grades will be out on Gradescope on Thursday!
• Mid-term presentation grades will be announced on Gradescope on

Thursday!
• Mid-term exam grades will be announced next Monday!
• Crash course on DL will be next Thursday! 
!2
What is Natural Language
Processing (NLP)?
• Automating the analysis, generation, and
acquisition of human (“natural”) language
• Analysis (or “understanding” or

“processing” ...) : input is language, output is
some representation that supports useful
action
• Generation: input is that representation,

output is language
• Acquisition: obtaining the representation and

necessary algorithms, from knowledge and
data
!3
Applications of NLP
Supervised NLP Unsupervised NLP
!4
Machine translation
!5
Question answering
credit: ifunny.com
!6
Dialog system
!7
Sentiment/opinion Analysis
!8
Text classification
www.wired.com
!9
NLP Key tasks
NLP language model
!10
Why NLP is hard?
Representation
• The mappings between levels are extremely complex.
• Appropriateness of a representation depends on the application.
!11
Why NLP is hard?
• Morphology: Analysis of words into meaningful
components
• Spectrum of complexity across languages:
• Analytic or isolating languages (e.g., English,

Chinese)
• Synthetic languages (e.g., Finnish, Turkish,

Hebrew)
!12
Tokenizing
• The most common task in NLP is tokenization.
• Tokenization is the process of breaking a document down into words,

punctuation marks, numeric digits, etc.
!13
Text Normalization
• Stemming and Lemmatization are Text

Normalization (or sometimes called Word
Normalization) techniques in the field of Natural
Language Processing that are used to prepare
text, words, and documents for further
processing.
• Languages we speak and write are made up of

several words often derived from one another.
• When a language contains words that are derived

from another word as their use in the speech
changes is called Inflected Language. The
degree of inflection may be higher or lower in a
language.
!14
Stemming
• Stem (root) is the part of the word to which you add inflectional
(changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). So stemming a
word or sentence may result in words that are not actual words. Stems are
created by removing the suffixes or prefixes used with a word.
• E.g.,
• Python NLTK provides not only two English stemmers:

PorterStemmer and LancasterStemmer  
but also a lot of non-English stemmers as part of SnowballStemmers,
ISRIStemmer, RSLPSStemmer.
!15
Lemmatization
• Lemmatization, unlike Stemming, reduces the inflected words properly

ensuring that the root word belongs to the language. In Lemmatization root
word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical
form, dictionary form, or citation form of a set of words.
• e.g., runs, running, ran are all forms of the word run, therefore run is the
lemma of all these words.
• Because lemmatization returns an actual word of the language, it is used

where it is necessary to get valid words.
!16
Stemming vs. Lemmatization
• You may be asking yourself when should I use Stemming and when should I
use Lemmatization?
• Stemming and Lemmatization both generate the root form of the inflected
words. The difference is that stem might not be an actual word whereas,
lemma is an actual language word.
• Stemming follows an algorithm with steps to perform on the words which

makes it faster. Whereas, in lemmatization, you used WordNet corpus and
a corpus for stop words as well to produce lemma which makes it slower
than stemming. You also had to define a parts-of-speech to obtain the
correct lemma.
!17
Example (Tokenizing and Stemming)
!18
Stop words
• Stop words: Stop Words are words which do not contain important significance to be
used. They are very common words in a language (e.g. a, an, the in English. 的, 了了in
Chinese. え, もin Japanese). It does not help on most of NLP problem such as
semantic analysis, classification etc.
• Usually, these words are filtered out from text because they return a vast amount of
unnecessary information.
• Each language will give its own list of stop words to use. e.g, words that are commonly
used in the English language are as, the, be, are.
• In NLTK, you can use pre-defined stop words, or you can use other words which are
defined by other party such as Stanford NLP and Rank NL.  
Stanford NLP: https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/
nlp/patterns/surface/stopwords.txt 
Rank NL: https://www.ranks.nl/stopwords 
jieba: https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
!19
Why NLP is hard?
• Lexemes: Normalize and disambiguate words
• Words with multiple meanings: bank, mean 

(Extra challenge: domain-specific meanings)
• Multi-word expressions: e.g., make ... decision,

take out, make up, …
• For English, part-of-speech tagging is one very

common kind of lexical analysis
• For Others: super-sense tagging, various forms

of word sense disambiguation, syntactic “super
tags,” …
!20
Part of Speech Tagging
• “Part-of-speech tagging (POS tagging or PoS tagging or POST), also

called grammatical tagging or word-category disambiguation, is the
process of marking up a word in a text (corpus) as corresponding to a
particular part of speech, based on both its definition and its context — i.e.,
its relationship with adjacent and related words in a phrase, sentence, or
paragraph. “
• A simplified form of this is commonly taught to school-age children, in the

identification of words as nouns, verbs, adjectives, adverbs, etc.
!21
POS algorithms
• POS-tagging algorithms fall into two distinctive groups: 

Rule-Based POS Taggers 
Stochastic POS Taggers
• Automatic part of speech tagging is an area of natural language processing where

statistical techniques have been more successful than rule-based methods.
• Example of a rule: If an ambiguous/unknown word X is preceded by a determiner and

followed by a noun, tag it as an adjective.
• Defining a set of rules manually is an extremely cumbersome process and is not

scalable at all. So we need some automatic way of doing this.
• Automatic approaches: The Brill’s tagger is a rule-based tagger that goes through the
training data and finds out the set of tagging rules that best define the data and
minimize POS tagging errors. Hidden Markov Model can be used as a simple
stochastic POS tagger approach.
!22
NER
• Named entity recognition (NER) (also known as entity identification (EI) and
entity extraction) is the task that locate and classify atomic elements in text
into predefined categories such as the names of persons, organizations,
locations, expressions of times, quantities, monetary values,
percentages, etc.
<ENAMEX TYPE="PERSON">John</ENAMEX> sold <NUMEX

TYPE="QUANTITY">5</NUMEX> companies in <TIMEX TYPE="DATE">2002</
TIMEX>.
!23
Example (POS/NER tagging)
!24
Why NLP is hard?
• Syntax: Transform a sequence of symbols into a
hierarchical or compositional structure.
• Closely related to linguistic theories about what

makes some sentences well-formed and others
not.
• Ambiguities explode combinatorially
!25
Synthetic parsing (Constituency)
• Constituent-based grammars are used to analyze and determine the

constituents of a sentence.
• These grammars can be used to model or represent the internal structure of

sentences in terms of a hierarchically ordered structure of their
constituents. Each and every word usually belongs to a specific lexical
category in the case and forms the head word of different phrases. These
phrases are formed based on rules called phrase structure rules.
!26
Dependency parsing
• In dependency parsing, we try to use dependency-based grammars to

analyze and infer both structure and semantic dependencies and
relationships between tokens in a sentence.
• The word that has no dependency is called the root of the sentence.
• The verb is taken as the root of the sentence in most cases. All the other
words are directly or indirectly linked to the root verb using links, which are
the dependencies.
https://nlp.stanford.edu/software/dependencies_manual.pdf
!27
Why NLP is hard?
• Semantics: Mapping of natural language

sentences into domain representations.
• E.g., a robot command language, a database

query, or an expression in a formal logic.
• Scope ambiguities:
a. Everyone loves SOMEone.

b. EVERYone loves someone.
• Going beyond specific domains is a goal of

Artificial Intelligence
!28
Semantic analysis
• Semantic role labeling is the process that assigns labels to words

or phrases in a sentence that indicate their semantic role in the
sentence, such as that of an agent, goal, or result.
• The suspect was detained by the police officer at the crime scene
• The police officer detained the suspect at the scene of the crime.
!29
Why NLP is hard?
• Pragmatics: Study the ways in which context
contributes to meaning
• Any non-local meaning phenomena:
• Discourse:
• Structures and effects in related sequences

of sentences
• Texts, dialogues, multi-party conversations
!30
Co-reference resolution
• Coreference resolution is the task of finding all expressions

that refer to the same entity in a text.
• It is an important step for a lot of higher level NLP tasks that
involve natural language understanding such as document
summarization, question answering, and information extraction.
Christopher Robin is alive and well. He is the

same person that you read about in the book,
Winnie the Pooh. As a boy, Chris lived in a
pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a
poem about him. The poem was printed in a
magazine for others to read. Mr. Robin then
wrote a book
!31
Complexity of NLP
• Linguistic representations are theorized constructs -> We cannot observe them directly.
• Data: Input is likely to be noisy.
• Ambiguity: each string may have many possible interpretations at every level. The correct
resolution of the ambiguity will depend on the intended meaning, which is often inferable from
context.
• People are good at linguistic ambiguity resolution but computers are not so good at it:  
Q : How do we represent sets of possible alternatives? 
Q : How do we represent context?
• Variablity: there are many ways to express the same meaning, and immeasurably many
meanings to express.
• Lots of words/phrases. Each level interacts with the others. There is tremendous diversity
in human languages. Languages express the same kind of meaning in different ways.
Some languages express some meanings more readily/often
!32
NLP challenges
• Data issues:
• A lot of data: In some cases, we deal with huge amounts of data 

Need to come up with models that can process a lot of data efficiently
• Lack of data: Many problems in NLP suffer from lack of data:
• Non-standard platforms (code-switching)
• Expensive annotation (word-sense disambiguation, named-

entity recognition)
• Need to use methods to overcome this challenge (semi-supervised

learning, multi-task learning…)
!33
NLP challenges
• Ambiguity challenge: Polysemy: one word, many meanings
!34
NLP challenges
• Ambiguity challenge: Syntactic Ambiguity: same words, many meanings
!35
NLP challenges
• Variability: different words, same meaning
!36
NLP framework
NLP language model
!37
Language Understanding
• Its all about how likely a sentence is...
!38
Probabilistic Language Models
• Goal: Compute the probability of a sentence or sequences of words
• Related task: probability of an upcoming word:
• A model that computes either of the above is called a language

model.
!39
Language Understanding
!40
Recall  
(Probability 101)
!41
Recall  
(Probability 101)
Chain Rule:
!42
Probabilistic Language Models
!43
N-gram Language Model
For Unigram:
!44
N-gram Example
For Unigram:
!45
Effects of n in performance
!46
Evaluation: How good is a language
model?
• Extrinsic evaluation: Test a trained model on a test collection 
– Try to predict each word 
– The more precisely a model can predict the words, the better is the model
• Intrinsic evaluation: Perplexity 

Perplexity is the inverse probability of the test set, normalized by
the number of words:  
N
1
– Given P(wi) and a test text of length N −
N
∑log2 P(wi )
– The lower the better! Perplexity = 2 i=1
!47
Parametric Language Model
!48
Parametric Language Model
!49
Embedding
• Embedding: Mapping between space with one dimension per linguistic unit
(character, morpheme, word, phrase, paragraph, sentence, document) to a
continuous vector space with much lower dimension.
• Vector representation: “meaning” of linguistic unit is represented by a vector

of real numbers.
good =
!50
One hot encoding
• Naive and simple word embedding technique: Map each word to a unique ID
• Typical vocabulary sizes will vary between 10k and 250k.
!51
One hot encoding
• Use word ID, to get a basic

representation of word via one-hot
encoding of the ID
• one-hot vector of an ID is a vector filled

with 0s, except for a 1 at the position
associated with the ID. e.g., for
vocabulary size D=10, the one-hot
vector of word (w) ID=4 is  
e(w) = [ 0 0 0 1 0 0 0 0 0 0 ]
• a one-hot encoding makes no

assumption about word similarity and
all words are equally similar/different
from each other
!52
One hot encoding
• Pros: Quick and simple
• Cons: Size of input vector scales with size of vocabulary. Must pre-determine
vocabulary size.
• Cannot scale to large or infinite vocabularies (Zipf’s law!)
• Computationally expensive - large input vector results in far too many parameters to
learn.
• “Out-of-Vocabulary” (OOV) problem: How would you handle unseen words in the
test set? (One solution is to have an “UNKNOWN” symbol that represents low-
frequency or unseen words)
• No relationship between words: Each word is an independent unit vector
•
!53
Bag of words
• Bag of words (BOW): is represented as the bag (multiset) of its words,

disregarding grammar and even word order but keeping multiplicity.
•  
Vocabulary: set of all the words in corpus 
Documents: Words in document w.r.t vocab with multiplicity
Sentence 1: "The cat sat on the hat"

Sentence 2: "The dog ate the cat and the hat”
Vocab = { the, cat, sat, on, hat, dog, ate, and }
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
!54
Bag of words
• Pros:
• Quick and simple
• Cons:
• Bag of words model is very high-dimensional and sparse
• Cannot capture semantics or morphology
• Orderless
!55
TF-IDF
• Raw counts are problematic: frequent words will characterize most words -> not
informative
• Importance increases proportionally to the number of times a word appears in the

document; but is inversely proportional to the frequency of the word in the corpus.
• TF-IDF (for term (t) – document (d)):
TF(t) = (Number of times term t appears in a document) / (Total number of
terms in the document).

IDF(t) = log (Total number of documents / Number of documents with term
t in it).
!56
TF-IDF
• Document D1 contains 100 words.
• cat appears 3 times in D1
• TF(cat) = 3 / 100 = 0.3
• Corpus contains 10 million documents
• cat appears in 1000 documents
• IDF(cat) = log (10,000,000 / 1,000) = 4
• TF-IDF (cat) = 0.3 * 4
!57
TF-IDF
• Pros:
• Easy to compute
• Has some basic metric to extract the most descriptive terms in a document
• Thus, can easily compute the similarity between 2 documents using it, e.g., using cosine
similarity
• Cons:
• Based on the bag-of-words (BoW) model, therefore it does not capture position in text,
semantics, co-occurrences in different documents, etc. TF-IDF is only useful as a lexical level
feature.
• Cannot capture semantics (unlike word embeddings)
• Orderless
!58
N-gram
• Vocab = set of all n-grams in corpus 

Document = n-grams in document w.r.t vocab with multiplicity
For bigram:
Sentence 1: "The cat sat on the hat"

Sentence 2: "The dog ate the cat and the hat”
Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat
and, and the}
Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
!59
N-gram
• Pros:
• Incorporates order of words
• Simple and quick
• Cons:
• Very large vocabulary
• Data sparsity
!60
Data sparsity
!61
N-gram
• Pros:
• Cons:
• Data sparsity
• False independence assumption
!62
False independence assumption
!63
N-gram
• Pros:
• Cons:
• Data sparsity
• False independence assumption
• Cannot capture syntactic/semantic similarity
!64
The Distributional Hypothesis
• The Distributional Hypothesis: words that occur in the same contexts tend
to have similar meanings (Harris, 1954)
• “You shall know a word by the company it keeps” (Firth, 1957)
!65
The Distributional Hypothesis
!66
Learning Dense embeddings
Deerwester, Dumais, Landauer, Furnas, and Harshman, Indexing by latent semantic analysis, JASIS, 1990.
Pennington, Socher, and Manning, GloVe: Global Vectors for Word Representation, EMNLP, 2014.
Mikolov, Sutskever, Chen, Corrado, and Dean, Distributed representations of words and phrases and their compositionality, NIPS, 2013.
!67

NLP Part1

Uploaded by

Copyright:

Available Formats

NLP Part1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Part1

Uploaded by

Copyright:

Available Formats

Natural Language

And many more …

• HM#1 grades are accessible in Gradescope!

• All the mid-term evaluation scores are on the scoreboard!

• HM#2 grades will be out on Gradescope on Thursday!

• Mid-term presentation grades will be announced on Gradescope on

• Mid-term exam grades will be announced next Monday!

• Crash course on DL will be next Thursday!

• Analysis (or “understanding” or

• Generation: input is that representation,

• Acquisition: obtaining the representation and

NLP language model

• The mappings between levels are extremely complex.

• Appropriateness of a representation depends on the application.

• Spectrum of complexity across languages:

• Analytic or isolating languages (e.g., English,

• Synthetic languages (e.g., Finnish, Turkish,

• The most common task in NLP is tokenization.

• Tokenization is the process of breaking a document down into words,

• Stemming and Lemmatization are Text

• Languages we speak and write are made up of

• When a language contains words that are derived

• Python NLTK provides not only two English stemmers:

• Lemmatization, unlike Stemming, reduces the inflected words properly

• Because lemmatization returns an actual word of the language, it is used

• Stemming follows an algorithm with steps to perform on the words which

• Words with multiple meanings: bank, mean

• Multi-word expressions: e.g., make ... decision,

• For English, part-of-speech tagging is one very

• For Others: super-sense tagging, various forms

• “Part-of-speech tagging (POS tagging or PoS tagging or POST), also

• A simplified form of this is commonly taught to school-age children, in the

• POS-tagging algorithms fall into two distinctive groups:

• Automatic part of speech tagging is an area of natural language processing where

• Example of a rule: If an ambiguous/unknown word X is preceded by a determiner and

• Defining a set of rules manually is an extremely cumbersome process and is not

<ENAMEX TYPE="PERSON">John</ENAMEX> sold <NUMEX

• Closely related to linguistic theories about what

• Ambiguities explode combinatorially

• Constituent-based grammars are used to analyze and determine the

• These grammars can be used to model or represent the internal structure of

• In dependency parsing, we try to use dependency-based grammars to

• Semantics: Mapping of natural language

• E.g., a robot command language, a database

a. Everyone loves SOMEone.

• Going beyond specific domains is a goal of

• Semantic role labeling is the process that assigns labels to words

• Any non-local meaning phenomena:

• Structures and eﬀects in related sequences

• Texts, dialogues, multi-party conversations

• Coreference resolution is the task of finding all expressions

Christopher Robin is alive and well. He is the

• Data: Input is likely to be noisy.

• A lot of data: In some cases, we deal with huge amounts of data

• Lack of data: Many problems in NLP suﬀer from lack of data:

• Non-standard platforms (code-switching)

• Expensive annotation (word-sense disambiguation, named-

• Need to use methods to overcome this challenge (semi-supervised

• Crash course on DL will be next Thursday! 

• Words with multiple meanings: bank, mean 

• POS-tagging algorithms fall into two distinctive groups: 

• A lot of data: In some cases, we deal with huge amounts of data 

– Try to predict each word 

• Intrinsic evaluation: Perplexity 

• Vocab = set of all n-grams in corpus