NLP Part1
NLP Part1
NLP Part1
Processing
IFT6758 - Data Science
Sources:
http://demo.clab.cs.cmu.edu/NLP/
http://u.cs.biu.ac.il/~89-680/
!2
What is Natural Language
Processing (NLP)?
• Automating the analysis, generation, and
acquisition of human (“natural”) language
!3
Applications of NLP
Supervised NLP Unsupervised NLP
!4
Machine translation
!5
Question answering
credit: ifunny.com
!6
Dialog system
!7
Sentiment/opinion Analysis
!8
Text classification
www.wired.com
!9
NLP Key tasks
!10
Why NLP is hard?
Representation
!11
Why NLP is hard?
• Morphology: Analysis of words into meaningful
components
!12
Tokenizing
!13
Text Normalization
!14
Stemming
• Stem (root) is the part of the word to which you add inflectional
(changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). So stemming a
word or sentence may result in words that are not actual words. Stems are
created by removing the suffixes or prefixes used with a word.
• E.g.,
!15
Lemmatization
• e.g., runs, running, ran are all forms of the word run, therefore run is the
lemma of all these words.
!16
Stemming vs. Lemmatization
• You may be asking yourself when should I use Stemming and when should I
use Lemmatization?
• Stemming and Lemmatization both generate the root form of the inflected
words. The difference is that stem might not be an actual word whereas,
lemma is an actual language word.
!17
Example (Tokenizing and Stemming)
!18
Stop words
• Stop words: Stop Words are words which do not contain important significance to be
used. They are very common words in a language (e.g. a, an, the in English. 的, 了了in
Chinese. え, もin Japanese). It does not help on most of NLP problem such as
semantic analysis, classification etc.
• Usually, these words are filtered out from text because they return a vast amount of
unnecessary information.
• Each language will give its own list of stop words to use. e.g, words that are commonly
used in the English language are as, the, be, are.
• In NLTK, you can use pre-defined stop words, or you can use other words which are
defined by other party such as Stanford NLP and Rank NL.
Stanford NLP: https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/
nlp/patterns/surface/stopwords.txt
Rank NL: https://www.ranks.nl/stopwords
jieba: https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
!19
Why NLP is hard?
• Lexemes: Normalize and disambiguate words
!20
Part of Speech Tagging
!21
POS algorithms
• Automatic approaches: The Brill’s tagger is a rule-based tagger that goes through the
training data and finds out the set of tagging rules that best define the data and
minimize POS tagging errors. Hidden Markov Model can be used as a simple
stochastic POS tagger approach.
!22
NER
• Named entity recognition (NER) (also known as entity identification (EI) and
entity extraction) is the task that locate and classify atomic elements in text
into predefined categories such as the names of persons, organizations,
locations, expressions of times, quantities, monetary values,
percentages, etc.
!23
Example (POS/NER tagging)
!24
Why NLP is hard?
• Syntax: Transform a sequence of symbols into a
hierarchical or compositional structure.
!25
Synthetic parsing (Constituency)
!26
Dependency parsing
• The word that has no dependency is called the root of the sentence.
• The verb is taken as the root of the sentence in most cases. All the other
words are directly or indirectly linked to the root verb using links, which are
the dependencies.
https://nlp.stanford.edu/software/dependencies_manual.pdf
!27
Why NLP is hard?
• Scope ambiguities:
!28
Semantic analysis
• The suspect was detained by the police officer at the crime scene
• The police officer detained the suspect at the scene of the crime.
!29
Why NLP is hard?
• Pragmatics: Study the ways in which context
contributes to meaning
• Discourse:
!30
Co-reference resolution
• Linguistic representations are theorized constructs -> We cannot observe them directly.
• Ambiguity: each string may have many possible interpretations at every level. The correct
resolution of the ambiguity will depend on the intended meaning, which is often inferable from
context.
• People are good at linguistic ambiguity resolution but computers are not so good at it:
Q : How do we represent sets of possible alternatives?
Q : How do we represent context?
• Variablity: there are many ways to express the same meaning, and immeasurably many
meanings to express.
• Lots of words/phrases. Each level interacts with the others. There is tremendous diversity
in human languages. Languages express the same kind of meaning in different ways.
Some languages express some meanings more readily/often
!32
NLP challenges
• Data issues:
!33
NLP challenges
!34
NLP challenges
!35
NLP challenges
!36
NLP framework
!37
Language Understanding
!38
Probabilistic Language Models
!39
Language Understanding
!40
Recall
(Probability 101)
!41
Recall
(Probability 101)
Chain Rule:
!42
Probabilistic Language Models
!43
N-gram Language Model
For Unigram:
!44
N-gram Example
For Unigram:
!45
Effects of n in performance
!46
Evaluation: How good is a language
model?
• Extrinsic evaluation: Test a trained model on a test collection
– The more precisely a model can predict the words, the better is the model
!47
Parametric Language Model
!48
Parametric Language Model
!49
Embedding
• Embedding: Mapping between space with one dimension per linguistic unit
(character, morpheme, word, phrase, paragraph, sentence, document) to a
continuous vector space with much lower dimension.
good =
!50
One hot encoding
• Naive and simple word embedding technique: Map each word to a unique ID
!51
One hot encoding
!52
One hot encoding
• Cons: Size of input vector scales with size of vocabulary. Must pre-determine
vocabulary size.
• Computationally expensive - large input vector results in far too many parameters to
learn.
• “Out-of-Vocabulary” (OOV) problem: How would you handle unseen words in the
test set? (One solution is to have an “UNKNOWN” symbol that represents low-
frequency or unseen words)
•
!53
Bag of words
•
Vocabulary: set of all the words in corpus
Documents: Words in document w.r.t vocab with multiplicity
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1}
!54
Bag of words
• Pros:
• Cons:
• Orderless
!55
TF-IDF
• Raw counts are problematic: frequent words will characterize most words -> not
informative
!56
TF-IDF
!57
TF-IDF
• Pros:
• Easy to compute
• Has some basic metric to extract the most descriptive terms in a document
• Thus, can easily compute the similarity between 2 documents using it, e.g., using cosine
similarity
• Cons:
• Based on the bag-of-words (BoW) model, therefore it does not capture position in text,
semantics, co-occurrences in different documents, etc. TF-IDF is only useful as a lexical level
feature.
• Orderless
!58
N-gram
For bigram:
Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat
and, and the}
Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1}
!59
N-gram
• Pros:
• Cons:
• Data sparsity
!60
Data sparsity
!61
N-gram
• Pros:
• Cons:
• Data sparsity
!62
False independence assumption
!63
N-gram
• Pros:
• Cons:
• Data sparsity
!64
The Distributional Hypothesis
• The Distributional Hypothesis: words that occur in the same contexts tend
to have similar meanings (Harris, 1954)
!65
The Distributional Hypothesis
!66
Learning Dense embeddings
Deerwester, Dumais, Landauer, Furnas, and Harshman, Indexing by latent semantic analysis, JASIS, 1990.
Pennington, Socher, and Manning, GloVe: Global Vectors for Word Representation, EMNLP, 2014.
Mikolov, Sutskever, Chen, Corrado, and Dean, Distributed representations of words and phrases and their compositionality, NIPS, 2013.
!67