Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

NLP Part1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

Natural Language

Processing
IFT6758 - Data Science

Sources:
http://demo.clab.cs.cmu.edu/NLP/

http://u.cs.biu.ac.il/~89-680/

And many more …


Announcements

• HM#1 grades are accessible in Gradescope!

• All the mid-term evaluation scores are on the scoreboard!

• HM#2 grades will be out on Gradescope on Thursday!

• Mid-term presentation grades will be announced on Gradescope on


Thursday!

• Mid-term exam grades will be announced next Monday!

• Crash course on DL will be next Thursday!


!2
What is Natural Language
Processing (NLP)?
• Automating the analysis, generation, and
acquisition of human (“natural”) language

• Analysis (or “understanding” or


“processing” ...) : input is language, output is
some representation that supports useful
action

• Generation: input is that representation,


output is language

• Acquisition: obtaining the representation and


necessary algorithms, from knowledge and
data

!3
Applications of NLP
Supervised NLP Unsupervised NLP

!4
Machine translation

!5
Question answering

credit: ifunny.com

!6
Dialog system

!7
Sentiment/opinion Analysis

!8
Text classification

www.wired.com

!9
NLP Key tasks

NLP language model

!10
Why NLP is hard?

Representation

• The mappings between levels are extremely complex.

• Appropriateness of a representation depends on the application.

!11
Why NLP is hard?
• Morphology: Analysis of words into meaningful
components

• Spectrum of complexity across languages:

• Analytic or isolating languages (e.g., English,


Chinese)

• Synthetic languages (e.g., Finnish, Turkish,


Hebrew)

!12
Tokenizing

• The most common task in NLP is tokenization.

• Tokenization is the process of breaking a document down into words,


punctuation marks, numeric digits, etc.

!13
Text Normalization

• Stemming and Lemmatization are Text


Normalization (or sometimes called Word
Normalization) techniques in the field of Natural
Language Processing that are used to prepare
text, words, and documents for further
processing.

• Languages we speak and write are made up of


several words often derived from one another.

• When a language contains words that are derived


from another word as their use in the speech
changes is called Inflected Language. The
degree of inflection may be higher or lower in a
language.

!14
Stemming

• Stem (root) is the part of the word to which you add inflectional
(changing/deriving) affixes such as (-ed,-ize, -s,-de,mis). So stemming a
word or sentence may result in words that are not actual words. Stems are
created by removing the suffixes or prefixes used with a word.

• E.g.,

• Python NLTK provides not only two English stemmers:


PorterStemmer and LancasterStemmer 

but also a lot of non-English stemmers as part of SnowballStemmers,
ISRIStemmer, RSLPSStemmer.

!15
Lemmatization

• Lemmatization, unlike Stemming, reduces the inflected words properly


ensuring that the root word belongs to the language. In Lemmatization root
word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical
form, dictionary form, or citation form of a set of words.

• e.g., runs, running, ran are all forms of the word run, therefore run is the
lemma of all these words.

• Because lemmatization returns an actual word of the language, it is used


where it is necessary to get valid words.

!16
Stemming vs. Lemmatization

• You may be asking yourself when should I use Stemming and when should I
use Lemmatization?

• Stemming and Lemmatization both generate the root form of the inflected
words. The difference is that stem might not be an actual word whereas,
lemma is an actual language word.

• Stemming follows an algorithm with steps to perform on the words which


makes it faster. Whereas, in lemmatization, you used WordNet corpus and
a corpus for stop words as well to produce lemma which makes it slower
than stemming. You also had to define a parts-of-speech to obtain the
correct lemma.

!17
Example (Tokenizing and Stemming)

!18
Stop words

• Stop words: Stop Words are words which do not contain important significance to be
used. They are very common words in a language (e.g. a, an, the in English. 的, 了了in
Chinese. え, もin Japanese). It does not help on most of NLP problem such as
semantic analysis, classification etc.

• Usually, these words are filtered out from text because they return a vast amount of
unnecessary information.

• Each language will give its own list of stop words to use. e.g, words that are commonly
used in the English language are as, the, be, are.

• In NLTK, you can use pre-defined stop words, or you can use other words which are
defined by other party such as Stanford NLP and Rank NL. 

Stanford NLP: https://github.com/stanfordnlp/CoreNLP/blob/master/data/edu/stanford/
nlp/patterns/surface/stopwords.txt

Rank NL: https://www.ranks.nl/stopwords

jieba: https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt

!19
Why NLP is hard?
• Lexemes: Normalize and disambiguate words

• Words with multiple meanings: bank, mean



(Extra challenge: domain-specific meanings)

• Multi-word expressions: e.g., make ... decision,


take out, make up, …

• For English, part-of-speech tagging is one very


common kind of lexical analysis

• For Others: super-sense tagging, various forms


of word sense disambiguation, syntactic “super
tags,” …

!20
Part of Speech Tagging

• “Part-of-speech tagging (POS tagging or PoS tagging or POST), also


called grammatical tagging or word-category disambiguation, is the
process of marking up a word in a text (corpus) as corresponding to a
particular part of speech, based on both its definition and its context — i.e.,
its relationship with adjacent and related words in a phrase, sentence, or
paragraph. “

• A simplified form of this is commonly taught to school-age children, in the


identification of words as nouns, verbs, adjectives, adverbs, etc.

!21
POS algorithms

• POS-tagging algorithms fall into two distinctive groups:



Rule-Based POS Taggers

Stochastic POS Taggers

• Automatic part of speech tagging is an area of natural language processing where


statistical techniques have been more successful than rule-based methods.

• Example of a rule: If an ambiguous/unknown word X is preceded by a determiner and


followed by a noun, tag it as an adjective.

• Defining a set of rules manually is an extremely cumbersome process and is not


scalable at all. So we need some automatic way of doing this.

• Automatic approaches: The Brill’s tagger is a rule-based tagger that goes through the
training data and finds out the set of tagging rules that best define the data and
minimize POS tagging errors. Hidden Markov Model can be used as a simple
stochastic POS tagger approach.

!22
NER

• Named entity recognition (NER) (also known as entity identification (EI) and
entity extraction) is the task that locate and classify atomic elements in text
into predefined categories such as the names of persons, organizations,
locations, expressions of times, quantities, monetary values,
percentages, etc.

<ENAMEX TYPE="PERSON">John</ENAMEX> sold <NUMEX


TYPE="QUANTITY">5</NUMEX> companies in <TIMEX TYPE="DATE">2002</
TIMEX>.

!23
Example (POS/NER tagging)

!24
Why NLP is hard?
• Syntax: Transform a sequence of symbols into a
hierarchical or compositional structure.

• Closely related to linguistic theories about what


makes some sentences well-formed and others
not.

• Ambiguities explode combinatorially

!25
Synthetic parsing (Constituency)

• Constituent-based grammars are used to analyze and determine the


constituents of a sentence.

• These grammars can be used to model or represent the internal structure of


sentences in terms of a hierarchically ordered structure of their
constituents. Each and every word usually belongs to a specific lexical
category in the case and forms the head word of different phrases. These
phrases are formed based on rules called phrase structure rules.

!26
Dependency parsing

• In dependency parsing, we try to use dependency-based grammars to


analyze and infer both structure and semantic dependencies and
relationships between tokens in a sentence.

• The word that has no dependency is called the root of the sentence.

• The verb is taken as the root of the sentence in most cases. All the other
words are directly or indirectly linked to the root verb using links, which are
the dependencies.

https://nlp.stanford.edu/software/dependencies_manual.pdf

!27
Why NLP is hard?

• Semantics: Mapping of natural language


sentences into domain representations.

• E.g., a robot command language, a database


query, or an expression in a formal logic.

• Scope ambiguities:

a. Everyone loves SOMEone.


b. EVERYone loves someone.

• Going beyond specific domains is a goal of


Artificial Intelligence

!28
Semantic analysis

• Semantic role labeling is the process that assigns labels to words


or phrases in a sentence that indicate their semantic role in the
sentence, such as that of an agent, goal, or result.

• The suspect was detained by the police officer at the crime scene
• The police officer detained the suspect at the scene of the crime.

!29
Why NLP is hard?
• Pragmatics: Study the ways in which context
contributes to meaning

• Any non-local meaning phenomena:

• Discourse:

• Structures and effects in related sequences


of sentences

• Texts, dialogues, multi-party conversations

!30
Co-reference resolution

• Coreference resolution is the task of finding all expressions


that refer to the same entity in a text.
• It is an important step for a lot of higher level NLP tasks that
involve natural language understanding such as document
summarization, question answering, and information extraction.

Christopher Robin is alive and well. He is the


same person that you read about in the book,
Winnie the Pooh. As a boy, Chris lived in a
pretty home called Cotchfield Farm. When
Chris was three years old, his father wrote a
poem about him. The poem was printed in a
magazine for others to read. Mr. Robin then
wrote a book
!31
Complexity of NLP

• Linguistic representations are theorized constructs -> We cannot observe them directly.

• Data: Input is likely to be noisy.

• Ambiguity: each string may have many possible interpretations at every level. The correct
resolution of the ambiguity will depend on the intended meaning, which is often inferable from
context.

• People are good at linguistic ambiguity resolution but computers are not so good at it: 

Q : How do we represent sets of possible alternatives?

Q : How do we represent context?

• Variablity: there are many ways to express the same meaning, and immeasurably many
meanings to express.

• Lots of words/phrases. Each level interacts with the others. There is tremendous diversity
in human languages. Languages express the same kind of meaning in different ways.
Some languages express some meanings more readily/often

!32
NLP challenges

• Data issues:

• A lot of data: In some cases, we deal with huge amounts of data



Need to come up with models that can process a lot of data efficiently

• Lack of data: Many problems in NLP suffer from lack of data:

• Non-standard platforms (code-switching)

• Expensive annotation (word-sense disambiguation, named-


entity recognition)

• Need to use methods to overcome this challenge (semi-supervised


learning, multi-task learning…)

!33
NLP challenges

• Ambiguity challenge: Polysemy: one word, many meanings

!34
NLP challenges

• Ambiguity challenge: Syntactic Ambiguity: same words, many meanings

!35
NLP challenges

• Variability: different words, same meaning

!36
NLP framework

NLP language model

!37
Language Understanding

• Its all about how likely a sentence is...

!38
Probabilistic Language Models

• Goal: Compute the probability of a sentence or sequences of words

• Related task: probability of an upcoming word:

• A model that computes either of the above is called a language


model.

!39
Language Understanding

!40
Recall 

(Probability 101)

!41
Recall 

(Probability 101)

Chain Rule:

!42
Probabilistic Language Models

!43
N-gram Language Model

For Unigram:

!44
N-gram Example
For Unigram:

!45
Effects of n in performance

!46
Evaluation: How good is a language
model?
• Extrinsic evaluation: Test a trained model on a test collection


– Try to predict each word


– The more precisely a model can predict the words, the better is the model

• Intrinsic evaluation: Perplexity



Perplexity is the inverse probability of the test set, normalized by
the number of words: 

N
1
– Given P(wi) and a test text of length N −
N
∑log2 P(wi )
– The lower the better! Perplexity = 2 i=1

!47
Parametric Language Model

!48
Parametric Language Model

!49
Embedding

• Embedding: Mapping between space with one dimension per linguistic unit
(character, morpheme, word, phrase, paragraph, sentence, document) to a
continuous vector space with much lower dimension.

• Vector representation: “meaning” of linguistic unit is represented by a vector


of real numbers.

good =

!50
One hot encoding

• Naive and simple word embedding technique: Map each word to a unique ID

• Typical vocabulary sizes will vary between 10k and 250k.

!51
One hot encoding

• Use word ID, to get a basic


representation of word via one-hot
encoding of the ID

• one-hot vector of an ID is a vector filled


with 0s, except for a 1 at the position
associated with the ID. e.g., for
vocabulary size D=10, the one-hot
vector of word (w) ID=4 is 

e(w) = [ 0 0 0 1 0 0 0 0 0 0 ]

• a one-hot encoding makes no


assumption about word similarity and
all words are equally similar/different
from each other

!52
One hot encoding

• Pros: Quick and simple

• Cons: Size of input vector scales with size of vocabulary. Must pre-determine
vocabulary size.

• Cannot scale to large or infinite vocabularies (Zipf’s law!)

• Computationally expensive - large input vector results in far too many parameters to
learn.

• “Out-of-Vocabulary” (OOV) problem: How would you handle unseen words in the
test set? (One solution is to have an “UNKNOWN” symbol that represents low-
frequency or unseen words)

• No relationship between words: Each word is an independent unit vector


!53
Bag of words

• Bag of words (BOW):  is represented as the bag (multiset) of its words,


disregarding grammar and even word order but keeping multiplicity.

• 

Vocabulary: set of all the words in corpus

Documents: Words in document w.r.t vocab with multiplicity

Sentence 1: "The cat sat on the hat"


Sentence 2: "The dog ate the cat and the hat”

Vocab = { the, cat, sat, on, hat, dog, ate, and }

Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Sentence 2 : { 3, 1, 0, 0, 1, 1, 1, 1}

!54
Bag of words

• Pros:

• Quick and simple

• Cons:

• Bag of words model is very high-dimensional and sparse

• Cannot capture semantics or morphology

• Orderless

!55
TF-IDF

• Raw counts are problematic: frequent words will characterize most words -> not
informative

• Importance increases proportionally to the number of times a word appears in the


document; but is inversely proportional to the frequency of the word in the corpus.

• TF-IDF (for term (t) – document (d)):

TF(t) = (Number of times term t appears in a document) / (Total number of

terms in the document).


IDF(t) = log (Total number of documents / Number of documents with term
t in it).

!56
TF-IDF

• Document D1 contains 100 words.

• cat appears 3 times in D1

• TF(cat) = 3 / 100 = 0.3

• Corpus contains 10 million documents

• cat appears in 1000 documents

• IDF(cat) = log (10,000,000 / 1,000) = 4

• TF-IDF (cat) = 0.3 * 4

!57
TF-IDF

• Pros:

• Easy to compute

• Has some basic metric to extract the most descriptive terms in a document

• Thus, can easily compute the similarity between 2 documents using it, e.g., using cosine
similarity

• Cons:

• Based on the bag-of-words (BoW) model, therefore it does not capture position in text,
semantics, co-occurrences in different documents, etc. TF-IDF is only useful as a lexical level
feature.

• Cannot capture semantics (unlike word embeddings)

• Orderless

!58
N-gram

• Vocab = set of all n-grams in corpus



Document = n-grams in document w.r.t vocab with multiplicity

For bigram:

Sentence 1: "The cat sat on the hat"


Sentence 2: "The dog ate the cat and the hat”

Vocab = { the cat, cat sat, sat on, on the, the hat, the dog, dog ate, ate the, cat
and, and the}

Sentence 1: { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Sentence 2 : { 1, 0, 0, 0, 0, 1, 1, 1, 1, 1}

!59
N-gram

• Pros:

• Incorporates order of words

• Simple and quick

• Cons:

• Very large vocabulary

• Data sparsity

!60
Data sparsity

!61
N-gram

• Pros:

• Incorporates order of words

• Simple and quick

• Cons:

• Very large vocabulary

• Data sparsity

• False independence assumption

!62
False independence assumption

!63
N-gram

• Pros:

• Incorporates order of words

• Simple and quick

• Cons:

• Very large vocabulary

• Data sparsity

• False independence assumption

• Cannot capture syntactic/semantic similarity

!64
The Distributional Hypothesis

• The Distributional Hypothesis: words that occur in the same contexts tend
to have similar meanings (Harris, 1954)

• “You shall know a word by the company it keeps” (Firth, 1957)

!65
The Distributional Hypothesis

!66
Learning Dense embeddings

Deerwester, Dumais, Landauer, Furnas, and Harshman, Indexing by latent semantic analysis, JASIS, 1990.
Pennington, Socher, and Manning, GloVe: Global Vectors for Word Representation, EMNLP, 2014.
Mikolov, Sutskever, Chen, Corrado, and Dean, Distributed representations of words and phrases and their compositionality, NIPS, 2013.

!67

You might also like