A Tutorial On: Linguistic Data Analysis
A Tutorial On: Linguistic Data Analysis
PYTHON
FORon
A tutorial
LINGUISTIC
DATA
ANALYSIS
RUTU MULKAR-MEHTA DATA
SCIENTIST
@RUTUMULKAR
ME@RUTUMULKAR.COM
HTTP://RUTUMULKAR.COM
WHO AM I?
o Data Scientist at Moz
o Background:
o PhD in Natural Language Processing
o Computer Science: BS, MS, PhD
o Worked on:
o IBM Watson
o NLP in Healthcare
o NLP for SEO (Search Engine Optimization)
o Other Stuff: Sentiment Analysis, Question Answering,
Natural Language Understanding ++
2
GOALS
o Highly interactive
o Stop me to ask questions
o No question is silly
3
GOALS
o Code as you go
o The end goal is for
YOU to learn
4
TOOLS TO INSTALL
o Python Installation:
https://www.python.org/downloads/
o iPython installation
http://ipython.org/install.html
(sudo) pip install ipython
o Fork/Download this Repo:
https://github.com/rutum/pynlp
5
HANDY PYTHON TOOLS
o Requests:
sudo pip install requests
o BeautifulSoup:
sudo pip install beautifulsoup4
o Gensim:
sudo pip install gensim
6
QUICK POLL
o Familiarity with python?
• Quick Poll
7
QUICK POLL
o Familiarity with python?
o Familiarity with NLP?
• Quick Poll
8
QUICK POLL
o Familiarity with python?
o Familiarity with NLP?
o Familiarity with Machine Learning?
• Quick Poll
9
QUICK POLL
o Familiarity with python?
o Familiarity with NLP?
o Familiarity with Machine Learning?
o What do you want to do with NLP?
• Quick Poll
10
AN INTRODUCTION
11
SENTIMENT ANALYSIS
Product Reviews – Kindle Paperwhite
Sharp
screen resolution
Low
battery life
12
TEXT SUMMARIZATION
MACHINE TRANSLATION
14
QUESTION ANSWERING
16
17
TOPICS WE WILL BE COVERING TODAY
TALK OVERVIEW
18
OUTLINE
19
You have just been offered a job at an
innovative new startup in Seattle
called “NEW TECH STORIES INC.”
22
WE WANT TO KNOW…
23
WE WANT TO KNOW…
24
WE WANT TO KNOW…
25
FAIRY GODMOTHER SAYS
26
FAIRY GODMOTHER SAYS
27
FAIRY GODMOTHER SAYS
28
WHAT SHOULD WE EXPECT TO SEE
o Suggestions?
o data
o python
Audience Participation
o machine learning
o nlp analysis
o tutorials
o microsoft
o anaconda
o code
29
WHERE DO WE START?
o Suggestions?
o frequency
Audience Participation
o TF IDF ->
o LSI
o Word2Vec
o categories
30
EXPERIMENTS WE WILL DO
o Word Frequencies
o TF*IDF : word importance
o Focus on Noun Phrases only?
31
WORD FREQUENCIES
32
FINDINGS
o Suggestions?
Audience Participation
33
WHAT NEXT?
34
WHAT ARE STOP WORDS
35
WHAT ARE STOP WORDS
36
WHAT ARE STOP WORDS
37
EXPERIMENTS REMOVING STOP
WORDS
o StopWords file:
o stopwords.txt
o Load Stopwords:
o nlp_tools.load_stopwords(<filename>)
o stopwords = load_stopwords(<filename>)
o Check if token is stopword:
o nlp_tools.is_stopword(token)
38
FINDINGS
o Suggestions?
Audience Participation
39
NEXT STEPS
40
NEXT STEPS
41
TF*IDF BASICS
o Raw term frequency suffers from a critical
problem: all terms are considered equally
important
42
TF*IDF BASICS
o Raw term frequency suffers from a critical
problem: all terms are considered equally
important
o Intuitive example : word “a” – high frequency, but
occurs in all documents
43
TF*IDF BASICS
o Raw term frequency suffers from a critical
problem: all terms are considered equally
important
o Intuitive example : word “a” – high frequency, but
occurs in all documents
o Scale down the term weights by the number of
documents they occur in – DF (Document
Frequency)
44
IDF : INVERSE DOCUMENT
FREQUENCY
o How is the document frequency DF of a
term used to scale its weight?
o N = total number of documents
o t = current word or token
IDFt = log (N/DFt)
45
TF*IDF FORMULA
o TF = Term Frequency
o IDF = Inverse Document Frequency (Inverse of the
Document Frequency)
46
EXPERIMENTS WITH TF IDF
o Using TF*IDF
o Function:
o nlp_tools.document_frequency(file_id, text, <df_dict>)
o nlp_tools.compute_tfidf(<wc_dict>, <df_dict>, count)
47
FINDINGS
o (Audience Participation)
Audience Participation
48
TF*IDF TF
49
PARTS OF SPEECH
50
AUTOMATIC TAGGERS
o Almost all the POS taggers use the Penn-Treebank
list of tags
o https
://www.ling.upenn.edu/courses/Fall_2003/ling001/
penn_treebank_pos.html
51
AUTOMATIC TAGGERS
o Almost all the POS taggers use the Penn-Treebank
list of tags
o https
://www.ling.upenn.edu/courses/Fall_2003/ling001/
penn_treebank_pos.html
o Nouns : NN, NNS, NNP, NNPS
o Verbs: VB, VBD, VBG, VBN, VBP, VBZ
o Adjectives: JJ, JJR, JJS
o Adverbs: RB, RBR, RBS
o Prepositions: IN
52
HOW ARE POS TAGS EFFECTIVE?
53
HOW ARE POS TAGS EFFECTIVE?
54
POS TAGGING AND PARSING
55
OTHER LINGUISTIC FEATURES OF
INTEREST
o The goal of both stemming and lemmatization
is to reduce inflectional forms and sometimes
derivationally related forms of a word to a
common base form. E.g.
o am, are, is be
o car, cars, car’s car
o Two approaches:
o Stemming
o Lemmatization
56
STEMMING AND LEMMATIZATION
o Stemming
o crude heuristic process
o chops off the ends of words
o hope of achieving this goal
o Lemmatization
o use of a vocabulary
o morphological analysis of words
o returns the base or dictionary form of a word
o base form is known as the lemma
57
EXPERIMENTS WITH NPS
58
RECAP
59
YOU JUST DID SOME
INFORMATION EXTRACTION
FROM TEXT!
60
OTHER WAYS FOR IE
colors such as red, blue and …
61
OTHER WAYS FOR IE
o cars such as toyota, honda and ?
62
OTHER WAYS FOR IE
63
OUTLINE
64
VERY COOL!
..one more thing..
65
When I type in this term, can you show me
all the talks about it?
It will probably take you several weeks!
We should leave you alone..
66
THOUGHTS AND STRATEGIES?
o Suggestions?
o something based on TF*IDF
Audience Participation
o Cosine Similarity
o J. Index
o Bag of words
67
FAIRY GODMOTHER SAYS
68
INFORMATION RETRIEVAL BASICS
d2 0 7 3 1 5 3 0 0 0
69
VECTOR SPACE MODEL
70
DOT PRODUCT
o Quantifies the similarity between two
documents
o Similarity = cosine similarity of their vector
representations Vd1 and Vd2
o It compensates for the effect of document
length
71
DOT PRODUCT
72
INTUITION
73
INTUITION
Consider 3 vectors:
1 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 0 0 0 0 0
2 data python nlp theano rnn apple banana cup elephant tea
0 0 1 0 0 1 1 0 0 0
3 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 1 1 0 0 0
2 data python nlp theano rnn apple banana cup elephant tea
0 0 1 0 0 1 1 0 0 0
75
INTUITION
Computing similarity between V1 and V2:
1 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 0 0 0 0 0
2 data python nlp theano rnn apple banana cup elephant tea
0 0 1 0 0 1 1 0 0 0
Numerator:
(1*0) + (1*0) + (1*1) + (1*0) + (0*0) + (0*1) + (0*1) + (0*0) + (0*0) + (0*0) = 1
Denominator:
sqrt(12+12+12+12) * sqrt(12+12+12) = 3.46
Similarity:
1/3.46 = 0.289
76
INTUITION
Computing similarity between V1 and V3:
1 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 0 0 0 0 0
3 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 1 1 0 0 0
Numerator:
(1*1) + (1*1) + (1*1) + (1*1) + (0*0) + (0*1) + (0*1) + (0*0) + (0*0) + (0*0) = 4
Denominator:
sqrt(12+12+12+12) * sqrt(12+12+12+12+12+12) = 4.89
Similarity:
4/4.89 = 0.8179
77
INTUITION
Consider 3 vectors:
1 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 0 0 0 0 0
2 data python nlp theano rnn apple banana cup elephant tea
0 0 1 0 0 1 1 0 0 0
3 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 1 1 0 0 0
o Functions:
o nlp_tools.build_search_model(<filename>)
o nlp_tools.search(<phrase>)
o nlp_tools.similarity(vector1, vector2)
79
FINDINGS
o Suggestions?
Audience Participation
80
YOU JUST DID SOME
INFORMATION RETRIEVAL
FROM TEXT!
81
OUTLINE
82
WHOA!
This is awesome!
83
I need to learn more NLP
technologies… FAST!
What is new in NLP?
84
WORD2VEC
85
WORD2VEC
o The word2vec tool takes a text corpus as input and
produces the word vectors as output.
86
WORD2VEC
o The word2vec tool takes a text corpus as input and
produces the word vectors as output.
o A simple way to investigate the learned
representations is to find the closest words for a
user-specified word.
87
KEYWORDS SIMILAR TO FRANCE
Word Cosine distance
spain 0.678515
belgium 0.665923
netherlands 0.652428
italy 0.633130
switzerland 0.622323
luxembourg 0.610033
portugal 0.577154
russia 0.571507
germany 0.563291
catalonia 0.534176
88
INTUITION BEHIND WORD2VEC
o Two Models:
o CBOW: Predicting probability of context given target word
o Skip Gram: Predicting probability of target word given
context
89
Slides by Kai Sasaki
90
Slides by Kai Sasaki
91
Slides by Kai Sasaki
CBOW Model
92
Slides by Kai Sasaki
93
WORD2VEC DEMO
94
EXPERIMENTS WITH WORD2VEC
95
OUTLINE
96
CLOSING REMARKS
97
CLOSING REMARKS
98
HOPE YOU ENJOYED LEARNING ABOUT NLP
99