Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
85 views

A Tutorial On: Linguistic Data Analysis

This document provides an introduction to using Python for linguistic data analysis. It discusses the speaker's background and goals for the presentation. It then outlines topics that will be covered, including information extraction, information retrieval, deep learning, and other areas of natural language processing. The presentation encourages participants to ask questions and will code examples as they go to help attendees learn.

Uploaded by

Abhishek
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

A Tutorial On: Linguistic Data Analysis

This document provides an introduction to using Python for linguistic data analysis. It discusses the speaker's background and goals for the presentation. It then outlines topics that will be covered, including information extraction, information retrieval, deep learning, and other areas of natural language processing. The presentation encourages participants to ask questions and will code examples as they go to help attendees learn.

Uploaded by

Abhishek
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

USING

PYTHON
FORon
A tutorial
LINGUISTIC
DATA
ANALYSIS
RUTU MULKAR-MEHTA  DATA
SCIENTIST
@RUTUMULKAR 
ME@RUTUMULKAR.COM
HTTP://RUTUMULKAR.COM
WHO AM I?
o Data Scientist at Moz
o Background:
o PhD in Natural Language Processing
o Computer Science: BS, MS, PhD
o Worked on:
o IBM Watson
o NLP in Healthcare
o NLP for SEO (Search Engine Optimization)
o Other Stuff: Sentiment Analysis, Question Answering,
Natural Language Understanding ++
2
GOALS

o Highly interactive
o Stop me to ask questions
o No question is silly

3
GOALS

o Code as you go
o The end goal is for
YOU to learn

4
TOOLS TO INSTALL

o Python Installation:
https://www.python.org/downloads/
o iPython installation
http://ipython.org/install.html
(sudo) pip install ipython
o Fork/Download this Repo:
https://github.com/rutum/pynlp

5
HANDY PYTHON TOOLS

o Requests:
sudo pip install requests
o BeautifulSoup:
sudo pip install beautifulsoup4
o Gensim:
sudo pip install gensim

6
QUICK POLL
o Familiarity with python?
• Quick Poll

7
QUICK POLL
o Familiarity with python?
o Familiarity with NLP?
• Quick Poll

8
QUICK POLL
o Familiarity with python?
o Familiarity with NLP?
o Familiarity with Machine Learning?
• Quick Poll

9
QUICK POLL
o Familiarity with python?
o Familiarity with NLP?
o Familiarity with Machine Learning?
o What do you want to do with NLP?
• Quick Poll

10
AN INTRODUCTION

SOME AREAS OF NLP

11
SENTIMENT ANALYSIS
Product Reviews – Kindle Paperwhite

Sharp
screen resolution

Low
battery life

12
TEXT SUMMARIZATION
MACHINE TRANSLATION

14
QUESTION ANSWERING

In 2011, Watson competed on Jeopardy! To be the first


computer to defeat two humans on the game show.
15
SIRI
• Information Extraction
• Topic Modeling
• Question Answering
• Text Classification
• Named Entity Recognition
• ….

16
17
TOPICS WE WILL BE COVERING TODAY

TALK OVERVIEW

18
OUTLINE

o Information Extraction: Extract “stuff” from


unstructured text
o Information Retrieval: search from several
documents
o Deep Learning: Converting words to semantic
vectors

Peppered with some NLP “Folk Knowledge”, hence


forth known as “Fairy Godmother”

19
You have just been offered a job at an
innovative new startup in Seattle
called “NEW TECH STORIES INC.”

You have accepted the position!

You are going to be working with


linguistic data!
CONGRATULATIO
NS!!
20
WELCOME ABOARD!
… here, analyze this data ...
DATA ANALYSIS

Github repo: https://github.com/rutum/pynlp

Go to the folder “data collection”:


o data_file.txt
o 77 Talks at pyData Seattle 2015

22
WE WANT TO KNOW…

o What are people talking about?

23
WE WANT TO KNOW…

o What are people talking about?


o What is the trend in python and data
science these days in Seattle?

24
WE WANT TO KNOW…

o What are people talking about?


o What is the trend in python and data
science these days in Seattle?

25
FAIRY GODMOTHER SAYS

ALWAYS KNOW HOW TO EVALUATE YOUR DATA

26
FAIRY GODMOTHER SAYS

ALWAYS KNOW HOW TO EVALUATE YOUR DATA


o What should we expect to find?
o If you know what to expect, you can inspect your results
periodically and check if you are in line with your
expectations

27
FAIRY GODMOTHER SAYS

ALWAYS KNOW HOW TO EVALUATE YOUR DATA


o What should we expect to find?
o If you know what to expect, you can inspect your results
periodically and check if you are in line with your
expectations
o How should we get there?
o What are the patterns that you see in your expected results
e.g.
o Are all your expected “keywords” also keynote presentations?
o Are all your expected “keywords” all scheduled in room 1?

28
WHAT SHOULD WE EXPECT TO SEE
o Suggestions?
o data
o python
Audience Participation

o machine learning
o nlp analysis
o tutorials
o microsoft
o anaconda
o code
29
WHERE DO WE START?

o Suggestions?
o frequency
Audience Participation

o TF IDF ->
o LSI
o Word2Vec
o categories

30
EXPERIMENTS WE WILL DO

o Word Frequencies
o TF*IDF : word importance
o Focus on Noun Phrases only?

31
WORD FREQUENCIES

o Also known as TF – Term Frequency


o Most frequent concepts are trending
concepts
o Functions:
o nlp_tools.word_count(text, <wc_dict>)
o nlp_tools.information_extraction(<filename>)

32
FINDINGS

o Suggestions?
Audience Participation

33
WHAT NEXT?

o Frequencies alone are not enough. Words


like “a”, “the” are the most common ones,
but have the least impact
o Maybe we can remove stop words?

34
WHAT ARE STOP WORDS

o Extremely frequent words that don’t add


any semantic value to the text

35
WHAT ARE STOP WORDS

o Extremely frequent words that don’t add


any semantic value to the text
o Stop words are different for different
domains, based on what the text is about

36
WHAT ARE STOP WORDS

o Extremely frequent words that don’t add


any semantic value to the text
o Stop words are different for different
domains, based on what the text is about
o Usually hand curated, based on the type of
data we are looking at. E.g. In pyData
related talks – python and data are
stopwords

37
EXPERIMENTS REMOVING STOP
WORDS
o StopWords file:
o stopwords.txt
o Load Stopwords:
o nlp_tools.load_stopwords(<filename>)
o stopwords = load_stopwords(<filename>)
o Check if token is stopword:
o nlp_tools.is_stopword(token)

38
FINDINGS

o Suggestions?
Audience Participation

39
NEXT STEPS

o We need a ratio of uncommon words, that


are just frequent enough

40
NEXT STEPS

o We need a ratio of uncommon words, that


are just frequent enough
o Lets use TF*IDF

41
TF*IDF BASICS
o Raw term frequency suffers from a critical
problem: all terms are considered equally
important

42
TF*IDF BASICS
o Raw term frequency suffers from a critical
problem: all terms are considered equally
important
o Intuitive example : word “a” – high frequency, but
occurs in all documents

43
TF*IDF BASICS
o Raw term frequency suffers from a critical
problem: all terms are considered equally
important
o Intuitive example : word “a” – high frequency, but
occurs in all documents
o Scale down the term weights by the number of
documents they occur in – DF (Document
Frequency)

44
IDF : INVERSE DOCUMENT
FREQUENCY
o How is the document frequency DF of a
term used to scale its weight?
o N = total number of documents
o t = current word or token
IDFt = log (N/DFt)

45
TF*IDF FORMULA
o TF = Term Frequency
o IDF = Inverse Document Frequency (Inverse of the
Document Frequency)

TFt * IDFt = TFt * log (N/DFt)

46
EXPERIMENTS WITH TF IDF

o Using TF*IDF
o Function:
o nlp_tools.document_frequency(file_id, text, <df_dict>)
o nlp_tools.compute_tfidf(<wc_dict>, <df_dict>, count)

47
FINDINGS

o (Audience Participation)
Audience Participation

48
TF*IDF TF
49
PARTS OF SPEECH

o How about using POS tags?


o Nouns
o Verbs
o Adjectives
o Adverbs
o Prepositions

50
AUTOMATIC TAGGERS
o Almost all the POS taggers use the Penn-Treebank
list of tags
o https
://www.ling.upenn.edu/courses/Fall_2003/ling001/
penn_treebank_pos.html

51
AUTOMATIC TAGGERS
o Almost all the POS taggers use the Penn-Treebank
list of tags
o https
://www.ling.upenn.edu/courses/Fall_2003/ling001/
penn_treebank_pos.html
o Nouns : NN, NNS, NNP, NNPS
o Verbs: VB, VBD, VBG, VBN, VBP, VBZ
o Adjectives: JJ, JJR, JJS
o Adverbs: RB, RBR, RBS
o Prepositions: IN
52
HOW ARE POS TAGS EFFECTIVE?

Selected words based on POS Tag = Noun


We get candidates:
o code.org
o students
o computer
o science

53
HOW ARE POS TAGS EFFECTIVE?

Selected words based on POS Tag = Noun


We get candidates:
o code.org We can post-process
o students “Computer” and “Science”
o computer to the Compound Nominal:
o science “Computer Science”

54
POS TAGGING AND PARSING

o Stanford Core NLP


o http://nlp.stanford.edu:8080/corenlp/
o NLTK
o Natural Language Toolkit
o You need to provide your own training data, and
train models for NLTK to be effective

55
OTHER LINGUISTIC FEATURES OF
INTEREST
o The goal of both stemming and lemmatization
is to reduce inflectional forms and sometimes
derivationally related forms of a word to a
common base form. E.g.
o am, are, is  be
o car, cars, car’s  car
o Two approaches:
o Stemming
o Lemmatization

56
STEMMING AND LEMMATIZATION
o Stemming
o crude heuristic process
o chops off the ends of words
o hope of achieving this goal
o Lemmatization
o use of a vocabulary
o morphological analysis of words
o returns the base or dictionary form of a word
o base form is known as the lemma

57
EXPERIMENTS WITH NPS

o Using NP’s only


o Do you expect results to improve?

58
RECAP

o Tokenization (word tokens) + data cleanup


o Word Frequencies
o Word Vectors
o TF*IDF
o POS tags
o Lemmatization/Stemming

59
YOU JUST DID SOME
INFORMATION EXTRACTION
FROM TEXT!

60
OTHER WAYS FOR IE
colors such as red, blue and …

61
OTHER WAYS FOR IE
o cars such as toyota, honda and ?

62
OTHER WAYS FOR IE

Find different relations between 2 concepts:


Microsoft bought Farecast

63
OUTLINE

o Information Extraction: Extract key phrases and


concepts from unstructured text
o Information Retrieval: search
o Deep Learning: Converting words to semantic
vectors

64
VERY COOL!
..one more thing..
65
When I type in this term, can you show me
all the talks about it?
It will probably take you several weeks!
We should leave you alone..

66
THOUGHTS AND STRATEGIES?

o Suggestions?
o something based on TF*IDF
Audience Participation

o Cosine Similarity
o J. Index
o Bag of words

67
FAIRY GODMOTHER SAYS

Fancy algorithms will help you build products


rapidly, but you need to know how they work to
apply them well
o INFORMATION RETRIEVAL
o Vector Space model using TF*IDF

68
INFORMATION RETRIEVAL BASICS

o Create a vector for each of the pages


d1 1 0 0 6 8 6 7 2 5

d2 0 7 3 1 5 3 0 0 0

o Find the similarity between the input phrase


and the pages
phr 0 7 3 7 5 8 0 0 0

o The page with the highest similarity wins!

69
VECTOR SPACE MODEL

o Large Sparse vectors of words in a page


o Each index in the vector represents a word
a ant car bug .. .. .. .. ..
d1 1 0 0 6 8 6 7 2 5

o The value of each index is the TF*IDF of that


word

70
DOT PRODUCT
o Quantifies the similarity between two
documents
o Similarity = cosine similarity of their vector
representations Vd1 and Vd2
o It compensates for the effect of document
length

71
DOT PRODUCT

72
INTUITION

1. Convert each document into a vector


data python nlp theano rnn apple banana cup elephant tea

2. If a word exists in the document, set its


value as the TF*IDF score of the word in the
document
3. If the word does not exist in the document,
set its value as zero

73
INTUITION
Consider 3 vectors:
1 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 0 0 0 0 0

2 data python nlp theano rnn apple banana cup elephant tea
0 0 1 0 0 1 1 0 0 0

3 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 1 1 0 0 0

Which vector is most similar to vector 1?


74
INTUITION
Computing similarity between V1 and V3:
1 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 0 0 0 0 0

2 data python nlp theano rnn apple banana cup elephant tea
0 0 1 0 0 1 1 0 0 0

75
INTUITION
Computing similarity between V1 and V2:
1 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 0 0 0 0 0

2 data python nlp theano rnn apple banana cup elephant tea
0 0 1 0 0 1 1 0 0 0
Numerator:
(1*0) + (1*0) + (1*1) + (1*0) + (0*0) + (0*1) + (0*1) + (0*0) + (0*0) + (0*0) = 1
Denominator:
sqrt(12+12+12+12) * sqrt(12+12+12) = 3.46
Similarity:

1/3.46 = 0.289

76
INTUITION
Computing similarity between V1 and V3:
1 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 0 0 0 0 0

3 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 1 1 0 0 0
Numerator:
(1*1) + (1*1) + (1*1) + (1*1) + (0*0) + (0*1) + (0*1) + (0*0) + (0*0) + (0*0) = 4
Denominator:
sqrt(12+12+12+12) * sqrt(12+12+12+12+12+12) = 4.89
Similarity:

4/4.89 = 0.8179

77
INTUITION
Consider 3 vectors:
1 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 0 0 0 0 0

2 data python nlp theano rnn apple banana cup elephant tea
0 0 1 0 0 1 1 0 0 0

3 data python nlp theano rnn apple banana cup elephant tea
1 1 1 1 0 1 1 0 0 0

Numbers tell us that V1 is most similar to V3


78
EXPERIMENTS WITH SEARCH

o Functions:
o nlp_tools.build_search_model(<filename>)
o nlp_tools.search(<phrase>)
o nlp_tools.similarity(vector1, vector2)

79
FINDINGS

o Suggestions?
Audience Participation

80
YOU JUST DID SOME
INFORMATION RETRIEVAL
FROM TEXT!

81
OUTLINE

o Information Extraction: Extract key phrases and


concepts from unstructured text
o Information Retrieval: search
o Deep Learning: Converting words to semantic
vectors

82
WHOA!
This is awesome!
83
I need to learn more NLP
technologies… FAST!
What is new in NLP?

84
WORD2VEC

What if you can create this big model, that


knows meaning?

That you can also objectively compute:


King – Man + Woman = Queen

You can find how the keywords are related

85
WORD2VEC
o The word2vec tool takes a text corpus as input and
produces the word vectors as output.

86
WORD2VEC
o The word2vec tool takes a text corpus as input and
produces the word vectors as output.
o A simple way to investigate the learned
representations is to find the closest words for a
user-specified word.

87
KEYWORDS SIMILAR TO FRANCE
Word Cosine distance
spain 0.678515
belgium 0.665923
netherlands 0.652428
italy 0.633130
switzerland 0.622323
luxembourg 0.610033
portugal 0.577154
russia 0.571507
germany 0.563291
catalonia 0.534176

88
INTUITION BEHIND WORD2VEC
o Two Models:
o CBOW: Predicting probability of context given target word
o Skip Gram: Predicting probability of target word given
context

89
Slides by Kai Sasaki

Skip Gram Model

90
Slides by Kai Sasaki

91
Slides by Kai Sasaki

CBOW Model

92
Slides by Kai Sasaki

93
WORD2VEC DEMO

o How to run the demo at your end?


o Online Word2Vec demo

94
EXPERIMENTS WITH WORD2VEC

o Download and install gensim


o Train a model

95
OUTLINE

o Information Extraction: Extract key phrases and


concepts from unstructured text
o Information Retrieval: search
o Deep Learning: Converting words to semantic
vectors

96
CLOSING REMARKS

o You already know:


o Information Extraction
o Information Retrieval
o Deep Learning ( Word2vec)
o NLP is very intuitive
o You need to know the answers to aim for,
before applying any algorithms

97
CLOSING REMARKS

o If you have any questions, please get in


touch with me:
o email : me@rutumulkar.com
o twitter : @rutumulkar

98
HOPE YOU ENJOYED LEARNING ABOUT NLP

THANK YOU FOR LISTENING!

99

You might also like