Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Module 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

Module 4

TEXT ANALYTICS

1
Introduction

• Text mining, also known as text data mining, is the process of


transforming unstructured text into a structured format to identify
meaningful patterns and new insights.
• By applying advanced analytical techniques, such as Naïve Bayes,
Support Vector Machines (SVM), and other deep learning
algorithms, companies are able to explore and discover hidden
relationships within their unstructured data.

2
Data Organization
• Text is a one of the most common data types within databases.
• Data can be organized as:
• Structured data
• Unstructured data
• Semi-structured data

3
Need of text mining
• Roughly 80% of data in the world resides in an unstructured format.
• Text mining is an extremely valuable practice within organizations.
• Text mining tools and NLP techniques transform unstructured
documents into a structured format to enable analysis and the
generation of high-quality insights.
• Improves the decision-making of organizations, leading to better
business outcomes.

4
Text Mining vs Text Analytics
• Text mining and text analysis identifies textual patterns and trends
within unstructured data through the use of machine learning,
statistics, and linguistics.
• By transforming the data into a more structured format through text
mining and text analysis, more quantitative insights can be found
through text analytics.
• Data visualization techniques can then be harnessed to
communicate findings to wider audiences.
5
Text mining techniques
• The process of text mining comprises several activities that enable you to deduce
information from unstructured text data.
• Start with text processing (cleaning) before you apply different techniques.
• Use NLP techniques
• Identification
• Tokenization
• POS tagging
• Chunking
• Syntax Parsing

6
Text mining techniques
• Information Retrieval
• Information retrieval (IR) returns relevant information or documents based on
a pre-defined set of queries or phrases.
• Tasks
• Tokenization
• Stemming

7
Text mining techniques
• NLP
• evolved from computational linguistics
• uses methods from various disciplines, such as computer science, artificial
intelligence, linguistics, and data science
• enable computers to understand human language in both written and verbal
forms

8
Text mining techniques
• Tasks
• Summarization
• Part-of-Speech (PoS) tagging (Semantic analysis)
• Text categorization (classification)
• Sentiment analysis

9
Text mining techniques
• Information extraction
• Information extraction (IE) surfaces the relevant pieces of data when searching
various documents.
• It also focuses on extracting structured information from free text and storing
these entities, attributes, and relationship information in a database.
• Tasks
• Feature selection (attribute selection)
• Feature extraction (subset of features for dimensionality reduction)
• Named-entity recognition (NER) (entity identification)

10
Text mining techniques
• Data Mining
• Data mining is the process of identifying patterns and extracting useful
insights from big data sets.
• Evaluates both structured and unstructured data to identify new information
• To analyze consumer behaviors within marketing and sales.
• Text mining is essentially a sub-field of data mining.

11
https://www.elderresearch.com/wp-content/uploads/2020/10/Whitepaper_The_Seven_Practice_Areas_of_Text_An
alytics_Chapter_2_Excerpt.pdf

Seven practice areas of Text Analytics


• Text mining can be divided into seven practice areas, based on the unique characteristics
of each area.
1. Search and information retrieval (IR)
2. Document clustering
3. Document classification
4. Web mining
5. Information extraction (IE)
6. Natural language processing (NLP)
7. Concept extraction

12
Seven practice areas of Text Analytics
• Search and information retrieval (IR)
• Storage and retrieval of text documents, including search engines and keyword search.
• Document clustering
• Grouping and categorizing terms, snippets, paragraphs, or documents, using data mining
clustering methods

• Document classification
• Grouping and categorizing snippets, paragraphs, or documents, using data mining
classi!cation methods, based on models trained on labeled examples.

13
Seven practice areas of Text Analytics
• Web mining
• Data and text mining on the Internet, with a specific focus on the scale and interconnectedness of the web.
• Information extraction (IE)
• Identification and extraction of relevant facts and relationships from unstructured text; the process of making
structured data from unstructured and semi-structured text.

• Natural language processing (NLP)


• Low-level language processing and understanding tasks
• Concept extraction
• Grouping of words and phrases into semantically similar groups

14
Seven practice areas of Text Analytics

15
16
Text mining : Application and Use cases
• Risk Management
• insufficient risk analysis is often a leading cause of failure in financial and insurance
industries.
• adoption of risk management software based on text mining technology can
dramatically increase the ability to mitigate risk.
• Knowledge Management
• ability to find important information quickly.
• healthcare industry organizations have tremendous amounts of information
17
Text mining : Application and Use cases
• Cybercrime Prevention
• increased risk of internet-based crimes.
• Text mining pinpoint real threats and limit the number of false positives created by
keywords taken out of context.
• Enhanced Customer Service
• Solve customer problems through chatbots.
• improve the customer experience by leveraging valuable information sources such as
surveys, trouble tickets and customer call notes to improve the quality, effectiveness and
speed of problem resolution.

18
Text mining : Application and Use cases
• Contextual Advertising
• Compared to the traditional cookie-based approach, contextual advertising analyzes the
text on a webpage to understand the content on a deeper level.
• E.g. Reading an ad of book – Kindle
• Business Intelligence
• BI tools helps in decision making process.
• These tools enable you to identify patterns, trends and opportunities from your data.
• Converts unstructured data to structured data.
19
Text mining : Application and Use cases
• Spam Filtering
• Spam is both an entry point for viruses and a detriment to productivity.
• Text mining techniques can be implemented to improve the effectiveness of statistical
filtering methods by leveraging established prior knowledge.
• Social Media Data Analysis
• Social media – important source for unstructured data.
• Social media is seen across the enterprise as a valuable source of market and customer
intelligence.
• extract opinions, emotions and sentiment that reveal the positive and negative of20
consumer relationships with brands and products.
https://blog.aureusanalytics.com/blog/5-natural-language-processing-techniques-for-extracting-inf
ormation

Extracting Meaningful Information -


Techniques
• Language is considered as one of the most significant achievements of
humans that has accelerated the progress of humanity.
• Plenty of work being done to integrate language into the field of artificial
intelligence in the form of Natural Language Processing (NLP).
• NLP Comprises
• NLU (human to machine)
• NLG (machine to human)
• NLU aids in extracting valuable information from text such as social media
data, customer surveys, and complaints.
21
Extracting Information -Techniques
• Named Entity Recognition
• Sentiment Analysis
• Text Summarization
• Aspect Mining
• Topic Modeling

22
Review of Auto Insurance Company
The customer service of Rocketz is terrible. I must call the call center multiple times
before I get a decent reply. The call center guys are extremely rude and totally
ignorant. Last month I called with a request to update my correspondence address
from Brooklyn to Manhattan. I spoke with about a dozen representatives – Lucas
Hayes, Ethan Gray, Nora Diaz, Sofia Parker to name a few. Even after writing multiple
emails and filling out numerous forms, the address has still not been updated. Even my
agent John is useless. The policy details he gave me were wrong. The only good
thing about the company is the pricing. The premium is reasonable compared to
the other insurance companies in the United States. There has not been any
significant increase in my premium since 2015.

23
Named Entity Recognition

• Extracting the entities in the text.


• Highlights the fundamental concepts and references in the text
• Named Entity Recognition (NER) identifies entities such as people,
locations, organizations, dates, etc. from the text.
• NER is generally based on grammar rules and supervised models.
• There are NER platforms such as open NLP that have pre-trained and built-
in NER models.
24
Named Entity Recognition

NER output for the example text will typically be:


Person: Lucas Hayes, Ethan Gray, Nora Diaz, Sofia Parker, John
Location: Brooklyn, Manhattan, United States
Date: Last month, 2015
Organization: Rocketz

25
Sentiment Analysis

• Most widely used technique in NLP


• Most useful in cases such as customer surveys, reviews and social media
comments where people express their opinions and feedback.
• Output of sentiment analysis is a 3-point scale: positive/negative/neutral.
• The output can be a numeric score that can be bucketed into as many
categories as required.

26
Sentiment Analysis

• Sentiment Analysis can be done using


• Supervised Techniques (Naïve Bayes, Random Forest)
• Unsupervised Techniques (lexicon based methods)
• Most negative comment: The call center guys are extremely rude and totally
ignorant.
• Sentiment Score: -1.233288
• Most positive comment: The premium is reasonable compared to the other
insurance companies in the United States.
• Sentiment Score: 0.2672612
27
Text Summarization

• Summarize large chunks of text.


• Mainly used in cases such as news articles and research articles.
• Two approaches:
• Extractive : Extraction methods create a summary by extracting parts from the text
• Abstractive: Create summary by generating fresh text that conveys the crux of the
original text

28
Text Summarization

• There are various algorithms that can be used for text summarization
• LexRank
• TextRank
• Latent Semantic Analysis.
• LexRank algorithm ranks the sentences using similarity between them.
• A sentence is ranked higher when it is similar to more sentences, and these
sentences are in turn similar to other sentences.
29
Text Summarization

• Output of example:
I have to call the call center multiple times before I get a decent reply.
The premium is reasonable compared to the other insurance companies
in the United States.

30
Aspect Mining

• Aspect mining identifies the different aspects in the text.


• When used in conjunction with sentiment analysis, it extracts complete information
from the text.
• It uses POS tagging to identify aspects.
• Output of example
Customer service – negative
Call center – negative
Agent – negative
Pricing/Premium – positive
31
Topic Modeling

• Identifying natural topics in the text.


• It is an unsupervised learning technique.
• Algorithms
• Latent Semantic Analysis(LSA)
• Probabilistic Latent Semantic Analysis (PLSA)
• Latent Dirichlet Allocation (LDA)
• Correlated Topic Model (CTM).

32
Topic Modeling

• The significance of LDA is that each text document comprises of several


topics and each topic comprises of several words.
• The input required by LDA is text documents and the expected number
of topics.
• Given sample text and topics, the topic modeling output will identify the
common words across both topics.

33
Topic Modeling

34
Text Analysis
• Text analysis, sometimes called text analytics, refers to the representation,
processing, and modeling of textual data to derive useful insights.
• An important component of text analysis is text mining, the process of
discovering relationships and interesting patterns in large text collections.
• Text Analysis suffers from:
• High dimensionality (Repetition of text)
• Unstructured data

35
High Dimensionality

36
High Dimensionality
• Text analysis often deals with textual data that is far more complex.
• A corpus (plural: corpora) is a large collection of texts used for various
purposes in Natural Language Processing (NLP).
• The high dimensionality of text is an important issue, and it has a direct
impact on the complexities of many text analysis tasks.

37
Corpus

38
Unstructured Data

39
Text Analysis Steps
• 3 important steps
• Parsing
• Search and Retrieval
• Text Mining

40
Parsing
• Parsing is the process that takes unstructured text and imposes a structure
for further analysis.
• The unstructured text could be a plain text file, a weblog, an Extensible
Markup Language (XML) file, a HyperText Markup Language (HTML) file,
or a Word document.
• Parsing deconstructs the provided text and renders it in a more structured
way for the subsequent steps.

41
Search and Retrieval
• Search and retrieval is the identification of the documents in a corpus that
contain search items such as specific words, phrases, topics, or entities like
people or organizations.
• These search items are generally called key terms.
• Search and retrieval originated from the field of library science and is now
used extensively by web search engines.

42
Text Mining
• Text mining uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest.
• Text mining may utilize methods and techniques from various fields of study, such
as statistical analysis, information retrieval, data mining, and natural language
processing.
• Clustering (K-Means) and Classification (Naïve Bayes) can be adapted to text
mining.
• To gain insights, all the three steps can be used in any order depending on the
application under context.
43
Text Analysis Process for an organization
(product review)

44
Collect raw text
• This corresponds to Phase 1 and Phase 2 of the Data Analytic Lifecycle.
• In this step, the Data Science team monitors websites for references to
specific products.
• The websites may include social media and review sites.
• The team could interact with social network application programming
interfaces (APIs) process data feeds, or scrape pages and use product
names as keywords to get the raw data.
• Regular expressions are commonly used in this case to identify text that
45
matches certain patterns.
Represent text

• Convert each review into a suitable document representation


with proper indices, and build a corpus based on these indexed
reviews.
• This step corresponds to Phases 2 and 3 of the Data Analytic
Lifecycle.

46
TF-IDF
• Compute the usefulness of each word in the reviews using methods such
as TF-IDF.
• This step corresponds to phase 3 through 5 of data analytic life cycle.

47
Topic modeling
• Categorize documents by topics.
• This can be achieved through topic models such as Latent Dirichlet
Allocation.
• This step corresponds to phase 3 through 5 of data analytic life cycle.

48
Sentiment Analysis
• Determine sentiments of the reviews.
• Identify whether the reviews are positive or negative.
• Many product review sites provide ratings of a product with each review.
If such information is not available, techniques like sentiment analysis can
be used on the textual data to infer the underlying sentiments.
• People can express many emotions.

49
Gain Insights
• Review the results and gain greater insights (Section 9.8). This step
corresponds to Phase 5 and 6 of the Data Analytic Lifecycle.
• Marketing gathers the results from the previous steps. Find out what
exactly makes people love or hate a product.
• Use one or more visualization techniques to report the findings.
• Test the soundness of the conclusions and operationalize the findings if
applicable.

50
POS Tagging
• Part of Speech Tagging (POS-Tag) is the labeling of the words in a text
according to their word types (noun, adjective, adverb, verb, etc.)
• It is a process of converting a sentence to forms — list of words, list of
tuples (where each tuple is having a form (word, tag)). The tag in case of
is a part-of-speech tag, and signifies whether the word is a noun, adjective,
verb, and so on.

51
POS Tagging
• Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope
• Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is
• Adjective(ADJ)- big, happy, green, young, fun, crazy, three
• Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow
• Preposition (P)- at, on, in, from, with, near, between, about, under
• Conjunction (CON)- and, or, but, because, so, yet, unless, since, if
• Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this
• Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!
52
53
Stemming
• Stemming is a technique used to extract the base form of the words by
removing affixes from them. It is just like cutting down the branches of a
tree to its stems.
• For example, the stem of the words eating, eats, eaten is eat.
• Search engines use stemming for indexing the words.
• PorterStemmer and SnowballStemmer – widely used algorithm for
stemming in NLTK library

54
Lemmatization
• Lemmatization is the process of finding the form of the related word in the dictionary.
• It is different from Stemming.
• It involves longer processes to calculate than Stemming.
• The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form.
• As opposed to stemming, lemmatization does not simply chop off inflections.
• Instead, it uses lexical knowledge bases to get the correct base forms of words.
• Output after lemmatization is ‘lemma’. E.g. plays, play, playing, played have play as lemma.
• NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus.
This class uses morphy() function to the WordNet CorpusReader class to find a lemma.
55
TF-IDF
• Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used
statistical method in natural language processing and information retrieval.
• It measures how important a term is within a document relative to a collection of
documents (i.e., relative to a corpus).
• Words within a text document are transformed into importance numbers by a text
vectorization process.
• There are many different text vectorization scoring schemes, with TF-IDF being
one of the most common.
• As its name implies, TF-IDF vectorizes/scores a word by multiplying the word’s
Term Frequency (TF) with the Inverse Document Frequency (IDF).
56
TF-IDF
• Term Frequency: TF of a term or word is the number of times the term
appears in a document compared to the total number of words in the
document.
• Inverse Document Frequency: IDF of a term reflects the proportion of
documents in the corpus that contain the term.
• Words unique to a small percentage of documents receive higher
importance values than words common across all documents

57
TF-IDF
• The TF-IDF of a term is calculated by multiplying TF and IDF scores.

58
TF-IDF
• Importance of a term is high when it occurs a lot in a given document and
rarely in others.
• In short, commonality within a document measured by TF is balanced by
rarity between documents measured by IDF.
• The resulting TF-IDF score reflects the importance of a term for a
document in the corpus.

59
TF-IDF
• TF-IDF is useful in many natural language processing applications.
• For example, Search Engines use TF-IDF to rank the relevance of a
document for a query.
• TF-IDF is also employed in text classification, text summarization, and
topic modeling.

60
Example

61
62
TF-IDF
• Some popular python libraries have a function to calculate TF-IDF.
The popular machine learning library Sklearn and
TfidfVectorizer().

63

You might also like