Module 4
Module 4
Module 4
TEXT ANALYTICS
1
Introduction
2
Data Organization
• Text is a one of the most common data types within databases.
• Data can be organized as:
• Structured data
• Unstructured data
• Semi-structured data
3
Need of text mining
• Roughly 80% of data in the world resides in an unstructured format.
• Text mining is an extremely valuable practice within organizations.
• Text mining tools and NLP techniques transform unstructured
documents into a structured format to enable analysis and the
generation of high-quality insights.
• Improves the decision-making of organizations, leading to better
business outcomes.
4
Text Mining vs Text Analytics
• Text mining and text analysis identifies textual patterns and trends
within unstructured data through the use of machine learning,
statistics, and linguistics.
• By transforming the data into a more structured format through text
mining and text analysis, more quantitative insights can be found
through text analytics.
• Data visualization techniques can then be harnessed to
communicate findings to wider audiences.
5
Text mining techniques
• The process of text mining comprises several activities that enable you to deduce
information from unstructured text data.
• Start with text processing (cleaning) before you apply different techniques.
• Use NLP techniques
• Identification
• Tokenization
• POS tagging
• Chunking
• Syntax Parsing
6
Text mining techniques
• Information Retrieval
• Information retrieval (IR) returns relevant information or documents based on
a pre-defined set of queries or phrases.
• Tasks
• Tokenization
• Stemming
7
Text mining techniques
• NLP
• evolved from computational linguistics
• uses methods from various disciplines, such as computer science, artificial
intelligence, linguistics, and data science
• enable computers to understand human language in both written and verbal
forms
8
Text mining techniques
• Tasks
• Summarization
• Part-of-Speech (PoS) tagging (Semantic analysis)
• Text categorization (classification)
• Sentiment analysis
9
Text mining techniques
• Information extraction
• Information extraction (IE) surfaces the relevant pieces of data when searching
various documents.
• It also focuses on extracting structured information from free text and storing
these entities, attributes, and relationship information in a database.
• Tasks
• Feature selection (attribute selection)
• Feature extraction (subset of features for dimensionality reduction)
• Named-entity recognition (NER) (entity identification)
10
Text mining techniques
• Data Mining
• Data mining is the process of identifying patterns and extracting useful
insights from big data sets.
• Evaluates both structured and unstructured data to identify new information
• To analyze consumer behaviors within marketing and sales.
• Text mining is essentially a sub-field of data mining.
11
https://www.elderresearch.com/wp-content/uploads/2020/10/Whitepaper_The_Seven_Practice_Areas_of_Text_An
alytics_Chapter_2_Excerpt.pdf
12
Seven practice areas of Text Analytics
• Search and information retrieval (IR)
• Storage and retrieval of text documents, including search engines and keyword search.
• Document clustering
• Grouping and categorizing terms, snippets, paragraphs, or documents, using data mining
clustering methods
• Document classification
• Grouping and categorizing snippets, paragraphs, or documents, using data mining
classi!cation methods, based on models trained on labeled examples.
13
Seven practice areas of Text Analytics
• Web mining
• Data and text mining on the Internet, with a specific focus on the scale and interconnectedness of the web.
• Information extraction (IE)
• Identification and extraction of relevant facts and relationships from unstructured text; the process of making
structured data from unstructured and semi-structured text.
14
Seven practice areas of Text Analytics
15
16
Text mining : Application and Use cases
• Risk Management
• insufficient risk analysis is often a leading cause of failure in financial and insurance
industries.
• adoption of risk management software based on text mining technology can
dramatically increase the ability to mitigate risk.
• Knowledge Management
• ability to find important information quickly.
• healthcare industry organizations have tremendous amounts of information
17
Text mining : Application and Use cases
• Cybercrime Prevention
• increased risk of internet-based crimes.
• Text mining pinpoint real threats and limit the number of false positives created by
keywords taken out of context.
• Enhanced Customer Service
• Solve customer problems through chatbots.
• improve the customer experience by leveraging valuable information sources such as
surveys, trouble tickets and customer call notes to improve the quality, effectiveness and
speed of problem resolution.
18
Text mining : Application and Use cases
• Contextual Advertising
• Compared to the traditional cookie-based approach, contextual advertising analyzes the
text on a webpage to understand the content on a deeper level.
• E.g. Reading an ad of book – Kindle
• Business Intelligence
• BI tools helps in decision making process.
• These tools enable you to identify patterns, trends and opportunities from your data.
• Converts unstructured data to structured data.
19
Text mining : Application and Use cases
• Spam Filtering
• Spam is both an entry point for viruses and a detriment to productivity.
• Text mining techniques can be implemented to improve the effectiveness of statistical
filtering methods by leveraging established prior knowledge.
• Social Media Data Analysis
• Social media – important source for unstructured data.
• Social media is seen across the enterprise as a valuable source of market and customer
intelligence.
• extract opinions, emotions and sentiment that reveal the positive and negative of20
consumer relationships with brands and products.
https://blog.aureusanalytics.com/blog/5-natural-language-processing-techniques-for-extracting-inf
ormation
22
Review of Auto Insurance Company
The customer service of Rocketz is terrible. I must call the call center multiple times
before I get a decent reply. The call center guys are extremely rude and totally
ignorant. Last month I called with a request to update my correspondence address
from Brooklyn to Manhattan. I spoke with about a dozen representatives – Lucas
Hayes, Ethan Gray, Nora Diaz, Sofia Parker to name a few. Even after writing multiple
emails and filling out numerous forms, the address has still not been updated. Even my
agent John is useless. The policy details he gave me were wrong. The only good
thing about the company is the pricing. The premium is reasonable compared to
the other insurance companies in the United States. There has not been any
significant increase in my premium since 2015.
23
Named Entity Recognition
25
Sentiment Analysis
26
Sentiment Analysis
28
Text Summarization
• There are various algorithms that can be used for text summarization
• LexRank
• TextRank
• Latent Semantic Analysis.
• LexRank algorithm ranks the sentences using similarity between them.
• A sentence is ranked higher when it is similar to more sentences, and these
sentences are in turn similar to other sentences.
29
Text Summarization
• Output of example:
I have to call the call center multiple times before I get a decent reply.
The premium is reasonable compared to the other insurance companies
in the United States.
30
Aspect Mining
32
Topic Modeling
33
Topic Modeling
34
Text Analysis
• Text analysis, sometimes called text analytics, refers to the representation,
processing, and modeling of textual data to derive useful insights.
• An important component of text analysis is text mining, the process of
discovering relationships and interesting patterns in large text collections.
• Text Analysis suffers from:
• High dimensionality (Repetition of text)
• Unstructured data
35
High Dimensionality
36
High Dimensionality
• Text analysis often deals with textual data that is far more complex.
• A corpus (plural: corpora) is a large collection of texts used for various
purposes in Natural Language Processing (NLP).
• The high dimensionality of text is an important issue, and it has a direct
impact on the complexities of many text analysis tasks.
37
Corpus
38
Unstructured Data
39
Text Analysis Steps
• 3 important steps
• Parsing
• Search and Retrieval
• Text Mining
40
Parsing
• Parsing is the process that takes unstructured text and imposes a structure
for further analysis.
• The unstructured text could be a plain text file, a weblog, an Extensible
Markup Language (XML) file, a HyperText Markup Language (HTML) file,
or a Word document.
• Parsing deconstructs the provided text and renders it in a more structured
way for the subsequent steps.
41
Search and Retrieval
• Search and retrieval is the identification of the documents in a corpus that
contain search items such as specific words, phrases, topics, or entities like
people or organizations.
• These search items are generally called key terms.
• Search and retrieval originated from the field of library science and is now
used extensively by web search engines.
42
Text Mining
• Text mining uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest.
• Text mining may utilize methods and techniques from various fields of study, such
as statistical analysis, information retrieval, data mining, and natural language
processing.
• Clustering (K-Means) and Classification (Naïve Bayes) can be adapted to text
mining.
• To gain insights, all the three steps can be used in any order depending on the
application under context.
43
Text Analysis Process for an organization
(product review)
44
Collect raw text
• This corresponds to Phase 1 and Phase 2 of the Data Analytic Lifecycle.
• In this step, the Data Science team monitors websites for references to
specific products.
• The websites may include social media and review sites.
• The team could interact with social network application programming
interfaces (APIs) process data feeds, or scrape pages and use product
names as keywords to get the raw data.
• Regular expressions are commonly used in this case to identify text that
45
matches certain patterns.
Represent text
46
TF-IDF
• Compute the usefulness of each word in the reviews using methods such
as TF-IDF.
• This step corresponds to phase 3 through 5 of data analytic life cycle.
47
Topic modeling
• Categorize documents by topics.
• This can be achieved through topic models such as Latent Dirichlet
Allocation.
• This step corresponds to phase 3 through 5 of data analytic life cycle.
48
Sentiment Analysis
• Determine sentiments of the reviews.
• Identify whether the reviews are positive or negative.
• Many product review sites provide ratings of a product with each review.
If such information is not available, techniques like sentiment analysis can
be used on the textual data to infer the underlying sentiments.
• People can express many emotions.
49
Gain Insights
• Review the results and gain greater insights (Section 9.8). This step
corresponds to Phase 5 and 6 of the Data Analytic Lifecycle.
• Marketing gathers the results from the previous steps. Find out what
exactly makes people love or hate a product.
• Use one or more visualization techniques to report the findings.
• Test the soundness of the conclusions and operationalize the findings if
applicable.
50
POS Tagging
• Part of Speech Tagging (POS-Tag) is the labeling of the words in a text
according to their word types (noun, adjective, adverb, verb, etc.)
• It is a process of converting a sentence to forms — list of words, list of
tuples (where each tuple is having a form (word, tag)). The tag in case of
is a part-of-speech tag, and signifies whether the word is a noun, adjective,
verb, and so on.
51
POS Tagging
• Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope
• Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is
• Adjective(ADJ)- big, happy, green, young, fun, crazy, three
• Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow
• Preposition (P)- at, on, in, from, with, near, between, about, under
• Conjunction (CON)- and, or, but, because, so, yet, unless, since, if
• Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this
• Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi!
52
53
Stemming
• Stemming is a technique used to extract the base form of the words by
removing affixes from them. It is just like cutting down the branches of a
tree to its stems.
• For example, the stem of the words eating, eats, eaten is eat.
• Search engines use stemming for indexing the words.
• PorterStemmer and SnowballStemmer – widely used algorithm for
stemming in NLTK library
54
Lemmatization
• Lemmatization is the process of finding the form of the related word in the dictionary.
• It is different from Stemming.
• It involves longer processes to calculate than Stemming.
• The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form.
• As opposed to stemming, lemmatization does not simply chop off inflections.
• Instead, it uses lexical knowledge bases to get the correct base forms of words.
• Output after lemmatization is ‘lemma’. E.g. plays, play, playing, played have play as lemma.
• NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus.
This class uses morphy() function to the WordNet CorpusReader class to find a lemma.
55
TF-IDF
• Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used
statistical method in natural language processing and information retrieval.
• It measures how important a term is within a document relative to a collection of
documents (i.e., relative to a corpus).
• Words within a text document are transformed into importance numbers by a text
vectorization process.
• There are many different text vectorization scoring schemes, with TF-IDF being
one of the most common.
• As its name implies, TF-IDF vectorizes/scores a word by multiplying the word’s
Term Frequency (TF) with the Inverse Document Frequency (IDF).
56
TF-IDF
• Term Frequency: TF of a term or word is the number of times the term
appears in a document compared to the total number of words in the
document.
• Inverse Document Frequency: IDF of a term reflects the proportion of
documents in the corpus that contain the term.
• Words unique to a small percentage of documents receive higher
importance values than words common across all documents
57
TF-IDF
• The TF-IDF of a term is calculated by multiplying TF and IDF scores.
58
TF-IDF
• Importance of a term is high when it occurs a lot in a given document and
rarely in others.
• In short, commonality within a document measured by TF is balanced by
rarity between documents measured by IDF.
• The resulting TF-IDF score reflects the importance of a term for a
document in the corpus.
59
TF-IDF
• TF-IDF is useful in many natural language processing applications.
• For example, Search Engines use TF-IDF to rank the relevance of a
document for a query.
• TF-IDF is also employed in text classification, text summarization, and
topic modeling.
60
Example
61
62
TF-IDF
• Some popular python libraries have a function to calculate TF-IDF.
The popular machine learning library Sklearn and
TfidfVectorizer().
63