Text Analytics Basics
Text Analytics Basics
Pāṇini is known for rules on linguistics, syntax and semantics and laid the foundation
of the Vyākaraṇa becomes the basis of modern linguistic applications of AI
(Russel & Norvig , AI A modern approach, Pearson , 2022)
History
• context-free grammars – studying Sanskrit – first discovered by Panini (350 BCE)
• reinvented by Noam Chomsky (1956) for the analysis of English – in the book – Syntactic Structures
• Chomsky’s theory - based on syntactic models going back to the Indian linguist Panini (c. 350 BCE) could explain -
how children could understand and make up sentences that they had never heard before - address the notion of
creativity -formal enough that it could in principle be programmed
• Natural Language Processing (NLP) – Science of computation whose subject matter is data structures
and algorithms for computer processing of human language
• Language Engineering – Building NLP tools whose costs and outputs are measurable and predictable
• Text Analytics/Text Mining – Process of deriving meaningful, valuable and useable insights from the
textual data
Text Analytics Challenges
• Huge Data - Volumes
• Real-time - Velocity
• Largely Unstructured and noisy data - Variety
• Reliability of information source questionable
• Data accessibility and storage
• Ambiguity and context sensitivity Source: https://web.stanford.edu/c
• automobile = car = vehicle = Toyota
• Apple (the company) or apple (the fruit)
Text Analytics Process
Exploratory Analysis
and Visualization
Domain Knowledge
Corpus Creation
Social Media Web properties,
platforms Websites, Apps
Collection
Documents Document
Tagging/Labelling Standardization
Data Cleaning
• Text cleaning is source/data dependent and can involve following clearning steps
• Symbols, URLs, Special characters
• Case Normalization
• Meta Data
• White Spaces
Both python in general and countvectorizer of sklearn provides text cleaning as per the requirement of analytics
requirement. For example in Document Clustering notebook you can see the function similar to below used for the
text data cleaning requirement.
def text_cleaning(text):
text = re.sub(r'\b\d+\b', '', text) # Remove numbers
text = re.sub(r'\s+', ' ', text) # Remove extra spaces
text = text.lower() # Convert to lowercase
text = [word for word in text.split() if word not in stop_words1] # Remove stopwords
return ' '.join(text)
Text Pre-processing
• Tokenization
• Breaking the text into sentences and words
• Token Feature - capitalization, inclusion of digits, punctuation, special characters etc.
• Challenges
• Bank’s rescue <Banks>; <Bank, ’s>; <Bank’s>; <Bank>
• What’re, can’t <What, are>; <What, ’re>; <What’re>
• Not-for-profit
• Prime Minister
• Domain Knowledge based Normalization
Text Pre-processing
• Stemming – to extract small
meaningful units that make up
words, called stem
• Stems, affixes, prefixes
• Porter’s Algorithm
• Lemmatization – refer a
dictionary to understand the
meaning of the word before
reducing it to its root word
called lemma
• You can or delete and in general customize stopwords for your requirements( for example in Basics of
Countvectorizer we use stop_words1=['kumar','jiter']+list_stop_words )
Incorporating Domain Knowledge
• Linguistic Knowledge
• Dictionary
• WordNet – available in NLTK, sklearn and other libraries in python
• https://wordnet.princeton.edu/
• Build your own domain dictionary
• Sentiment Lexicon –
• VADER in nltk
• bigram or 2-gram
• Both unigram
and bigram
Exploratory Analysis – Frequency Analysis
• Frequency Analysis is useful for Understanding Vocabulary, Data Cleaning purpose, Text
Summarization, Identifying Key Terms among other uses.
• Frequency of words/tokens is an important aspect of carrying out exploratory analysis along with
wordclouds
Exploratory Analysis
• Wordclouds
• A wordcloud is a visual representation of the
most frequent words in a text corpus, where
the size of each word is proportional to its
frequency. It provides a quick and intuitive way
to understand the key themes, topics, or
sentiments present in the text data.
• Useful for Exploratory Data Analysis, Data
Presentation, Social Media Analysis,Text
Summary, Sentiment Analysis,Topic Modeling.
• Wordcloud provides an overview of the most
frequent words in a text corpus, they may not
be suitable for in-depth analysis or to capture
the context of each word. For more advanced
text analysis, other techniques such as topic
modeling, sentiment analysis,
Text Analytics Core Tasks
• Information Extraction
• Text Classification
• Sentiment Analysis ( Using Lexicon or Text Classification)
• Document Clustering/Text Clustering
• Topic Detection
Information Extraction
• Information Extraction (IE) Systems
• Find and understand limited relevant part of text
• Gather information from many pieces of text
• Produce a structured representation of relevant information
• Concepts/Entities
• Relations
• A Knowledge base
• Goal: organize information in useable structure
Product
Aspect
Sentiment
Product
Reviews
Sentiment Analysis
• aka Opinion Extraction, Opinion Mining, Sentiment Mining, Subjectivity Analysis
• Why Sentiment Analysis?
• Is this review positive or negative?
• Is the market sentiment positive or negative?
• What do people think about the new product?
• How is consumer confidence? Is despair increasing?
• What do people think about this issue?
• Predict market trends and sentiments!
• Challenges
• How to handle negation
• I didn’t like this product
• I didn’t say that I didn’t like this product
• Sarcasm
Text Classification
• Classification or the supervised learning mechanism in machine learning, is the process of learning
concepts or patterns generalizing the relationship between the dependent and independent variables,
given a labeled or annotated data.
• A typical text classification task can be understood with examples like
• To classify opinions (positive or negative)
• To classify documents by authors
• To classify documents by topics
Topic Modeling using Latent Dirichlet allocation
• Topic modeling is a form of unsupervised learning that identifies hidden relationships in data. Being unsupervised, topic modeling doesn’t need
labeled data. It can be applied directly to a set of text documents to extract information. Topic modeling works in an exploratory manner, looking
for the themes (or topics) that lie within a set of text data. It discovers topics using a probabilistic framework to infer the themes within the data
based on the words observed in the documents. Topic modeling is a versatile way of making sense of an unstructured collection of text
documents. It can be used to automate the process of sifting through large volumes of text data and help to organize and understand it. Once key
topics are discovered, text documents can be grouped for further analysis, to identify trends, for instance, or as a form of classification.
• Unsupervised learning is extensively employed in the area of text data mainly because of the difficulty of
obtaining labeled text data. Unlike numerical data (such as house prices, stock data, and online click
streams), labeling text can sometimes be subjective, manual, and tedious. Unsupervised learning
algorithms that do not require labels become effective when it comes to mining text data.
Retrieve Data from Social Media - Reddit
• Register the Reddit account
• pip install praw
• Create a reddit app
https://www.reddit.com/prefs/apps
Use the client_id='client_id',
client_secret='client_secret’ keys to retrieve data
Retrieve Data from Social Media - Twitter
• Similarly, as reddit by first creating
developer account and then an app to
generate credentials to request api for the
data.
• In Twitter there are Consumer Keys and
Access Tokens. Another version api 2.0
allows to fetch the data using bearer token
and client ID
Use of Text Analytics for Social Media and Web
• Text Analytics for Social Media and Web data can help you
• To identify sentiments of users/customers/stakeholders
• Organize the campaign/event/hashtag data using clustering
• Target based on classification of data
• Automate some of the frequent analysis to understand behavioral changes of users.
• Identify the topics being discussed or mentioned
• Represent large information with visualization for quicker and effective information assimilation