Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Text Analytics Basics

Uploaded by

rubbyy1234598
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Text Analytics Basics

Uploaded by

rubbyy1234598
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Text Analytics

FOR SOCIAL MEDIA AND WEB DATA


source: https://en.wikipedia.org/wiki/p%c4%81%e1%b9%87ini#/media/file:birch_bark_ms_from_kashmir_of_the_rupavatra_wellcome_l0032691.jpg

Pāṇini is known for rules on linguistics, syntax and semantics and laid the foundation
of the Vyākaraṇa becomes the basis of modern linguistic applications of AI
(Russel & Norvig , AI A modern approach, Pearson , 2022)
History
• context-free grammars – studying Sanskrit – first discovered by Panini (350 BCE)
• reinvented by Noam Chomsky (1956) for the analysis of English – in the book – Syntactic Structures
• Chomsky’s theory - based on syntactic models going back to the Indian linguist Panini (c. 350 BCE) could explain -
how children could understand and make up sentences that they had never heard before - address the notion of
creativity -formal enough that it could in principle be programmed

• N-gram letter models – by Markov (1913)


• Shannon and Weaver (1949) – of fame communication model – generate n-gram word models of English
• Linguist Zellig Harris (1954) – “language is not merely a bag of words but a tool with particular
properties.”
• David Blei, Andrew Ng and Michael I. Jordan in 2003 applied probabilistic Latent Dirichlet allocation
model (2000) for machine learning – a text model which views document as a mixture of topics
• Wordnet (Fellbaum, 2001) – publicly available dictionary of about 100,000 words and phrases,
categorized into parts of speech and linked by semantic relations such as synonym, antonym, and part-of
Natural Language Processing (NLP) notable tasks
• Speech recognition is the task of transforming spoken sound into text.
• Text-to-speech synthesis is the inverse process—going from text to sound.
• Machine translation transforms text in one language to another.
• Information extraction is the process of acquiring knowledge by skimming a text and Information
Extraction looking for occurrences of particular classes of objects and for relationships among them.
• Information retrieval is the task of finding documents that are relevant and important Information
retrieval for a given query. Google search engine and Baidu perform this task billions of times a day.
• Question Answering is the task where the query is a question, and response is an answer
• Large Language Models (LLMs) taking an input text and repeatedly predicting the next token or word. Can
make use of prompt engineering. Examples are GPT-3.5 and GPT-4 in ChatGPT, Google's PaLM (used in
Bard), and Meta's LLaMa, as well as BLOOM, Ernie 3.0 Titan, and Claude
Different terminologies
• Computation Linguistics - is the scientific study of language from a computational perspective. This
in modern sense has also become synonym

• Natural Language Processing (NLP) – Science of computation whose subject matter is data structures
and algorithms for computer processing of human language

• Language Engineering – Building NLP tools whose costs and outputs are measurable and predictable

• Text Analytics/Text Mining – Process of deriving meaningful, valuable and useable insights from the
textual data
Text Analytics Challenges
• Huge Data - Volumes
• Real-time - Velocity
• Largely Unstructured and noisy data - Variety
• Reliability of information source questionable
• Data accessibility and storage
• Ambiguity and context sensitivity Source: https://web.stanford.edu/c
• automobile = car = vehicle = Toyota
• Apple (the company) or apple (the fruit)
Text Analytics Process

Core Text Analytics


Processes

Data Data Cleaning and Text Vectorization


Collection Pre-processing

Exploratory Analysis
and Visualization
Domain Knowledge
Corpus Creation
Social Media Web properties,
platforms Websites, Apps

External data: Blogs,


Industry Reports,
Policy documents
Text Data etc

Collection

Internal data : Building Corpus from documents.ipynb


Annual reports,
Newsletter, notebook gives a demonstration of how a pdf ,
Emails, word document and text file are combined to
Whitepapers
Text Data Quality create a plain text corpus and analyze using nltk
and Cleaning
(natural language toolkit) in python

Documents Document
Tagging/Labelling Standardization
Data Cleaning
• Text cleaning is source/data dependent and can involve following clearning steps
• Symbols, URLs, Special characters
• Case Normalization
• Meta Data
• White Spaces

Both python in general and countvectorizer of sklearn provides text cleaning as per the requirement of analytics
requirement. For example in Document Clustering notebook you can see the function similar to below used for the
text data cleaning requirement.

def text_cleaning(text):
text = re.sub(r'\b\d+\b', '', text) # Remove numbers
text = re.sub(r'\s+', ' ', text) # Remove extra spaces
text = text.lower() # Convert to lowercase
text = [word for word in text.split() if word not in stop_words1] # Remove stopwords
return ' '.join(text)
Text Pre-processing
• Tokenization
• Breaking the text into sentences and words
• Token Feature - capitalization, inclusion of digits, punctuation, special characters etc.
• Challenges
• Bank’s rescue <Banks>; <Bank, ’s>; <Bank’s>; <Bank>
• What’re, can’t <What, are>; <What, ’re>; <What’re>
• Not-for-profit
• Prime Minister
• Domain Knowledge based Normalization
Text Pre-processing
• Stemming – to extract small
meaningful units that make up
words, called stem
• Stems, affixes, prefixes
• Porter’s Algorithm

• Lemmatization – refer a
dictionary to understand the
meaning of the word before
reducing it to its root word
called lemma

• In python using sklearn we can


define classes and use them in
countvectorizer
Stopwords
• Stopwords are a set of common words that are often removed from a text corpus before performing
various natural language processing tasks. These words are generally considered to have little or no
informational value and do not contribute significantly to the meaning of the text. Removing stopwords
helps to reduce the dimensionality of the data and improves the efficiency and accuracy of text processing
tasks.
• For example, from wordcloud library in python we get STOPWORDS as follows
{'a', 'about', 'above', 'after', 'again', 'against', 'all', 'also', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been',
'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', "can't", 'cannot', 'com', 'could', "couldn't", 'did', "didn't", 'do', 'does',
"doesn't", 'doing', "don't", 'down', 'during', 'each', 'else', 'ever', 'few', 'for', 'from', 'further', 'get', 'had', "hadn't", 'has', "hasn't",
'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'hence', 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how',
"how's", 'however', 'http', 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', 'just', 'k',
"let's", 'like', 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'otherwise',
'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'r', 'same', 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't",
'since', 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', "there's", 'therefore',
'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was',
"wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', "weren't", 'what', "what's", 'when', "when's", 'where', "where's", 'which', 'while',
'who', "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 'www', 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours',
'yourself', 'yourselves'}

• You can or delete and in general customize stopwords for your requirements( for example in Basics of
Countvectorizer we use stop_words1=['kumar','jiter']+list_stop_words )
Incorporating Domain Knowledge
• Linguistic Knowledge
• Dictionary
• WordNet – available in NLTK, sklearn and other libraries in python
• https://wordnet.princeton.edu/
• Build your own domain dictionary
• Sentiment Lexicon –
• VADER in nltk

• List of punctuation of language, list of stopwords


Text Vectorization
• Processing the unstructured body of text into fixed length structured format based on a set of features that
capture the “valuable information” i.e. necessary and sufficient to conduct the given task
• For instance: The earliest known human remains in South Asia date to about 30,000 years
ago.[26] Nearly contemporaneous human rock art sites have been found in many
parts of the Indian subcontinent, including at the Bhimbetka rock shelters in
Madhya Pradesh.[27] After 6500 BCE, evidence for domestication of food crops and
animals, construction of permanent structures, and storage of agricultural surplus,
appeared in Mehrgarh and other sites in what is now Balochistan.[28] These
gradually developed into the Indus Valley Civilisation,[29][28] the first urban
culture in South Asia,[30] which flourished during 2500–1900 BCE in what is now
Pakistan and western India.[31] Centred around cities such as Mohenjo-daro,
Harappa, Dholavira, and Kalibangan, and relying on varied forms of subsistence,
the civilization engaged robustly in crafts production and wide-ranging trade

earliest human remains SouthAsi India construct Pakistan …. … ..


Vector
a ion Space
1 2 1 2 2 1 1 2 2 1 Model
Text Vectorization
• Vectorization the general process of turning a collection of text
documents into numerical feature vectors.
• Frequency encoding – weights are frequency – also called as
• Bag of words approach
• One-hot encoding – weights are presence or absence of a term
• Tf-idf vectorization (term frequency – inverse document frequency)
• Let there be N documents to be analyzed then
• Tf-idf(t,d) = tf(t,d) X idf(t)
• Term frequency tf = tf(t,d) the number of times a term t occurs in a given
document d
• Inverse document frequency idf (t) = log(N/ df (t)) + 1 - Rare terms can also
be important!
• See more at
f1 f2 f3 …. fn
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_freque d1 w11 w12 w13 … w1n
ncy
d2 w21
d3 w31
The terms are the features and documents are the record .. ..
Also called as term-document or document-term matrix . .
Sometimes it is also called text representation dn w41 wnn
Text Vectorization
• N-gram model
• unigram or 1-gram

• bigram or 2-gram

• Both unigram
and bigram
Exploratory Analysis – Frequency Analysis
• Frequency Analysis is useful for Understanding Vocabulary, Data Cleaning purpose, Text
Summarization, Identifying Key Terms among other uses.
• Frequency of words/tokens is an important aspect of carrying out exploratory analysis along with
wordclouds
Exploratory Analysis
• Wordclouds
• A wordcloud is a visual representation of the
most frequent words in a text corpus, where
the size of each word is proportional to its
frequency. It provides a quick and intuitive way
to understand the key themes, topics, or
sentiments present in the text data.
• Useful for Exploratory Data Analysis, Data
Presentation, Social Media Analysis,Text
Summary, Sentiment Analysis,Topic Modeling.
• Wordcloud provides an overview of the most
frequent words in a text corpus, they may not
be suitable for in-depth analysis or to capture
the context of each word. For more advanced
text analysis, other techniques such as topic
modeling, sentiment analysis,
Text Analytics Core Tasks
• Information Extraction
• Text Classification
• Sentiment Analysis ( Using Lexicon or Text Classification)
• Document Clustering/Text Clustering
• Topic Detection
Information Extraction
• Information Extraction (IE) Systems
• Find and understand limited relevant part of text
• Gather information from many pieces of text
• Produce a structured representation of relevant information
• Concepts/Entities
• Relations
• A Knowledge base
• Goal: organize information in useable structure

• IE systems extract clear factual information


• Who did what to whom?
Sentiment Analysis
• Use VADER (Valence Aware Dictionary and sEntiment Reasoner) library for lexicon based sentiment
analysis or NRC Emotion Lexicon or SentiWordNet lexicon

Product
Aspect
Sentiment

Product
Reviews
Sentiment Analysis
• aka Opinion Extraction, Opinion Mining, Sentiment Mining, Subjectivity Analysis
• Why Sentiment Analysis?
• Is this review positive or negative?
• Is the market sentiment positive or negative?
• What do people think about the new product?
• How is consumer confidence? Is despair increasing?
• What do people think about this issue?
• Predict market trends and sentiments!

• Challenges
• How to handle negation
• I didn’t like this product
• I didn’t say that I didn’t like this product
• Sarcasm
Text Classification
• Classification or the supervised learning mechanism in machine learning, is the process of learning
concepts or patterns generalizing the relationship between the dependent and independent variables,
given a labeled or annotated data.
• A typical text classification task can be understood with examples like
• To classify opinions (positive or negative)
• To classify documents by authors
• To classify documents by topics
Topic Modeling using Latent Dirichlet allocation
• Topic modeling is a form of unsupervised learning that identifies hidden relationships in data. Being unsupervised, topic modeling doesn’t need
labeled data. It can be applied directly to a set of text documents to extract information. Topic modeling works in an exploratory manner, looking
for the themes (or topics) that lie within a set of text data. It discovers topics using a probabilistic framework to infer the themes within the data
based on the words observed in the documents. Topic modeling is a versatile way of making sense of an unstructured collection of text
documents. It can be used to automate the process of sifting through large volumes of text data and help to organize and understand it. Once key
topics are discovered, text documents can be grouped for further analysis, to identify trends, for instance, or as a form of classification.

Example of Topic modelling using LDA


• doc1.apple banana
• doc2.apple orange Topic 1 Topic 2
• doc3.banana orange Apple 33% 0%
Banana 33% 0%
• doc4.tiger cat LDA (K =2 )
Orange 33% 0%
• doc5.tiger dog Tiger 0% 33%
• doc6.cat dog Cat 0% 33%
Dog 0% 33%
Document/Text Clustering
• Unsupervised learning has the freedom to discover hidden information underneath input data.
• Unsupervised learning can include the following types:
• Clustering: This means grouping data based on commonality, which is often used for exploratory data
analysis.Clustering techniques are widely used in customer segmentation or for grouping similar online behaviors
for a marketing campaign.
• Association: This explores the co-occurrence of particular values of two or more features. Outlier detection (also
called anomaly detection) is a typical case, where rare observations are identified.
• Dimensionality Reduction : This maps the original feature space to a reduced dimensional space retaining or
extracting a set of principal variables.
• Topic Modeling

• Unsupervised learning is extensively employed in the area of text data mainly because of the difficulty of
obtaining labeled text data. Unlike numerical data (such as house prices, stock data, and online click
streams), labeling text can sometimes be subjective, manual, and tedious. Unsupervised learning
algorithms that do not require labels become effective when it comes to mining text data.
Retrieve Data from Social Media - Reddit
• Register the Reddit account
• pip install praw
• Create a reddit app
https://www.reddit.com/prefs/apps
Use the client_id='client_id',
client_secret='client_secret’ keys to retrieve data
Retrieve Data from Social Media - Twitter
• Similarly, as reddit by first creating
developer account and then an app to
generate credentials to request api for the
data.
• In Twitter there are Consumer Keys and
Access Tokens. Another version api 2.0
allows to fetch the data using bearer token
and client ID
Use of Text Analytics for Social Media and Web
• Text Analytics for Social Media and Web data can help you
• To identify sentiments of users/customers/stakeholders
• Organize the campaign/event/hashtag data using clustering
• Target based on classification of data
• Automate some of the frequent analysis to understand behavioral changes of users.
• Identify the topics being discussed or mentioned
• Represent large information with visualization for quicker and effective information assimilation

You might also like