Pipeline

NLP PIPELINE
Most important keywords:

i. Corpus (C): words in full dataset including duplicates
ii. Vocabulary (V): unique words
iii. Document (D): each review is a document
iv. Word (W): every word of document
1. Data Collection:
 This is the first step where you gather text data from various sources such as web scraping,
APIs, databases, or text files.
a. Data is available:
- Data is on the table, carry on
- Data is in some database, connect to database and collect the data
- Less data:
 Do data augmentation like, synonym conversion, back translate (translate to
other language and translate again), bigram flip (joining two words)
b. Others:
- Public dataset (from competitors, webscraping, API, PDF, image, audio (using tools).
c. Nothing:
- Contact product team, survey and collect data, label the data for sentiment.
2. Text Preprocessing:
a. Cleaning:
- like removing https, html tags using regex,
- emojis – convert it to machine understandable code using python (UTF encoding)
- spelling check (chats like you: u) using textblob.correct()
- convert short forms to full forms (github).
- Remove punctuations with string.punctuation
b. Basic preprocessing:
i. Tokenization:
1. Sentence tokenization – using nltk convert para to sentence based on “.”.
2. Word tokenization - using nltk convert sentence to words based on “ ”.
ii. It can be done by splitting, but the problem is for eg delhi! Cannot split “delhi”
and “!”.
iii. It also can be done by regex but lot patterns have to be written
iv. Next option is NLTK which is better but spacy does lot better work. Eg gmail
Why tokenization is important?
 Text Segmentation: Tokenization segments continuous text into discrete units, allowing us to
process and analyze text at a more granular level.
 Feature Extraction: Many NLP techniques and models, such as Bag-of-Words (BoW) and
Word Embeddings, require text to be represented in the form of individual tokens. Tokenization
enables the extraction of these features, making them more meaningful and interpretable.
 Vocabulary Creation: Tokenization helps build a vocabulary or a list of unique words or
subwords present in the text corpus. The vocabulary is essential for various tasks, such as
creating word embeddings or mapping words to numerical representations.
 Word-Level Analysis: Tokenization allows us to perform word-level analysis, such as word
frequency counting, sentiment analysis, or language model training.
 Machine Learning Input: Many NLP models and algorithms require text data to be converted
into numerical representations. Tokenization is the initial step in preparing text data for input
into these models.
 N-gram Generation: Tokenization can help generate N-grams, which are contiguous
sequences of tokens. N-grams are used in various NLP tasks, such as language modeling and
information retrieval.
 Efficient Memory Usage: Tokenization helps manage memory usage when working with large
text datasets. By representing text as tokens, we reduce the memory footprint and make it easier
to process text data in chunks.
i. Lowercasing: Convert all text to lowercase to ensure case-insensitivity during analysis.
ii. Stop Word Removal: Remove common words (e.g., "the," "is," "and") that do not carry
significant meaning in the context of the analysis.
iii. Stemming or Lemmatization: Reduce words to their base or root form. For example, stemming
would convert "running," "runs," and "ran" to "run," while lemmatization might map them to
"run."
 Stemming
v. Stemming is used to reduce inflection (different grammatical categories, walk,
walked, walking) eg use in information retrieval system.
vi. Porter stemmer is used for English language, for other language snowball
stemmer for stemming
vii. It might give a word which has no meaning as well or not a valid word
b. Lemmatization (lemma)
i. Lemmatization will always have a valid word
ii. Comparatively slow because it is lexical based and search in the dictionary
unlike stemming which is algorithm based
iii. Also provide POS
iv. Used when we have to show output to user eg. Chatbot
c. Advance preprocessing:
i. Parts of speech (POS) tagging: tagging each word with pos for better
understanding
ii. Parsing – To understand the syntactical structure of the sentence.
iii. Core reference resolution: I(mahesh) am Mahesh
NOTE: all these are done when we did not remove stop words. Basically, this is for chat bots
application.
3. Feature Extraction (text representation, text vectorization): extracting numbers from text like count
of positive, negative, neutral words etc
 Machine learning method: we need to create features and advantage is model interpretability
is not lost and disadvantage is we need to know domain knowledge and model might backstab
with features what we have extracted.
- Feature extraction is done to get the valid output (garbage in / garbage out)
- It is difficult because we need to get the symantic meaning from the text unlike image
and audio
i. Techniques used
- OHE (one hot encoding)
 First collect the unique words from corpus i.e., vocabulary then convert each
word of the document into V dimensional vector
 Disadvantage – sparsity gives overfitting, no fixed input feature can’t do ML, if

any new word comes, we can’t handle it i.e., OOV (out of vocabulary), we can’t
capture semantic meaning.
- Bag-of-Words (BoW): Represent text as a numerical vector by counting the frequency

of each word in the document. Each document is represented as a sparse vector of word
counts.
 Order of words does not matter
 Core idea here is in a sentence if the same words are there with same frequency
than it belongs to same class
 If we plot the vectors on V dimensional plane than the distance between the
vectors which is measures by cos theta if less, than they belong to same class
else vice versa.
 If new words come It doesn’t matter, as it gets handled that is OOV is taken care
of, as it do not assigns any value
 Binary parameter keep true for sentiment analysis (proven by research) means
their frequency is not the requirement its just whether word exists or not
 Max features: used to remove rare words, it decides the features based on the
frequency of words occurred.
 Advantage is no problem of OOV (out of vocabulary), chances of capturing

semantic meaning is better than OHE. (cos theta)
 Disadvantage is sparsity (might go upto 50000 features), since OOV don’t

throw error but we are neglecting the new word which might be important, since
ordering doesn’t matter we loose the meaning, if there two words,
i. This is a very good movie

j. This is not a very good movie.
 Both the above sentence vectors will be near which is incorrect
- Bag of ngrams
 Here we will take more words as vocabulary like bi gram means a token is
formed by combining two words to form a vocabulary
ngram= (1,1) considers one word unigram or just BOW

ngram= (1,2) considers one word and two words vocab tokens
ngram= (2,2) only bigram
 Advantage of n grams is representation of vectors in n dimensional space will
vary (distance varies as the vectors representing each vocab will have less
similar value given) unlike BOW as mentioned in above example so gives
meaning.
 disadvantage is slowing down the algo as uni gram has < terms (tokens) when
compared to bi gram.
 OOV problem still exists (new word comes – ignored)
- TF-IDF (Term Frequency-Inverse Document Frequency): Similar to BoW, but the
word frequencies are normalized by the inverse document frequency, giving more
weight to less common words.
- give weightage to those words which is more in the document but less or rare in the
whole corpus
- Term Frequency (TF):
 Term Frequency is a measure of how frequently a term (word) occurs in a
document.
 The formula to calculate the Term Frequency for a term "t" in a document "d" is
given by:
TF(t, d) = (Number of times term t appears in document d) / (Total number of
terms in document d)
- Inverse Document Frequency (IDF):
 Inverse Document Frequency is a measure of how important a term is in the
entire collection of documents.
 If the term occurs more then IDF value decreases
 The formula to calculate the Inverse Document Frequency for a term "t" in a
collection of documents "D" is given by:
IDF (t, D) = log ((Total number of documents in D) / (Number of documents
containing term t in D))
- TF-IDF (Term Frequency-Inverse Document Frequency):
 TF-IDF is the product of Term Frequency (TF) and Inverse Document
Frequency (IDF).
 The formula for TF-IDF of a term "t" in a document "d" from a collection of
documents "D" is given by:
TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)
- Why log in formula: because IDF should not dominate the TF
- Disadvantage: sparsity to some extent, OOV, dimension increase as usual, doesn’t
capture semantic meaning. (This is beautiful  this is gorgeous ) not captured
- Advantages- used in information retrieval system.
4. Custom features: handcrafted feature like, count of pos, neg words, ratio, word, char count, vowel
count, fuzzy features, token features, stop words features, length features etc.
 Deep learning method: No feature engineering is required it automatically generates features
(embeddings), directly use it later for ML algos and disadvantage is we lose interpretability
because we might not know how are model is good or bad, we cannot justify.
- Word Embeddings: Represent words as dense vectors capturing semantic relationships.
Word embeddings like Word2Vec, GloVe, or FastText are used to convert words into
continuous vector representations.
- Here we simply convert the words as dense vectors such that the vectors that are closer
in the vector space are expected to be in similar meaning.
- Types of word embeddings
 Frequency base – based on frequency counts – BOW, tfidf, glove
 Prediction based – algorithm based – word2Vec
- Word2Vec – we can capture semantic meaning here unlike BOW and tfidf , meaning,
it can identify there is close relationship with words such as happy, joy.
- We get low dimension (100-300) computationally efficient.
- Vectors are dense and non zeroes – so no overfitting
- Google has trained weights of word2Vec and prepared a pre trained model on google
news corpus containing 3 billion words. This model consists of 300-dimensional
vectors for 3 million words and phrases.
- Genism is library which has this pretrained model.
- we create a object of this model and use it as for example model[‘man’] we get 300
numbered vectors meaning 300D vectors. Which will represent man.
- We can use this model and extract the similar words (which internally use cosine
similarity distance between the vectors)
- We also can-do vector operations like,
- Intuition behind word2Vec
 Say we have to create a word2Vec from scratch, idea here is to get the
relationship between words.
 Say we have vocabulary like King, Queen, Man, Woman, Monkey
 Now we have to create our own features and like gender, wealth… and create
the vectors so that it captures the meaning manually.
 Now we get the vectors, now we can do some vector operations like, king – man
+ woman will give queen, similarly we can get new word meaning by doing this
operation and get the closer results.
 This is how the model works and gets the semantic relationship.
 Here we just have 5 vocabs but as in genism model we have 30 lakh vocabs and
300 features (column wise) which can’t be hand crafted.
 It is done by neural network. But we can’t get the features name as in king,
queen... it will be f1, f2... instead and its values. See below fig
 So, this genism model has created and trained 300 features for 30 lakh words
and phrases.
 So now we can do vector operations and get the closer results or similar results.
 Now how will deep learning create features on what basis?
 The underlying assumption of word2Vec is that the two words sharing similar
contexts also share a similar meaning and consequently a similar vector
representation from the model.
 That is for eg: I play football, I play hockey. Deep learning will understand that
these contexts are similar and form a similar vector representation,
- Two types of architecture in word2Vec
a. cBow – Continous bOW
b. skip-gram
5. Machine Learning Model:

 Choose an appropriate machine learning model based on the NLP task (e.g., classification,
sentiment analysis, etc.).
 Train the model using the preprocessed text and extracted features.
 Evaluate the model's performance using metrics like accuracy, precision, recall, F1-score, etc.
A. Heuristic approach: when data is less for eg, email spam classification- based on some keywords
classify spam or non-spam.
B. ML approach: when data is moderate, extract features and make use of heuristic approach to get a
new feature based on count of positive and negative words count and is email spam or not.
C. Deep learning approach: use transfer learning model like BERT
D. Cloud based approach: already certain models are trained and kept which we can use it for our use
cases.
6. Model Evaluation
- Intrinsic evaluation – like accuracy, precession, recall, confusion matrix (model
centric)
- Extrinsic evaluation – business centric, i.e., after the deployment when users are using
the application
For eg: whatsapp chat text generation, if user is not using the suggestions, it means extrinsic
evaluation metric is not good but intrinsic might be good. Here perplexity is used as metric
which states that how our model is confused or not.
7. Model Deployment:
 Once the model is trained and evaluated, deploy it to a production environment where it can be
used to process new, unseen data.
 Deploy it in any cloud platform like AWS and the app works as soon as the app hits the API
8. Inference and Prediction:
 Use the deployed model to make predictions or inferences on new text data.
 The model takes preprocessed text as input and generates predictions, classifications, or other
desired outputs.
9. Feedback Loop for Model Improvement or monitoring:
 Collect user feedback and logs from the model's predictions and performance in the production
environment.
 Prepare a dashboard which contains the extrinsic evaluation metric and graphs to check the
performance of the model
 Use this feedback to identify areas of improvement and potential issues with the model's
predictions.
10. Data Collection for Model Update:
 Acquire new data that reflects the changes or additions required for model improvement.
 This data can be collected through the same sources used in the initial data acquisition phase.
11. Data Preprocessing (New Data):
 Preprocess the new data in the same manner as the initial data to ensure consistency.
 Perform tokenization, lowercasing, stop word removal, stemming/lemmatization, etc.
12. Model Update:
 Combine the new data with the existing data used to train the model in the initial step.
 Re-train the machine learning model using the updated dataset to incorporate the new
information.
 Fine-tune the model parameters based on the new data and the insights gained from user
feedback.
13. Model Evaluation (Updated Model):
 Assess the performance of the updated model using evaluation metrics similar to the initial
model evaluation.
 Compare the updated model's performance with the previous version to ensure improvement.
14. Model Deployment (Updated Model):
 Deploy the updated model to the production environment, replacing the previous version.
 The updated model is now ready to make improved predictions and handle new text data

Pipeline

Uploaded by

Copyright:

Available Formats

Pipeline

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipeline

Uploaded by

Copyright:

Available Formats

NLP PIPELINE

Most important keywords:

 Disadvantage – sparsity gives overfitting, no fixed input feature can’t do ML, if

- Bag-of-Words (BoW): Represent text as a numerical vector by counting the frequency

 Advantage is no problem of OOV (out of vocabulary), chances of capturing

 Disadvantage is sparsity (might go upto 50000 features), since OOV don’t

i. This is a very good movie

ngram= (1,1) considers one word unigram or just BOW

- Intuition behind word2Vec

5. Machine Learning Model:

You might also like