Pipeline
Pipeline
Pipeline
1. Data Collection:
This is the first step where you gather text data from various sources such as web scraping,
APIs, databases, or text files.
a. Data is available:
- Data is on the table, carry on
- Data is in some database, connect to database and collect the data
- Less data:
Do data augmentation like, synonym conversion, back translate (translate to
other language and translate again), bigram flip (joining two words)
b. Others:
- Public dataset (from competitors, webscraping, API, PDF, image, audio (using tools).
c. Nothing:
- Contact product team, survey and collect data, label the data for sentiment.
2. Text Preprocessing:
a. Cleaning:
- like removing https, html tags using regex,
- emojis – convert it to machine understandable code using python (UTF encoding)
- spelling check (chats like you: u) using textblob.correct()
- convert short forms to full forms (github).
- Remove punctuations with string.punctuation
b. Basic preprocessing:
i. Tokenization:
1. Sentence tokenization – using nltk convert para to sentence based on “.”.
2. Word tokenization - using nltk convert sentence to words based on “ ”.
ii. It can be done by splitting, but the problem is for eg delhi! Cannot split “delhi”
and “!”.
iii. It also can be done by regex but lot patterns have to be written
iv. Next option is NLTK which is better but spacy does lot better work. Eg gmail
Why tokenization is important?
Text Segmentation: Tokenization segments continuous text into discrete units, allowing us to
process and analyze text at a more granular level.
Feature Extraction: Many NLP techniques and models, such as Bag-of-Words (BoW) and
Word Embeddings, require text to be represented in the form of individual tokens. Tokenization
enables the extraction of these features, making them more meaningful and interpretable.
Vocabulary Creation: Tokenization helps build a vocabulary or a list of unique words or
subwords present in the text corpus. The vocabulary is essential for various tasks, such as
creating word embeddings or mapping words to numerical representations.
Word-Level Analysis: Tokenization allows us to perform word-level analysis, such as word
frequency counting, sentiment analysis, or language model training.
Machine Learning Input: Many NLP models and algorithms require text data to be converted
into numerical representations. Tokenization is the initial step in preparing text data for input
into these models.
N-gram Generation: Tokenization can help generate N-grams, which are contiguous
sequences of tokens. N-grams are used in various NLP tasks, such as language modeling and
information retrieval.
Efficient Memory Usage: Tokenization helps manage memory usage when working with large
text datasets. By representing text as tokens, we reduce the memory footprint and make it easier
to process text data in chunks.
i. Lowercasing: Convert all text to lowercase to ensure case-insensitivity during analysis.
ii. Stop Word Removal: Remove common words (e.g., "the," "is," "and") that do not carry
significant meaning in the context of the analysis.
iii. Stemming or Lemmatization: Reduce words to their base or root form. For example, stemming
would convert "running," "runs," and "ran" to "run," while lemmatization might map them to
"run."
Stemming
v. Stemming is used to reduce inflection (different grammatical categories, walk,
walked, walking) eg use in information retrieval system.
vi. Porter stemmer is used for English language, for other language snowball
stemmer for stemming
vii. It might give a word which has no meaning as well or not a valid word
b. Lemmatization (lemma)
i. Lemmatization will always have a valid word
ii. Comparatively slow because it is lexical based and search in the dictionary
unlike stemming which is algorithm based
iii. Also provide POS
iv. Used when we have to show output to user eg. Chatbot
c. Advance preprocessing:
i. Parts of speech (POS) tagging: tagging each word with pos for better
understanding
ii. Parsing – To understand the syntactical structure of the sentence.
iii. Core reference resolution: I(mahesh) am Mahesh
NOTE: all these are done when we did not remove stop words. Basically, this is for chat bots
application.
3. Feature Extraction (text representation, text vectorization): extracting numbers from text like count
of positive, negative, neutral words etc
Machine learning method: we need to create features and advantage is model interpretability
is not lost and disadvantage is we need to know domain knowledge and model might backstab
with features what we have extracted.
- Feature extraction is done to get the valid output (garbage in / garbage out)
- It is difficult because we need to get the symantic meaning from the text unlike image
and audio
i. Techniques used
- OHE (one hot encoding)
First collect the unique words from corpus i.e., vocabulary then convert each
word of the document into V dimensional vector
Max features: used to remove rare words, it decides the features based on the
frequency of words occurred.
Here we will take more words as vocabulary like bi gram means a token is
formed by combining two words to form a vocabulary
Say we have to create a word2Vec from scratch, idea here is to get the
relationship between words.
Say we have vocabulary like King, Queen, Man, Woman, Monkey
Now we have to create our own features and like gender, wealth… and create
the vectors so that it captures the meaning manually.
Now we get the vectors, now we can do some vector operations like, king – man
+ woman will give queen, similarly we can get new word meaning by doing this
operation and get the closer results.
This is how the model works and gets the semantic relationship.
Here we just have 5 vocabs but as in genism model we have 30 lakh vocabs and
300 features (column wise) which can’t be hand crafted.
It is done by neural network. But we can’t get the features name as in king,
queen... it will be f1, f2... instead and its values. See below fig
So, this genism model has created and trained 300 features for 30 lakh words
and phrases.
So now we can do vector operations and get the closer results or similar results.
Now how will deep learning create features on what basis?
The underlying assumption of word2Vec is that the two words sharing similar
contexts also share a similar meaning and consequently a similar vector
representation from the model.
That is for eg: I play football, I play hockey. Deep learning will understand that
these contexts are similar and form a similar vector representation,
- Two types of architecture in word2Vec
a. cBow – Continous bOW
b. skip-gram