Natural Language Processing Notes
Natural Language Processing Notes
MODULE-1
MODULE-1
1950s-1960s: The origins of NLP can be traced back to the mid-20th century. During
this time, researchers like Alan Turing and John McCarthy laid the groundwork for
artificial intelligence (AI) and computational linguistics. Turing proposed the famous
Turing Test in 1950, which became a benchmark for evaluating a machine's ability to
exhibit intelligent behavior indistinguishable from a human.
1960s-1970s: Early NLP efforts focused on rule-based systems and symbolic
approaches. One of the pioneering systems was the Georgetown-IBM Experiment in
1954, which translated Russian sentences into English using an IBM 701 computer. In
the 1960s and 1970s, researchers like Roger Schank and Terry Winograd worked on
natural language understanding and dialogue systems, leading to developments such
as the SHRDLU program by Winograd.
1980s-1990s: This era witnessed advancements in statistical NLP and machine
learning techniques. The advent of probabilistic models like Hidden Markov Models
(HMMs) and the use of corpora for training and evaluation marked a shift towards
data-driven approaches. Systems like IBM's "Watson" were also developed during
this time, showcasing the potential of NLP in real-world applications.
2000s-Present: The 2000s saw a surge in research and development in NLP, fueled
by the availability of large datasets, computational power, and algorithmic
improvements. Key developments include the rise of deep learning methods such as
Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks,
and Transformer models like BERT and GPT. These models revolutionized tasks like
machine translation, sentiment analysis, question answering, and more.
Recent Trends: In recent years, NLP has witnessed rapid progress in areas like
transfer learning, pre-trained language models, and multimodal AI, where systems can
understand and generate text, speech, and images. Models like GPT-3, developed by
OpenAI, have showcased the power of large-scale language models in performing
diverse NLP tasks with human-like fluency.
Need of Natural Language Processing
Natural Language Processing (NLP) is a crucial field within artificial intelligence and
computational linguistics, serving various purposes and addressing specific needs.
Natural Language Processing (NLP) is broadly divided into two main components: Natural
Language Understanding (NLU) and Natural Language Generation (NLG).
a. Tokenization
It is the process of breaking down a text into smaller units called tokens (e.g., words, phrases,
symbols).It facilitates further analysis by providing basic units of language.
It identifies the grammatical parts of speech (e.g., nouns, verbs, adjectives) for each token in
a sentence. It helps in understanding the syntactic structure and meaning of sentences.
It identifies and classifying named entities (e.g., people, organizations, locations) within text.
It extracts important information and entities from text for further analysis.
e. Semantic Analysis
f. Coreference Resolution
It Identify when different words or phrases refer to the same entity in a text. It
maintains coherence and consistency in understanding text.
g. Sentiment Analysis
h. Discourse Analysis
It analysing the structure and coherence of larger text segments (e.g., paragraphs,
documents).It also understands how sentences connect and form a meaningful whole.
Natural Language Generation involves the creation of human-like text by machines. NLG
focuses on producing coherent, contextually appropriate, and grammatically correct text. The
key tasks and techniques in NLG include:
a. Content Determination
c. Sentence Planning
d. Lexicalization
It chooses the appropriate words and phrases to convey the intended meaning.
It enhances the fluency and naturalness of the generated text.
e. Surface Realization
f. Text Summarization
g. Dialogue Generation
It creates narratives and reports from structured data. It also automates the creation of
detailed and informative narratives.
While NLU focuses on understanding and interpreting text, NLG is concerned with
generating text. Both components often work together in applications such as:
Chatbots: NLU interprets user input, and NLG generates appropriate responses.
Machine Translation: NLU understands the source text, and NLG generates the
translated text in the target language.
Text Summarization: NLU identifies key information, and NLG produces a concise
summary.
Together, NLU and NLG enable a wide range of applications that require both understanding
and generating human language, making NLP a powerful tool for interacting with and
processing natural language data.
Lexicon describes the understandable vocabulary that makes up a language. Lexical analysis
deciphers and segments language into units—or lexemes—like paragraphs, sentences,
phrases, and words. NLP algorithms categorize words into parts of speech (POS) and split
lexemes into morphemes—meaningful language units that you can‟t further divide. There are
2 types of morphemes:
Syntax describes how a language‟s words and phrases arrange to form sentences. Syntactic
analysis checks word arrangements for proper grammar.
For instance, the sentence “Dave wrote the paper” passes a syntactic analysis checks because
it‟s grammatically correct. Conversely, a syntactic analysis categorizes a sentence like “Dave
do jumps” as syntactically incorrect.
3. Semantic Analysis
Semantics describe the meaning of words, phrases, sentences, and paragraphs. Semantic
analysis attempts to understand the literal meaning of individual language selections, not
syntactic correctness. However, a semantic analysis doesn‟t check language data before and
after a selection to clarify its meaning.
For instance, “Manhattan calls out to Dave” passes a syntactic analysis because it‟s a
grammatically correct sentence. However, it fails a semantic analysis. Because Manhattan is
a place (and can‟t literally call out to people), the sentence‟s meaning doesn‟t make sense.
4. Discourse Integration
For instance, if one sentence reads, “Manhattan speaks to all its people,” and the following
sentence reads, “It calls out to Dave,” discourse integration checks the first sentence for
context to understand that “It” in the latter sentence refers to Manhattan.
5. Pragmatic Analysis
2.2 Linguistic resources, Word Level Analysis, Regular Expression and its applications.
NLP Libraries
2. Spacy:
- Features pre-trained models for various languages, supporting tasks such as POS tagging,
named entity recognition (NER), and dependency parsing.
3. Gensim:
Implements algorithms like Latent Semantic Analysis (LSA) and Latent Dirichlet
Allocation (LDA).
4. Stanford NLP:
6. TextBlob:
7. AllenNLP:
Provides pre-built components for tasks like text classification, named entity
recognition, and machine reading comprehension.
8. CoreNLP (Stanford):
Data Types
Structured Data:
Unstructured Data:
Speech and Audio Data: NLP extends beyond written text to include spoken
language. Transcriptions of speech, phone call recordings, or other audio data fall into
the category of unstructured data that requires processing to extract meaningful
information.
Linguistic resources:
Linguistic resources typically refer to various types of data or tools used in the study or
processing of language.
2. Corpora: Collections of texts or speech data used for linguistic analysis, including
written texts, transcripts, and audio recordings.
3. Thesauri: Resources providing synonyms and antonyms for words, often used for
expanding vocabulary or improving natural language processing tasks.
4. Word Lists: Lists of words categorized by various criteria, such as frequency, part of
speech, or semantic category.
8. Named Entity Recognizers (NER): Tools that identify and classify named entities
(e.g., persons, organizations, locations) in text data.
9. Semantic Networks: Graph-based representations of semantic relationships between
words or concepts, often used in natural language processing tasks.
10. Language Models: Statistical or machine learning models trained on large text
corpora to predict or generate text, commonly used in speech recognition, machine
translation, and text generation.
11. Parsing Tools: Software tools that analyze the grammatical structure of sentences,
often used in syntactic analysis and parsing.
12. Speech Recognition and Synthesis Systems: Tools for converting spoken language
into text and vice versa, commonly used in voice interfaces and speech-to-text
applications.
Word-level analysis in natural language processing (NLP) involves studying and processing
text data at the level of individual words. This type of analysis is fundamental in many NLP
tasks and applications. some common techniques and tasks involved in word-level analysis:
5. Named Entity Recognition (NER): NER is the task of identifying and classifying
named entities (e.g., persons, organizations, locations) mentioned in a text. Named
entities are often crucial for information extraction and knowledge discovery tasks.
6. Word Sense Disambiguation (WSD): WSD is the task of determining the correct
meaning of a word in context, particularly when a word has multiple possible
meanings. WSD is important for improving the accuracy of NLP applications such as
machine translation and question answering.
Regular expressions (regex) are powerful tools used in Natural Language Processing (NLP)
for pattern matching and text manipulation.
Types of RE:
Description: Matches the exact occurrence of the word "word" in the text.
2. Wildcards:
Description: The dot (.) represents any character, so this pattern matches words like "ward,"
"word," and "wind."
3. Character Classes:
Description: Matches any single vowel. Square brackets [ ] denote a character class, and the
pattern matches any character within the specified set.
Description: Matches any single character that is not a vowel. The caret (^) inside the
character class negates the set.
5. Quantifiers:
Description: Matches "gol," "gool," "gooool," and so on. The plus (+) indicates one or more
occurrences of the preceding character.
6. Optional Character:
Description: Matches both "color" and "colour." The question mark (?) makes the preceding
character optional.
7. Anchors:
Description: Matches the pattern only if it appears at the beginning of a line. The caret (^) is
an anchor for the start of a line.
8. Word Boundaries:
Description: Matches the whole word "word" and not a part of a larger word. The \b denotes
a word boundary.
9. Grouping:
Description: Matches either "redcar" or "bluecar." The pipe (|) acts as an OR operator within
the parentheses.
Description: Matches "aa," "aaa," or "aaaa." The curly braces {} specify a range for the
number of occurrences.
Regular expressions (regex) are a powerful tool in natural language processing (NLP) for
pattern matching and text manipulation. They are used to find, extract, or replace specific
patterns in text.
1. Tokenization:
o Splitting Text: Separating text into words, sentences, or other meaningful
units.
import re
text = "Hello, world! How are you?"
tokens = re.findall(r'\b\w+\b', text)
2. Text Cleaning:
o Removing Punctuation: Cleaning text by removing punctuation marks.
3. Finding Patterns:
o Extracting Email Addresses: Identifying and extracting email addresses from
text.
4. Replacing Text:
o Censoring Words: Replacing specific words or phrases with asterisks.
5. Validation:
o Checking Phone Numbers: Validating phone numbers in a specific format.
phone = "123-456-7890"
valid = re.match(r'^\d{3}-\d{3}-\d{4}$', phone)
To search for words ending with "t", you can use the following regular expression pattern:
import re
text = "This is a test text to find words that end with t, such as heat, light, and flight."
ii) Search the string to see if it starts with "I" and ends with "Asia"
To check if a string starts with "I" and ends with "Asia", you can use the following regular
expression pattern:
import re
# Regular expression to check if string starts with "I" and ends with "Asia"
pattern = r'^I.*Asia$'
if match:
print("The string starts with 'I' and ends with 'Asia'")
else:
print("The string does not match the criteria")
MODULE-3
3.1 Dependency Grammar: Named Entity Recognition, Question Answer System, Co-
reference resolution, text summarization, text classification
3.3 Hidden Markov Models, Dependency Parsing, Corpus, Tokens and N-grams
Dependency Grammar:
Dependency Grammar is a class of syntactic theories that regards the structure of a sentence
as based on the dependency relations between words. Each word is connected to another
word in the sentence, establishing a "head-dependent" relationship. The primary focus is on
the direct relationships between words, rather than constituent structure.
Key Concepts
Head: The central word that determines the syntactic type of the phrase.
Dependent: The word that modifies or complements the head.
Dependency Relation: The link between a head and its dependent.
Example
Miscellaneous: Other types of entities like monetary values, percentages, product names, etc.
Process:
NER is significant in various text analysis tasks due to the following reasons:
Information Extraction:
Improved Search and Retrieval: By identifying named entities, search engines can
better understand the context of queries and documents, leading to more relevant
search results.
Content Summarization: NER helps in summarizing documents by extracting key
entities, providing a concise overview of the content.
Database Population: NER assists in populating knowledge bases and databases with
structured information extracted from unstructured text.
Entity Linking: It helps in linking entities in text to their corresponding entries in a
knowledge base, enhancing the richness of information.
Customer Insights:
Business Intelligence:
Challenges:
Ambiguity: Entities can be ambiguous (e.g., "Apple" can refer to the fruit or the
company).
Variability: Entities can have various forms and spellings (e.g., "USA", "United
States", "America").
Context-Dependence: The meaning of entities can depend on context (e.g., "Jordan"
can be a country or a person).
Python Code:
import spacy
nlp = spacy.load('en_core_web_sm')
# Input sentence
sentence = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in April 1976 in
Cupertino, California."
doc = nlp(sentence)
if ent.label_ == "ORG":
entity_type = "Organization"
entity_type = "Date"
entity_type = "Location"
else:
entity_type = ent.label_
print(f"{ent.text} ({entity_type})")
Question answering builds systems that automatically answer questions posed by humans in a
natural language. It is a computer program, which constructs its answers by querying a
structured database of knowledge or information, usually a knowledge base. Examples of
natural language document collections used for Question Answer systems consist of :
It deals with fact, list, definition, How, Why, hypothetical, semantically constrained, and
cross-lingual questions.
It deals with questions under a specific domain, Natural Language Processing systems can
exploit domain-specific knowledge frequently formalized in ontologies.
2. Open-domain question answering: It deals with questions about nearly anything, and can
only rely on general ontologies and world knowledge.
Question Answer systems includes a question classifier module that determines the type of
question and the type of answer. The idea of data redundancy in massive collections, such as
the web, means that nuggets of information are likely to be phrased in many different ways in
differing contexts and documents which lead to two benefits:
1. By having the right information appear in many forms, the burden on the Question
Answer system to perform complex NLP techniques to understand the text is lessened.
2. Correct answers can be filtered from false positives by relying on the correct answer to
appear more times in the documents than instances of incorrect ones.
For example, systems have been developed to automatically answer temporal and
multilingual questions, and questions about the content of audio, images, and video.
Corefernce Resolution:
1. Mention Detection:
o Identifying all the phrases or words (mentions) in a text that potentially refer
to entities.
2. Coreference Chains:
o Grouping mentions that refer to the same entity into clusters. Each cluster
represents one unique entity.
3. Types of Coreference:
o Pronouns: E.g., "John arrived. He was late." Here, "He" refers to "John".
o Proper Names: E.g., "Barack Obama" and "Obama".
o Common Nouns: E.g., "The president" and "Barack Obama".
o Synonyms and Hypernyms: E.g., "The car" and "The vehicle".
1. Mention Extraction:
o Extract all potential mentions from the text. This can include noun phrases,
pronouns, and named entities.
2. Feature Extraction:
o Extract features for each pair of mentions. Features can include grammatical
roles, proximity, gender, number agreement, semantic compatibility, etc.
3. Mention Pair Classification:
o Classify whether pairs of mentions refer to the same entity using machine
learning models. Techniques can range from rule-based methods to neural
networks.
4. Cluster Formation:
o Form clusters of mentions that refer to the same entity based on the pairwise
classifications.
1. Rule-Based Methods:
o Utilize linguistic rules and heuristics. For example, resolving pronouns based
on grammatical and syntactic cues.
2. Machine Learning Approaches:
o Use supervised learning with annotated datasets. Common models include
decision trees, support vector machines, and neural networks.
o Feature-Based Models: Handcrafted features are used to train classifiers.
o Deep Learning Models: End-to-end models, such as those using LSTM or
transformers, which learn representations automatically from data.
3. Recent Advances:
o Neural Networks: Use of RNNs, LSTMs, and transformers to capture context
and dependencies better.
o Pre-trained Language Models: Models like BERT, GPT-3, and their
derivatives fine-tuned for coreference tasks have shown state-of-the-art
performance.
Datasets
1. OntoNotes:
o A large corpus annotated for coreference resolution, commonly used for
training and evaluating models.
2. ACE (Automatic Content Extraction):
o Another dataset used for coreference resolution among other NLP tasks.
Evaluation Metrics
Challenges
1. Ambiguity:
o Determining the correct antecedent for pronouns and other ambiguous
mentions can be difficult.
2. Variability in Language:
o Differences in writing style, use of synonyms, and indirect references add
complexity.
3. Context Understanding:
o Requires deep understanding of context and world knowledge.
Applications
1. Information Extraction:
o Improves the extraction of structured information from unstructured text.
2. Text Summarization:
o Enhances the coherence and cohesion of summaries by correctly linking
entities.
3. Question Answering:
o Ensures accurate linking of entities in questions and answers for better
performance.
Text summarization and text classification are two fundamental tasks in natural language
processing (NLP), each with its distinct methodologies and applications.
Text Summarization
Text summarization involves creating a concise and coherent version of a longer document,
capturing its main points. There are two main types of text summarization:
1. Extractive Summarization:
Method: Selects important sentences, phrases, or sections directly from the source document
and combines them to form a summary.
Techniques:
2. Abstractive Summarization:
Method: Generates new sentences that capture the essence of the source text, rather than just
extracting parts of it.
Techniques:
Text Classification
Text classification involves assigning predefined categories or labels to a given piece of text.
It has various applications such as sentiment analysis, spam detection, topic categorization,
and more.
Text Preprocessing:
Feature Extraction:
1. Bag-of-Words (BoW): Represents text as a set of word counts or binary indicators.
2. TF-IDF (Term Frequency-Inverse Document Frequency): Weighs words based on
their importance in the document and across the corpus.
3. Word Embeddings: Dense vector representations of words (e.g., Word2Vec, GloVe,
FastText).
4. Contextual Embeddings: Captures word meanings in context using models like
BERT, ELMo, and GPT.
Model Training:
1. Naive Bayes:
A probabilistic classifier based on Bayes' theorem, often used for spam detection and
sentiment analysis.
Example:
Find the tag for the sentence “She sings very sweet” using Naive Bayes algorithm.
Given Data:
Sentence Category
A Great medicine Medicine
She is having sweet voice Music
Tulsi is a medicinal plant Medicine
Ginger is good for health Medicine
Yaman is one of the sweet raga. Music
Listening to music is good for Music
health
To classify the sentence "She sings very sweet" using the Naive Bayes algorithm, we will
follow these steps:
1. Prepare the Data: Create a dataset with the given sentences and their corresponding
categories.
2. Preprocess the Data: Tokenize the sentences and convert them into a suitable format
for the Naive Bayes classifier.
3. Train the Naive Bayes Classifier: Using the training data, train a Naive Bayes
classifier.
4. Classify the New Sentence: Use the trained classifier to predict the category of the
new sentence.
Python Implementation:
import pandas as pd
# Training data
data = {
'Sentence': [
],
'Category': [
'Medicine',
'Music',
'Medicine',
'Medicine',
'Music',
'Music'
# Create a DataFrame
df = pd.DataFrame(data)
X_train = df['Sentence']
y_train = df['Category']
model.fit(X_train, y_train)
predicted_category = model.predict(new_sentence)
print(predicted_category[0])
3. Neural Networks:
Fine-tuning models like BERT, RoBERTa, and GPT for specific classification tasks,
leveraging their deep understanding of language context and semantics.
Applications
1. Text Summarization:
2. Text Classification:
Both text summarization and text classification are integral to numerous real-world
applications, enabling better information retrieval, understanding, and decision-making
processes in various domains.
Tokenization
Tokenization is the process of breaking down a text into smaller units called tokens. These
tokens can be words, phrases, symbols, or other meaningful elements.
Tokenization is a fundamental step in text processing and analysis, as it converts raw text into
a structured form that can be easily analyzed.
Techniques
Example
Consider the sentence: "Apple Inc. was founded by Steve Jobs and Steve Wozniak in April
1976 in Cupertino, California."
Tokens:
"Apple"
"Inc."
"was"
"founded"
"by"
"Steve"
"Jobs"
"and"
"Steve"
"Wozniak"
"in"
"April"
"1976"
"in"
"Cupertino"
","
"California"
"."
import spacy
nlp = spacy.load("en_core_web_sm")
# Input sentence
sentence = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in April 1976 in
Cupertino, California."
doc = nlp(sentence)
print("Tokens:", tokens)
Text Normalization
Text normalization is the process of transforming text into a standard format. It is a crucial
pre-processing step in natural language processing (NLP) to ensure that the text is consistent
and clean before further analysis or model training.
Example
Consider the sentence: "Running faster and faster, he couldn't believe it!"
Steps:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import string
# Example sentence
sentence = "Running faster and faster, he couldn't believe it!"
Part of Speech (POS) tagging involves assigning each word in a sentence its
corresponding part of speech, such as noun, verb, adjective, etc. POS tagging is a
critical step in the syntactic analysis of a language, helping to understand the structure
and meaning of a sentence.
Lexical syntax refers to the structure and formation of tokens within a language,
covering both their identification and syntactic roles. POS tagging is a part of this
broader area, as it deals with the syntactic roles of words.
Key Concepts
Parts of Speech:
Rule-Based Tagging: Uses a set of hand-crafted rules to assign POS tags based on
word patterns and context.
Statistical Tagging: Uses probabilistic models like Hidden Markov Models (HMM) to
assign POS tags based on the likelihood of sequences of tags.
Machine Learning Tagging: Utilizes supervised learning models such as Conditional
Random Fields (CRF) and neural networks to predict POS tags based on annotated
training data.
Example
Consider the sentence: "Apple Inc. was founded by Steve Jobs and Steve Wozniak in
April 1976 in Cupertino, California."
POS Tags:
"Apple" (NNP)
"Inc." (NNP)
"was" (VBD)
"founded" (VBN)
"by" (IN)
"Steve" (NNP)
"Jobs" (NNP)
"and" (CC)
"Steve" (NNP)
"Wozniak" (NNP)
"in" (IN)
"April" (NNP)
"1976" (CD)
"in" (IN)
"Cupertino" (NNP)
"," (,)
"California" (NNP)
"." (.)
import spacy
nlp = spacy.load("en_core_web_sm")
# Input sentence
sentence = "Apple Inc. was founded by Steve Jobs and Steve Wozniak in April 1976 in
Cupertino, California."
doc = nlp(sentence)
A Hidden Markov Model (HMM) is a statistical model used to describe systems that are
modeled by a Markov process with hidden states. HMMs are widely used in various fields
such as speech recognition, handwriting recognition, and natural language processing,
particularly in part-of-speech tagging, named entity recognition, and other sequence labeling
tasks.
Key Concepts
1. States: The possible conditions or positions that can be taken by the system (e.g.,
parts of speech in a sentence). In HMM, these states are not directly visible (hidden).
2. Observations: The data or outputs that can be directly observed (e.g., the words in a
sentence).
3. Transition Probabilities: The probabilities of transitioning from one state to another.
4. Emission Probabilities: The probabilities of an observation being generated from a
particular state.
5. Initial State Probabilities: The probabilities of the system starting in each state.
Components of HMM
Applications in NLP
We aim to tag each word with its part of speech using an HMM.
import numpy as np
n_states = len(states)
n_observations = len(observations)
# Transition matrix
# Emission matrix
model = hmm.MultinomialHMM(n_components=n_states)
model.startprob_ = π
model.transmat_ = A
model.emissionprob_ = B
obs_seq = np.array([[0, 1, 2, 3, 4]]).T # Mapping "the cat sat on mat" to indices [0, 1, 2, 3, 4]
Dependency Parsing
Dependency parsing is a type of syntactic parsing that focuses on the relationships between
words in a sentence. Unlike phrase structure parsing, which emphasizes constituent structures
(like noun phrases and verb phrases), dependency parsing is concerned with how words are
connected through grammatical relationships.
Key Concepts
Dependencies:
Root:
The main verb or action in a sentence, which governs all other words either directly or
indirectly.
Arcs:
Example
Dependency Tree
enjoys
Python Code:
import spacy
nlp = spacy.load("en_core_web_sm")
# Input sentence
doc = nlp(sentence)
spacy.displacy.serve(doc, style="dep")
Corpus, Tokens and N-grams
1. Corpus:
A corpus (plural: corpora) is a large and structured set of texts. It serves as a dataset for
training and evaluating NLP models. A corpus can contain various types of text data such as
articles, books, emails, tweets, etc.
Example
2. Tokens
Tokenization is the process of splitting text into individual pieces, called tokens. Tokens can
be words, subwords, or characters. Tokenization is one of the first steps in text processing.
Example
An n-gram is a collection of n successive items in a text document that may include words,
numbers, symbols, and punctuation. N-gram models are useful in many text analytics
applications where sequences of words are relevant, such as in sentiment analysis, text
classification, and text generation. N-gram modeling is one of the many techniques used to
convert text from an unstructured format to a structured format. An alternative to n-gram is
word embedding techniques, such as word2vec.
Types of N-grams:
Advantages:
Limitations:
Language Modeling:
n-grams play a critical role in language modeling by predicting the likelihood of a word
given its preceding words.
Text Analysis:
In text analysis, n-grams assist in understanding the structure and frequency of phrases within
large text corpora:
Frequency Analysis: n-grams provide insights into common word combinations and
phrases by analyzing their frequency of occurrence. This is useful in various
applications like information retrieval, where identifying common phrases can
improve search results.
Contextual Understanding: By examining n-grams, one can better understand the
local context of words within a text, aiding tasks like sentiment analysis and keyword
extraction.
Python Code:
import nltk
nltk.download('punkt')
# Tokenize text
tokens = word_tokenize(text)
print("Tokens:", tokens)
# Generate Bigrams
print("Bigrams:", bigrams)
# Generate Trigrams
print("Trigrams:", trigrams)
bigram_freq = Counter(bigrams)
trigram_freq = Counter(trigrams)
print("Bigram Frequencies:", bigram_freq)
Example2: Using Textblob object in Python construct trigrams for the following text
“ Nelson Rolihlahla Mandela was a South African anti-apartheid activist and politician
who served as the first president of South Africa”. Write suitable Python code for it.
text = "Nelson Rolihlahla Mandela was a South African anti-apartheid activist and politician
who served as the first president of South Africa."
blob = TextBlob(text)
trigrams = blob.ngrams(n=3)
print(trigram)
Stemming
Stemming is a method in text processing that eliminates prefixes and suffixes from words,
transforming them into their fundamental or root form, The main objective of stemming is to
streamline and standardize words, enhancing the effectiveness of the natural language
processing tasks.
->"likes"
->"liked"
->"likely"
->"liking"
Types of stemmer:
1.Porter Stemmer: Proposed in 1980 by Martin Porter, it's one of the most popular
stemming methods. It's fast and widely used in English-based applications like data mining
and information retrieval. However, it can produce stems that are not real words.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED
ending, change the ending to EE” as „agreed‟ becomes „agree‟.
2.Lovins Stemmer: Developed by Lovins in 1968, this stemmer removes the longest suffix
from a word. It's fast and handles irregular plurals well, but it may not always produce valid
words from stems.
4.Krovetz Stemmer: Proposed in 1993 by Robert Krovetz, this stemmer converts plural
forms to singular and past tense to present, removing 'ing' suffixes. It's light and can be used
as a pre-stemmer, but it may be inefficient for large documents.
5.Xerox Stemmer: Capable of processing extensive datasets and generating valid words, but
it can over-stem due to reliance on lexicons, making it language-dependent.
Example:
6.N-Gram Stemmer: Breaks words into segments of length 'n' and applies statistical analysis
to identify patterns. It's language-dependent and requires space to create and index n-grams.
Example: „INTRODUCTIONS‟ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT,
TI, IO, ON, NS, S*
7.Snowball Stemmer (Porter2): Multi-lingual and more aggressive than Porter Stemmer, it's
faster and supports various languages. It's based on the Snowball programming language.
Example:
Input: running
Output: run
The Snowball Stemmer (for English) is based on the Porter Stemmer algorithm and operates
similarly by removing common suffixes. It recognizes "ing" as a suffix and removes it to
obtain the base form "run".
8.Lancaster Stemmer: More aggressive and dynamic, it's faster but can be confusing with
small words. It saves rules externally and uses an iterative algorithm.
Output: run
The Lancaster Stemmer is more aggressive than the Porter Stemmer. It removes common
suffixes based on a different set of rules. In this case, it identifies "ing" as a suffix and
removes it to produce the base form "run".
9.Regexp Stemmer: Uses regular expressions to define custom rules for stemming. It offers
flexibility and control over the stemming process for specific applications.
Example:
Input: running
Output: run
The regex-based stemmer uses regular expressions to identify and remove specific suffixes.
In this example, it recognizes "ing" as a suffix and removes it, resulting in the base form
"run".
Stemming Operations in Python:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer,
LancasterStemmer, RegexpStemmer
# Sample text
text = "The quick brown foxes were running and jumping over the lazy
dogs."
import nltk
import spacy
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
# Sample text
text = "The quick brown foxes were running and jumping over the lazy dogs."
words = nltk.word_tokenize(text)
wordnet_lemmatizer = WordNetLemmatizer()
# Function to get wordnet POS tag
def get_wordnet_pos(word):
tag = nltk.pos_tag([word])[0][1][0].upper()
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
Stop words are common words such as "and", "the", "is", "in", etc., that are often removed
from text because they carry little meaningful information for tasks like text classification,
sentiment analysis, and more.
Python Code:
import nltk
import spacy
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = "The quick brown foxes were running and jumping over the lazy dogs."
words = nltk.word_tokenize(text)
nltk_stop_words = set(stopwords.words('english'))
nlp = spacy.load('en_core_web_sm')
# Process the text with spaCy
doc = nlp(text)
Term Frequency (TF) and Inverse Document Frequency (IDF) are fundamental concepts in
information retrieval and text mining. They are used to evaluate how important a word is to a
document in a collection (or corpus). The combination of these two metrics forms the TF-IDF
score, which is widely used in search engines, text mining, and various NLP applications.
Formula:
Python Code:
from collections import Counter
import numpy as np
def compute_tf(doc):
tf_dict = {}
word_counts = Counter(doc)
total_terms = len(doc)
return tf_dict
# Example document
doc = "the quick brown fox jumps over the lazy dog".split()
tf = compute_tf(doc)
print(tf)
Inverse Document Frequency measures how important a term is. While computing TF, all
terms are considered equally important. However, certain terms like "is", "of", and "that" may
appear frequently in many documents but have little importance. IDF diminishes the weight
of such common terms and increases the weight of rare terms.
Python Code:
def compute_idf(docs):
idf_dict = {}
N = len(docs)
return idf_dict
# Example documents
docs = [
idf = compute_idf(docs)
TF-IDF
Python Code:
tfidf_dict = {}
return tfidf_dict
tfidf_docs = []
tf = compute_tf(doc)
tfidf_docs.append(tfidf)
print("TF-IDF:")
for i, tfidf in enumerate(tfidf_docs):
print(f"Document {i+1}:")
print(tfidf)
Significance:
Use Cases:
Text Summarization: TF helps in identifying key terms that are important for
summarizing the content of a document.
Document Clustering: High TF values indicate terms that can be used for clustering
similar documents together.
Significance:
Global Importance: IDF measures the rarity of a term across all documents in the
corpus. Words that appear in many documents (common words) get lower IDF scores,
while words that appear in fewer documents (rare words) get higher IDF scores. This
helps in distinguishing important terms from common ones.
Balancing Commonality: By giving less weight to common words and more weight to
rare ones, IDF helps in reducing the noise caused by frequent but uninformative terms
(e.g., "the", "is", "and").
Enhancing Specificity: IDF boosts the significance of terms that are more informative
and specific to particular documents, making it easier to distinguish between
documents based on their unique content.
Use Cases:
Significance of TF-IDF:
Word embeddings are a type of word representation that allows words to be represented as
vectors in a continuous vector space. This representation helps capture the semantic meaning
of words based on their context in large corpora. Two popular models for generating word
embeddings are Word2Vec and FastText.
Word2Vec
1. Model Architecture:
Continuous Bag of Words (CBOW): In CBOW, the model predicts a target word
based on its context. The context words are used as input, and the target word is the
output. The model is trained to minimize the difference between the predicted and
actual target words.
Skip-gram: Conversely, the Skip-gram model predicts context words given a target
word. The target word serves as input, and the model aims to predict the words that
are likely to appear in its context. Like CBOW, the goal is to minimize the difference
between the predicted and actual context words.
Both CBOW and Skip-gram models leverage neural networks to learn vector representations.
The neural network is trained on a large text corpus, adjusting the weights of connections to
minimize the prediction error. This process places similar words closer together in the
resulting vector space.
3. Vector Representations:
Once trained, Word2Vec assigns each word a unique vector in the high-dimensional space.
These vectors capture semantic relationships between words. Words with similar meanings or
those that often appear in similar contexts have vectors that are close to each other, indicating
their semantic similarity.
4. Advantages and Disadvantages:
Advantages:
Disadvantages:
Python Code:
# Sample corpus
corpus = [
similar_words = model.wv.most_similar('cat')
1. Subword Information:
FastText represents each word as a bag of character n-grams in addition to the whole word
itself. This means that the word “apple” is represented by the word itself and its constituent n-
grams like “ap”, “pp”, “pl”, “le”, etc. This approach helps capture the meanings of shorter
words and affords a better understanding of suffixes and prefixes.
2. Model Training:
Similar to Word2Vec, FastText can use either the CBOW or Skip-gram architecture.
However, it incorporates the subword information during training. The neural network in
FastText is trained to predict words (in CBOW) or context (in Skip-gram) not just based on
the target words but also based on these n-grams.
A significant advantage of FastText is its ability to generate better word representations for
rare words or even words not seen during training. By breaking down words into n-grams,
FastText can construct meaningful representations for these words based on their subword
units.
Advantages:
Python Code:
# Example corpus
corpus = [
similar_words = model.wv.most_similar('apples')
print(f"{word}: {similarity}")
MODULE-5
Deep learning models are a class of machine learning algorithms that are capable of learning
complex patterns and representations from large amounts of data. They have revolutionized
many fields, including Natural Language Processing (NLP), computer vision, speech
recognition, and more. Following are some common deep learning models:
Key Features:
Consist of convolutional layers followed by pooling layers.
Learn hierarchical representations of image features.
Effective in tasks like image classification, object detection, and image segmentation.
Key Features:
Key Features:
A type of RNN designed to address the vanishing gradient problem.
Maintain long-term dependencies in sequential data.
Include forget, input, and output gates to control information flow.
Example Applications: Language modeling, sentiment analysis, time series prediction, speech
recognition.
Key Features:
5. Transformer Models
Key Features:
Key Features:
Example Applications: Image generation (e.g., faces, artworks), data augmentation, anomaly
detection.
7. Variational Autoencoders (VAEs)
Key Features:
Key Features:
Example Applications: Game playing (e.g., AlphaGo), robotics control, autonomous driving.
Each of these deep learning models has its strengths and is suited for specific tasks.
Understanding their principles and architectures can help in selecting the right model for a
given problem and designing effective machine learning solutions.
ELMo Overview:
Overview of BERT:
Python Code:
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
# Example text
outputs = model(**inputs)
print(predictions)
Zipf's Law
Zipf's Law states that the frequency of a word in a corpus is inversely proportional to its rank.
In simpler terms, the most frequent word appears twice as often as the second most frequent
word, three times as often as the third most frequent word, and so on. Mathematically, Zipf's
Law can be expressed as:
Example:
The total number of words in a document is 80. The “Commendable”
Word appeared 12 times in the document. Using Zipf‟s Law calculate the
probability of word “Commendable” at rank 6.
Ans:
To calculate the probability of the word "Commendable" at rank 6 using
Zipf's Law, we can follow these steps:
Heap's Law
Example:
Ans:
For simplicity, let's assume k=10k = 10k=10 and b=0.5b = 0.5b=0.5,
which are commonly used values within the typical range for these
constants.
Using these values, we can calculate the vocabulary size (V) for a
document with N=625 words
V=10×6250.5
V=10×25
V=250