NLP Notes For Students

CS8084 NATURAL LANGUAGE PROCESSING
Prepared by
Jeni Narayanan L A
Assistant professor
UNIT I INTRODUCTION 9
Origins and challenges of NLP – Language Modeling: Grammar-based LM, Statistical LM -
Regular Expressions, Finite-State Automata – English Morphology, Transducers for lexicon
and rules, Tokenization, Detecting and Correcting Spelling Errors, Minimum Edit Distance
The origins of Natural Language Processing (NLP) can be traced back to the 1950s and 1960s
with the development of early computational linguistics and machine translation systems. Some
key milestones include the development of the Georgetown-IBM experiment in 1954, which
translated Russian sentences into English, and the creation of the first chatbot, ELIZA, in the
mid-1960s.
Challenges in NLP stem from the complexity and ambiguity inherent in natural language. Here
are some of the key challenges:
1. Ambiguity: Natural language is highly ambiguous, with words and phrases often having
multiple meanings depending on context. Resolving this ambiguity is a major challenge in
tasks such as parsing, word sense disambiguation, and machine translation.
2. Syntax and Semantics: Understanding the syntactic and semantic structure of sentences
is crucial for NLP tasks. However, natural language exhibits complex syntactic and
semantic patterns that can be difficult for machines to parse and understand accurately.
3. Context Dependency: The meaning of a word or phrase can vary depending on the
surrounding context. Capturing and modeling context dependencies is essential for tasks
like sentiment analysis, named entity recognition, and question answering.
4. Lack of Annotated Data: Many NLP tasks require large amounts of annotated data for
training machine learning models. However, creating high-quality annotated datasets can
be time-consuming and expensive, especially for languages with limited resources.
5. Domain Specificity: Natural language varies greatly across different domains and genres
(e.g., medical texts, legal documents, social media posts). Building NLP systems that
perform well across diverse domains is challenging due to the need for domain adaptation
and specialized knowledge.
6. Commonsense Reasoning: Understanding and reasoning about commonsense knowledge
is essential for many NLP tasks, such as language understanding and generation.
However, capturing and representing commonsense knowledge in a machine-readable
format is still an ongoing research challenge.
7. Ethical and Bias Concerns: NLP systems can inadvertently perpetuate biases present in
the data they are trained on, leading to issues such as algorithmic bias and fairness
concerns. Addressing these ethical considerations is crucial for the responsible
development and deployment of NLP technologies.
Despite these challenges, significant progress has been made in NLP in recent years, driven by
advances in machine learning, deep learning, and computational linguistics. Ongoing research
continues to push the boundaries of what is possible in natural language understanding and
generation.
2. Language modeling is a fundamental task in natural language processing (NLP) that involves
predicting the next word in a sequence of words. The goal is to capture the statistical structure of
language and generate coherent and contextually relevant text.
Here's how language modeling typically works:
1. Input Sequence: A language model takes as input a sequence of words or tokens. This
sequence can be a sentence, paragraph, or longer text.
2. Context Encoding: The input sequence is encoded into a numerical representation that
can be processed by the language model. This encoding captures the contextual
information of the input, such as the meaning of words and their relationships within the
sequence.
3. Prediction: Based on the encoded context, the language model predicts the probability
distribution over the vocabulary of possible next words. This distribution indicates the
likelihood of each word occurring given the context provided by the input sequence.
4. Sampling: To generate text, the language model can either select the word with the
highest probability (greedy decoding) or sample from the probability distribution to
introduce randomness and generate diverse text.
Language modeling can be approached using different techniques:
1. Statistical Language Models: These models estimate the probability of word sequences
based on statistical analysis of large text corpora. Techniques such as n-grams and
smoothing methods are commonly used in statistical language modeling.
2. Neural Language Models: Neural network-based approaches, particularly recurrent
neural networks (RNNs) and more recently transformer-based architectures like GPT
(Generative Pre-trained Transformer), have become prevalent for language modeling.
These models learn distributed representations of words and capture long-range
dependencies in text.
3. Fine-tuning: Pre-trained language models can be fine-tuned on specific tasks or domains
to improve performance on downstream tasks like text generation, machine translation,
and sentiment analysis. This fine-tuning process adapts the pre-trained model to the
characteristics of the target dataset or task.
Language modeling has numerous applications in NLP, including:
 Text generation: Generating coherent and contextually relevant text, such as in chatbots,
language generation systems, and content creation tools.
 Machine translation: Modeling the probability of target language words given the source
language context.
 Speech recognition: Estimating the likelihood of spoken words or phrases given acoustic
features.
 Information retrieval: Ranking documents based on their relevance to a query by

modeling the likelihood of word sequences in documents.
 Summarization: Generating concise summaries of longer texts by predicting the most
important words or phrases.
Overall, language modeling plays a crucial role in various NLP tasks and continues to be an
active area of research and development.
Grammar-based language models (LMs) are a class of language models that rely on explicit
grammar rules to generate or understand natural language text. These models are based on
linguistic theories and formal grammars, which define the syntax and structure of a language.
Here's how grammar-based language models typically work:
1. Grammar Rules: Grammar-based LMs start with a set of grammar rules that describe
the syntactic structure of the language. These rules define how words and phrases can be
combined to form grammatically correct sentences.
2. Parsing: When generating or understanding text, the input is parsed according to the
grammar rules to identify the syntactic structure of the input sentence. This involves
breaking down the input into its constituent parts, such as words, phrases, and clauses,
and determining how they relate to each other.
3. Rule Application: The grammar rules are then applied to the parsed input to generate or
interpret text. These rules govern how words and phrases can be combined to form valid
sentences according to the grammar of the language.
4. Constraints: Grammar-based LMs may incorporate additional constraints to ensure that
the generated text adheres to specific criteria, such as style, domain-specific vocabulary,
or semantic coherence.
5. Evaluation: The generated text is evaluated based on its grammaticality and coherence
according to the rules of the grammar. This evaluation may involve checking for
violations of grammar rules, semantic inconsistencies, or other linguistic criteria.
Grammar-based language models have several advantages:
 Explicit Linguistic Knowledge: By encoding linguistic knowledge in the form of

grammar rules, these models can capture intricate syntactic structures and language
patterns.
 Interpretability: Since the grammar rules are explicitly defined, the behavior of
grammar-based LMs is often more interpretable compared to black-box models like
neural networks.
However, grammar-based LMs also have limitations:
 Scalability: Creating comprehensive grammar rules for natural languages can be

challenging, especially for languages with complex syntax and semantics. As a result,
grammar-based LMs may struggle to handle the full richness of natural language.
 Coverage: Grammar-based LMs may not capture all linguistic phenomena, leading to
gaps in coverage and potential errors in text generation or interpretation.
Overall, while grammar-based language models provide a principled approach to natural

language processing, they are often supplemented or replaced by data-driven approaches, such as
statistical language models and neural language models, which can learn patterns directly from
data without relying on explicit grammar rules.
grammar-based language modeling using context-free grammar (CFG) and a probabilistic context-free
grammar (PCFG).
Examples:
S -> NP VP
NP -> Det N
VP -> V NP | V NP PP
PP -> P NP
Det -> "the" | "a"
N -> "cat" | "dog" | "ball"
V -> "chased" | "ate"
P -> "on" | "under" | "with"
This grammar consists of rules for generating
1. sentences (S),
2. noun phrases (NP),
3. verb phrases (VP),
4. prepositional phrases (PP),
5. determiners (Det),
6. nouns (N),
7. verbs (V),
8. prepositions (P).
We start with the start symbol "S" and recursively apply the production rules until we derive a complete
sentence:
1. S
2. NP VP (using the rule S -> NP VP)
3. Det N VP (using the rule NP -> Det N)
4. "the" N VP (using the rule Det -> "the")
5. "the" N V NP (using the rule VP -> V NP)
6. "the" N V Det N (using the rule NP -> Det N)
7. "the" N V Det N PP (using the rule VP -> V NP PP)
8. "the" N V Det N P NP (using the rule PP -> P NP)
9. "the" N V Det N P Det N (using the rule NP -> Det N)
Now we have a complete sentence: "the cat chased the dog on a ball."
Statistical language modelling
Statistical language modeling is a technique used to estimate the probability distribution of word
sequences in a language based on observed data. It forms the basis for many natural language
processing tasks, such as speech recognition, machine translation, and text generation.
Here's how statistical language modeling typically works:
1. Training Data: Statistical language models are trained on large amounts of text data,
known as a corpus. This corpus contains sequences of words along with their frequencies
of occurrence.
2. n-gram Models: One of the simplest approaches to statistical language modeling is the
n-gram model, where the probability of a word sequence is estimated based on the
frequencies of occurrence of n-length sequences of words (n-grams) in the training data.
For example, a bigram model (n=2) estimates the probability of a word given its
preceding word, while a trigram model (n=3) estimates the probability of a word given its
two preceding words.
3. Estimating Probabilities: Given a sequence of words w1, w2, ..., wn, the probability of
the entire sequence P(w1, w2, ..., wn) is estimated as the product of the conditional
probabilities of each word given its preceding context:
P(w1, w2, ..., wn) ≈ P(w1) * P(w2|w1) * P(w3|w1, w2) * ... * P(wn|wn−1, ..., w1)
These probabilities are estimated from the frequencies of n-grams in the training data
using techniques such as maximum likelihood estimation (MLE) or smoothed estimation
methods like add-one smoothing or Kneser-Ney smoothing.
4. Backoff and Interpolation: To address data sparsity issues and improve the robustness
of n-gram models, techniques like backoff and interpolation are often employed. Backoff
involves using lower-order n-grams when higher-order n-grams have zero counts, while
interpolation combines probabilities from different n-gram orders to smooth the
probability estimates.
5. Application: Once trained, a statistical language model can be used for various NLP
tasks. For example, in speech recognition, the language model helps to recognize the
most likely sequence of words given the input speech signal. In machine translation, it
guides the generation of fluent and grammatically correct translations.
Statistical language modeling provides a simple yet effective framework for capturing the
statistical properties of natural language. However, it has limitations such as the inability to
capture long-range dependencies and the need for large amounts of training data to achieve good
performance. More sophisticated approaches, such as neural language models, have been
developed to address these limitations and achieve state-of-the-art results in many NLP tasks.
1. Speech Recognition: In speech recognition systems, statistical language models are used
to decode the most likely sequence of words given an input speech signal. The language
model helps to distinguish between alternative word sequences and improve the accuracy
of the recognized text. For example, given the audio signal "I want to eat," the language
model may help decide between "I want two eat" and "I want to eat."
2. Autocomplete and Text Prediction: Statistical language models power autocomplete
and text prediction features in applications such as search engines, messaging apps, and
word processors. These models suggest the most likely next word or phrase based on the
context of the input text. For example, when typing "I am going to," the language model
may suggest "the store" or "the park" as likely completions.
3. Machine Translation: In machine translation systems, statistical language models help
generate fluent and grammatically correct translations by estimating the probability of
different word sequences in the target language. The language model guides the selection
of the most likely translation given the source text. For example, given the source text "Je
suis content" in French, the language model may help choose between "I am happy" and
"I am satisfied" as translations.
4. Text Generation: Statistical language models can be used to generate coherent and
contextually relevant text for various applications, such as chatbots, content creation
tools, and language generation systems. These models estimate the probability of word
sequences and use sampling techniques to generate new text based on the learned
language patterns. For example, a language model trained on news articles may generate
new headlines or article summaries.
5. Spell Checking and Correction: Statistical language models are used in spell checking
and correction systems to identify and correct spelling errors in text. These models
estimate the likelihood of word sequences and suggest corrections based on the context of
the input text. For example, when detecting the misspelled word "recieve," the language
model may suggest "receive" as a likely correction based on its training data.
These examples demonstrate how statistical language modeling is applied in various NLP tasks
to improve the accuracy, fluency, and naturalness of text processing and generation.
a simple example of statistical language modeling using a bigram model. Suppose we have a
small corpus consisting of the following sentences:
1. "I like to eat apples."

2. "Apples are delicious."
3. "I like to eat bananas."
We can use this corpus to build a bigram language model, which estimates the probability of
each word given its preceding word. Here's how we can do it:
1. Tokenization: First, we tokenize the sentences into individual words, removing

punctuation and converting everything to lowercase. This gives us the following
tokenized corpus:
["i", "like", "to", "eat", "apples"]

["apples", "are", "delicious"]
["i", "like", "to", "eat", "bananas"]
2. Counting Bigrams: Next, we count the occurrences of bigrams (pairs of consecutive words) in the
tokenized corpus:
("i", "like"): 2
("like", "to"): 2
("to", "eat"): 2
("eat", "apples"): 1
("apples", "are"): 1
("are", "delicious"): 1
("eat", "bananas"): 1
3. Estimating Probabilities: We calculate the probability of each word given its preceding word
using maximum likelihood estimation (MLE)
P("like" | "i") = Count("i like") / Count("i") = 2 / 2 = 1.0

P("to" | "like") = Count("like to") / Count("like") = 2 / 2 = 1.0
P("eat" | "to") = Count("to eat") / Count("to") = 2 / 2 = 1.0
P("apples" | "eat") = Count("eat apples") / Count("eat") = 1 / 2 = 0.5
P("are" | "apples") = Count("apples are") / Count("apples") = 1 / 1 = 1.0
P("delicious" | "are") = Count("are delicious") / Count("are") = 1 / 1 = 1.0
P("bananas" | "eat") = Count("eat bananas") / Count("eat") = 1 / 2 = 0.5
Now, we have a bigram language model that can estimate the probability of word sequences. For
example, if we want to compute the probability of the sentence "I like to eat bananas," we can
multiply the probabilities of the bigrams:
P("i") * P("like" | "i") * P("to" | "like") * P("eat" | "to") * P("bananas" | "eat")
= 1.0 * 1.0 * 1.0 * 1.0 * 0.5
= 0.5
This shows that according to our bigram model, the probability of the sentence "I like to eat
bananas" is 0.5.
Regular expressions (regex) are powerful tools used in natural language processing (NLP) for
pattern matching and text processing tasks. They allow for efficient searching, extraction, and
manipulation of text based on specified patterns. Here are some common applications of regular
expressions in NLP:
1. Tokenization: Regular expressions can be used to split a text into tokens, such as words
or sentences. For example, \w+ matches one or more word characters, effectively
tokenizing words in a sentence.
2. Text Cleaning: Regular expressions are useful for cleaning and preprocessing text data
by removing unwanted characters, punctuation, or formatting. For instance, \W matches
any non-word character, which can be used to remove punctuation marks from text.
3. Pattern Matching: Regular expressions enable the extraction of specific patterns or
entities from text data. For example, \b\d{3}-\d{3}-\d{4}\b matches phone numbers
in the format XXX-XXX-XXXX.
4. Named Entity Recognition (NER): Regular expressions can be used as simple rules for
identifying named entities such as dates, emails, or URLs in text. For example, a regex
pattern can match strings that resemble email addresses (\b[A-Za-z0-9._%+-]+@[A-Za-
z0-9.-]+\.[A-Z|a-z]{2,}\b).
5. Information Extraction: Regular expressions can aid in extracting structured
information from unstructured text, such as dates, addresses, or numerical data. For
instance, \b\d{2}/\d{2}/\d{4}\b matches dates in the format MM/DD/YYYY.
6. Text Normalization: Regular expressions can be used to normalize text by converting it
to a standard format. For example, \b[A-Z]+\b matches all uppercase words, which can
be converted to lowercase for normalization.
7. Text Segmentation: Regular expressions can help in segmenting text into meaningful
units, such as paragraphs or sections. For example, \n\n matches two consecutive
newline characters, which can be used to split text into paragraphs.
While regular expressions are powerful, they also have limitations. They may not handle
complex patterns or variations in text well, and writing and maintaining complex regex patterns
can be challenging. Additionally, regular expressions are often not robust to noisy or ambiguous
text data. In such cases, more advanced techniques, such as rule-based systems or machine
learning models, may be more suitable.
Example
simple example of a regular expression in Python that matches email addresses:
import re
# Sample text containing email addresses

text = "Contact us at info@example.com or support@company.co.uk for assistance."
# Regular expression pattern to match email addresses

pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
# Find all email addresses in the text

matches = re.findall(pattern, text)
# Print the matches

print(matches)
Output:
['info@example.com', 'support@company.co.uk']
In this example:
 The regular expression pattern r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]

{2,}\b' consists of several components:
o \b: Word boundary anchor to ensure that the email address starts and ends with a
word boundary.
o [A-Za-z0-9._%+-]+: Matches one or more characters that can occur in the local
part of the email address (before the '@' symbol), including letters, digits, dots,
underscores, percent signs, plus signs, and hyphens.
o @: Matches the '@' symbol.
o [A-Za-z0-9.-]+: Matches one or more characters that can occur in the domain
part of the email address (after the '@' symbol), including letters, digits, dots, and
hyphens.
o \.: Matches a literal dot '.' (used to separate domain levels).
o [A-Z|a-z]{2,}: Matches two or more letters (upper or lowercase) representing
the top-level domain (TLD), such as 'com', 'net', 'co.uk', etc.
o \b: Word boundary anchor to ensure that the email address ends with a word
boundary.
 The re.findall() function is used to find all occurrences of the pattern in the given
text.
 The matches found are printed, which are the email addresses present in the text.
Finite State Automata (FSA) or Finite State Machines (FSM), which are models used in computer science
and mathematics to represent systems that can be in only a finite number of states at any given time.
These automata are widely used in various fields, including natural language processing, compiler
design, and digital circuit design.
1. Definition: A Finite State Automaton is defined by a finite set of states, a finite set of
input symbols, a transition function that describes how the automaton transitions between
states based on input symbols, a start state, and a set of accept states.
2. Types of FSAs:
o Deterministic Finite Automaton (DFA): In a DFA, for each state and input
symbol, there is exactly one transition leading to a next state. DFAs are
commonly used in lexical analysis and pattern matching.
o Nondeterministic Finite Automaton (NFA): In an NFA, there can be multiple
transitions for a given state and input symbol, or there can be ε-transitions
(transitions without consuming an input symbol). NFAs are often used in regular
expression matching.
3. Operations on FSAs:
o Union, intersection, and complementation of automata.
o Concatenation, Kleene star (closure), and concatenation of automata.
o Minimization of DFAs to reduce the number of states while preserving the
language recognized by the automaton.
4. Applications:
o Regular expression matching: FSAs are used to implement regular expression
engines.
o Lexical analysis: DFAs are used to recognize tokens in programming languages.
o Pattern recognition: FSAs can be used to model and recognize patterns in data.
5. Limitations:
o FSAs are limited in their expressive power compared to more complex automata
models like pushdown automata and Turing machines.
o They can only recognize regular languages, which are a subset of the languages
recognized by context-free grammars.
Let's consider a simple example of a deterministic finite automaton (DFA) that recognizes strings
over the alphabet {0, 1} that end with "01".
Here's the DFA:
1. States: {q0, q1, q2}

2. Alphabet: {0, 1}
3. Start state: q0
4. Accept states: {q2}
5. Transition function:
----0----> q0 ----0----> q0 ----0----> q0
| | | |
| | | |
Start | v1 v1 v1
| q1 q1 q1
|0 0
v
q2
 The start state is q0.

 The accept state is q2.
 The DFA transitions from state to state based on the input symbols. For example, if it
reads a 0 in state q0, it stays in q0, if it reads a 1 in q0, it transitions to q1, and so on.
 The string will be accepted if it ends in state q2, indicating that it ends with "01".
nglish morphology is an essential aspect of natural language processing (NLP) that deals with the
structure and formation of words in the English language. It encompasses various morphological
processes, such as inflection, derivation, compounding, and others. Understanding English
morphology is crucial for tasks like tokenization, stemming, lemmatization, and part-of-speech
tagging. Here's a brief overview of some key concepts in English morphology and their
relevance in NLP:
1. Inflection: Inflection involves adding affixes (prefixes, suffixes, infixes) to a base word
to indicate grammatical features such as tense, aspect, mood, number, case, and gender.
For example:
o Walk (base form) -> walks, walked, walking (inflected forms)
o Cat (singular) -> cats (plural)
2. Derivation: Derivation involves forming new words by adding affixes to base words,
resulting in changes in meaning or word class. For example:
o Happy (adjective) -> happiness (noun)
o Nation (noun) -> national (adjective)
3. Compounding: Compounding involves combining two or more words to form a new
word with a new meaning. Compounds can be open (e.g., ice cream), hyphenated (e.g.,
well-being), or closed (e.g., keyboard).
4. Stemming: Stemming is the process of reducing words to their base or root forms by
removing affixes. It aims to normalize words so that different inflected forms map to the
same stem. For example:
o Running, runs -> run (stem)
o Cats, cat's -> cat (stem)
5. Lemmatization: Lemmatization is similar to stemming but considers the context of
words to determine their base forms (lemmas). It typically involves dictionary lookup and
morphological analysis to ensure that the lemma is a valid word. For example:
o Running, runs -> run (lemma)
o Better, best -> good (lemma)
6. Part-of-Speech Tagging: Part-of-speech tagging assigns grammatical categories (nouns,
verbs, adjectives, etc.) to words in a sentence. Morphological features play a significant
role in determining the part of speech of a word. For example:
o Running (verb)
o Cat (noun)
o Happy (adjective)
7. Morphological Analysis: Morphological analysis involves breaking down words into
their constituent morphemes (the smallest units of meaning). This process is essential for
understanding the internal structure of words and for various NLP tasks.
In NLP, algorithms and models are developed to handle these morphological processes
efficiently, enabling tasks such as text normalization, syntactic analysis, semantic analysis, and
more. Proper handling of English morphology enhances the accuracy and effectiveness of NLP
systems across a wide range of applications.
English morphology in natural language processing (NLP) involves analyzing the structure and
formation of words in the English language. Morphology deals with the internal structure of
words and how they are formed from smaller meaningful units called morphemes. Here's an
example illustrating English morphology:
Consider the word "unhappiness."
1. Root: The root or base of the word is "happy."

2. Affixes:
o "un-" is a prefix indicating negation or reversal.
o "-ness" is a suffix indicating the quality or state of being.
3. Morphemes:
o "un-" (prefix)
o "happi-" (root)
o "-ness" (suffix)
4. Morphological Analysis:
o Prefix: "un-": negation
o Root: "happy": feeling or showing pleasure
o Suffix: "-ness": state or quality
5. Word Formation: The word "unhappiness" is formed by combining the prefix "un-"
with the root "happy" and the suffix "-ness," resulting in the meaning "the state of not
being happy" or "lack of happiness."
In NLP, understanding English morphology is crucial for various tasks, including:
1. Tokenization: Breaking text into words or tokens, considering morphological

boundaries.
2. Stemming: Reducing words to their root form (stem) by removing affixes. For example,
stemming "unhappiness" would result in "happi."
3. Lemmatization: Similar to stemming but produces the base or dictionary form of a word
(lemma). For "unhappiness," the lemma would be "happiness."
4. Part-of-Speech Tagging: Identifying the grammatical category of each word based on its
morphology. For example, "unhappiness" would be tagged as a noun.
5. Named Entity Recognition (NER): Identifying named entities like person names,
organization names, etc., which often have specific morphological patterns.
Understanding English morphology helps NLP systems better comprehend and process text,
enabling tasks such as sentiment analysis, machine translation, information retrieval, and more.
1. Lexical Transduction:
o Lexical transduction refers to the process of mapping words from one form to
another based on specific rules or patterns. This could involve transformations
such as stemming or lemmatization, where words are reduced to their base or
dictionary forms.
o For example, in English morphology, converting the word "running" to its base
form "run" involves a lexical transduction rule that removes the suffix "-ing."
2. Rules for Lexical Transduction:
o Lexical transduction rules are typically based on linguistic knowledge and
patterns observed in the language. These rules define how words are transformed
from one form to another.
o Rules can involve the application of affix stripping, suffix removal, or applying
irregular transformation patterns.
o Example lexical transduction rule: "If a word ends with '-ing', remove the suffix to
obtain the base form."
3. Grammatical Transduction:
o Grammatical transduction refers to the process of transforming sentences or
phrases from one grammatical form to another. This could involve tasks such as
converting active voice to passive voice, changing tense, or altering sentence
structure.
o Example: Converting the sentence "The cat chased the mouse" from active voice
to passive voice results in "The mouse was chased by the cat."
4. Rules for Grammatical Transduction:
o Grammatical transduction rules are based on syntactic and grammatical structures.
These rules define how sentences or phrases are transformed while preserving
their meaning.
o Rules can involve rearranging word order, changing verb conjugation, or altering
grammatical features.
o Example grammatical transduction rule: "To convert active voice to passive
voice, move the object of the active sentence to the subject position and change
the verb form to the passive voice."
Tokenization is a fundamental task in natural language processing (NLP) that involves breaking
down a text into smaller units called tokens. These tokens can be words, phrases, symbols, or
other meaningful elements, depending on the context and the specific requirements of the task at
hand. Here's an overview of tokenization in NLP:
1. Word Tokenization:
o Word tokenization, also known as word segmentation or word splitting, involves
dividing a text into individual words based on whitespace or punctuation
boundaries.
o Example: The sentence "Tokenization is an important NLP task" can be tokenized
into ["Tokenization", "is", "an", "important", "NLP", "task"].
2. Sentence Tokenization:
o Sentence tokenization involves splitting a text into individual sentences based on
punctuation marks like periods, exclamation marks, and question marks.
o Example: The paragraph "This is the first sentence. This is the second sentence!
And this is the third sentence?" can be tokenized into ["This is the first sentence.",
"This is the second sentence!", "And this is the third sentence?"].
3. Subword Tokenization:
o Subword tokenization involves dividing words into smaller units, such as
morphemes or character n-grams. This approach is commonly used in languages
with complex morphology or for handling out-of-vocabulary words.
o Example: In subword tokenization, the word "tokenization" can be split into ["to",
"ken", "iza", "tion"] or ["token", "iza", "tion"].
4. Tokenization Challenges:
o Tokenization can be challenging for languages with complex word boundaries or
agglutinative morphology.
o Ambiguity in tokenization can arise due to punctuation marks, abbreviations,
contractions, and compound words.
5. Tokenization Libraries:
o Various NLP libraries provide built-in functions for tokenization, including
NLTK (Natural Language Toolkit), spaCy, and the tokenization module in the
TensorFlow and PyTorch frameworks.
6. Preprocessing:
o Tokenization is typically the first step in text preprocessing, followed by tasks
such as lowercasing, stemming, lemmatization, and stop word removal.
Detecting and correcting spelling errors is an important task in natural language processing
(NLP) and can significantly improve the accuracy and readability of text. Here's an overview of
how spelling errors are detected and corrected:
1. Spell Checking:
o Spell checking involves identifying words in a text that are not found in a
dictionary or known vocabulary.
o Spell checkers compare each word in the text against a dictionary or a list of
known words to determine if it is spelled correctly.
o Words that are not found in the dictionary are flagged as potential spelling errors.
2. Candidate Generation:
o Once spelling errors are detected, candidate words are generated as potential
replacements for the misspelled words.
o Candidate generation techniques may involve:
 Generating possible corrections by applying operations such as insertion,
deletion, substitution, or transposition of characters.
 Using statistical language models to suggest the most likely replacements
based on context.
3. Candidate Ranking:
o After generating candidate replacements, a ranking algorithm is applied to score
and rank the candidate corrections.
o Ranking algorithms consider factors such as:
 Edit distance: How many edits are required to transform the misspelled
word into each candidate.
 Language model probabilities: How likely each candidate is based on the
surrounding context.
 Frequency of occurrence: How frequently each candidate appears in a
large corpus of text.
4. Correction Selection:
o The correction selection process involves choosing the highest-ranked candidate
as the replacement for the misspelled word.
o In some cases, multiple candidate corrections may be suggested to the user for
manual selection.
5. Contextual Spelling Correction:
o Contextual spelling correction takes surrounding context into account when
detecting and correcting spelling errors.
o Contextual information, such as adjacent words, grammar, syntax, and semantics,
can help improve the accuracy of spelling correction.
6. Evaluation and Feedback:
o Spell checkers are often evaluated using manually annotated datasets or user
feedback to assess their accuracy and effectiveness.
o Continuous improvement based on user feedback helps refine and enhance
spelling correction algorithms over time.
In natural language processing (NLP), the minimum edit distance (also known as Levenshtein
distance) is a metric used to quantify the similarity between two strings by measuring the
minimum number of single-character edits (insertions, deletions, or substitutions) required to
transform one string into the other. It's a fundamental concept used in various NLP tasks, such as
spell checking, text correction, and approximate string matching. Here's how it works:
1. Definition:
o Given two strings, A of length m and B of length n, the minimum edit distance
between them, denoted as D(A, B), is the minimum number of edits required to
transform string A into string B.
2. Operations:
o Insertion: Add a character to string A.
o Deletion: Remove a character from string A.
o Substitution: Replace a character in string A with another character.
3. Dynamic Programming Algorithm:
o The minimum edit distance can be efficiently computed using dynamic
programming.
o The algorithm fills in a matrix where each cell (i, j) represents the minimum edit
distance between the substrings A[0:i] and B[0:j].
o The algorithm iterates through each position in the matrix, updating the values
based on the minimum cost of the possible edit operations.
o The final value in the bottom-right corner of the matrix represents the minimum
edit distance between the two strings.
4. Applications:
o Spell Checking: Determine the closest words to a misspelled word by computing
the minimum edit distance between the misspelled word and all words in a
dictionary.
o Approximate String Matching: Find strings in a database that are similar to a
given query string by computing the minimum edit distance between the query
string and database strings.
o OCR (Optical Character Recognition): Correct errors in OCR output by
comparing the recognized text with the original text using minimum edit distance.
5. Example:
o For example, consider the strings "kitten" and "sitting":
 The minimum edit distance between them is 3.
 One possible sequence of edit operations is: substitute 'k' with 's',
substitute 'e' with 'i', and insert 'g' at the end.
UNIT II WORD LEVEL ANALYSIS 9

Unsmoothed N-grams, Evaluating N-grams, Smoothing, Interpolation and Backoff –
Word
Classes, Part-of-Speech Tagging, Rule-based, Stochastic and Transformation-based
tagging, Issues in PoS tagging – Hidden Markov and Maximum Entropy models.
1. Definition:
o An n-gram is a contiguous sequence of n items (words, characters, or tokens)
within a larger sequence of text.
o Unsmoothed n-grams involve calculating the probability of observing each n-
gram in the training data directly from the counts of those n-grams, without any
adjustments for unseen or rare events.
2. Probability Estimation:
o Given a corpus of text, the probability of a word sequence is estimated by
counting the occurrences of each n-gram in the training data and dividing by the
total count of all n-grams.
o For example, the probability of observing the word sequence "the cat sat" using
trigrams would be estimated by counting the number of occurrences of the trigram
"the cat sat" and dividing by the total count of all trigrams in the corpus.
3. Challenges:
o Unsmoothed n-grams can suffer from data sparsity issues, especially for higher-
order n-grams or in corpora with limited training data.
o If an n-gram is not observed in the training data, its probability will be zero,
which can lead to severe underestimation of the likelihood of unseen word
sequences.
4. Usage:
o Despite their limitations, unsmoothed n-grams can still be useful in certain
contexts, particularly for small or specialized corpora where data sparsity is less
of an issue.
o Unsmoothed n-grams can serve as a baseline model for comparison with more
sophisticated language models that incorporate smoothing techniques.
5. Evaluation:
o The performance of unsmoothed n-gram models can be evaluated using standard
metrics such as perplexity or accuracy on a held-out test set.
o Perplexity measures how well the model predicts the test data and can indicate the
effectiveness of the language model in capturing the distribution of word
sequences in the training corpus.

NLP Notes For Students

Uploaded by

Copyright:

Available Formats

NLP Notes For Students

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Notes For Students

Uploaded by

Copyright:

Available Formats

CS8084 NATURAL LANGUAGE PROCESSING

Origins and challenges of NLP – Language Modeling: Grammar-based LM, Statistical LM -

Regular Expressions, Finite-State Automata – English Morphology, Transducers for lexicon

Here's how language modeling typically works:

Language modeling can be approached using different techniques:

Language modeling has numerous applications in NLP, including:

 Information retrieval: Ranking documents based on their relevance to a query by

Here's how grammar-based language models typically work:

Grammar-based language models have several advantages:

 Explicit Linguistic Knowledge: By encoding linguistic knowledge in the form of

However, grammar-based LMs also have limitations:

 Scalability: Creating comprehensive grammar rules for natural languages can be

Overall, while grammar-based language models provide a principled approach to natural

Statistical language modelling

Here's how statistical language modeling typically works:

1. "I like to eat apples."

1. Tokenization: First, we tokenize the sentences into individual words, removing

["i", "like", "to", "eat", "apples"]

P("like" | "i") = Count("i like") / Count("i") = 2 / 2 = 1.0

# Sample text containing email addresses

# Regular expression pattern to match email addresses

# Find all email addresses in the text

# Print the matches

 The regular expression pattern r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]

Here's the DFA:

1. States: {q0, q1, q2}

 The start state is q0.

Consider the word "unhappiness."

1. Root: The root or base of the word is "happy."

In NLP, understanding English morphology is crucial for various tasks, including:

1. Tokenization: Breaking text into words or tokens, considering morphological

UNIT II WORD LEVEL ANALYSIS 9

Classes, Part-of-Speech Tagging, Rule-based, Stochastic and Transformation-based

You might also like