unit1

Bhilai Institute of Technology, Durg
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
Natural Language Processing

UNIT 1: Introduction to NLP
Prepared By- Dr Shikha Pandey
Dr. Shikha Pandey, BIT, Durg Page 1

Syllabus
UNIT – I (CO1)
Introduction: A computational framework for natural language, description of English or
an Indian language in the frame work, lexicon, algorithms and data structures for
implementation of the framework, Finite state automata, the different analysis levels
used for NLP (morphological, syntactic, semantic, pragmatic, Recursive and augmented
transition networks. Applications like machine translations. [8Hrs]
UNIT – II (CO2)
Word level and syntactic analysis: Word Level Analysis: Regular Expressions, Finite-State
Automata, Morphological Parsing, Spelling Error Detection and correction, Words and Word
classes, Part-of Speech Tagging. Syntactic Analysis: Context-free Grammar, Constituency,
Parsing-Probabilistic Parsing. Machine readable dictionaries and lexical databases, RTN,
ATN [7Hrs]
UNIT – III (CO3)
Semantic analysis: Semantic Analysis: Meaning Representation, Lexical Semantics,
Ambiguity, Word Sense Disambiguation. Discourse Processing: cohesion, Reference
Resolution, Discourse Coherence and Structure. Knowledge Representation, reasoning.
[7Hrs]
UNIT – IV (CO4)
Natural Language generation: Natural Language Generation (NLG): Architecture of NLG
Systems, Generation Tasks and Representations, Application of NLG. Machine Translation:
Problems in Machine Translation, Characteristics of Indian Languages, Machine Translation
Approaches, Translation involving Indian Languages. [7Hrs]
Seth Balkrishan Memorial
Bhilai Institute of Technology, Durg
(An Autonomous Institute affiliated to CSVTU Bhilai)
SYLLABUS
B. Tech. (Computer Science & Engineering) Seventh Semester
July 2023 1.00 Applicable for
Chairman (AC) Chairman (BoS) Date of Release Version AY 2023-24 Onwards
UNIT – V (CO5)
Information retrieval and lexical resources Information Retrieval: Design features of
Information Retrieval Systems, Classical, Non-classical, Alternative Models of
Information Retrieval, valuation Lexical Resources: World Net, Frame Net, Stemmers, POS
Tagger. [7Hrs]
Books etc.
 Natural Language Understanding: James Allan

 Speech and NLP: Jurafsky and Martin
 Foundations of Statistical NLP: Manning and Schutze
 NLP a Paninian Perspective: Bharati, Cahitanya and Sangal

UNIT – I (CO1)
Introduction: A computational framework for natural language, description of English or an Indian
language in the frame work, lexicon, algorithms and data structures for implementation of the
framework, Finite state automata, the different analysis levels used for NLP (morphological,
syntactic, semantic, pragmatic, Recursive and augmented transition networks. Applications like
machine translations.
NLP is the branch of computer science focused on developing systems that allow
computers to communicate with people using everyday language.
NLP having 2 Goals
Science Goal : Understand the way language operates
 Engineering Goal : Build systems that analyse and generate language
reduce the man machine gap.
The primary objective of Natural Language Processing (NLP) in AI is to enable

computers to understand, interpret, and generate human language, such as text or
speech. NLP aims to bridge the gap between human communication and computer
understanding, allowing humans to interact with machines more naturally and
efficiently.

A COMPUTATIONAL FRAMEWORK FOR NATURAL LANGUAGE
A computational framework for natural language involves creating a system that can process and understand human
language through computational methods. This typically encompasses several components and methodologies, such
as:
Components
1. Text Preprocessing
o Tokenization: Splitting text into words, phrases, symbols, or other meaningful elements.
o Normalization: Converting text to a standard form, e.g., lowercasing, stemming, and lemmatization.
o Stop-word Removal: Filtering out common words that may not be useful in analysis, like "and",
"the", etc.
2. Syntax Analysis
o Part-of-Speech Tagging: Identifying the grammatical parts of speech in the text (nouns, verbs,
adjectives, etc.).
o Parsing: Analyzing the syntactic structure of sentences to understand the relationships between
words.
3. Semantics
o Named Entity Recognition (NER): Identifying and classifying entities such as names, dates, and
places.
o Semantic Role Labeling: Determining the roles that words play in a sentence.
o Word Sense Disambiguation: Determining which meaning of a word is being used in context.
4. Pragmatics
o Coreference Resolution: Identifying when different words refer to the same entity.
o Discourse Analysis: Understanding the flow of text and the relationships between sentences and
paragraphs.

LEXICON
A lexicon is a comprehensive collection of words and their meanings within a particular language or a subject area.
It can be thought of as a dictionary, but it may also include additional information beyond simple definitions, such
as grammatical properties, usage examples, and relationships between words. Here are some key aspects of a
lexicon:
Key Aspects of a Lexicon
1. Vocabulary List:
o A comprehensive list of words or lexical items in a language.
2. Definitions:
o Each word in a lexicon is typically accompanied by its meanings or definitions.
3. Grammatical Information:
o Information such as parts of speech (noun, verb, adjective, etc.), inflections (plural forms, tenses),
and syntactic roles.
4. Pronunciation:
o Many lexicons include phonetic transcriptions to indicate how words are pronounced.
5. Etymology:
o Historical origins and evolution of words.
6. Usage Examples:
o Sentences or phrases illustrating how words are used in context.
Types of Lexicons
1. General Lexicons:
o Comprehensive collections like dictionaries that cover the entire vocabulary of a language.
o Example: Oxford English Dictionary.
2. Specialized Lexicons:
o Focused on specific fields or domains, containing terms and jargon relevant to that area.
o Example: Medical lexicon, legal lexicon.
3. Computational Lexicons:
o Used in natural language processing (NLP) and include additional information to assist in
computational tasks.
o Example: WordNet, which provides synonyms, antonyms, and hierarchical relationships between
words.
Uses in Computational Linguistics
 Text Analysis: Lexicons help in understanding and analyzing text by providing meanings and context.
 Machine Translation: Essential for translating words and phrases between languages.
 Speech Recognition: Pronunciation data from lexicons aids in recognizing spoken words.
 Information Retrieval: Improve search algorithms by understanding synonyms and related terms.







SOME POINTS ABOUT INDIAN LANGUAGE:
India is a multi-lingual country with great linguistic and cultural diversities
 22 official languages mentioned in the Indian constitution
 However, Census of India in 2001 reported-
 122 major languages
 1,599 other regional languages
 2,371 scripts
 30 languages are spoken by more than one million native speakers
 122are spoken by more than 10,000 people
 20% understand English
 80%cannot understand
TDIL: Govt. of India
 Technology Development for Indian Languages (TDIL) Programme
Objective:
 developing Information Processing Tools and Techniques to facilitate human machine interaction without
language barrier;
 creating and accessing multilingual knowledge resources; and
 integrating them to develop innovative user products and services
TDIL: Some major initiatives
 Development of English to Indian Language Machine Translation

Anuvadaksh
English to Hindi/Marathi/Bangla/Oriya/Tamil/Urdu/ Gujrati /Bodo
 Development of English to Indian Language Machine Translation System with Angla Bharti Technology
English to Bangla/Punjabi/ Malaylam /Urdu/Hindi/Telugu
 Development of Indian Language to Indian Language Machine Translation System Sampark 18 pairs of
languages
Hindi to Bengali, Bengali to Hindi, Marathi to Hindi, Hindi to Marathi, Hindi to
Punjabi, Punjabi to Hindi, Hindi to Tamil, Tamil to Hindi, Hindi to Kannada, Kannada
to Hindi, Hindi to Telugu, Telugu to Hindi, Hindi to Urdu, Urdu Hindi, Malaylam to
Tamil, Tamil to Malaylam Tamil to Telugu, Telugu to Tamil

Govt. Portal: MyGov.in
 Citizen centric platform empowers people to connect with the Government contribute towards good
governance
 Unique first of its kind participatory governance initiative involving the common citizen at large
 Idea is to bring the government closer to the common man by the use of online platform creating an interface
for healthy exchange of ideas and views involving the common citizen and experts
 Ultimate goal is to contribute to the social and economic transformation of India
Was launched on July 26 2014 by the Hon’ble PM
DESCRIPTION OF ENGLISH OR AN INDIAN LANGUAGE IN THE

FRAME WORK
Describing English or an Indian language within a computational framework involves highlighting the unique
linguistic features and processing steps specific to that language. Let's take Hindi, one of the widely spoken Indian
languages, as an example and compare it to English within the computational framework.
Text Preprocessing
English:
 Tokenization: Straightforward, based on spaces and punctuation.

 Normalization: Lowercasing, handling contractions (e.g., "isn't" to "is not").
 Stop-word Removal: Using standard English stop-word lists.
Hindi:
 Tokenization: More complex due to the lack of spaces between words in certain contexts and the use of
compound words.
 Normalization: Handling variations in spellings, removing diacritics, and converting script variations.
 Stop-word Removal: Using lists specific to Hindi, which differ significantly from English.
Syntax Analysis
English:
 Part-of-Speech Tagging: Generally straightforward with well-defined rules.

 Parsing: Parsing structures are based on a relatively fixed word order (Subject-Verb-Object).
Hindi:
 Part-of-Speech Tagging: More complex due to rich morphology and inflectional variations.
 Parsing: Flexible word order (Subject-Object-Verb), requiring different parsing strategies and algorithms.
Semantics
English:

 Named Entity Recognition (NER): Standard models work well but need customization for different
dialects.
 Word Sense Disambiguation: Relies on context and often large corpora to distinguish meanings.
Hindi:
 Named Entity Recognition (NER): Requires models trained specifically on Hindi datasets.
 Word Sense Disambiguation: More challenging due to the richness of homonyms and polysemous words.
Pragmatics
English:
 Coreference Resolution: Well-studied, with many datasets and models available.

 Discourse Analysis: Extensive research and tools are available for understanding the flow of English text.
Hindi:
 Coreference Resolution: Limited resources and datasets, requiring more specialized approaches.
 Discourse Analysis: Still an emerging field with fewer tools and resources compared to English.
Tools and Libraries
 For English:
o NLTK: Comprehensive support for English.
o spaCy: Strong English support with pre-trained models.
o Transformers by Hugging Face: Pre-trained models like BERT, GPT,
etc.
 For Hindi:
o Indic NLP Library: Tools for text normalization, tokenization, and
more.
o iNLTK: Specifically designed for Indian languages, including Hindi.
o Transformers by Hugging Face: Multilingual models like mBERT,
XLM-R, and IndicBERT.

The grammatical structure of Hindi language is as follows:
1. Word order: Subject-Object-Verb (SOV)
2. Nouns:
- Gender: Masculine, Feminine, Neuter
- Number: Singular, Plural
- Case: Nominative, Accusative, Genitive, Dative, Ablative
3. Verbs:
- Tense: Present, Past, Future
- Aspect: Simple, Perfect, Continuous
- Mood: Indicative, Imperative, Subjunctive
- Voice: Active, Passive
4. Adjectives:
- Agree with the noun they modify in gender, number, and case
5. Pronouns:
- Personal: First person (I, we), Second person (you), Third person (he, she, it, they)
- Demonstrative: This, that, these, those
- Interrogative: Who, what, when, where, why
6. Postpositions:
- Used to indicate location, direction, time, etc.
- Examples: में (in), पर (on), से (from), तक (until)
7. Conjunctions:
- Used to connect words, phrases, or clauses
- Examples: और (and), ऱेककन (but), या (or)
8. Particles:
- Used to indicate various grammatical relationships
- Examples: ने (indicating the doer of the action), को (indicating the indirect object)

ALGORITHMS AND DATA STRUCTURES FOR IMPLEMENTATION OF THE
FRAMEWORK FOR NLP
Implementing a natural language processing (NLP) framework requires various algorithms and data structures to
handle text preprocessing, syntactic analysis, semantic analysis, and pragmatic analysis. Below are key algorithms
and data structures commonly used in NLP:
Text Preprocessing
1. Tokenization:-
o Algorithms: Regular expressions, NLTK's word_tokenize, spaCy's tokenizer.
o Data Structures: Lists (to store tokens), Trie (for efficient dictionary-based tokenization).
2. Normalization
o Algorithms: Lowercasing, stemming (Porter Stemmer, Snowball Stemmer), lemmatization
(WordNet Lemmatizer).
o Data Structures: Sets (for stop words), HashMap/Dictionary (for mapping original to normalized
forms).
3. Stop-word Removal
o Algorithms: Simple filtering using predefined lists.
o Data Structures: HashSet (to store stop words for O(1) lookups).
Syntax Analysis
1. Part-of-Speech Tagging
o Algorithms: Hidden Markov Model (HMM), Conditional Random Fields (CRF), Neural Networks
(LSTM, BiLSTM-CRF).
o Data Structures: Vectors (for feature representation), Transition Matrices (for HMM), Neural
Network layers.
2. Parsing
o Algorithms: Dependency Parsing (Transition-based parsers like MaltParser, Graph-based parsers),
Constituency Parsing (CKY algorithm, Earley parser).
o Data Structures: Parse Trees, Dependency Graphs, Stack (for transition-based parsing).
Semantics
1. Named Entity Recognition (NER)

o Algorithms: CRF, BiLSTM-CRF, Transformers (BERT, spaCy NER model).
o Data Structures: BIO tag sequences, Neural Network layers.
2. Word Sense Disambiguation
o Algorithms: Lesk algorithm, Knowledge-based methods, Supervised learning (SVM, Neural
Networks), Contextual embeddings (BERT).
o Data Structures: Graphs (for sense networks), Vectors (for word embeddings).
Pragmatics
1. Coreference Resolution
o Algorithms: Rule-based approaches, Machine Learning models (Neural Networks, mention-pair
models), End-to-end neural models (e.g., SpanBERT).
o Data Structures: Clusters (for coreference chains), Graphs (to represent entity relations).
2. Discourse Analysis
o Algorithms: Rhetorical Structure Theory (RST), Discourse Parsing, Neural models (BERT-based).

o Data Structures: Discourse Trees, Vectors (for embedding representations).
Feature Extraction
1. Algorithms: TF-IDF, Word Embeddings (Word2Vec, GloVe), Contextual Embeddings (BERT, GPT).
o Data Structures: Sparse Matrices (for TF-IDF), Dense Vectors (for embeddings), HashMaps (for
vocabulary mapping).
Machine Learning Models
1. Algorithms: Naive Bayes, Support Vector Machines (SVM), Decision Trees, Neural Networks (RNN,
LSTM, Transformer models).
o Data Structures: Matrices (for weight parameters), Tensors (for deep learning models), Graphs (for
neural network architectures).

FINITE STATE AUTOMATA
Finite State Automata (FSA) plays a significant role in Natural Language Processing (NLP) as a fundamental
computational model for processing regular languages. In NLP, FSA is used in various applications, including:
1. Tokenization: FSA is used to split text into individual words or tokens, based on patterns like whitespace,
punctuation, or special characters.
2. Lexical analysis: FSA helps analyze words into their constituent parts, such as roots, prefixes, and suffixes.
3. Pattern matching: FSA is used to match patterns in text, like identifying keywords, phrases, or sentences.
4. Text normalization: FSA helps normalize text data, converting all words to their base form (e.g., "running"
becomes "run").
5. Language modeling: FSA is used in language models to predict the next word in a sequence, based on the context.
6. Regular expression matching: FSA is the basis for regular expressions, which are used to match complex patterns
in text.
7. Part-of-speech tagging: FSA helps identify the grammatical category of words (e.g., noun, verb, adjective).
8. Named entity recognition: FSA is used to identify and extract specific entities like names, locations, and
organizations.
9. * Sentiment analysis*: FSA can be used to analyze text for sentiment, by matching patterns that indicate positive
or negative sentiment.
FSA's strengths in NLP include:
 Efficient processing of large datasets

 Ability to handle ambiguity and uncertainty
 Capacity to model complex patterns and relationships
 Simlicity
 flexibility
While FSA has limitations in handling context-free languages, it remains a fundamental building block for many
NLP applications, and its concepts are essential for understanding more advanced NLP techniques

The levels of linguistic analysis for any language are:
1. Phonetics: Study of speech sounds (phones)
- Phone recognition, transcription (e.g., IPA)
2. Phonology: Study of sound patterns and phoneme distribution
- Phoneme identification, syllable structure
3. Morphology: Study of word structure and formation
- Morpheme identification, part-of-speech tagging
4. Syntax: Study of sentence structure and grammar
- Parsing, syntactic tree representation
5. Semantics: Study of meaning and interpretation
- Word sense disambiguation, semantic role labeling
6. Pragmatics: Study of context-dependent meaning and usage
- Discourse analysis, implicature resolution
7. Discourse Analysis: Study of language in context
- Text structure, coherence, and cohesion
8. Sociolinguistics: Study of language in social contexts
- Language variation, dialectology, language policy
These levels are not mutually exclusive, and analysis often involves combining multiple levels to understand the
complexities of language.
By examining language at these various levels, linguists and NLP researchers can gain a deeper understanding of
language structure, meaning, and usage, enabling applications like language teaching, language translation, and
language generation.

THE DIFFERENT ANALYSIS LEVELS USED FOR NLP
In Natural Language Processing (NLP), different levels of analysis are used to process and understand text. Each
level focuses on a specific aspect of language, from basic text processing to complex semantic understanding. Here
are the primary levels of analysis in NLP:
1. Phonological Analysis: This level is applicable only if the text is generated from the speech and deals with
the interpretation of speech sounds within and across different words.
 Focus: Sound patterns in spoken language.

 Tasks:
o Phoneme recognition.
o Speech synthesis and recognition.
2. Morphological Analysis : Morphological analysis is a field of linguistics that studies the structure of words.
It identifies how a word is produced through the use of morphemes. A morpheme is a basic unit of the English
language. The morpheme is the smallest element of a word that has grammatical function and meaning.
 Focus: Structure of words.

 Tasks:
o Identifying and processing root words, prefixes, and suffixes.
o Handling inflections and derivations.
 Tools: Stemming algorithms (e.g., Porter Stemmer), lemmatization (e.g., WordNet Lemmatizer).
3. Lexical Analysis :
 Focus: Words and their meanings.

 Tasks:
o Tokenization: Breaking down text into words or tokens.
o Part-of-Speech Tagging: Assigning parts of speech to each word.
o Named Entity Recognition (NER): Identifying proper nouns and entities.
 Tools: NLTK, spaCy, Stanford NLP.
4. Syntactic Analysis (Parsing): This knowledge relates to how words are put together or structured of
form grammatically correct sentences in the language.
 Focus: Sentence structure and grammar.

 Tasks:
o Syntax Tree Construction: Building parse trees to represent sentence structure.
o Dependency Parsing: Identifying relationships between words in a sentence.
 Tools: Constituency parsers (e.g., Stanford Parser), dependency parsers (e.g., spaCy).
5. Semantic Analysis : This knowledge is concerned with the meanings of words and phrases and how
they combine to form sentence meanings.
 Focus: Meaning of words and sentences.

 Tasks:
o Word Sense Disambiguation: Determining the correct meaning of a word in context.
o Semantic Role Labeling: Assigning roles to words based on their semantic relationships.
o Building ontologies: Representing knowledge as a set of concepts and categories.
 Tools: WordNet, BERT, Word2Vec.

6. Discourse Analysis : By analyzing language in context, discourse analysis in NLP can reveal insights into
how language is used to communicate, persuade, and shape social relationships.
 Focus: Text beyond the sentence level.

 Tasks:
o Coreference Resolution: Identifying when different words refer to the same entity.
o Discourse Structure: Understanding how sentences and paragraphs relate to each other.
o Anaphora Resolution: Resolving references to earlier parts of the text.
 Tools: Stanford CoreNLP, AllenNLP.
7. Pragmatic Analysis: Pragmatic analysis in NLP is the process of analyzing the context and intent behind a
text or conversation to understand the implied meaning, beyond the literal interpretation of the words.
 Focus: Contextual meaning and language use.

 Tasks:
o Understanding implied meanings and intentions.
o Speech Act Recognition: Identifying the purpose of an utterance (e.g., request, command).
o Contextual Interpretation: Considering situational context to derive meaning.
 Tools: Contextual models like BERT, GPT.
Example:
Text: "Can you pass the salt?"

Literal meaning: A request to pass the salt.
Pragmatic analysis:
- Identifies the speaker's intention: The speaker wants to add salt to their food.
- Recognizes the implied request: The speaker is asking someone to perform an action (passing the salt).
- Understands the context: The conversation is likely taking place during a meal.
Pragmatic analysis considers factors like:
- Speaker's intention
- Implicature (implied meaning)
- Context (physical, social, and cultural)
- Inference (drawing conclusions based on the conversation)
Other examples:
- Text: "What's up?" (Pragmatic analysis: a greeting, not a literal question)

- Text: "I'm fine, thanks" (Pragmatic analysis: a polite response, not necessarily indicating actual well-being)
8. Sentiment Analysis
 Focus: Emotional tone and subjective information.

 Tasks:
o Identifying positive, negative, or neutral sentiment in text.
o Fine-grained sentiment analysis to detect specific emotions.
 Tools: VADER, TextBlob, Transformer-based models..
 WordNet: Lexical database for semantic analysis.

 CoreNLP: Comprehensive toolset for syntactic, discourse, and sentiment analysis.

RECURSIVE AND AUGMENTED TRANSITION NETWORKS IN
NLP
Recursive Transition Networks (RTNs) and Augmented Transition Networks (ATNs) are both graphical
representations used in Natural Language Processing (NLP) to model syntax and semantics. They are extensions of
Finite State Automata (FSA) and are used to handle more complex linguistic structures.
Recursive Transition Networks (RTNs):
RTNs are a generalization of FSAs, allowing for recursion and nested structures. They consist of:
- States (nodes)
- Transitions (edges)
- Actions (labels on edges)
RTNs can:
- Recognize context-free languages

- Model recursive structures (e.g., nested clauses)
- Perform parsing and semantic analysis
Augmented Transition Networks (ATNs):

ATNs are an extension of RTNs, adding:
- Registers (memory locations)

- Predicates (conditions on registers)
ATNs can:
- Handle more complex linguistic phenomena (e.g., long-distance dependencies)

- Perform deeper semantic analysis
- Support more efficient parsing algorithms
Both RTNs and ATNs are used in various NLP applications, including:
- Parsing and syntactic analysis

- Semantic role labeling, complier design
- Machine translation
- Text generation
Key benefits of RTNs and ATNs in NLP include:
- Ability to model complex linguistic structures

- Improved parsing accuracy
- Enhanced semantic analysis capabilities
While RTNs and ATNs are powerful tools in NLP, they have limitations, such as:
- Computational complexity
- Difficulty in handling ambiguity and uncertainty
Overall, RTNs and ATNs are important graphical representations in NLP, enabling the analysis and processing of
complex linguistic structures.

APPLICATION OF NLP
Natural Language Processing (NLP) has a wide range of applications across various domains. These applications
leverage the ability of NLP to understand, interpret, and generate human language. Here are some prominent
applications of NLP:
1. Machine Translation
 Description: Translating text or speech from one language to another.

 Examples:
o Google Translate.
o Microsoft Translator.
 Technologies: Neural machine translation models (e.g., Transformer models).
2. Sentiment Analysis
 Description: Identifying and extracting subjective information from text, determining whether the expressed
opinion is positive, negative, or neutral.
 Examples:
o Analyzing customer reviews.
o Social media monitoring.
 Technologies: Lexicon-based approaches, machine learning models, BERT.
3. Chatbots and Conversational Agents
 Description: Automated agents that can interact with users through natural language.
 Examples:
o Customer service bots.
o Virtual assistants like Siri, Alexa, and Google Assistant.
 Technologies: Dialog systems, intent recognition, response generation, Transformer models (e.g., GPT).
4. Text Summarization
 Description: Creating a concise summary of a longer text document.

 Examples:
o Summarizing news articles.
o Generating abstracts for academic papers.
 Technologies: Extractive and abstractive summarization models, BERT, GPT.
5. Information Retrieval
 Description: Retrieving relevant information from large datasets based on user queries.
 Examples:
o Search engines like Google.
o Enterprise document search.
 Technologies: Inverted indexes, ranking algorithms, BERT for query understanding.
6. Named Entity Recognition (NER)
 Description: Identifying and classifying named entities in text into predefined categories such as names of
people, organizations, locations, etc.
 Examples:
o Information extraction from legal documents.
o Annotating medical records.
 Technologies: CRF, BiLSTM-CRF, spaCy's NER.
7. Speech Recognition
 Description: Converting spoken language into written text.

 Examples:
o Voice input for mobile devices.
o Transcription services.
 Technologies: Hidden Markov Models (HMM), Deep Learning models, ASR systems.
8. Text Classification
 Description: Categorizing text into predefined labels or categories.

 Examples:
o Spam detection in emails.
o Topic classification for news articles.
 Technologies: Naive Bayes, SVM, Neural Networks, BERT.
9. Question Answering Systems
 Description: Providing accurate answers to user queries based on a given dataset.

 Examples:
o Customer support systems.
o Academic assistance tools.
 Technologies: BERT, GPT, QA datasets like SQuAD.
10. Optical Character Recognition (OCR)
 Description: Converting different types of documents, such as scanned paper documents, PDFs, or images
captured by a digital camera, into editable and searchable data.
 Examples:
o Digitizing printed texts.
o Extracting information from receipts and invoices.
 Technologies: Tesseract OCR, Deep Learning models.
11. Text-to-Speech (TTS)
 Description: Converting written text into spoken words.

 Examples:
o Audiobook generation.
o Assistive technologies for visually impaired individuals.
 Technologies: WaveNet, Tacotron, Transformer-based TTS models.
12. Document Summarization
 Description: Automatically shortening documents to create summaries.

 Examples:
o Summarizing legal documents.
o News article summarization.
 Technologies: Extractive and abstractive summarization models.

13. Autocorrect and Predictive Text
 Description: Automatically correcting typos and suggesting words or phrases as users type.
 Examples:
o Mobile keyboards.
o Word processors.
 Technologies: N-gram models, Neural Networks, Transformer models.
14. Language Modeling
 Description: Predicting the next word in a sequence of words.

 Examples:
o Text generation.
o Autocomplete features.
 Technologies: RNNs, LSTMs, Transformer models like GPT.
15. Plagiarism Detection
 Description: Identifying copied content in text.

 Examples:
o Academic plagiarism detection.
o Content originality verification.
 Technologies: String matching algorithms, machine learning models.
Summary of Technologies and Tools
 Transformers by Hugging Face: BERT, GPT, T5 for various NLP tasks.

 spaCy: Industrial-strength NLP library with models for NER, parsing, and more.
 NLTK: Toolkit for various NLP tasks, including tokenization and parsing.
 Scikit-learn: Machine learning library for text classification and clustering.
 TensorFlow/PyTorch: Deep learning frameworks for building custom models.
 Stanford NLP: Comprehensive tools for syntactic and semantic analysis.
 Tesseract OCR: Open-source OCR engine.
NLP applications are increasingly embedded in everyday technologies, enhancing user interactions and enabling
smarter, more intuitive interfaces.

WHAT IS MACHINE TRANSLATION (MT)?
Overview Machine Translation

• What is Machine Translation (MT)?
– Automated system Machine Translation (MT), is a sub
– Analyzes text from Source Language (SL) of computational linguistics that investi
– Produces “equivalent” text in Target Language (TL) the use of software to translate text or sp
– Ideally without human intervention from one language to another.
Source Target
Language Language Good Morning ात
• Automatic translation of all kinds of

documents at a quality equaling that of the Word Order
best human translators.
• Hindi is sometimes called an “SOV” language.
<subject> <object> <verb>

माया को म पस
• But typical word order of English sentences, is

• In any translation, meaning of the statement is SVO.
to be preserved. <subject> <verb> <object>
• Right words and order. Maya likes mangoes

• Pronoun Resolution: Their - उनका / उनकी /उनके
Word Sense
Idoms: नौ दो ग्यारह होना
A word may have more than one sense. Nine two becomes eleven
Choosing the appropriate target word according
To escape
to context is very important.
John went to office: जॉन चऱा गया के लऱए
कायााऱय

Indian Institutes with Major work in MT
• IIIT Hyderabad -Anusaaraka- Prof. Rajeev Sangal
• Centre for Development of Advanced Computing (CDAC), Pune- Mantra machine
translation system:
• IIT, Bombay- Prof. Pushpak Bhattacharyya working on machine translation system from
English to Marathi and Bengali using the UNL (universal networking languages-
interlingua) formalism
• Government of India, through its Technology Development in Indian Languages (TDIL)
Project
• IIT Kanpur – AnglaBharti (English to Indian Languages)
Machine Translation: India

Problem #1
Too many language pairs!
 Implication: Language Barrier will continue to be a problem.
Problem #2
Fragmentation of efforts
 No consolidated effort at solving MT problems
Problem #3
Lack of NLP tools
Lack of Corpora
Lack of standardized methods of evaluation, encoding, etc.
Highly Specialized
 Poor quality systems, No reusable components, No real
learning from each other’s work
Solutions!
Problem #1: Statistical Machine Translation
Problem #2: Collaborative work (2-3 teams)
Problem #3: Common Tools Framework plus Standards

SET OF QUESTIONS FOR UNIT 1
i. Describe the key features of the English language in the context of NLP.
ii. What is the primary goal of Natural Language Processing (NLP)?
iii. Describe a computational framework for processing natural language.
iv. How does NLP differ from human language processing?
v. How would you represent an Indian language like Hindi or Tamil in the NLP framework?
vi. What are the challenges in describing languages with complex grammatical structures?
vii. What is a lexicon in NLP, and what are its components?
viii. Describe an algorithm for tokenization and its implementation.
ix. What data structures are commonly used in NLP, and why?
x. Define Finite State Automata (FSA) and its components.
xi. How are FSAs used in NLP, and what are their limitations?
xii. Describe a simple FSA for a language recognition task.
xiii. What are the different levels of analysis in NLP (morphological, syntactic, semantic, pragmatic)?
xiv. Describe the tasks and techniques used at each level.
xv. How do these levels interact in the NLP framework?
xvi. Define Recursive Transition Networks (RTNs) and Augmented Transition Networks (ATNs).
xvii. How do RTNs and ATNs extend Finite State Automata?
xviii. Describe an application of RTNs or ATNs in NLP.
xix. What is machine translation, and how does it relate to NLP?
xx. Describe the challenges in machine translation and approaches to address them.
xxi. What are some other applications of NLP beyond machine translation?

unit1

Uploaded by

Copyright:

Available Formats

unit1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

unit1

Uploaded by

Copyright:

Available Formats

Bhilai Institute of Technology, Durg

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Natural Language Processing

Prepared By- Dr Shikha Pandey

Dr. Shikha Pandey, BIT, Durg Page 1

 Natural Language Understanding: James Allan

Dr. Shikha Pandey, BIT, Durg Page 2

The primary objective of Natural Language Processing (NLP) in AI is to enable

Dr. Shikha Pandey, BIT, Durg Page 3

Dr. Shikha Pandey, BIT, Durg Page 4

Key Aspects of a Lexicon

Uses in Computational Linguistics

Dr. Shikha Pandey, BIT, Durg Page 5

TDIL: Govt. of India

 Technology Development for Indian Languages (TDIL) Programme

 creating and accessing multilingual knowledge resources; and

 integrating them to develop innovative user products and services

TDIL: Some major initiatives

 Development of English to Indian Language Machine Translation

Hindi to Bengali, Bengali to Hindi, Marathi to Hindi, Hindi to Marathi, Hindi to

Tamil, Tamil to Malaylam Tamil to Telugu, Telugu to Tamil

Dr. Shikha Pandey, BIT, Durg Page 6

Was launched on July 26 2014 by the Hon’ble PM

DESCRIPTION OF ENGLISH OR AN INDIAN LANGUAGE IN THE

 Tokenization: Straightforward, based on spaces and punctuation.

 Part-of-Speech Tagging: Generally straightforward with well-defined rules.

Dr. Shikha Pandey, BIT, Durg Page 7

 Coreference Resolution: Well-studied, with many datasets and models available.

Tools and Libraries

Dr. Shikha Pandey, BIT, Durg Page 8

Dr. Shikha Pandey, BIT, Durg Page 9

1. Named Entity Recognition (NER)

Dr. Shikha Pandey, BIT, Durg Page 10

Machine Learning Models

Dr. Shikha Pandey, BIT, Durg Page 11

FSA's strengths in NLP include:

 Efficient processing of large datasets

Dr. Shikha Pandey, BIT, Durg Page 12

Dr. Shikha Pandey, BIT, Durg Page 13

 Focus: Sound patterns in spoken language.

 Focus: Structure of words.

 Focus: Words and their meanings.

 Focus: Sentence structure and grammar.

 Focus: Meaning of words and sentences.

Dr. Shikha Pandey, BIT, Durg Page 14

 Focus: Text beyond the sentence level.

 Focus: Contextual meaning and language use.

Text: "Can you pass the salt?"

Pragmatic analysis considers factors like:

- Text: "What's up?" (Pragmatic analysis: a greeting, not a literal question)

 Focus: Emotional tone and subjective information.

 WordNet: Lexical database for semantic analysis.

Dr. Shikha Pandey, BIT, Durg Page 15

Recursive Transition Networks (RTNs):

- Recognize context-free languages

Augmented Transition Networks (ATNs):

- Registers (memory locations)

- Handle more complex linguistic phenomena (e.g., long-distance dependencies)

- Parsing and syntactic analysis

Key benefits of RTNs and ATNs in NLP include: