unit1
unit1
unit1
Books etc.
NLP is the branch of computer science focused on developing systems that allow
computers to communicate with people using everyday language.
NLP having 2 Goals
Science Goal : Understand the way language operates
Engineering Goal : Build systems that analyse and generate language
reduce the man machine gap.
Components
1. Text Preprocessing
o Tokenization: Splitting text into words, phrases, symbols, or other meaningful elements.
o Normalization: Converting text to a standard form, e.g., lowercasing, stemming, and lemmatization.
o Stop-word Removal: Filtering out common words that may not be useful in analysis, like "and",
"the", etc.
2. Syntax Analysis
o Part-of-Speech Tagging: Identifying the grammatical parts of speech in the text (nouns, verbs,
adjectives, etc.).
o Parsing: Analyzing the syntactic structure of sentences to understand the relationships between
words.
3. Semantics
o Named Entity Recognition (NER): Identifying and classifying entities such as names, dates, and
places.
o Semantic Role Labeling: Determining the roles that words play in a sentence.
o Word Sense Disambiguation: Determining which meaning of a word is being used in context.
4. Pragmatics
o Coreference Resolution: Identifying when different words refer to the same entity.
o Discourse Analysis: Understanding the flow of text and the relationships between sentences and
paragraphs.
A lexicon is a comprehensive collection of words and their meanings within a particular language or a subject area.
It can be thought of as a dictionary, but it may also include additional information beyond simple definitions, such
as grammatical properties, usage examples, and relationships between words. Here are some key aspects of a
lexicon:
1. Vocabulary List:
o A comprehensive list of words or lexical items in a language.
2. Definitions:
o Each word in a lexicon is typically accompanied by its meanings or definitions.
3. Grammatical Information:
o Information such as parts of speech (noun, verb, adjective, etc.), inflections (plural forms, tenses),
and syntactic roles.
4. Pronunciation:
o Many lexicons include phonetic transcriptions to indicate how words are pronounced.
5. Etymology:
o Historical origins and evolution of words.
6. Usage Examples:
o Sentences or phrases illustrating how words are used in context.
Types of Lexicons
1. General Lexicons:
o Comprehensive collections like dictionaries that cover the entire vocabulary of a language.
o Example: Oxford English Dictionary.
2. Specialized Lexicons:
o Focused on specific fields or domains, containing terms and jargon relevant to that area.
o Example: Medical lexicon, legal lexicon.
3. Computational Lexicons:
o Used in natural language processing (NLP) and include additional information to assist in
computational tasks.
o Example: WordNet, which provides synonyms, antonyms, and hierarchical relationships between
words.
Text Analysis: Lexicons help in understanding and analyzing text by providing meanings and context.
Machine Translation: Essential for translating words and phrases between languages.
Speech Recognition: Pronunciation data from lexicons aids in recognizing spoken words.
Information Retrieval: Improve search algorithms by understanding synonyms and related terms.
Objective:
developing Information Processing Tools and Techniques to facilitate human machine interaction without
language barrier;
Development of English to Indian Language Machine Translation System with Angla Bharti Technology
English to Bangla/Punjabi/ Malaylam /Urdu/Hindi/Telugu
Development of Indian Language to Indian Language Machine Translation System Sampark 18 pairs of
languages
Punjabi, Punjabi to Hindi, Hindi to Tamil, Tamil to Hindi, Hindi to Kannada, Kannada
to Hindi, Hindi to Telugu, Telugu to Hindi, Hindi to Urdu, Urdu Hindi, Malaylam to
Text Preprocessing
English:
Hindi:
Tokenization: More complex due to the lack of spaces between words in certain contexts and the use of
compound words.
Normalization: Handling variations in spellings, removing diacritics, and converting script variations.
Stop-word Removal: Using lists specific to Hindi, which differ significantly from English.
Syntax Analysis
English:
Hindi:
Part-of-Speech Tagging: More complex due to rich morphology and inflectional variations.
Parsing: Flexible word order (Subject-Object-Verb), requiring different parsing strategies and algorithms.
Semantics
English:
Hindi:
Named Entity Recognition (NER): Requires models trained specifically on Hindi datasets.
Word Sense Disambiguation: More challenging due to the richness of homonyms and polysemous words.
Pragmatics
English:
Hindi:
Coreference Resolution: Limited resources and datasets, requiring more specialized approaches.
Discourse Analysis: Still an emerging field with fewer tools and resources compared to English.
For English:
o NLTK: Comprehensive support for English.
o spaCy: Strong English support with pre-trained models.
o Transformers by Hugging Face: Pre-trained models like BERT, GPT,
etc.
For Hindi:
o Indic NLP Library: Tools for text normalization, tokenization, and
more.
o iNLTK: Specifically designed for Indian languages, including Hindi.
o Transformers by Hugging Face: Multilingual models like mBERT,
XLM-R, and IndicBERT.
2. Nouns:
- Gender: Masculine, Feminine, Neuter
- Number: Singular, Plural
- Case: Nominative, Accusative, Genitive, Dative, Ablative
3. Verbs:
- Tense: Present, Past, Future
- Aspect: Simple, Perfect, Continuous
- Mood: Indicative, Imperative, Subjunctive
- Voice: Active, Passive
4. Adjectives:
- Agree with the noun they modify in gender, number, and case
5. Pronouns:
- Personal: First person (I, we), Second person (you), Third person (he, she, it, they)
- Demonstrative: This, that, these, those
- Interrogative: Who, what, when, where, why
6. Postpositions:
- Used to indicate location, direction, time, etc.
- Examples: में (in), पर (on), से (from), तक (until)
7. Conjunctions:
- Used to connect words, phrases, or clauses
- Examples: और (and), ऱेककन (but), या (or)
8. Particles:
- Used to indicate various grammatical relationships
- Examples: ने (indicating the doer of the action), को (indicating the indirect object)
Implementing a natural language processing (NLP) framework requires various algorithms and data structures to
handle text preprocessing, syntactic analysis, semantic analysis, and pragmatic analysis. Below are key algorithms
and data structures commonly used in NLP:
Text Preprocessing
1. Tokenization:-
o Algorithms: Regular expressions, NLTK's word_tokenize, spaCy's tokenizer.
o Data Structures: Lists (to store tokens), Trie (for efficient dictionary-based tokenization).
2. Normalization
o Algorithms: Lowercasing, stemming (Porter Stemmer, Snowball Stemmer), lemmatization
(WordNet Lemmatizer).
o Data Structures: Sets (for stop words), HashMap/Dictionary (for mapping original to normalized
forms).
3. Stop-word Removal
o Algorithms: Simple filtering using predefined lists.
o Data Structures: HashSet (to store stop words for O(1) lookups).
Syntax Analysis
1. Part-of-Speech Tagging
o Algorithms: Hidden Markov Model (HMM), Conditional Random Fields (CRF), Neural Networks
(LSTM, BiLSTM-CRF).
o Data Structures: Vectors (for feature representation), Transition Matrices (for HMM), Neural
Network layers.
2. Parsing
o Algorithms: Dependency Parsing (Transition-based parsers like MaltParser, Graph-based parsers),
Constituency Parsing (CKY algorithm, Earley parser).
o Data Structures: Parse Trees, Dependency Graphs, Stack (for transition-based parsing).
Semantics
Pragmatics
1. Coreference Resolution
o Algorithms: Rule-based approaches, Machine Learning models (Neural Networks, mention-pair
models), End-to-end neural models (e.g., SpanBERT).
o Data Structures: Clusters (for coreference chains), Graphs (to represent entity relations).
2. Discourse Analysis
o Algorithms: Rhetorical Structure Theory (RST), Discourse Parsing, Neural models (BERT-based).
Feature Extraction
1. Algorithms: TF-IDF, Word Embeddings (Word2Vec, GloVe), Contextual Embeddings (BERT, GPT).
o Data Structures: Sparse Matrices (for TF-IDF), Dense Vectors (for embeddings), HashMaps (for
vocabulary mapping).
1. Algorithms: Naive Bayes, Support Vector Machines (SVM), Decision Trees, Neural Networks (RNN,
LSTM, Transformer models).
o Data Structures: Matrices (for weight parameters), Tensors (for deep learning models), Graphs (for
neural network architectures).
1. Tokenization: FSA is used to split text into individual words or tokens, based on patterns like whitespace,
punctuation, or special characters.
2. Lexical analysis: FSA helps analyze words into their constituent parts, such as roots, prefixes, and suffixes.
3. Pattern matching: FSA is used to match patterns in text, like identifying keywords, phrases, or sentences.
4. Text normalization: FSA helps normalize text data, converting all words to their base form (e.g., "running"
becomes "run").
5. Language modeling: FSA is used in language models to predict the next word in a sequence, based on the context.
6. Regular expression matching: FSA is the basis for regular expressions, which are used to match complex patterns
in text.
7. Part-of-speech tagging: FSA helps identify the grammatical category of words (e.g., noun, verb, adjective).
8. Named entity recognition: FSA is used to identify and extract specific entities like names, locations, and
organizations.
9. * Sentiment analysis*: FSA can be used to analyze text for sentiment, by matching patterns that indicate positive
or negative sentiment.
While FSA has limitations in handling context-free languages, it remains a fundamental building block for many
NLP applications, and its concepts are essential for understanding more advanced NLP techniques
These levels are not mutually exclusive, and analysis often involves combining multiple levels to understand the
complexities of language.
By examining language at these various levels, linguists and NLP researchers can gain a deeper understanding of
language structure, meaning, and usage, enabling applications like language teaching, language translation, and
language generation.
In Natural Language Processing (NLP), different levels of analysis are used to process and understand text. Each
level focuses on a specific aspect of language, from basic text processing to complex semantic understanding. Here
are the primary levels of analysis in NLP:
1. Phonological Analysis: This level is applicable only if the text is generated from the speech and deals with
the interpretation of speech sounds within and across different words.
2. Morphological Analysis : Morphological analysis is a field of linguistics that studies the structure of words.
It identifies how a word is produced through the use of morphemes. A morpheme is a basic unit of the English
language. The morpheme is the smallest element of a word that has grammatical function and meaning.
3. Lexical Analysis :
4. Syntactic Analysis (Parsing): This knowledge relates to how words are put together or structured of
form grammatically correct sentences in the language.
5. Semantic Analysis : This knowledge is concerned with the meanings of words and phrases and how
they combine to form sentence meanings.
7. Pragmatic Analysis: Pragmatic analysis in NLP is the process of analyzing the context and intent behind a
text or conversation to understand the implied meaning, beyond the literal interpretation of the words.
Example:
- Speaker's intention
- Implicature (implied meaning)
- Context (physical, social, and cultural)
- Inference (drawing conclusions based on the conversation)
Other examples:
RTNs are a generalization of FSAs, allowing for recursion and nested structures. They consist of:
- States (nodes)
- Transitions (edges)
- Actions (labels on edges)
RTNs can:
ATNs can:
Both RTNs and ATNs are used in various NLP applications, including:
While RTNs and ATNs are powerful tools in NLP, they have limitations, such as:
- Computational complexity
- Difficulty in handling ambiguity and uncertainty
Overall, RTNs and ATNs are important graphical representations in NLP, enabling the analysis and processing of
complex linguistic structures.
1. Machine Translation
2. Sentiment Analysis
Description: Identifying and extracting subjective information from text, determining whether the expressed
opinion is positive, negative, or neutral.
Examples:
o Analyzing customer reviews.
o Social media monitoring.
Technologies: Lexicon-based approaches, machine learning models, BERT.
Description: Automated agents that can interact with users through natural language.
Examples:
o Customer service bots.
o Virtual assistants like Siri, Alexa, and Google Assistant.
Technologies: Dialog systems, intent recognition, response generation, Transformer models (e.g., GPT).
4. Text Summarization
5. Information Retrieval
Description: Retrieving relevant information from large datasets based on user queries.
Examples:
o Search engines like Google.
o Enterprise document search.
Technologies: Inverted indexes, ranking algorithms, BERT for query understanding.
Description: Identifying and classifying named entities in text into predefined categories such as names of
people, organizations, locations, etc.
Dr. Shikha Pandey, BIT, Durg Page 17
Examples:
o Information extraction from legal documents.
o Annotating medical records.
Technologies: CRF, BiLSTM-CRF, spaCy's NER.
7. Speech Recognition
8. Text Classification
Description: Converting different types of documents, such as scanned paper documents, PDFs, or images
captured by a digital camera, into editable and searchable data.
Examples:
o Digitizing printed texts.
o Extracting information from receipts and invoices.
Technologies: Tesseract OCR, Deep Learning models.
Description: Automatically correcting typos and suggesting words or phrases as users type.
Examples:
o Mobile keyboards.
o Word processors.
Technologies: N-gram models, Neural Networks, Transformer models.
NLP applications are increasingly embedded in everyday technologies, enhancing user interactions and enabling
smarter, more intuitive interfaces.
Source Target
Language Language Good Morning ात
translation system:
• IIT, Bombay- Prof. Pushpak Bhattacharyya working on machine translation system from
English to Marathi and Bengali using the UNL (universal networking languages-
interlingua) formalism
Project