5.2 Natural Language Processing
5.2 Natural Language Processing
5.2 Natural Language Processing
INT404
Natural Language Processing
Language is meant for Communicating about the world.
The problem : There are lots of ways to say the same thing :
Mary was born on October 11.
Mary’s birthday is October 11.
The good side : When you know a lot, facts imply each other. Language
is intended to be used by agents who know a lot.
Components of NLP
There are the following two components of NLP -
1. Natural Language Understanding (NLU)
• Natural Language Understanding (NLU) helps the
machine to understand and analyze human language by
extracting the metadata from content such as concepts,
entities, keywords, emotion, relations, and semantic roles.
• NLU mainly used in Business applications to understand
the customer's problem in both spoken and written
language.
NLU involves the following tasks -
• It is used to map the given input into useful
representation.
• It is used to analyze different aspects of the
language.
2. Natural Language Generation (NLG)
• Natural Language Generation (NLG) acts as a
translator that converts the computerized data
into natural language representation.
• It mainly involves Text planning, Sentence
planning, and Text Realization.
Difference between NLU and NLG
Phases of NLP
There are the following five phases of NLP:
1. Lexical Analysis and Morphological
• The first phase of NLP is the Lexical Analysis. This
phase scans the source code as a stream of characters and
converts it into meaningful lexemes.
• It divides the whole text into paragraphs, sentences, and
words.
2. Syntactic Analysis (Parsing)
• Syntactic Analysis is used to check grammar, word
arrangements, and shows the relationship among the
words.
Example: Agra goes to the Poonam
• In the real world, Agra goes to the Poonam, does not
make any sense, so this sentence is rejected by the
Syntactic analyzer.
3. Semantic Analysis
• Semantic analysis is concerned with the meaning
representation. It mainly focuses on the literal meaning of
words, phrases, and sentences.
4. Discourse Integration
• Discourse Integration depends upon the sentences that
proceeds it and also invokes the meaning of the sentences
that follow it.
5. Pragmatic Analysis
• Pragmatic is the fifth and last phase of NLP. It helps you
to discover the intended effect by applying a set of rules
that characterize cooperative dialogues.
For Example: "Open the door" is interpreted as a request
instead of an order.
Why NLP is difficult?
NLP is difficult because Ambiguity and Uncertainty exist
in the language.
Ambiguity
There are the following three ambiguity -
• Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more
possible meanings of the sentence within a single word.
Example:
• Manya is looking for a match.
• In the above example, the word match refers to that either
Manya is looking for a partner or Manya is looking for a
match. (Cricket or other match)
Syntactic Ambiguity
• Syntactic Ambiguity exists in the presence of two or
more possible meanings within the sentence.
Example:
• I saw the girl with the binocular.
• In the above example, did I have the binoculars? Or did
the girl have the binoculars?
Referential Ambiguity
• Referential Ambiguity exists when you are referring to
something using the pronoun.
Example: Kiran went to Sunita. She said, "I am hungry."
• In the above sentence, you do not know that who is
hungry, either Kiran or Sunita.
Parse Tree
Tokenization
Tokenization is essentially splitting a phrase, sentence,
paragraph, or an entire text document into smaller units, such as
individual words or terms. Each of these smaller units are called
tokens.
Why is Tokenization required in NLP?
• Before processing a natural language, we need to identify
the words that constitute a string of characters. That’s why
tokenization is the most basic step to proceed with NLP (text
data). This is important because the meaning of the text
could easily be interpreted by analyzing the words present
in the text.
• Let’s take an example. Consider the below string:
“This is a cat.”
What do you think will happen after we perform tokenization on
this string?
[‘This’, ‘is’, ‘a’, cat’].
• There are numerous uses of doing this. We can use this
tokenized form to:
• Count the number of words in the text
• Count the frequency of the word, that is, the number of times a
particular word is present
Bag of Words Model
“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”
• used in
– Word processing
– Character or text recognition
– Speech recognition and generation.
• Most available spell checkers focus on
processing isolated words and do not take into
account the context.
G630