Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
32 views

NLP Mid-1

The document discusses key concepts in natural language processing including defining NLP, parsing, lexemes, morphology, treebanks, Chomsky Normal Form rules, word component analysis methods, and morphological analysis methods. It also explains document structure, generative sequence classification, and the treebank method with examples.

Uploaded by

CLouD ontube
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

NLP Mid-1

The document discusses key concepts in natural language processing including defining NLP, parsing, lexemes, morphology, treebanks, Chomsky Normal Form rules, word component analysis methods, and morphological analysis methods. It also explains document structure, generative sequence classification, and the treebank method with examples.

Uploaded by

CLouD ontube
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

NLP MID-1

Refer at your own discretion….


~ Dokja

SAQs
1. Define NLP.
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human
language, and Artificial Intelligence.
It is a branch of Artificial Intelligence that helps computers to understand,
interpret and manipulate human language.

2. Define Parsing.
Parsing in Natural Language Processing (NLP) refers to the process of
analyzing the grammatical structure of a sentence to understand its
components and how they relate to each other syntactically.
This process involves breaking down a sentence into its constituent parts,
such as nouns, verbs, adjectives, and phrases, and determining the
relationships between them, such as subject-verb-object relationships.

NLP MID-1 1
3. What is lexeme?
Lexeme refers to the basic unit of vocabulary, typically corresponding to a
single word as it appears in a dictionary or lexicon.
However, it can also represent a base or root form of a word, from which
various inflected forms (such as different tenses, plurals, etc.) can be derived.
E.g: the word "run" can be a lexeme representing both the base form of the
verb ("run") and its inflected forms ("running," "ran"). Similarly, "cat" is a
lexeme representing both the singular noun form ("cat") and its plural form
("cats").

4. What is Morphology?

Morphology refers to the study of the internal structure of words and the
rules governing their formation.

It deals with how words are constructed from morphemes, which are the
smallest units of meaning in a language.

Morphemes can be roots, prefixes, suffixes, or infixes, and they combine


to form words.

5. What is Treebank?

A treebank is a corpus (a collection of written or spoken texts) of parsed


sentences where each sentence is annotated with a syntactic structure
represented as a tree.

These trees, often called parse trees or syntactic trees, depict the
grammatical structure of the sentences according to a predefined
formalism, such as constituency parsing or dependency parsing.

6. What are the rules of CNF?

Rules of Chomsky Normal Form (CNF):

NLP MID-1 2
1. All production rules must be of the forms:

A → BC (or) A → a

Where, A, B, and C are non-terminal symbols and ‘a’ is a terminal symbol.

2. There should be no ε-productions: An ε-production is a production rule


that generates the empty string ε.

3. There should be no unit productions: Unit productions are rules where a


non-terminal directly produces another non-terminal without any
intervening terminals.

For example, A → B is a unit production.

4. There should be no unreachable symbols: Every non-terminal symbol


should be reachable from the start symbol of the grammar.

5. There should be no useless symbols: Symbols (both terminals and non-


terminals) that cannot be reached from the start symbol or cannot derive
any terminal string should be eliminated.

7. List the methods of Word Components.

Morphological Analysis: Breaking words into morphemes.

Stemming: Reducing words to their base or root form.

Lemmatization: Transforming words to their dictionary form (lemma).

Part-of-Speech Tagging: Assigning grammatical categories to words.

8. List out the Morphological methods (models)?

1. Dictionary Lookup: Morphological info retrieved from dictionary.

2. Finite State Morphology: Uses finite state machines for morphological


processes.

3. Unification-based Morphology: Represents morphological info using


feature structures.

NLP MID-1 3
4. Functional Morphology: Models morphological processes using functional
programming concepts.

LAQs:
9. Explain the structure of documents.

The structure of documents in human language follows patterns, with


words combining to form meaningful grammatical units like statements,
requests, and commands.

Automatic extraction of document structure aids subsequent NLP tasks


such as parsing, machine translation, and semantic role labeling, all relying
on sentences as the basic processing unit.

Sentence boundary annotation is crucial for improving human readability


in automatic speech recognition (ASR) systems.

Sentence boundary detection involves deciding where sentences start


and end within a sequence of characters, often marked by punctuation like
periods, question marks, or exclamation points.

Topic segmentation determines when a topic begins and ends in a


sequence of sentences.

Statistical classification methods are used to detect sentence and topic


boundaries using annotated training data, relying on features of the input
such as punctuation marks for prediction.

Effective feature design and selection are vital to prevent overfitting and
noise problems.

While statistical approaches are generally language-independent, each


language presents unique challenges.

For example, processing Chinese documents may require segmentation of


character sequences into words since words are not typically separated
by spaces.

NLP MID-1 4
Similarly, morphologically rich languages may necessitate word structure
analysis to extract additional features.

Such processing usually occurs in a pre-processing step, where a


sequence of tokens is determined.

Tokens can be word or sub-word units depending on the task and


language.

These algorithms are then applied to tokens to further analyze and extract
meaningful linguistic features.

10. Explain the Generative Sequence Classification method.

Generative Models:

Statistical models that learn the underlying probability distribution of


data.

Used in NLP to model the probability distribution of sequences of


tokens.

Examples include Hidden Markov Models (HMMs), Generative


Adversarial Networks (GANs), and Variational Autoencoders (VAEs).

Sequence Classification:

Task of assigning labels to sequences of tokens (e.g., documents,


sentences).

Examples include document topic classification, sentiment analysis,


and intent classification.

Training Phase:

Model learns probability distribution of sequences for different


classes.

Trained on labeled data where each sequence has an associated label.

Learns likelihood of observing sequences given their labels.

Modeling Sequence Probability:

NLP MID-1 5
Calculates probability of observing a sequence given a class label

y(P (X ∣ y))

Techniques include n-grams, recurrent neural networks (RNNs), or


transformers.

Classification Decision:

Uses Bayes' theorem to calculate posterior probability of each class


given the sequence.

Class with highest posterior probability is assigned as predicted label.

Bayesian Inference:

Bayes' theorem: Probability of a label given a sequence.

Probability is linked to evidence and prior probability of the label.

P(X) acts as a normalization constant and can be ignored for


classification.

Generative Sequence Classification leverages generative models to estimate


probabilities of sequences belonging to different classes, enabling
probabilistic sequence labeling and classification in NLP tasks.

11. Explain Treebank method with example.

The Treebank method is a linguistic annotation technique used in Natural


Language Processing (NLP) to represent the syntactic structure of
sentences using parse trees.

These parse trees show the hierarchical relationships between words in a


sentence, helping in tasks like grammar analysis, syntactic parsing, and

NLP MID-1 6
information extraction.

Treebank Method Explanation:


1. Parse Trees:

Parse trees are graphical representations of the syntactic structure of


sentences. They show how words are grouped into phrases and how
phrases are related to each other.

Each node in the parse tree represents either a word (terminal node) or
a phrase (non-terminal node).

2. Treebank Annotation:

The Treebank method involves manually annotating sentences with


parse trees. Linguists or annotators analyze the sentence's syntax and
create a structured tree representation.

These annotated sentences form a corpus called a Treebank, which


serves as training data for developing and testing syntactic parsers
and other NLP models.

Example of Treebank Annotation:


Consider the sentence: "The cat chased the mouse."

1. Tokenization:

First, we break the sentence into tokens (words): "The", "cat",


"chased", "the", "mouse".

2. Part-of-Speech (POS) Tagging:

Next, we assign each token a part-of-speech tag based on its


grammatical role in the sentence:

"The" - Determiner (Det)

"cat" - Noun (NN)

"chased" - Verb (V)

"the" - Determiner (Det)

"mouse" - Noun (N)

NLP MID-1 7
3. Annotation Process:

The annotator analyzes the sentence's structure and creates a parse


tree based on grammatical rules.

Here's the parse tree for our example sentence:

In this tree:

S: Sentence

NP: Noun Phrase

VP: Verb Phrase

Det: Determiner

N: Noun

12.Construct Shift reduce Parsing:


N N ‘and’ N
NN ‘or’ N
N’a’/’b’/’c’

NLP MID-1 8
13. What are the Issues and Challenges of Morphology?

Irregularity: word forms are not described by a prototypical linguistic


model.

Ambiguity: word forms be understood in multiple ways out of the context


of their
discourse.

Productivity: is the inventory of words in a language finite, or is it


unlimited?

Morphological parsing tries to eliminate the variability of word forms to


provide higher-level linguistic units whose lexical and morphological
properties are explicit and well
defined.

It attempts to remove unnecessary irregularity and give limits to ambiguity,


both of
which are present inherently in human language.

By irregularity, we mean existence of such forms and structures that are


not described
appropriately by a prototypical linguistic model.

Some irregularities can be understood by redesigning the model and


improving its

NLP MID-1 9
rules, but other lexically dependent irregularities often cannot be
generalized

Morphological parsing aims to reduce ambiguity by providing clearer


interpretations of word forms and structures.

Morphological modelling also faces the problem of productivity and


creativity in language,
which unconventional but perfectly meaningful new words or new senses
are coined.

14. Explain in detail about Morphological Methods.

1. Dictionary Lookup:

Definition: Dictionary lookup, also known as lexicon-based analysis,


involves referencing a dictionary or lexicon to identify and analyze
word forms.

Process: In this method, a pre-existing dictionary containing


information about words, such as their base forms, inflections, and
meanings, is consulted. When a word is encountered, it is searched in
the dictionary to retrieve its morphological information.

Application: Dictionary lookup is commonly used in morphological


analysis tasks, such as part-of-speech tagging, stemming, and
lemmatization. It provides a straightforward approach to morphological
analysis, particularly for languages with relatively simple
morphological structures.

2. Finite State Morphology:

Definition: Finite state morphology (FSM) represents the


morphological rules of a language as finite state transducers (FSTs) or
networks of finite state machines.

Process: FSM models morphological processes as sequences of


states and transitions. Each state represents a linguistic unit (e.g., a
morpheme), and transitions between states correspond to
morphological operations (e.g., affixation, concatenation).

NLP MID-1 10
Application: FSM is widely used in natural language processing (NLP)
tasks, such as tokenization, stemming, and morphological analysis. It
offers a computationally efficient framework for capturing complex
morphological phenomena in a formal and expressive manner.

3. Unification-based Morphology:

Definition: Unification-based morphology (UBM) is a linguistic


framework that employs unification, a process of merging and
reconciling linguistic features, to analyze word forms.

Process: UBM represents morphological rules and constraints as


feature structures, which encode linguistic properties such as
agreement, tense, and gender. These feature structures are unified or
combined during the analysis of word forms to generate morphological
analyses.

Application: UBM is commonly used in computational linguistics and


artificial intelligence for parsing, generation, and translation tasks. It
provides a flexible and expressive formalism for modeling
morphological phenomena, particularly in languages with rich
inflectional and derivational systems.

4. Functional Morphology:

Definition: Functional morphology focuses on the functional roles and


meanings of morphological structures within a language.

Process: In functional morphology, morphological analysis is guided


by the semantic, syntactic, and pragmatic functions of morphemes
and word forms. It considers how morphological structures contribute
to the overall meaning and communicative functions of linguistic
expressions.

Application: Functional morphology is used in linguistic analysis,


language teaching, and language planning to explore the functional
motivations and effects of morphological phenomena. It provides
insights into the relationship between form and meaning in language,
helping to understand how morphological structures shape
communication and discourse.

NLP MID-1 11
15. Explain how the Morphological typology divides languages into groups.

16. Explain minimum spanning tree with an example dependency graph.

A minimum spanning tree (MST) is a subset of the edges of a connected,


undirected graph that connects all the vertices together without any cycles
and with the minimum possible total edge weight.

A dependency graph represents the syntactic relationships between


words in a sentence. Each word is a vertex, and the syntactic
dependencies between words are represented as edges.

Example: Let's create a dependency graph for the sentence "The Cat chased
the mouse" and then find its minimum spanning tree.
Dependency Graph:

"The" --> "Cat" (subject)

"Cat" --> "chased" (verb)

"chased" --> "The" (object)

"chased" --> "mouse" (object)

Now, let's illustrate this dependency graph:

chased
____/ | \___
/ | \
The Cat mouse

In this graph, "chased" is the main verb, with "The" and "mouse" as its direct
objects, and "Cat" as its subject.

To find the minimum spanning tree of this dependency graph, we select a


subset of edges that connect all the vertices with the least total edge weight.
MST for the Dependency Graph:

"The" --> "Cat"

"Cat" --> "chased"

NLP MID-1 12
"chased" --> "mouse"

chased
/ \
Cat The
\
mouse

In this MST, we have selected the edges that maintain the syntactic
structure of the sentence while minimizing the total number of edges.

This tree connects all the words in the sentence without forming any
cycles and with the minimum possible number of edges.

17. Find out the probability for the grammar:


S → NP VP [0.80]
NP → Det N [0.3]
VP → V NP [0.20]
V → Includes [0.05]
Det → The [0.4]
Det → a [0.4]
N → meal [0.013]
N → flight [0.02]
for the input string : “The flight includes a meal”.

Step-1 : Number the words from in order (include 0 at the start) . here we have
5 words.
Step-2 : Create a 5*5 matrix and number them as shown below.

Note that the column side starts from 1, while the row side starts from 0.

Step-3 : Take each word one by one.

Place the word in its corresponding position on the chart.

For example, place "DA" in position (0,1) and note its significance.

NLP MID-1 13
Continue placing and annotating each word in the chart.

NLP MID-1 14
Final correct calculated values for:
VP = 0.000012
S = 0.00000002304 (or) 2.304 * 10^(-8)

skip to 6:50 for solution…..

https://www.youtube.com/watch?v=SFQ-owZaU_s

NLP MID-1 15

You might also like