Professional Documents
Culture Documents
Lecture-3 (Words - Transducers)
Lecture-3 (Words - Transducers)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Words and Transducers (Some Concepts)
Further, fish don’t usually change their form when they are
plural
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Words and Transducers (Some Concepts) (Cont..)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Words and Transducers (Some Concepts) (Cont..)
But, for many NLP applications this isn’t possible because -ing is a
productive suffix.
Mean that it applies to every verb.
Similarly -s applies to almost every noun.
Productive suffixes even apply to new words; thus the new word fax can
automatically be used in the -ing form
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
1. Words and Transducers (Some Concepts) (Cont..)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Survey of English Morphology
For example
- the word fox consists of a single morpheme (the morpheme fox).
- while, the word cats consists of two: (i) the morpheme cat and (ii)
the morpheme -s.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Survey of English Morphology (Cont..)
The stem is the “main” morpheme of the word, supplying the main
meaning.
- example; In Cat’s, Cat is stem.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Survey of English Morphology
2.1 Categories of Affixes
Affixes are further divided into 4 types;
(1) prefixes, (2) suffixes, (3) infixes, and (4) circumfixes.
(1)Prefixes precede the stem,
e.g., The word unbuckle is composed of a stem buckle and the prefix un-.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
2. Survey of English Morphology
2.1 Categories of Affixes (Cont..)
(4) Circumfixe, circumfixes do both (prefixes and suffixes).
- English doesn’t have any good examples of circumfixes, but many
other languages do. In German,
e.g., adding ge- to the beginning of the stem and -t to the end;
so the past participle of the verb sagen (to say) is gesagt (said).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Morphology to create Words
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Morphology to create Words (Cont..)
1. Inflection
It is the combination of a word stem with a grammatical morpheme,
usually resulting in a word of the same class as the original stem,
and usually filling some syntactic function like agreement.
- English has the inflectional morpheme -s for marking the plural
on nouns, and
- the inflectional morpheme -ed for marking the past tense on verbs
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3.1 Inflectional Morphology (a. Nouns)
English has simple inflectional system with;
(a) nouns,
(b) verbs and
(c) some times adjectives.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3.1 Inflectional Morphology (b. Verbs)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3.1 Inflectional Morphology (b. Verbs) (Cont…)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3.1 Inflectional Morphology (b. Verbs) (Cont…)
The Irregular verbs are those that have some more or less
idiosyncratic forms of Irregular verb inflection
Irregular verbs in English often have five different forms, but can have
as many as eight or as few as three (e.g. cut or hit).
Note that an irregular verb can inflect in the past form (also called the
preterite) by changing its vowel (eat/ate), or its vowel and some
consonants (catch/caught), or with no change at all (cut/cut).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3.1 Inflectional Morphology (b. Verbs) (Cont…)
For Example, a single consonant letter is doubled before adding the –ing
and -ed suffixes (beg/begging/begged).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Morphology to create Words (Cont..)
2. Derivation
is the combination of a word stem with a grammatical morpheme,
- mainly deal with adjective, nouns and verbs.
Resulting in a word of a different class, often with a meaning hard to
predict exactly.
For example
the verb computerize can take the derivational suffix -ation to
produce the noun computerization.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3.2 Derivational Morphology
Case 1: Verb/Adjective to Noun :-
While English inflection is relatively simple compared to other
languages, derivation in English is quite complex.
A very common kind of derivation in English is the formation of
new nouns, often from verbs or adjectives. This process is called
nominalization.
For Example:-
the suffix -ation produces nouns from verbs ending often in the suffix -
ize (computerize → computerization). Here are examples of some
particularly productive English nominalizing suffixes.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3.2 Derivational Morphology (Cont..)
Case 2: Verb/Noun to Adjective:-
Adjectives can also be derived from nouns and verbs. Here are
examples of a few suffixes deriving adjectives from nouns or verbs.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Morphology to create Words (Cont..)
3. Cliticization
It is the combination of a word stem with a clitic.
A clitic is a morpheme that acts syntactically like a word, but is
reduced in form and attached (phonologically and sometimes
orthographically) to another word
For example
English morpheme ’ve in the word “ I’ve ” is a clitic
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3.3 Cliticization Morphology
• Note that the clitics in English are ambiguous; Thus she’s can mean
she is or she has, correctly segmenting off clitics in English is
simplified by the presence of the apostrophe (’) .
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
3. Morphology to create Words (Cont..)
4. Compounding
It is the combination of multiple word stems together.,
For example
the noun doghouse is the concatenation of the morpheme
dog with the morpheme house.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Finite-State Morphological Parsing
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Finite-State Morphological Parsing (Cont…)
The second column contains the stem of each word as well as
assorted morphological features. These features specify additional
information Feature about the stem.
For Example the feature;
+N : means that the word is a noun;
+Sg : means it is singular,
+Pl : means it is plural.
+PresPart : is Present Participle (ending in “ing”)
+PastPart : is Past Participle (ending in “ed”)
Note that some of the input forms (like caught, goose, canto, or
vino) will be ambiguous between different morphological parses.
For now, we will consider the goal of morphological parsing merely
to list all possible parses.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4. Finite-State Morphological Parsing (Cont…)
In order to build a morphological parser, we’ll need at least the
following:
(1)Lexicon: the list of stems and affixes, together with basic information
about them (whether a stem is a Noun stem or a Verb stem, etc.).
(3) Orthographic rules: these spelling rules are used to model the
changes that occur in a word, usually when two morphemes combine
For Example; (e.g., the y→ie spelling rule that changes city + -s to
cities rather than citys).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4.1 Building a Finite-State
LEXICON (Working For Words)
A lexicon is a repository for words.
The simplest possible lexicon would consist of an explicit list of
every word of the language
For Example;
- (every word, i.e., including abbreviations (“AAA”) and
e.g., a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, . . .
- proper names (“Jane” or “Beijing”)) as follows:
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4.2 Building a Finite-State
LEXICON (Reg/Irreg Noun)
Reg-noun:- The FSA assumes that the
lexicon includes regular nouns (reg-noun) that
take the regular -s plural (e.g., cat, dog, fox,
aardvark).
This lexicon has three stem classes (reg-verb-stem, irreg-verb-stem, and irreg-
pastverb-form), plus four more affix classes (-ed past, -ed participle, -ing
participle, and third singular -s).
Table: Lexicon for finite-state
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4.4 Building a Finite-State LEXICON (Example-1)
Problem Defined:
While this FSA will recognize all the adjectives, it will also recognize
ungrammatical forms like unbig, unfast, oranger, or smally. We need
to set up classes of roots and specify their possible suffixes.
- Thus adj-root1 would include adjectives that can occur with un-
and -ly (clear, happy, and real)
- while adj-root2 will include adjectives that can’t (big, small),
This FSA models a number of derivational facts, such as the well
known generalization that any verb ending in -ize can be followed by
the nominalizing suffix –ation.
CASE STUDY : -
There is a word fossilize, we can predict the word fossilization by
following states q0, q1, and q2. Similarly, adjectives ending in -al or -
able at q5 (equal, formal, realizable) can take the suffix -ity, or
sometimes the suffix -ness to state q6 (naturalness, casualness).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4.4 Building a Finite-State LEXICON
(Class Participation)
Design and build a finite-state Lexicon of derivation in which
morphotactics of English adjectives and FSA of following
combinations are defined:
[Note: design single FSA for overall word].
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
4.4 Building a Finite-State LEXICON
(Assignments)
Consider the following FSA of English derivational morphology;
describe following combinations of;
q0->q1->q2->q3
q0->q1->q2->q4
q0->q5->q6
q0->q5->q2->q3
q0->q5->q2->q4
q0->q5->q6
q0->q5->q9
q0->q10->q8->q9
q0->q8->q9
q0->q10->q8->q6
q0->q8->q6
q0->q11->q8->q9
q0->q7->q8->q9
q0->q11->q8->q6
q0->q10->q8->q6
q0q1q2q3q4q5q6q7q8q9q10q11
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5 Finite State Transducers [FST]
(Working For String/ Set of Strings)
We’ve now seen that FSAs can represent the morphotactic structure
of a lexicon, and can be used for word recognition.
A transducer maps between one representation and another;
Finite-state transducer or FST is a type of finite automaton which;
- maps between two sets of symbols. We can visualize an FST as a
two-tape automaton which recognizes or generates pairs of strings.
• Case 2: “3 states”
• Case 3: “4 states”
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.1 Finite State Transducers (FST)
(a. Types of FST)
The FST has a more general function than an FSA;
- where an FSA defines a formal language by defining a set of
strings,
- an FST defines a relation between sets of strings.
Another way of looking at an FST is as a machine that reads one
string and generates another.
(1)FST as recognizer:
- A transducer that takes a pair of strings as input and outputs accept if
the string-pair is in the string-pair language, and reject if it is not.
(e.g; he go:goes to school. He goes to bazar.).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.1 Finite State Transducers (FST)
(a. Types of FST) (Cont…)
(2) FST as generator:
- A machine that outputs pairs of strings of the language. Thus, the
output is a yes or no, and a pair of output strings.
(e.g; She like mercedes car. His choice of car’s color is red [Yes/No] ).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.2 Finite State Transducers (FST)
(b. Inversion Vs Composition FST) (Cont…)
FSTs and regular relations are closed under union, in general they are not
closed under difference, complementation and intersection.
(e.g; Older men (A) and a boy (Z) travel in a bus. He (Z) acts as guider to them (A)
during travelling).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.2 Finite State Transducers (FST)
(b. Inversion Vs Composition FST) (Cont…)
(2) Composition: If T1 is a transducer from I1 to O1 and T2 a transducer from
O1 to O2, then T1 ◦ T2 maps from I1 to O2. example;
SYNTAX: T1> Input1: A – Output1: E T2> Output1: E – Output2: G
FST-based Composition
Composition is useful because it allows us to take two transducers that run
in series and replace them with one more complex transducer.
Composition works as in algebra; applying T1 ◦ T2 to an input sequence S is
identical to applying T1 to S and then T2 to the result; thus T1 ◦ T2(S) = T2(T1(S)).
(e.g; Ali (a) and Aliya (b) are married together. Aliya (b) has two children (c)).
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
5.3 Finite State Transducers (FST)
(c. Sequential Transducers and Determinism)
Sequential transducers, by contrast, are a subtype of transducers that are
deterministic on their input.
Sequential transducers are not necessarily sequential on their output.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. FSTs For Morphological Parsing
• In the finite-state morphology paradigm, we represent a word as a
correspondence between a lexical level, which represents a
concatenation of morphemes making up a word, and
• the surface level, which represents the concatenation of letters
which make up the actual spelling of the word.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. FSTs For Morphological Parsing (Cont…)
(Example)
Transducer will map plural nouns into the stem plus the
morphological marker +Pl, and singular nouns into the stem plus the
morphological marker +Sg.
For Example;
A surface cats will map to cat +N +Pl. This can be viewed in
feasible-pair format as
c:c a:a t:t +N:ǫ +Pl:ˆs# [reg-noun] [ǫ = nothing]
p:p e:e o:o p:p l:l e:e +N:ǫ +Sg: ǫ [irreg-sg-noun]
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
6. FSTs For Morphological Parsing (Cont…)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
7. Transducers and Orthographic Rules
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
7. Transducers and Orthographic Rules (Cont…)
We could write an E-insertion rule that performs the mapping from the
intermediate to surface levels shown.
Such a rule might say something like “insert an e on the surface tape just
when the lexical tape has a morpheme ending in (s, z, x, ch, sh etc.) and
the next morpheme is -s”.
Here’s a formalization of the rule
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Combining FST Lexicon and Rules
The lexicon transducer maps between the lexical level, with its stems and
morphological features, and an intermediate level that represents a simple
concatenation of morphemes.
Then a host of transducers, each representing a single spelling rule
constraint, all run in parallel so as to map between this intermediate level
and the surface level.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Combining FST Lexicon and Rules (Cont…)
A trace of the system accepting the mapping from fox +N +PL to foxes.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Combining FST Lexicon and Rules
(Class Participation)
Design architecture of 2nd level cascade of transducers by
considering combination of FST lexicon and rules :
[Note: Draw Lexical + Intermediate + surface] & [FST lexicon].
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
8. Combining FST Lexicon and Rules (Cont…)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
9. Lexicon-Free FSTs: The Porter Stemmer
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
9. Lexicon-Free FSTs: The Porter Stemmer (Cont…)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
10. Word and Sentence Tokenization
Word tokenization may seem very simple in a language like English that
separates words via a special ‘space’ character.
A closer examination will make it clear that whitespace is not sufficient by
itself.
For Example;
Consider the following sentences from a Wall Street Journal and New York
Times article, respective
Sentence 1(Wall Street Journal )
Mr. Sherwood said reaction to Sea Containers’ proposal has been "very positive." In
New York Stock Exchange composite trading yesterday, Sea Containers closed at
$62.625, up 62.5 cents.
Sentence 2(New York Times article)
‘‘I said, ‘what’re you? Crazy?’ ’’ said Sadowsky. ‘‘I can’t afford to do that.’’
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
10. Word and Sentence Tokenization (Cont…)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
10. Word and Sentence Tokenization
(Presentation of each candidate)
Solutions of Word/sentence
tokenization:
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
11. Minimum Edit Distance
The distance between String distance two strings is a measure of how alike
two strings are to each other.
The minimum edit distance between two strings is the minimum number of
editing operations (insertion, deletion, substitution) needed to transform one
string into another.
For example the gap between the words intention and execution is five
operations
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
11. Minimum Edit Distance (Cont…)
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)
11. Minimum Edit Distance (Cont…)
Dynamic programming algorithms for sequence comparison work by
creating a distance matrix with one column for each symbol in the target
sequence and one row for each symbol in the source sequence (i.e., target
along the bottom, source along the side).
For minimum edit distance, this matrix is the edit-distance matrix. Each
cell edit-distance[i,j] contains the distance between the first i characters of
the target and the first j characters of the source.
Each cell can be computed as a simple function of the surrounding cells;
thus starting from the beginning of the matrix it is possible to fill every
entry.
The value in each cell is computed by taking the minimum of the three
possible paths through the matrix which arrive there.
@Copyrights: Natural Language Processing (NLP) Organized by Dr. Ahmad Jalal (http://portals.au.edu.pk/imc/)