Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
Learning Morphological Rules for Amharic Verbs
Using Inductive Logic Programming
Wondwossen Mulugeta1 and Michael Gasser2
1
Addis Ababa University, Addis Ababa, Ethiopia
2
Indiana University, Bloomington, USA
1
2
E-mail: wondgewe@indiana.edu, gasser@cs.indiana.edu
Abstract
This paper presents a supervised machine learning approach to morphological analysis of Amharic verbs. We use Inductive Logic
Programming (ILP), implemented in CLOG. CLOG learns rules as a first order predicate decision list. Amharic, an under-resourced
African language, has very complex inflectional and derivational verb morphology, with four and five possible prefixes and suffixes
respectively. While the affixes are used to show various grammatical features, this paper addresses only subject prefixes and suffixes.
The training data used to learn the morphological rules are manually prepared according to the structure of the background
predicates used for the learning process. The training resulted in 108 stem extraction and 19 root template extraction rules from the
examples provided. After combining the various rules generated, the program has been tested using a test set containing 1,784
Amharic verbs. An accuracy of 86.99% has been achieved, encouraging further application of the method for complex Amharic
verbs and other parts of speech.
1.
prepositions and conjunctions.
For Amharic, like most other languages, verbs have
the most complex morphology. In addition to the
affixation, reduplication, and compounding common to
other languages, in Amharic, as in other Semitic
languages, verb stems consist of a root + vowels +
template merger (e.g., sbr + ee + CVCVC, which leads
1
‘broke’) (Yimam, 1995;
to the stem seber
Bender, 1968). This non-concatenative process makes
morphological analysis more complex than in languages
whose morphology is characterized by simple affixation.
The affixes also contribute to the complexity. Verbs can
take up to four prefixes and up to five suffixes, and the
affixes have an intricate set of co-occurrence rules.
For Amharic verbs, grammatical features are not only
shown using the affixes. The intercalation pattern of the
consonants and the vowels that make up the verb stem
will also be used to determine various grammatical
features of the word. For example, the following two
words have the same prefixes and suffixes and the same
root while the pattern in which the consonants and the
vowels intercalated is different, resulting in different
grammatical information.
Introduction
Amharic is a Semitic language, related to Hebrew,
Arabic, and Syriac. Next to Arabic, it is the second most
spoken Semitic language with around 27 million
speakers (Sieber, 2005; Gasser, 2011). As the working
language of the Ethiopian Federal Government and
some regional governments in Ethiopia, most documents
in the country are produced in Amharic. There is also an
enormous production of electronic and online accessible
Amharic documents.
One of the fundamental computational tasks for a
language is analysis of its morphology, where the goal is
to derive the root and grammatical properties of a word
based on its internal structure. Morphological analysis,
especially for complex languages like Amharic, is vital
for development and application of many practical
natural language processing systems such as machinereadable dictionaries, machine translation, information
retrieval, spell-checkers, and speech recognition.
While various approaches have been used for other
languages, Amharic morphology has so far been
attempted using only rule-based methods. In this paper,
we applied machine learning to the task.
2.
?-sebr-alehu 1s pers.sing. simplex imperfective
?-seber-alehu 1stpers.sing.passive imperfective
Amharic Verb Morphology
Figure 1: Stem template variation example
The different parts of speech and their formation
along with the interrelationships which constitute the
morphology of Amharic words have been more or less
thoroughly studied by linguists (Sieber, 2005;
Dwawkins, 1960; Bender, 1968). In addition to lexical
information, the morphemes in an Amharic verb convey
subject and object person, number, and gender; tense,
aspect, and mood; various derivational categories such
as passive, causative, and reciprocal; polarity
(affirmative/negative); relativization; and a range of
In this second case, the difference in grammatical
feature is due to the affixes rather than the internal root
template structure of the word.
te-deres-ku 1st pers. sing. passive perfective
deres-ku 1st pers. sing. simplex perfective
Figure 2: Affix variation example
1
Amharic is written in the Geez writing system. For our morphology learning
system we romanize Amharic orthography, and we cite these romanized forms in
this paper.
7
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
4.
As in many other languages, Amharic morphology is
also characterized by alternation rules governing the
form that morphemes take in particular environments.
The alternation can happen either at the stem affix
intersection points or within the stem itself. Suffix-based
alternation is seen, for example, in the second person
singular feminine imperfect and imperative, shown in
Table 1. The first two examples in Table 1 shows that,
the second person singular feminine imperative marker
'i', if preceded by the character 'l', is altered to 'y'. The
last two examples show that the same alternation rule
applies for imperfect roots.
No.
Word
Root
Feature
1
gdel
gdl
2nd person sing. masc. imperative
2
gdey (gdel-i)
gdl
2nd person sing. fem. imperative
3
t-gedl-aleh
gdl
2nd person sing. masc. imperfect
4
t-gedy-alex
gdl
2nd person sing. fem. imperfect
Table 1: Example of Amharic Alternation Rule
3.
ILP and Morphology Learning
Inductive Logic Programming (ILP) is a supervised
machine learning framework based on logic
programming. In ILP a hypothesis is drawn from
background knowledge and examples. The examples
(E), background knowledge (B) and hypothesis (H) all
take the form of logic programs. The background
knowledge and the final hypothesis induced from the
examples are used to evaluate new instances.
Since logic programming allows for the expression of
arbitrary relations between objects, ILP is more
expressive than attribute-value representations, enabling
flexible use of background knowledge (Bratko & King,
1994; Mooney & Califf, 1995). It also has advantages
over approaches such as n-gram models, Hidden
Markov Models, neural networks and SVM, which
represent examples using fixed length feature vectors
(Bratko & King, 1994). These techniques have difficulty
representing relations, recursion and unbounded
structural representation (Mooney, 2003). ILP, on the
other hand, employs a rich knowledge representation
language without length constraints. Moreover, the first
order logic that is used in ILP limits the amount of
feature extraction required in other approaches.
In induction, one begins with some data during the
training phase, and then determines what general
conclusion can logically be derived from those data. For
morphological analysis, the learning data would be
expected to guide the construction of word formation
rules and interactions between the constituents of a
word.
There have been only a few attempts to apply ILP to
morphology, and most of these have dealt with
languages with relatively simple morphology handling
few affixations (Kazakov, 2000; Manandhar et al, 1998;
Zdravkova et al, 2005). However, the results are
encouraging.
While we focus on Amharic verb morphology, our
goal is a general-purpose ILP morphology learner. Thus
we seek background knowledge that is plausible across
languages that can be combined with language-specific
examples to yield rule hypotheses that generalize to new
examples in the language.
CLOG is a Prolog based ILP system, developed by
Manandhar et al (1998)2, for learning first order decision
lists (rules) on the basis of positive examples only. A
rule in Prolog is a clause with one or more conditions.
The right-hand side of the rule (the body) is a condition
and the left-hand side of the rule (the head) is the
conclusion. The operator between the left and the right
hand side (the sign ‘:-’) means if. The body of a rule is a
list of goals separated by commas, where commas are
understood as conjunctions. For a rule to be true, all of
its conditions/goals must be evaluated to be true. In the
expression below, p is true if q and r are true or if s and t are
Machine Learning of Morphology
Since Koskenniemi’s (1983) ground-breaking work on
two-level morphology, there has been a great deal of
progress in finite-state techniques for encoding
morphological rules (Beesley & Karttunen, 2003).
However, creating rules by hand is an arduous and timeconsuming task, especially for a complex language like
Amharic. Furthermore, a knowledge-based system is
difficult to debug, modify, or adapt to other similar
languages. Our experience with HornMorpho (Gasser,
2011), a rule-based morphological analyser and
generator for Amharic, Oromo, and Tigrinya, confirms
this. For these reasons, there is considerable interest in
robust machine learning approaches to morphology,
which extract linguistic knowledge automatically from
an annotated or un-annotated corpus. Our work belongs
to this category.
Morphology learning systems may be unsupervised
(Goldsmith, 2001; Hammarström & Borin, 2011; De
Pauw & Wagacha, 2007) or supervised (Oflazer et al
2001; Kazakov, 2000). Unsupervised systems are trained
on unprocessed word forms and have the obvious
advantage of not requiring segmented data. On the other
hand, supervised approaches have important advantages
of their own where they are less dependent on large
corpora, requires less human effort, relatively fast which
makes it scalable to other languages and that all rules in
the language need not be enumerated.
Supervised morphology learning systems are usually
based on two-level morphology. These approaches differ
in the level of supervision they use to capture the rules.
A weakly supervised approach uses word pairs as input
(Manandhar et al, 1998; Mooney & Califf, 1995;
Zdravkova et al, 2005). Other systems may require
segmentation of input words or an analysis in the form
of a stem or root and a set of grammatical morphemes.
true.
2
CLOG is freely available ILP system at:
http://www-users.cs.york.ac.uk/suresh/CLOG.html )
8
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
p :- q, r.
p :- s, t.
a) Learning stem extraction:
The background predicate 'set_affix' uses a
combination of multiple ‘split‟ operations to
identify the prefix and suffixes attached to the input
word. This predicate is used to learn the affixes
from examples presented as in Figure 3 by taking
only the Word and the Stem (the first two arguments
from the example).
p ⇔ (q ᴧ r) ᴠ (s ᴧ t)
Where q, r, s and t could be facts or predicates with any
arity and p is a predicate with any number of arguments.
CLOG relies on output completeness, which assumes
that every form of an object is included in the example
and everything else is excluded (Mooney & Califf,
1995). We preferred CLOG over other ILP systems
because it requires only positive examples and runs
faster than the other variants (Manandhar et al, 1998).
CLOG uses a hill climbing strategy to build the rules,
starting from a simple goal and iteratively adding more
rules to satisfy the goal until there are no possible
improvements. The evaluation of the rules generated by
the learner is validated using a gain function that
compares the number of positively and negatively
covered examples in the current and previous learning
stages (Manandhar et al, 1998).
5.
set_affix(Word, Stem, P1,P2,S1,S2):split(Word, P1, W11),
split(Stem, P2, W22),
split(W11, X, S1),
split(W22, X, S2),
not( (P1=[],P2=[],S1=[],S2=[])).
Figure 4: Affix extraction predicate
The predicate makes all possible splits of Word and
Stem into three segments to identify the prefix and
suffix substitutions required to unify Stem with
Word. In this predicate, P1 and S1 are the prefix and
suffix of the Word; while P2 and S2 are the prefix
and suffix of the Stem respectively. For example, if
Word and Stem are tgedyalex and gedl respectively,
then the predicate will try all possible splits, and
one of these splits will result in P1=[t], P2=[],
S1=[yalex] and S2=[l]. That is, tgedyalex will be
associated with the stem gedl, if the prefix P1 is
replaced with P2 and the suffix S1is replaced with
S2.
The ultimate objective of this predicate is to identify
the prefix and suffix of a word and then extract the
valid stem (Stem) from the input string (Word).
Here, we have used the utility predicate ‘split‟ that
segments any input string into all possible pairs of
substrings. For example, the string sebr could be
segmented as {([]-[sebr]), ([s]-[ebr]), ([se]-[br]),
([seb]-[r]), or ([sebr]-[])}.
Experiment Setup and Data
Learning morphological rules with ILP requires
preparation of the training data and background
knowledge. To handle a language of the complexity of
Amharic, we require background knowledge predicates
that can handle stem extraction by identifying affixes,
root and vowel identification and grammatical feature
association with constituents of the word.
The training data used during the experiment is of the
following form:
stem([s,e,b,e,r,k,u],[s,e,b,e,r],[s,b,r] [1,1]).
stem([s,e,b,e,r,k],[s,e,b,e,r],[s,b,r], [1,2]).
stem([s,e,b,e,r,x],[s,e,b,e,r],[s,b,r], [1,3]).
Figure 3: Sample examples for stem and root learning
The predicate 'stem' provides a word and its stem to
permit the extraction of the affixes and root template
structure of the word. The first three parameters specify
the input word, the stem of the word after affixes are
removed, and the root of the stem respectively. The
fourth parameter is the codification of the grammatical
features (tense-aspect-mood and subject) of the word.
Taking the second example in Figure 3, the word
seberk has the stem seber with the root sbr and is
perfective (the first element of the third parameter which
is 1) with second person singular masculine subject (the
second element of the third parameter is 2).
We codified the grammatical features of the words
and made them parameters of the training data set rather
than representing the morphosyntactic description as
predicates as in approaches used for other languages
(Zdravkova et al, 2005).
The background knowledge also includes predicates
for string manipulation and root extraction. Both are
language-independent, making the approach adaptable
to other similar languages. We run three separate
training experiments to learn the stem extraction, root
patterns, and internal stem alternation rules.
b) Learning Roots:
The root extraction predicate, 'root_vocal‟, extracts
Root and the Vowel with the right sequence from the
Stem. This predicate learns the root from examples
presented as in Figure 3 by taking only the Stem and
the Root (the second and third arguments).
root_vocal(Stem,Root,Vowel):merge(Stem,Root,Vowel).
merge([X,Y,Z|T],[X,Y|R],[Z|V]):merge(T,R,V).
merge([X,Y|T],R,[X,Y|V]):merge(T,R,V).
merge([X|Y],[X|Z],W) :merge(Y,Z,W).
merge([X|Y],Z,[X|W]) :merge(Y,Z,W).
Figure 5: Root template extraction predicate
The predicate ‘root_vocal‟ performs unconstrained
permutation of the characters in the Stem until the
first part of the permutated string matches the Root
character pattern provided during the training. The
9
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
‘feature‟: used to associate the identified affixes
goal of this predicate is to separate the vowels and
the consonants of a Stem. In this predicate we have
used the utility predicate ‘merge‟ to perform the
permutation. For example, if Stem is seber and the
example associates this stem with the Root sbr, then
„root_temp‟, using ‘merge‟, will generate many
patterns, one of which would be sbree. This,
ultimately, will learn that the vowel pattern [ee] is
valid within a stem.
and root CV pattern with the known
grammatical features from the example. This
predicate uses a codified representation of the
eight subjects and four tense-aspect-mood
features (‘tam’) of Amharic verbs, which is also
encoded as background knowledge. This
predicate is the only language-dependent
background knowledge we have used in our
implementation.
c) Learning stem internal alternations:
Another challenge for Amharic verb morphology
learning is handling stem internal alternations. For
this purpose, we have used the background
predicate „set_internal_alter‟:
feature([X,Y],[X1,Y1]):tam([X],X1),
subj([Y],Y1).
Figure 9: Grammatical feature assignment predicate
set_internal_alter(Stem,Valid_Stem,St1,St2):split(Stem,P1,X1),
split(Valid_Stem,P1,X2),
split(X1,St1,Y1),
split(X2,St2,Y1).
6.
Experiments and Result
For CLOG to learn a set of rules, the predicate and
arity for the rules must be provided. Since we are
learning words by associating them with their stem, root
and grammatical features, we use the predicate schemas
rule(stem(_,_,_,_)) for set_affix and root_vocal, and
rule(alter(_,_)) for set_internal_alter. The training
examples are also structured according to these predicate
schemas.
The training set contains 216 manually prepared
Amharic verbs. The example contains all possible
combinations of tense and subject features. Each word is
first romanized, then segmented into the stem and
grammatical features, as required by the ‘stem‟ predicate
in the background knowledge. When the word results
from the application of one or more alternation rules, the
stem appears in the canonical form. For example, for the
word gdey, the stem specified is gdel (see the second
example in Table 1).
Characters in the Amharic orthography represent
syllables, hiding the detailed interaction between the
consonants and the vowels. For example, the masculine
imperative verb ‘ግደል’ gdel can be made feminine by
adding the suffix ‘i’ (gdel-i). But, in Amharic, when the
dental ‘l’ is followed by the vowel ‘i’, it is palatalized,
becoming ‘y’. Thus, the feminine form would be written
‘ግደይ’, where the character ‘ይ’ ‘y’ corresponds to the
sequence ‘l-i’.
To perform the romanization, we have used our own
Prolog script which maps Amharic characters directly to
sequences of roman consonants and vowels, using the
familiar SERA transliteration scheme. Since the
mapping is reversible, it is straightforward to convert
extracted forms back to Amharic script.
Figure 6: stem internal alternation extractor
This predicate works much like the ‘set_affix’
predicate except that it replaces a substring which is
found in the middle of Stem by another substring
from Valid_Stem. In order to learn stem alternations,
we require a different set of training data showing
examples of stem internal alternations. Figure 7
shows some sample examples used for learning
such rules.
alter([h,e,d],[h,y,e,d]).
alter([m,o,t],[m,e,w,o,t]).
alter([s,a,m],[s,e,?,a,m]).
Figure 7: Examples for internal stem alternation learning
The first example in Figure 7 shows that for the
words hed and hyed to unify, the e in the first
argument should be replaced with ye.
Along with the three experiments for learning various
aspects of verb morphology, we have also used two
utility predicates to support the integration between the
learned rules and to include some language specific
features. These predicates are ‘template‟ and ‘feature‟:
‘template‟: used to extract the valid template for
Stem. The predicate manipulates the stem to
identify positions for the vowels. This predicate
uses the list of vowels (vocal) in the language to
assign ‘0’ for the vowels and ‘1’ for the
consonants.
template([],[]).
template([X|T1],[Y|B]):template(T1,B),
(vocal(X)->Y=0;Y=1).
After training the program using the example set,
which took around 58 seconds, 108 rules for affix
extraction, 18 rules for root template extraction and 3
rules for internal stem alternation have been learned. A
sample rule generated for affix identification and
associating the word constituents with the grammatical
features is shown below:
Figure 8: CV pattern decoding predicate
For the stem seber this predicate tries each
character separately and finally generates the
pattern [1,0,1,0,1] and for the stem sebr, it
generates [1,0,1,1] to show the valid template of
Amharic verbs.
10
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
stem(Word, Stem, [2, 7]):set_affix(Word, Stem, [y], [], [u], []),
feature([2, 7], [imperfective, tppn]),
template(Stem, [1, 0, 1, 1]).
The above example shows that the suffix that needs to
be stripped off is [k,u] and that there is an alternation
rule that changes ‘a’ to ‘?,a’ at the beginning of the
word.
Figure 10: Learned affix identification rule example
InputWord: [t, k, e, f, y, a, l, e, x]
Stem: [k, e, f, l]
Template: [1,0, 1, 1]
Root: [k, f, l]
GrammaticalFeature: [imperfective, spsf*]
The above rule declares that, if the word starts with y
and ends with u and if the stem extracted from the word
after stripping off the affixes has a CVCC ([1,0,1,1])
pattern, then that word is imperfective with third person
plural neutral subject (tppn).
Figure 14: Sample Test Result (Internal alternation)
*spsf: second person singular feminine
alter(Stem,Valid_Stem):set_internal_alter(Stem,Valid_Stem, [o], [e, w, o]).
The above example shows that the prefix and suffix
that need to be stripped off are [t] and [a,l,e,x]
respectively and that there is an alternation rule that
changes ‘y’ to ‘l’ at the end of the stem after removing
the suffix.
The system is able to correctly analyze 1,552 words,
resulting in 86.99% accuracy. With the small set of
training data, the result is encouraging and we believe
that the performance will be enhanced with more
training examples of various grammatical combinations.
The wrong analyses and test cases that are not handled
by the program are attributed to the absence of such
examples in the training set and an inappropriate
alternation rule resulting in multiple analysis of a single
test word.
Figure 11: Learned internal alternation rule example
The above rule will make a substitution of the vowel o
in a specific circumstances (which is included in the
program) with ewo to transform the initial stem to a
valid stem in the language. For example, if the Stem is
zor, then o will be replaced with ewo to give zewor.
The other part of the program handles formation of
the root of the verb by extracting the template and the
vowel sequence from the stem. A sample rule generated
to handle the task looks like the following:
root(Stem, Root):root_vocal(Stem, Root, [e, e]),
template(Stem, [1, 0, 1, 0, 1]) .
Test Word
[s,e,m,a,c,h,u]
[s,e,m,a,c,h,u]
[l,e,g,u,m,u]
Figure 12: Learned root-template extraction rule example
The above rule declares that, as long as the consonant
vowel sequence of a word is CVCVC and both vowels
are e, the stem is a possible valid verb. Our current
implementation does not use a dictionary to validate
whether the verb is an existing word in Amharic.
Finally, we have combined the background predicates
used for the three learning tasks and the utility
predicates. We have also integrated all the rules learned
in each experiment with the background predicates. The
integration involves the combination of the predicates in
the appropriate order: stem analysis followed by internal
stem alternation and root extraction.
After building the program, to test the performance of
the system, we started with verbs in their third person
singular masculine form, selected from the list of verbs
transcribed from the appendix of Armbruster (1908)3.
We then inflected the verbs for the eight subjects and
four tense-aspect-mood features of Amharic, resulting in
1,784 distinct verb forms. The following are sample
analyses of new verbs that are not part of the training set
by the program:
Root
[s,m,?]
[s,y,m]
NA
Feature
perfective, sppn
gerundive, sppn
NA
Table 2: Example of wrong analysis
Table 2 shows some of the wrong analyses and words
that are not analyzed at all. The second example shows
that an alternation rules has been applied to the stem
resulting in wrong analysis (the stem should have been
the one in the first example). The last example generated
a stem with vowel sequence of ‘eu’ which is not found
in any of the training set, categorizing the word in the
not-analyzed category.
7.
Future work
ILP has proven to be applicable for word formation
rule extraction for languages with simple rules like
English. Our experiment shows that the approach can
also be used for complex languages with more
sophisticated background predicates and more examples.
While Amharic has more prefixes and suffixes for
various morphological features, our system is limited to
only subject markers. Moreover, all possible
combinations of subject and tense-aspect-mood have
been provided in the training examples for the training.
This approach is not practical if all the prefix and
suffixes are going to be included in the learning process.
One of the limitations observed in ILP for
morphology learning is the inability to learn rules from
incomplete examples. In languages such as Amharic,
there is a range of complex interactions among the
InputWord: [a, t, e, m, k, u]
Stem: [?, a, t, e, m]
Template: [1,0, 1, 0, 1]
Root: [?, t, m]
GrammaticalFeature: [perfective, fpsn*]
Figure 13: Sample Test Result (with boundary alternation)
*fpsn: first person singular neuter
3
Stem
[s,e,m,a,?]
[s,e,y,e,m]
[l,e,g,u,m]
Available online at: http://nlp.amharic.org/resources/lexical/word-lists/verbs/ch-armbruster-initia-amharica/ (accessed February 12, 2012).
11
Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012)
Generative Approach. Ph.D. thesis, Graduate School
of Texas.
Bratko, I. and King, R. (1994). Applications of Inductive
Logic Programming. SIGART Bull. 5, 1, 43-49.
Dawkins, C. H., (1960). The Fundamentals of Amharic.
Sudan Interior Mission, Addis Ababa, Ethiopia.
De Pauw, G. and P.W. Wagacha. (2007). Bootstrapping
Morphological Analysis of Gĩkũyũ Using Unsupervised Maximum Entropy Learning. Proceedings of the
Eighth INTERSPEECH Conference, Antwerp, Belgium.
Gasser, M. (2011). HornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya. Conference on Human Language Technology
for Development, Alexandria, Egypt.
Goldsmith, J. (2001). The unsupervised learning of
natural language morphology. Computational Linguistics, 27: 153-198.
Hammarström, H. and L. Borin. (2011). Unsupervised
learning of morphology. Computational Linguistics,
37(2): 309-350.
Kazakov, D. (2000). Achievements and Prospects of
Learning Word Morphology with ILP, Learning Language in Logic, Lecture Notes in Computer Science.
Kazakov, D. and S. Manandhar. (2001). Unsupervised
learning of word segmentation rules with genetic algorithms and inductive logic programming. Machine
Learning, 43:121–162.
Koskenniemi, K. (1983). Two-level Morphology: a General Computational Model for Word-Form Recognition and Production. Department of General Linguistics, University of Helsinki, Technical Report No. 11.
Manandhar, S. , Džeroski, S. and Erjavec, T. (1998).
Learning multilingual morphology with CLOG. Proceedings of Inductive Logic Programming. 8th International Workshop in Lecture Notes in Artificial Intelligence. Page, David (Eds) pp.135–44. Berlin:
Springer-Verlag.
Mooney, R. J. (2003). Machine Learning. Oxford Handbook of Computational Linguistics, Oxford University Press, pp. 376-394.
Mooney, R. J. and Califf, M.E. (1995). Induction of firstorder decision lists: results on learning the past tense
of English verbs, Journal of Artificial Intelligence Research, v.3 n.1, p.1-24.
Oflazer, K., M. McShane, and S. Nirenburg. (2001).
Bootstrapping morphological analyzers by combining
human elicitation and machine learning. Computational Linguistics, 27(1):59–85.
Sieber, G. (2005). Automatic Learning Approaches to
Morphology, University of Tübingen, International
Studies in Computational Linguistics.
Yimam, B. (1995). Yamarigna Sewasiw (Amharic
Grammar). Addis Ababa: EMPDA.
Zdravkova, K., A. Ivanovska, S. Dzeroski and T. Erjavec, (2005). Learning Rules for Morphological
Analysis and Synthesis of Macedonian Nouns. In
Proceedings of SIKDD 2005, Ljubljana.
different morphemes, but we cannot expect every one of
the thousands of morpheme combinations to appear in
the training set. When examples are limited to only
some of the legal morpheme combinations, CLOG is
inadequate because it is not able to use variables as part
of the body of the predicates to be learned.
An example of a rule that could be learned from
partial examples is the following: “if a word has the
prefix 'te', then the word is passive no matter what the
other morphemes are”. This rule (not learned by our
system) is shown in Figure 15.
stem(Word, Stem, Root, GrmFeatu):set_affix(Word, Stem, [t,e], [], S, []),
root_vocal(Stem, Root, [e, e]),
template(Stem, [1, 0, 1, 0, 1]),
feature(GrmFeatu, [Ten, passive, Sub]).
Figure 15: Possible stem analysis rule with partial feature
That is, S is one of the valid suffixes, Ten is the Tense,
and Sub is the subject, which can take any of the
possible values.
Moreover, as shown in section 2, in Amharic verbs,
some grammatical information is shown by various
combinations of affixes. The various constraints on the
co-occurrence of affixes are the other problem that needs
to be tackled. For example, the 2nd person masculine
singular imperfective suffix aleh can only co-occur with
the 2nd person prefix t in words like t-sebr-aleh. At the
same time, the same prefix can occur with the suffix
alachu for the 2nd person plural imperfective form. To
represent these constraints, we apparently need explicit
predicates that are specific to the particular affix
relationship. However, CLOG is limited to learning only
the predicates that it has been provided with.
We are currently experimenting with genetic
programming as a way to learn new predicates based on
the predicates that are learned using CLOG.
8.
Conclusion
We have shown in this paper that ILP can be used to
fast-track the process of learning morphological rules of
complex languages like Amharic with a relatively small
number of examples. Our implementation goes beyond
simple affix identification and confronts one of the
challenges in template morphology by learning the roottemplate extraction as well as stem-internal alternation
rule identification exhibited in Amharic and other
Semitic languages. Our implementation also succeeds in
learning to relate grammatical features with word
constituents.
9.
References
Armbruster, C. H. (1908). Initia Amharic: an Introduction to Spoken Amharic. Cambridge: Cambridge University Press.
Beesley, K. R. and L. Karttunen. (2003). Finite State
Morphology. Stanford, CA, USA: CSLI Publications.
Bender, M. L. (1968). Amharic Verb Morphology: A
12