Satellite Workshop
On
Language, Artificial Intelligence
and Computer Science
for Natural Language Processing Applications (LAICS-NLP)
October 19, 2006
Department of Computer Engineering
Faculty of Engineering
Kasetsart University, Bangkok, Thailand.
http://naist.cpe.ku.ac.th/LAICS-NLP/
Organising Chair :
Dr. Bali Ranaivo
Penang, Malaysia.
On behalf of the organising committee, I would like to express a very warm welcome
to all the participants of the Workshop organised within the LAICS-NLP Summer School
2006. For the first time, a Summer School on “Language, Artificial Intelligence and
Computer Science for Natural Language Processing applications (LAICS-NLP)” could take
place in South-East Asia. It has made possible thanks to the initiative of researchers involved
in the STIC-ASIA project “Multilingual Language Processing”.
During this one week Summer School, one day workshop has been dedicated to
students, researchers and practioners from Asia to give them the opportunity of presenting
their works. The objective is to make possible the meetings and discussions between the
participants who are involved directly or indirectly in the field of computational linguistics
and natural language processing. We can say that it is a success as papers submitted and
accepted come from Nepal, India, Bangladesh, Thailand, Malaysia, Singapore and Indonesia.
The variety of the topics to be presented proves that Asian countries are interested in the
development of language engineering systems, and therefore are ready to participate in any
international project.
One day workshop is not sufficient regarding the number of papers submitted. The
organising committee had to choose papers original by their contents or reflecting advanced
researches. A very hard task was to choose the three best papers, submitted by undergraduate
and postgraduate students, allowing their first presenters to get free registration to attend the
LAICS-NLP Summer School. I would like to congratulate personally the authors of the three
best papers which come from India, Malaysia, and Indonesia. I wish them the best for their
career as researcher.
This event could not be made possible without the international collaboration between
France, India, Malaysia, and Thailand. I would like to express my gratitude and appreciation
to all members of the organising committee, authors, and the wonderful NAIST team
(Specialty Research Unit in Natural Language Processing and Intelligent Information System
Technology) which had the hard task to manage most of administrative problems. Because
most events cannot be realised without valuable sponsorships, I would like to sincerely thank
the French Ministry of Foreign Affairs, French Embassies in India and Thailand, CNRS (the
French National Center for Scientific Research, France), INRIA (the French National
Institute for Research in Computer Science and Control), NECTEC (the Thai National
Electronics and Computer Technology Center), and Kasetsart University for their financial
supports and trusts on this Summer School and workshop.
I really hope that this will not be the first and last time that this kind of event is
organised in Asia. International collaborations are needed if we want to move one step ahead
and be part of the development of the world.
I wish to have another “Monsoon School” next year!
Ranaivo-Malançon Bali
Chairwoman
LAICS-NLP Summer School Workshop
Kasetsart University, Bangkok, Thailand
19 October 2006
Satellite Workshop
On
Language, Artificial Intelligence
and Computer Science
for Natural Language Processing Applications (LAICS-NLP)
October 19, 2006
Department of Computer Engineering
Faculty of Engineering
Kasetsart University, Bangkok, Thailand.
http://naist.cpe.ku.ac.th/LAICS-NLP/
Organising Chair
•
Bali Ranaivo, USM, Penang, Malaysia
Program Committees
•
•
•
•
•
•
•
•
•
•
•
•
Asanee Kawtrakul, Kasetsart Univ., Bangkok, Thailand
Claire Gardent, LORIA, Nancy, France
Eli Murguia, IRIT, Toulouse, France
Farah Benamara, IRIT, Toulouse, France
Leila Amgoud, IRIT, Toulouse, France
Monojit Choudhury, IIT, Kharagpur, India
Patrick Saint-Dizier, IRIT, Toulouse, France
Sudeshna Sarkar, IIT, Kharagpur, India
Tang Enya Kong, USM, Penang, Malaysia
Thanaruk Theeramunkong, SIIT, Bangkok, Thailand
Thierry Poibeau, LIPN, Paris, France
Virach Sornlertlamvanich, TCL, Bangkok, Thailand
Local Organisers:
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Asanee
Mukda
Patcharee
Achara
Areerat
Taddao
Chaikorn
Vasuthep
Phukao
Aurawan
Sutee
Chalathip
Thana
Chalermpon
Chaveevan
Vee
Rapepun
Kawtrakul
Suktarachan
Varasrai
Napachot
Thongbai
Kleepikul
Yingsaree
Khunthong
Soraprasert
Imsombut
Sudprasert
Thumkanon
Sukwaree
Sirigayon
Pechsiri
Satayamas
Piriyakul
Main Objectives
Language processing is now a major field in Computer science, with large scale
applications, which are still under intense research, such as question-answering, machine
translation, automatic summarization, etc. most of them in a multilingual setting. Language
processing technology requires knowledge from a large variety of disciplines: applied
linguistics, computer science and artificial intelligence, ergonomics and the science of
interaction, psychology, etc.
The goal of this summer school, in a short period of time, is to present the foundations and
the most recent advances of the different topics of interest to any language processing
practitioner, with in view the development of well targeted applications. This summer school
is more application oriented than most western summer schools such as ESSLII, most notably.
Besides courses, a 1 day workshop will be organized in the middle of the school where
groups or individuals attending will have the opportunity to present their work. The objective
is to enhance cooperation and to have a better view of what’s being done in Asia in
computational Linguistics.
Workshop Schedule
8.30 - 8.45
8.45 - 9.05
9.05 - 9.25
Opening
Indra Budi
Chaveevan Pechsiri,
Asanee Kawtrakul
Information Extraction for the Indonesian Language
Causality Knowledge Extraction based on Causal Verb Rules
9.25 - 9.45
Asif Ekbal
Named Entity Recognition in Bengali
9.45 - 10.05
Lim Lian Tze, Nur Hussein
Fast Prototyping of a Malay WordNet System
Aurawan Imsombut,
Taxonomic Ontology Learning by using Item List on the Basis of
Asanee Kawtrakul
Text Corpora in Thai
10.05 - 10.25
Ong Siou Chin,
10.25 - 10.45 Narayanan Kulathuramaiyer,
Discovery of Meaning from Text
Alvin W. Yeo
10.45 - 11.00
11.00 - 11.20
11.20 - 11.40
Tea/Coffee Break
Swati Challa, Shourya Roy,
L. Venkata Subramaniam
Analysis of agents from call transcriptions of a car rental process
Loh Chee Wyai, Alvin W. Yeo, Integration Techniques for multimodal Speech and Sketch map-based
Narayanan K.
11.40 - 12.00 Vishal Chourasia
12.00 - 12.20 Ayesha Binte Mosaddeque
12.20 - 12.40 Sourish Chaudhuri
12.40 - 13.00 Bal Krishna Bal
system
Phonological rules of Hindi and Automatic Generation of
Pronunciation Dictionary for Speech Recognition
Rule based Automated Pronunciation Generator
Transliteration from Non-Standard Phonetic Bengali to Standard
Bengali
The Structure of Nepali Grammar
13.00 - 14.00
14.15 - 14.35
Lunch
Rahul Malik, L. Venkata
Subramaniam, Saroj Kaushik
14.35 - 14.55 Shen Song, Yu-N Cheah
14.55 - 15.15
Rapepun Piriyakul,
Asanee Kawtrakul
Email Answering Assistant for Contact Centers
Extracting Structural Rules for Matching Questions to Answers
"Who" Question Analysis
15.15 - 15.30
Tea/Coffee Break
Stephane Bressan, Mirna
15.30 - 15.50 Adriani, Zainal A. Hasibuan,
Bobby Nazief
Suhaimi Ab. Rahman,
15.50 - 16.10 Normaziah Abdul Aziz,
Abdul Wahab Dahalan
Mind Your Language: Some Information Retrieval and Natural
Language Processing Issues in Development of an Indonesian Digital
Library
Searching Method for English-Malay Translation Memory Based on
Combination and Reusing Word Alignment Information
16.10 - 16.30 Sudip Kumar Naskar
A Phrasal EBMT System for Translating English to Bengali
16.30 - 16.50 Zahrah Abd Ghafur
Prepositions in Malay: Instrumentality
16.50 - 17.10 Patrick Saint-Dizier
"Multilingual Language Processing" (STIC-Asia project)
Contents
Information Extraction for the Indonesian Language
1
INDRA BUDI
Causality Knowledge Extraction based on Causal Verb Rules
5
CHAVEEVAN PECHSIRI, ASANEE KAWTRAKUL
Named Entity Recognition in Bengali
9
ASIF EKBAL
Fast Prototyping of a Malay WordNet System
13
LIM LIAN TZE, NUR HUSSEIN
Taxonomic Ontology Learning by using Item List on the Basis of Text Corpora in
Thai
17
AURAWAN IMSOMBUT, ASANEE KAWTRAKUL
Discovery of Meaning from Text
21
ONG SIOU CHIN, NARAYANAN KULATHURAMAIYER, ALVIN W. YEO
Analysis of agents from call transcriptions of a car rental process
25
SWATI CHALLA, SHOURYA ROY, L. VENKATA SUBRAMANIAM
Integration Techniques for multimodal Speech and Sketch map-based system
29
LOH CHEE WYAI, ALVIN W. YEO, NARAYANAN K.
Phonological rules of Hindi and Automatic Generation of Pronunciation Dictionary
for Speech Recognition
33
VISHAL CHOURASIA
Rule based Automated Pronunciation Generator
37
AYESHA BINTE MOSADDEQUE
Transliteration from Non-Standard Phonetic Bengali to Standard Bengali
41
SOURISH CHAUDHURI
The Structure of Nepali Grammar
45
BAL KRISHNA BAL
Email Answering Assistant for Contact Centers
49
RAHUL MALIK, L. VENKATA SUBRAMANIAM, SAROJ KAUSHIK
Extracting Structural Rules for Matching Questions to Answers
53
SHEN SONG, YU-N CHEAH
"Who" Question Analysis
57
RAPEPUN PIRIYAKUL, ASANEE KAWTRAKUL
Mind Your Language: Some Information Retrieval and Natural Language
Processing Issues in Development of an Indonesian Digital Library
61
STEPHANE BRESSAN, MIRNA ADRIANI, ZAINAL A. HASIBUAN, BOBBY NAZIEF
Searching Method for English-Malay Translation Memory Based on Combination
and Reusing Word Alignment Information
65
SUHAIMI AB. RAHMAN, NORMAZIAH ABDUL AZIZ, ABDUL WAHAB DAHALAN
A Phrasal EBMT System for Translating English to Bengali
69
SUDIP KUMAR NASKAR
Prepositions in Malay: Instrumentality
ZAHRAH ABD GHAFUR
73
Information Extraction for the Indonesian Language
Indra Budi
Faculty of Computer Science University of Indonesia
Kampus UI Depok 16424
Email: indra@cs.ui.ac.id
(British Foreign Office Minister Mike O'Brien had
been in Jakarta yesterday. He held meeting with
Megawati Soekarnoputri at the State Palace.
Megawati is the first woman who become president in
Indonesia)
Abstract
A modern digital library should be providing
effective integrated access to disparate information
sources. Therefore the extraction of semistructured or structured information - for instance
in XML format - from free text is one of the great
challenges in its realization. The work presented
here is part of a wider initiative aiming at the
design and development of tools and techniques for
an Indonesian digital library. In this paper, we
present the blueprint of our research on information
extraction for the Indonesian language. We report
the first result of our experiments on name entity
recognition and co-reference resolution.
The components highlighted in italic in Fig 1.1 require
global, ancillary, or external knowledge.
We need models and techniques to recognize named
entities and their relationships. There are two main
approaches in building rules and pattern for the
information extraction task, namely, knowledge
engineering and machine learning [1].
<meeting>
<date>05/12/2003</date>
<location>
<name>Istana Negara</name>
<city>Jakarta</city>
<country>Indonesia</country>
</location>
<participants>
<person>
<name>Megawati Soekarnoputri</name>
<quality>Presiden</quality>
<country>Indonesia</country>
</person>
<person>
<name>Mike O'Brien</name>
<quality>Menteri Luar Negeri</quality>
<country>Inggris</country>
</person>
</participants>
</meeting>
1. Introduction
The purpose of information extraction (IE) is to
locate and to extract specific data and relationships
from texts and to represent them in a structured
form [7, 8]. XML is a particularly suited candidate
for the target data model thanks to its flexibility in
representing data and relationships and to its
suitability to modern Internet applications.
Indeed, IE is potentially at the heart of numerous
modern applications. For instance and to name a
few, IE can be used in software engineering to
generate test cases from use case scenario; in
database design, IE can be used to generate Entity
Relationship Diagrams from analysis cases; in the
legal domain, IE can be used to extract patterns
from legal proceedings. The list is open-ended,
however, we choose for this article and for the sake
of simplicity a domain that can be apprehended by
the non-expert: we try and extract information
about events that are meetings from news articles.
A meeting is an event for which we wish to
identify the location (place, city and country), the
date (day, month, year) and the list of participants
(name, quality and nationality). Fig 1.1 illustrates
the expected output corresponding to the following
sample text.
Fig 1.1 Structured information in XML
In a knowledge engineering approach experts handcraft
an instance of a generic model and technique. In a
machine learning approach, the instance of the model
and technique is learned from examples with or without
training and feedback.
Following [5], we consider that the information
extraction process requires the following tasks to be
completed: named entity recognition, co-reference
resolution, template element extraction and scenario
template extraction. Named entity recognition (NER)
identifies names of and references to persons, locations,
dates and organization from the text, while co-reference
resolves references and synonymies. Template element
extraction completes the description of each entity by
adding, in our example, quality and nationality to
Menteri Luar Negeri Inggris Mike O’Brien1
kemarin berada di Jakarta. Dia2 bertemu dengan
3
Megawati Soekarnoputri di Istana Negara.
Megawati4 adalah wanita pertama yang menjadi
presiden di Indonesia.
1
We have asked educated native speakers to design rules
combining contextual, morphological, and part of
speech features that assign classes to terms and groups
of terms in the text. They based their work on the
analysis of a training corpus.
persons for instance. Finally, scenario template
extraction associates the different entities, for
instance, the different elements composing an
event in our example. The extraction tasks are
usually leveraging several features the most
essential of which being linguistic. These include
morphology, part of speech of terms, and their
classification and associations in thesauri and
dictionaries. It also leverages the context in which
terms are found such as neighboring terms and
structural elements of the syntactical units propositions, sentences, and paragraphs, for
instance. Clearly, because of the morphological
and grammatical differences between languages,
the useful and relevant combinations of the above
features may differ significantly from one language
to another. Techniques developed for the English
language need to be adapted to indigenous
linguistic peculiarities. It is also possible that
entirely new and specific techniques need to be
designed.
In machine learning approaches, a generic computer
program learns to recognize named entities with or
without training and feedback. General machine
learning models exist that do not necessitate the
mobilization of expensive linguistic expert knowledge
and resources.
Using a training corpus, in which terms and groups of
terms are annotated with the class they belong to, and a
generic association rule mining algorithm, we extracted
association rules combining the identified contextual,
morphological, and part of speech features. For
example, if a sequence of terms <t1, t2> occurs in the
training corpus, where f2 is an identified feature of t2
and nc2 is the name class of t2. We obtain a rule of the
following form where support and confidence are
computed globally.
Our research is concerned with the design and
implementation of information extraction suite of
tools and techniques for the Indonesian language
and to study the genericity and peculiarities of the
task in various domains. We report in this paper
our first results in the comparison of knowledge
based and machine learning based named entity
recognition as well as a first attempt of coreference resolution.
<t1, f2> => nc2, (support, confidence)
If the training corpus contains the sentence: “Prof.
Hasibuan conducted a lecture on information retrieval”
in which the term “Hasibuan” is of class person, we
produce a rule of the following form.
〈Prof., Capitalized_word(X)〉 => person_named(X)
2. Named Entity Recognition (NER)
In our running example, the NER task should
identify that Mike O’Brien and Megawati
Soekarnoputri are persons, Istana Negara and
Jakarta are locations (and possibly that the former
is a place and the latter a city), Presiden and
Menteri Luar Negeri are qualities of persons
The rule support and confidence depend on the
occurrences of the expression “Prof. X” with X a person
or not in the training corpus. The complete list of rule
forms and features used can be seen in [2].
In both approaches, the rules produced are then used for
NER. The NER process consists of the following
stages: tokenization, feature assignment, rule
assignment and name tagging. The left hand side of a
rule is the pattern. The right hand side of a rule is the
identified named entity class. The following is an
example of an actual rule as encoded in the
implemented tested.
2.1 Approaches
Approaches to named entity recognition (see [1])
can be classified into two families: knowledge
engineering approaches and machine learning
approaches.
Knowledge engineering approaches are expertcrafted instances of generic models and techniques
to recognize named entity in the text. Such
approaches are typically rule-based. In a rule-based
approach the expert design rules to be used by a
generic inference engine. The rule syntax allows
the expression of grammatical, morphological and
contextual patterns. The rules can also include
dictionary and thesauri references. For example,
the following rule contributes to the recognition of
persons.
IF
Token[i].Kind="WORD" and Token[i].OPOS and
Token[i+1].Kind=”WORD” and Token[i+1].UpperCase
and Token[i+1].OOV
THEN Token[i+1].NE = "ORGANIZATION"
The tokenization process identifies tokens (words,
punctuation and other units of text such as numbers
etc.) from the input sentence. The feature assignment
component labels the tokens with their features: the
basic contextual features (for instance identifying
preposition, days, or titles), the morphological features,
as well as the part of speech classes. See [3] for a
complete list of features and details of the labeling
process. The rule assignment component selects the
candidate rules for each identified token in the text.
If a proper noun is preceded by a title then the proper
noun is name of person
2
In our running example, the co-reference resolution
process should identify that Dia2 refers to Mike
O’Brien1 and that Megawati4 refers to Megawati
Soekarnoputri3. In other words, the system should
produce two clusters: {Mike O’Brien1 and Dia2} and
{Megawati Soekarnoputri3 and Megawati4}.
The rule is then applied and terms and group of
terms are annotated with XML tags. The syntax of
the tags follows the recommendation of MUC [9].
The following is the output of the system for the
second sentence in our running example.
<ENAMEX TYPE=PERSON”>Megawati</ENAMEX>
adalah wanita pertama yang menjadi presiden di
<ENAMEX
TYPE=”LOCATION”>Indonesia</ENAMEX>.
3.1 Approaches
Our first attempt to implement co-reference algorithms
aimed at comparing two machine learning methods: an
original method based on association rules and a state
of the art method based on decision trees. Both methods
use the same features. We consider nine different
features. See [4] for details of features.
2.2 Performance Evaluation
We comparatively evaluate the performance of the
two approaches with a corpus consisting of 1.258
articles from the online versions of two mainstream
Indonesian newspaper Kompas (kompas.com) and
Republika (republika.co.id). The corpus includes
801 names of person, 1.031 names of organization,
and 297 names of location.
The state of the art method [10] is based on decision
trees and the C4.5 algorithm. Each node of the tree
corresponds to a decision about a particular feature and
leafs are Booleans represent co-references.
We also devised an original association rule method.
We mine association rules that are capable of testing
the pairwise equivalence of markables. The association
rules are obtained from a training corpus and selected
because their support and confidence are above given
thresholds. The rules have the form X ⇒ Y , where X
represents features of a pair of markables and Y is a
Boolean indicating whether the markables are coreferences or not. Each feature corresponds to an
attribute. Therefore the association rules have the
following form.
In order to measure and compare the effectiveness
of the two approaches we use the recall, precision
and F-Measure metrics as defined in [6] as a
reference for the Message Understanding
Conference (MUC)
On the one hand, our results confirm and quantify
the expected fact that the knowledge engineering
approach performs better (see table 2.1) than the
machine learning approach. Of course, this comes
at the generally high cost of gathering, formalizing
and validating expert knowledge. The machine
learning approach, on the other hand, yields
respectable performance with minimum expert
intervention (it only requires the annotation of the
training corpus).
<attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9> ⇒
<isEquiv>
The left-hand-side (LHS) of the rule is a list of values
for the attributes for a pair of markables. The righthand-side (RHS) is the variable isEquiv. It is true if the
markables are equivalent and false otherwise. Indeed,
we consider negative rules that indicate that pairs of
markables are not co-referenced.
Table 2.1. Knowledge-engineering versus machine
learning method for NER
Method
Knowledge
Engineering
Machine
Learning
Recall
Precision
F-Measure
63.43%
71.84%
67.37%
60.16%
58.86%
59.45%
Fig 3.1 illustrates the general architectures of the
system in the testing phase. It is assumed that pronoun
tagging and named entity recognition has been done
and that association rules are readily available from the
training phase. For a pair of markables, if several
association rules are applicable, the rule with the
highest confidence is applied. If the RHS in rule is true
then the markables are marked equivalent and not
equivalent otherwise. After all the pairs of markables
have been checked then we group the markables that
are equivalent. We randomly choose a representative
for each class. We can now output a document in which
markables are tagged with the representative of their
class.
A finer grain analysis the results, manually going
through the correct, partial, possible, and actual
named entities (See [3] for definitions), seems to
indicate that the machine learning approach
induces more partial recognition. This is avoided
by the knowledge engineering approach, which
allows a more effective usage of the variety of
features available.
3. Co-reference Resolution
Co-reference resolution attempts to cluster terms or
phrases that refer to the same entity (markables)
[7,9]. Terms or phrases are pronouns or entities
that have been recognized in a named entity
recognition phase.
3
based in either knowledge engineering or machine
learning, with the objective in mind to find the most
economical yet effective and efficient solution to the
design of the necessary tools specific to the Indonesian
language and to the application domain selected.
Document with NE and
PRONOUN tag
Coreference Resolution
Set of Rules from
training
At the same time that we explore this compromise and
look for a generic and adaptive solution for the
Indonesian language, we continue developing the next
components of a complete information extraction
system starting with ad hoc and state of the art (possibly
designed for English or other foreign languages)
solutions. We also expect to devise and evaluate novel
methods based on association rules, which would give
us, if effective and efficient, a uniform framework for
all information extraction tasks. The reader notice that
these methods do not exclude the utilization of global,
ancillary, and external knowledge such as gazetteers
document temporal and geographical context, etc.
Grouping of Markables
Coreference Tagging
Document with
Coreference tag
Fig 3.1. The association rules for co-reference resolution
system architecture
References:
[1] Appelt, Douglas E. and Israel, David J., “Introduction to
Information Extraction Technology”, Tutorial in IJCAI-99.
[2] Budi, I. and Bressan, S., “Association Rules Mining for
Named entity Recognition” in proceeding of WISE
Conference, Roma, 2003.
[3] Budi, I. and et. al, “Named Entity Recognition for the
Indonesian
Language:
Combining
Contextual,
Morphological and Part-of-Speech Features into a
Knowledge Engineering Approach”, in proceedings of the
8th International Conference on Discovery Science,
Singapore, October, 2005.
[4] Budi, I., Nasrullah and Bressan, S., “Co-reference
Resolution for the Indonesian Language Using Association
Rules”, submitted to IIWAS 2006.
[5] Cunningham, Hamish., “Information Extraction - a User
Guide
(Second
Edition)”,
accessed
on
th
http://www.dcs.shef.ac.uk/~hamish/IE/userguide/ at 5 March
2003
[6] Douthat, A.: The Message Understanding Conference
Scoring Software User’s Manual, In Proceedings of the
7th Message Understanding Conference (MUC-7), 1998.
[7] Grishman, Ralph., “Information Extraction: Techniques
and Challenges” Lecture Notes in Computer Science, Vol.
1299, Springer-Verlag, 1997.
[8] Huttenen, Silja., Yanbarger, Roman., and Grishman,
Ralph., “Diversity of Scenarios in Information Extraction”,
Proceedings of Third International Conference on
Language Resources and Evaluation, Las Palmas, Canary
Islands, Spain, 2002.
[9] MUC.: MUC-7 Co-reference task definition. Proceedings
of the Seventh Message Understanding Conference, 1998.
[10] Soon, W.M., Yong Lim, D. W., and Ng, H.T.: A
Machine Learning Approach to Co-reference Resolution of
Noun Phrases. Computational Linguistics, Volume 27,
(2001) 521-544.
3.2 Performance Evaluation
We use a corpus of 100 articles from the
Indonesian
online
newspaper
Republika
(republika.co.id). The articles contain 43,234
words and 5,783 markables consisting of 3,383
named entities (person, location, organization) and
2,400 pronouns in the corpus.
In order to measure and compare the effectiveness
of the two approaches we use the recall, precision
and F-Measure metrics as defined in [6] as a
reference for the Message Understanding
Conference (MUC).
Table 3.1 shows the results for both methods. It
shows that association rules yield comparable
performance to the one of the state-of-the-art
method based on decision tree. This result is to be
put in the perspective of the one of the previous
section in which we saw that the association rule
based method performed respectably. This opens
the way for a generic association rule based
information extraction system.
Table 3.1. Association rules versus decision tree
method for co-reference resolution
Method
Recall
Precision
F-Measure
Association Rules
74.38
93.17
82.70
Decision Tree
74.31
93.05
82.60
4. Conclusions and Future Task
We have presented the blueprint of our research
project on information extraction for the
Indonesian language. We have presented our
preliminary results for the tasks of named entity
recognition and co-reference resolution. In
particular we have compared several techniques
4
Causality Knowledge Extraction based on Causal Verb Rules
Chaveevan Pechsiri, Asanee Kawtrakul
Department of Computer Engineering,
Kasetsart University
Phaholyothin Rd., Bangkok, Thailand 10900
Tel. +662-942-8555, Fax.: +622-579-0358
e-mail : itdpu@hotmail.com, ak@.ku.ac.th
within more than one simple EDU. Hence, it is
equivalent to the inter-sentences of [Chang and
Choi, 2004]. However, this paper is a part of
our research of causality knowledge extraction
and it works on only the intra-causal EDU
extraction.
Several techniques [Marcu,1997; Girju and
Moldovan,2002; Girju, 2003; Inui and et
al.,2004; Chang and Choi,2004]
have been
used for extracting cause-effect information
varying from one sentence to two adjacent
sentences. The recent researches [Girju, 2003]
uses the causal verbs from the lexico syntactic
pattern, ‘NP1 cause-verb NP2’ pattern, to
identify the causal question, yet this has problem
with verb ambiguity being solved by learning
technique. Later, Chang and Choi [2004] uses
NP pair from this lexico syntactic pattern to
extract the causality from one sentence. In our
work, we are aiming to extract the intra-causal
EDU from Thai documents by using the causal
verb rules mined from the specified sentence
patterns of “NP1 Verb NP2 Preposition NP3”
where only NP2 can have null value. The
reason that we use this specified pattern is about
50% of the intra-causal EDU, from the corpus
behavior study, occurred within this pattern.
Thai has specific characteristics, such as, zero
anaphora and nominal anaphora. All of these
characteristics are involved in the two main
problems of causality extraction: the first is how
to identify the interesting causality from
documents and the second is the zero anaphora.
From all of these problems, we need to develop
a framework which combines NLP techniques to
form the EDU for mining the specified sentence
pattern.
In conclusion, unlike other methods where
the emphasis is based on the Lexico syntactic
pattern [Girju and Moldovan, 2002; Chang and
Choi,2004], our research uses the causal verb
rules based on the specified sentence pattern to
identify causality for the intra-causal EDU
Abstract
The aiming of this paper is to
automatically extract the causality
knowledge from documents for the
contribution knowledge sources of the
question-answering system. This paper
is only concern of extracting the
causality knowledge from a single
sentence
or
EDU
(Elementary
Discourse Units) with two problems of
the causality identification and the zero
anaphora. Then, we propose the usage
of causal verbs rules mined from the
specified sentence pattern by ID3 to
extract the causality knowledge of the
single EDU. However, our intra-causal
EDU extraction model shows the 0.87
precision and the 0.73 recall.
1
Introduction
Causality knowledge extraction from textual
data is an important task to gain useful
expressions of Know-Why for the question
answering system. There are various forms of
causality or cause-effect expression such as in
the form of intra-NP, inter-NP, and intersentence [Chang and Choi,2004]. In according
to our research, we separate the causality
knowledge into 2 group based on the elementary
discourse unit (EDU) as defined by [Carlson and
et al., 2003]. Our EDU is often expressed as a
simple sentence or clause. These EDUs will be
used to form the causality relation which will be
expressed in two forms, an intra-causal EDU
and an inter-causal EDU. We define the intracausal EDU as an expression within one simple
EDU with or without an embedded EDU. This
is equivalent to the intra-NP form or the interNP form[Chang and Choi,2004]. Also, the
inter-causal EDU is defined as an expression
5
occur explicitly within one sentence. They
obtained 81% of precision for causality
extraction. However, there are more than two
NPs contained in our intra-causal EDU and this
extra NP can not be hold as a part of cue phrase.
Hence, we are aiming at learning causal verb
rules from the specified sentence pattern by
using ID3 with all five features from the
sentence pattern to extract the intra-causal EDU
expressions
extraction. Our research will be separated into 6
sections.
In section 2, related work is
summarized. Problems in causality mining from
Thai documents will be described in section 3
and in section 4 our framework for causality
extraction. In section 5, we evaluate our
proposed model and a conclusion in section 6.
2
Related Work
Girju’s work [Girju and Moldovan, 2002]
consists in finding patterns to be used to extract
causal relations from a learning corpus, where
the aim of the extraction of causal relations from
the inter-noun phrase is to aid in question
identification as in [Girju, 2003]. In their
research [Girju and Moldovan, 2002], causal
verbs were observed, by the pattern <NP1 verb
NP2> in documents whereas NP pairs were the
causative relationships referenced by WordNet.
The causal verbs were used to extract the causal
relation with a particular kind of NP; e.g.
phenomena NP, “An earthquake generates
Tsunami”. The problem of research was that the
causal verb became ambiguous; e.g. “Some
fungi produce the Alfa toxin in peanut” was a
causal sentence while “The Century Fox
produces movies” was a non causal sentence.
Girju and Moldovan [2002] solved the problem
by using a C4.5 decision tree to learn the
annotated corpus with syntactic and semantic
constraints. The precision of their causality
extraction was 73.91%. However, some of our
causal verbs in the intra-causal EDUs are
expressed as a general verb followed by a
preposition, such as ‘เปน..จาก/be..from’ , ‘ไดรับ..
จาก/get..from’, etc., whereas the lexico syntactic
patterns can not cover this kind of causal verb,
for example: “ใบเปนแผลจากเชื้อรา /leaf is scars from
fungi. )”,
Chang and Choi [2002]’s work aimed to
extract only causal relations between two events
expressed by a lexical pair of NP and the cue
phrase with the problems of causality
identification. Naïve Bayes classifier was used
to solve their problems. They defined the cue
phrase used in their work as “a word, a phrase,
or a word pattern which connects one event to
the other with some relation”; e.q. “caused by”,
“because”, “as the result of”, “Thus” etc. And
their lexical pair was a pair of causative noun
phrase, and effective noun phrase that must
3
Problems in Causality Extraction
To extract the cause-effect expressions, there are
two main problems that must be solved; to
identify interesting cause-effect events from
Thai documents, and to solve an implicit noun
phrase.
3.1
Causality identification
Like many languages, to identify the causality
expressions in Thai uses an explicit cue phrase
[Chang and Choi , 2002] to connect between
cause and effect expressions. In order to avoid
non-necessary tasks in whole text analysis, the
causal verb which is the linking verb between
the causative NP and the effective NP will be
used to indicate the cause-effect expression.
Although the causal verb is used to identify
whether it is a causality or non causality
expression, we still have a problem of the causal
verb ambiguity. For example:
Causality:
a. “ใบพืช/Plant leaf มี/has จุดสีนา้ํ ตาล /brown
sports จาก/from เชื้อรา/fungi”
b. “คนไข/The patient ตาย/dies ดวย/with
โรคมะเร็ง/cancer”
Non causality:
c. “ใบพืช/Plant leaf มี/has จุดสีนา้ํ ตาล /brown
sports จาก/from โคนใบ/the leaf base”
d. “คนไข/patient ตาย/dies ดวย/with ความ
สงสัย/suspicion”
This problem of causal verb ambiguity can be
solved by learning the EDUs with ID3 from the
specified sentence pattern. The result from ID3
learning is plenty of causal verb rules which
need to be verified before identifying the intracausal EDU.
3.2
6
Zero anaphora or implicit noun
phrase
Regardless of whether the noun phrase is in the
intra-causal EDU, it will contain the implicit
noun phrases; such as zero anaphora. For
example:
“โรคไขหวัดนก /The Bird flu disease เปน /is โรคที่
There are two processes involved in this step,
which are Feature annotation for learning, Rule
mining, and verification.
4.2.1. Feature annotation for learning
Due to the problems in the intra-causal EDU
identification, the causal verb will be used as a
feature in this process to extract causality.
Because some causal verbs are ambiguity, we
have to learn this causal verb feature along with
the other four features: NP1, NP2, Preposition,
and NP3, from the specified sentence pattern as
<NP1 Verb NP2 Preposition NP3>. Then, we
manually annotate these five features with the
specifying “causality/non causality”, and also
with their concept from Wordnet after Thai-to
English translation to solve the word-sense
ambiguity and the variety surface forms of a
word with the same concept. If the NP has
modifier, the only head noun will be assigned
the concept. And, if the NP means “symptom”,
e.g. ‘ใบเหลือง/yellow leaf’, ‘จุดสีนา้ํ ตาล/brown spot’,
etc., we will assign its’ concept as ‘symptom’.
The annotation of the intra-causal EDU is shown
by the following example:
สําคัญโรคหนื่ง /an important disease . Φ เกิด /occur
จาก / from ไวรัส H1N5/ H1N5 virus. ”
where Φ is zero anaphora = Bird flu disease.
This problem can be solved by using the
heuristic rule that the previous subject of the
noun phrase will be the ellipsis one.
A Framework for Causality
Extraction
4
There are three steps in our framework. First is
corpus preparation step followed by causality
learning, and causality extraction steps as shown
in figure 1.
Corpus
Preparation
Text
Word
net
<EDU><NP1 concept=plant organ>ใบพืช
</NP1><Verb concept=have> มี</Verb><NP2
concept=symptom> จุดสีนา้ํ ตาล</NP2><Preposition>
จาก</Preposition></NP3 concept=fungi> เชือ้ รา
</fungi>< causality></EDU>
Causality
learning
Causal verb rules
Causality extraction
Causality relation Knowledge base
4.2.2 Rule mining
This step is to mine the causal verb rules from
the annotation corpus of the intra-causal/non
causal
EDU
by
using
ID3
from
Weka(http://www.cs.waikato.ac.nz/ml/weka/).
From this mining step, there are 30 causal verb
rules from 330 EDUs of specified sentence
pattern, as shown in table1.
Figure 1. The frame work for causality
extraction
4.1
Corpus Preparation
This step is the preparation of corpus in the form
of EDU from text. The step involves using Thai
word segmentation tools to solve a boundary of
a Thai word and tagged its part of speech
[Sudprasert and Kawtrakul, 2003], including
Name entity [Chanlekha and Kawtrakul, 2004],
and word-formation recognition [Pengphom, et
al 2002] to solve the boundary of Thai Name
entity and Noun phrase. After the word
segmentation is achieved, EDU segmentation is
then to be dealt with. According to Charoensuk
et al. [2005], EDU segmentation will be
generated and be kept as an EDU corpus for the
next step of learning.
4.2.3 Verifying
This step is to verify the rules before giving rise
to identify the causal EDU. There are some
rules having the same general concept which can
be combined into one rule as in the following
example:
R1:
R2:
R3:
4.2
Causality learning
7
IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep=
จาก/from>^ <NP3= fungi > then causality
IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep=
จาก/from>^ <NP3= bacteria> then causality
IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep=
จาก/from>^ <NP3=pathogen > then causality
The R3 rule is the general concept rule of R1
and R2. Then, we have 25 rules to be verified.
The test corpus from agricultural and health
news domains of 2000 EDUs contain 102 EDUs
of the specified sentence pattern, which only 87
EDUs are causality.
5
References
1.Daniel Marcu. 1997. The Rhetorical Parsing of Natural
Language Texts, The proc. of the 35th annual meeting of
the
association
for
computational
linguistics(ACL’97/EACL’97), Madrid, Spain .
2.Du-Seong Chang and Key-Sun Choi.. 2004. Causal
Relation Extraction Using Cue Phrase and Lexical Pair
Probabilities, IJCNLP 2004, Hainan Island, China.
3.George A. Miler, Richard Beckwith,
Christiane
Fellbuan, Derek Gross, and Katherine Miller. 1993 .
Introduction to Word Net, An Online Lexical Database .
4.Hutchatai Chanlekha, Asanee Kawtrakul. 2004. Thai
Named Entity Extraction by incorporating Maximum
Entropy Model with Simple Heuristic Information,
IJCNLP’ 2004 , HAINAN Island , China.
5.Barbara J. Groz , Aravind K. Joshi, and Scott Weinstein.
1995. Centering: A Framework for Modelling the Local
Coherence of Discourse, In Computational Linguistic
21(2),June 1995, pp. 203-225.
6.Jirawan Chareonsuk, Tana Sukvakree and Aasanee
Kawtrakul.
2005.
Elementary
Discourse
unit
Segmentation for Thai using Discourse Cue and
Syntactic Information, NCSEC 2005,Thailand.
7.Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski.
2003. Building a Discourse-Tagged Corpus in the
Framework of Rhetorical Structure Theory, In Current
Directions in Discourse and Dialogue.
8.Marilyn A. Walker, Aaravind K. Joshi, Ellen F. Prince.
1998. Centering in Naturally Occuring Discource: An
Overview , in Centering Theory of Discourse, Oxford:
Calendron Press.
9.Nattakan Pengphon, Asanee Kawtrakul and Mukda
Suktarachan. 2002. Word Formation Approach to Noun
Phrase Analysis for Thai, SNLP2002, Thailand.
10.Roxana Girju and Dan Moldovan. 2002. Mining
Answers for Question Answering , In proc. of AAI
Symposium on Mining Answers from Texts and
Knowledge Bases
11.Sutee Sudprasert and Asanee Kawtrakul. 2003. Thai
Word Segmentation based on Global and Local
Unsupervised Learning, NCSEC’2003, Chonburi,
Thailand.
12.Takashi Inui,
K.Inui and Y Matsumoto. 2004.
Acquiring causal knowledge from text using the
connective markers, Journal of the information
processing society of Japan(2004) 45(3)
Evaluation
During this research, we have used documents
containing 6000 EDUs from the agricultural and
health news domains to extract the causal
relations. We have divided this corpus into two
parts. One part is for learning to determine the
intra-causal EDU. The other part is used for
evaluating the performance of the causality
extraction with the following precision and the
recall where R is the causality relation :
Recall =
# of samples correctly extracted as R
# of all samples holding the target relation R
Precision =
# of samples correctly extracted as R
# of all samples extracted as being R
The results of precision and recall are
evaluated by a human. The precision of the
extracted causality of the specified sentence
pattern is 87% while the recall is 73 %,.
6
Conclusion
However, our model or system will be very
beneficial for causal question answering and
causal generalization for knowledge discovery.
Acknowledgement
The work described in this paper has been supported
by the grant of NECTEC No. NT-B-22-14-12-46-06
and partially supported by a grant from FAO.
.
Table1 Show causal verb rules where * means ‘any’
Example of causality
Causal verb rules
IF<NP1=*>^<Verb=be>^<NP2=*>^<Prep=
จาก/from>^ <NP3=pathogen > then causality
พืช /Plant เปน /is โรค /disease จาก/ from ไวรัส/ virus
(Plant gets disease from virus)
IF<NP1=*>^<Verb=have>^<NP2=*>^<Prep=
จาก/from>^ <NP3=insect > then causality
ใบ/Leaf มี /has ตําหนิ /defect จาก/ fromเพลี้ย/aphid
(The leaf has a defect from an aphid)
IF<NP1=*>^<Verb=have>^<NP2=*>^<Prep=
จาก/from> <NP3= toxicant food > then causal
IF<NP1=*>^<Verb=occur>^<NP2=*>^<Prep=
จาก/from>^ <NP3*> then causal
IF NP1^<ติดเชือ้ /infect ><NP2=*>^< Prep=
จาก/from>^<NP3= contaction> then causality
ผูปว ย/Patient มี /has อาการทองเสีย /diaria symptom จาก/
from อาหารเปนพิษ/ food poisoning
(A patient has diaria symptom from food poisoning)
โรค /Disease เกิด / occur จาก /from แบคทีเรีย/ bacteria
(Disease is caused by bacteria)
เด็ก /Kid ติดเชื้อ / infect จาก /fromการสัมผัส/ contaction
(A kid is infected by contaction)
8
Named Entity Recognition in Bengali
Asif Ekbal
Department of Computer Science and Engineering
Jadavpur University,Kolkata,India
Email: ekbal_asif12@yahoo.co.in/asif.ekbal@gmail.com
concerned with language independent NER. An
unsupervised learning algorithm for automatic discovery
of NEs in a resource free language has been presented in
[5]. A framework to handle the NER task for long NEs
with many labels has been described in [6]. For learning
generalized names in text an algorithm, NOMEN, has
been presented in [7]. NOMEN uses a novel form of
bootstrapping to grow sets of textual instances and their
contextual patterns. A joint inference model has been
presented in [8] to improve Chinese name tagging by
incorporating feedback from subsequent stages in an
information extraction pipeline: name structure parsing,
cross-document co-reference, semantic relation extraction
and event extraction. It has been shown in [9] that a
simple-two stage approach to handle non-local
dependencies in NER can outperform existing approaches
that handle non-local dependencies, while being much
more computationally efficient. But in Indian Languages,
no work in the area of has been carried out as yet.
The rest of the paper is organized as follows. Section 2
deals with the NER task in Bengali. Section 3 shows the
evaluation techniques and results. Finally, conclusion is
drawn in Section 4.
Abstract
A tagged Bengali news corpus, developed from the
web, has been used in this work for the recognition of
named entities (NEs) in Bengali language. A supervised
learning method has been adopted to develop two
different models of a Named Entity Recognition (NER)
system, one (Model A) without using any linguistic
features and the other (Model B) by incorporating
linguistic features. The different tags in the news corpus
help to identify the seed data. The training corpus is
initially tagged against the different seed data and a
lexical contextual seed pattern is generated for each tag.
The entire training corpus is shallow parsed to identify
the occurrence of these initial seed patterns. In a position
where the context or a part of each seed pattern matches,
the systems predict the boundary of a named entity and
further patterns are generated through bootstrapping.
Patterns that occur in the entire training corpus above a
certain threshold frequency are considered as the final set
of patterns learnt from the training corpus. The test
corpus is shallow parsed to identify the occurrence of
these patterns and estimate the named entities. Models
have been tested with two news documents (Gold
Standard Test Sets) and their results have been compared
in terms of evaluation parameters.
2. Named Entity Recognition in Bengali
Bengali is the fifth language in the world, second in
India and the national language of Bangladesh. NER in
Indian Languages (ILs) in general and in Bengali in
particular is difficult and challenging. In English, NE
always appears with capitalized letters but there is no
concept of capitalization in Bengali. In the present work, a
supervised learning system based on pattern directed
shallow parsing has been used to identify named entities
in Bengali using a tagged Bengali news corpus. The
corpus has been developed from a widely used Bengali
newspaper available in the web and at present it contains
around 34 million wordforms. The location, reporter,
agency and date tags in the tagged corpus help to identify
the location, person, organization and miscellaneous
names respectively and these serve as the seed data of the
systems. In addition to these, most frequent NEs are
collected from the different domains of the newspaper and
used as the seed data. The systems have been trained on a
1. Introduction
Named Entity Recognition (NER) is an important tool
in almost all Natural Language Processing (NLP)
application areas. NER’s main role is to identify
expressions such as the names of people, locations and
organizations as well as date, time and monetary
expressions. Such expressions are hard to analyze using
traditional NLP because they belong to the open class of
expressions, i.e., there is an infinite variety and new
expressions are constantly being invented.
The problem of correct identification of NEs is
specifically addressed and benchmarked by the developers
of Information Extraction Systems, such as the GATE
system [1] and the multipurpose MUSE system [2].
Morphological and contextual clues for identifying NEs in
English, Greek, Hindi, Romanian and Turkish have been
reported in [3]. The shared task of CoNLL-2003 [4] were
9
part of the developed corpus. The training corpus is
partially tagged with elements from the seed list that serve
as the gazetteer. The initial contextual lexical seed
patterns that are learnt using the seed data and constitute a
partial named entity grammar, identify the external
evidences of NEs in the training corpus. These evidences
are used to shallow parse the training corpus to estimate
possible NEs that are manually checked. These NEs in
turn help to identify further patterns. The training
document is thus partially segmented into NEs and its
context patterns. The context patterns that appear in the
training document above a certain threshold frequency are
retained and are expected to be applicable for the test
documents as well in line with the maximum likelihood
estimate. Initially the NER system has been developed
using only the lexical contextual patterns learned (NER
system without linguistic features i.e. Model A) from the
training corpus and then linguistic features have been used
along with the same set of lexical contextual patterns
(NER system with linguistic features i.e. Model B) to
develop it. The performance of the two systems has been
compared using the three evaluation parameters namely
Recall, Precision and F-Score.
[dhar] etc.) that may appear as part of named entity as
well as the common words. These clue words are kept in
order to tag more and more NEs during the training of the
system. As a result, more potential patterns are generated
in the lexical pattern generation phase.
2.2. Lexical Seed Patterns Generation from the
Training Corpus
For each tag T inserted in the training corpus, the
algorithm generates a lexical pattern p using a context
window of maximum width 4 (excluding the tagged NE)
around the left and right tags, e.g., p = [ l-2 l-1 <T> l+1 l+2
] where l±i are the context of p. Any of l±i may be a
punctuation symbol. In such cases, the width of the lexical
patterns will vary.
The lexical patterns are generalized by replacing the
tagged elements by the tags. These generalized patterns
form the set of potential seed patterns, denoted by P.
These patterns are stored in a Seed Pattern table, which
has four different fields namely: pattern id (identifies any
particular pattern), pattern, type (Person name/ Location
name/ Organization name/ Miscellaneous name) and
frequency (indicates the number of times any particular
pattern appears in the entire training corpus).
2.1. Tagging with Seed Lists and Clue Words
The tagger places the left and right tags around each
occurrence of the named entities of the seed lists in the
corpus.
For example, <person> åaç×XÌ^ç GçµùÝ (sonia Gandhi)
</person>, <loc> åEõç_EõçTöç (kolkata) </loc> and <org>
^çV[ýYÇÌ[ý ×[ý`Ÿ×[ýVîç_Ì^ (jadavpur viswavidyalya) </org>.
For the Model A, the training corpus is tagged only with
the help of different seed lists. In case of Model B, after
tagging the entire training corpus with the named entities
from the seed lists, the algorithm starts tagging with the
help of different internal and external evidences that help
to identify different NEs. It uses the clue words like
surname (e.g.,×]y [mitra], Vwø [dutta] etc.), middle name
2.3. Generation of new Patterns through
Bootstrapping
Every pattern p in the set P is matched against the entire
training corpus. In a place, where the context of p
matches, the system predicts where one boundary of a
name in the text would occur. The system considers all
possible noun, verb and adjective inflections during
matching. At present, there are 214 different verb
inflections and 27 noun inflections in the systems. During
pattern checking, the maximum length of a named entity is
considered to be six words. Each named entity so obtained
in the training corpus is manually checked for correctness.
The training corpus is further tagged with these newly
acquired named entities to identify further lexical patterns.
The bootstrapping is applied on the training corpus until
no new patterns can be generated. The patterns are added
to the pattern set P with the ‘type’ and ‘frequency’ fields
set properly, if they are not already in the pattern set P
with the same ‘type’. Any particular pattern in the set of
potential patterns P may occur many times but with
different ‘type’ and with equal or different ‘frequency’
values. For each pattern of the set P, the probabilities of
its occurrence as Person, Location, Organization and
Miscellaneous names are calculated.
For the candidate patterns acquisition under each type,
a particular threshold value of probability is chosen. All
these acquired patterns form the set of accepted patterns
and this set is denoted by Accept Pattern.
»Jô³VÐ [Chandra], XçU [nath] etc.), prefix word (e.g.,
`ÒÝ]çX [sriman], `ÒÝ [sree], `ÒÝ]TöÝ [srimati] etc.) and suffix
word (e.g., [ýç[ýÇ [-babu], Vç [-da], ×V [-di] etc.) for person
(e.g.,
names. A list of common words (e.g.,
åXTöç
[neta],
açeaV
[sangsad], åFã_çÌ^çQÍö [kheloar] etc.) has been kept that
often determines the presence of person names. It
considers the different affixes (e.g. - _îç³Qö [-land] YÇÌ[ [-
,
- ý
-
pur],
×_Ì^ç [-lia] etc.) that may occur with location
names. The system also considers the several clue words
that are helpful in detecting organization names (e.g., åEõçe,
[kong], ×_×]äOôQö [limited] etc.). Tagging algorithm also
uses the list of words (e.g.,
Eõ×[ýTöç
[kabita],
EõÌ[ý
[kar]
, WýÌ[ý
10
Any particular pattern may appear more than once with
different type in the Accept Pattern set. So, while testing
the NER systems, some identified NEs may be assigned
more than one named entity categories (type). Model A
cannot cope with this NE-classification disambiguation
problem at present. Model B uses different linguistic
features, as identified in Section 3.1 to deal with this NEclassification disambiguation problem.
have been ordered in order to make them available for
inclusion in the training set. The two different test sets can
be ordered in 2 different ways. Out of these 2 different
combinations, a particular combination has been
considered in the present work. It may be interesting to
consider the other combination and observe whether the
results vary.
Each pattern of the Accept Pattern set is matched
against the first Test corpus (Test Set1) according to the
pattern matching process described in Section 2 and the
identified NEs are stored in the appropriate NE category
tables. Any particular pattern of the Accept Pattern set
may assign more than one NE categories to any identified
NE of the test set. This is known as the NE-Classification
disambiguation problem. The identified NEs, assigned
more than one NE categories, should be further verified
for the correct classification. Model A cannot cope with
this situation and always assigns the highest probable NE
category to the identified NE. One the other hand different
linguistic patterns, used as the clue words for the
identification of different types of NEs (in Section 2), are
used in order to assign the actual categories (NE type) to
the identified NEs in Model B. Once the actual category
of a particular NE is explored, it is removed from the
other NE category tables.
The same procedures described in Section 2 are
performed for this test set (Test Set1) in order to include it
in to the training set. Now, the resultant Accept Pattern is
formed by taking the union of the initial Accept Pattern
set and the Accept Pattern set of this test corpus. This
resultant Accept Pattern set is used in evaluating the NER
models with the next test corpus (Test Set 2 in the order)..
So in each run, some new patterns may be added to the set
of Accept Pattern. As a result, the performance of the
NER systems (models) gradually improves since all the
test sets have been collected from a particular news topic.
The Bengali date, Day and English date can be recognized
from the different date tags in the corpus. Some person
names, location names and organization names can be
identified from the reporter, location and agency tags in
the corpus.
3. Evaluation and Results
The set of accepted patterns is applied on a test set. The
process of pattern matching can be considered as a
shallow parsing process.
3.1. Training and Test Set
A supervised learning method has been followed to
develop two different models of a NER system. The
systems have been trained on a portion of the tagged
Bengali news corpus. Some statitics of training corpus is
as follows:
Total number of news documents = 1819, Total number
of sentences in the corpus = 44432, Average number of
sentences in a document = 25, Total number of wordforms
in the corpus = 541171, Average number of wordforms in
a document = 298, Total number of distinct wordforms in
the corpus = 45626.
This training set is initially tagged against the different
seed lists used in the system and the lexical pattern
generation, pattern matching, new pattern generation and
candidate pattern acquisition procedures are sequentially
performed.
Two manually tagged test sets (Gold Test Sets) have been
used to evaluate the models of the NER system. Each test
corpus has been collected from a particular news topic
(i.e. international, national or business).
3.2. Evaluation Parameters
The models have been evaluated in terms of Recall,
Precision and F-Score. The three evaluation parameters
are defined as follows:
Recall (R) = (No. of tagged NEs) / (Total no. of NEs
present in the corpus) * 100%
Precision (P) = (No. of correctly tagged NEs) / (No. of
tagged NEs) * 100%
F-Score (FS) = (2 * Recall* Precision) / (Recall +
Precision) *100%
The three evaluation parameters are computed for
each individual NE category i.e. for person name,
location name, organization name and miscellaneous
name.
3.4. Results and Discussions
The performance of the systems with the help of two
news documents (test sets), collected from a particular
news topic, has been presented in Tables 1 and 2. Here,
following abbreviations have been used: Person name
(PN), Location name (LOC), Organization name (ORG)
and Miscellaneous (MISC).
3.3. Evaluation Method
The actual number of different types of NEs present in
each test corpus (Gold Test Set) is known in advance and
they are noted. A test corpus may be used in generating
new patterns also i.e. it may be utilized in training the
models after evaluating the models with it. The test sets
11
Model
B
Model
A
NE
category
PN
LOC
ORG
MISC
PN
LOC
ORG
MISC
R
P
FS
71.6
66.2
64.7
33.2
69.1
64.7
62.3
33.2
79.2
74.2
73.4
98.3
73.2
67.1
63.7
98.3
75.30
69.71
68.8
49.63
71.09
65.87
62.68
49.63
by replacing each of l±I of p with its lexical category (i.e.
each word is replaced by it’s part of speech information).
So, a Part-of-Speech (POS) tagger could be more
effective in generating more and more potential general
patterns. In presence of the POS tagger, the potential NEs
can be identified by the regular expression of the form
Noun (i.e. NE is always a sequence of noun words).
Currently, we are working to include the HMM based
part of speech tagger and a rule based chunker in to the
systems. More linguistic knowledge could be helpful in
NE-classification disambiguation problem and as a result
precision values of different NE categories would
increase. Observation of the results with the various
orders of the test sets would be an interesting experiment.
Table 1: Result for Test Set 1
Model
B
Model
A
NE
category
PN
LOC
ORG
MISC
PN
LOC
ORG
MISC
R
P
FS
72.8
67.9
66.3
37.2
69.9
65.3
63.6
37.2
81.2
76.5
75.1
99.1
73.8
68.1
64.1
99.1
76.80
71.96
70.40
54.09
71.8
66.67
63.84
54.09
References
[1] H. Cunningham, Gate, a general architecture for text
engineering, Computing and the Humanities, 2001.
[2] D. Maynard, V. Tablan, K. Cunningham, and Y. Wilks,
Muse: a multisource entity recognition system, Computing
and the Humanities, 2003.
[3] S. Cucerzon and David Yarowsky, Language independent
named entity recognition combining morphological and
contextual evidence, Proceedings of the 1999 Joint SIGDAT
conference on EMNLP and VLC, 1999.
Table 2: Result for Test Set 2
[4] F. Erik, Tjong Kim Sang and Fien De Meulder, Introduction
to the CoNLL-2003 Shared Task: Language Independent
Named Entity Recognition, Proceedings of the CoNLL2003, Edmonton, Canada, 2003, pp.142-147.
It is observed from Tables 1 and 2 that the NER system
with linguistic features i.e. Model B outperforms the NER
system without linguistic features i.e. Model A in terms of
Recall, Precision and F-Score. Linguistic knowledge plays
the key role to enhance the performance of Model B
compared to Model A. Improvement in Precision, Recall
and F-Score values with test set 2 occurs as test set 1 is
included in this case as part of the training corpus.
Whenever any pattern of the set of accepted patterns
(Accept Pattern) produces more than one NE categories
(type) for any identified (from the test corpus) NE, Model
A always assigns that particular NE category (type) which
has the maximum probability value for that pattern. This
often produces some errors in assigning NE categories to
the identified NEs. So the precision values diminish and
as a result the F-Score values get affected. Model B solves
this problem with the help of linguistic knowledge and so
its precision as well as the F-Score values are better than
Model A. At present, the systems can only identify the
various date expressions but cannot identify the other
miscellaneous NEs like monetary expressions and time
expressions.
[5] A. Klementiev and D. Roth, Weakly Supervised Named
Entity Transliteration and Discovery from Multilingual
Comparable Corpora, In Proceedings of the COLING-ACL
2006, Sydney, Australia, 17-21 July, pp. 817-824.
[6] D. Okanohara, Y. Miyao, Y. Tsuruoka and J. Tsujii,
Improving the Scalibility of Semi-Markov Conditional
Random Fields for Named Entity Recognition, In
Proceedings of the COLING-ACL 2006, Sydney, Australia,
17-21 July, pp.465-472.
[7] R. Yangarber, W. Lin and R. Grishman, Unsupervised
Learning of Generalized Names, In Proceedings of the 19th
International Conference on Computational Linguistics
(COLING-2002).
[8] Heng Ji and Ralph Krishnan, Analysis and Repair of Name
Tagging Errors, In Proceedings of the COLING-ACL 2006,
Sydney, Australia, 17-21 July, pp.420-427.
[9] Vijay Krishnan and Christopher D. Manning, An Effective
Two-Stage Model for Exploiting Non-Local Dependencies in
Named Entity Recognition, In Proceedings of the COLINGACL 2006, Sydney, Australia, 17-21 July, pp.1121-1128.
4. Conclusion and Future Works
Experimental results show that the performance of a
NER system employing machine learning technique can
be improved significantly by incorporating linguistic
features. The lexical Patterns can further be generalized
12
Fast Prototyping of a Malay WordNet System
LIM Lian Tze and Nur HUSSEIN
Computer Aided Translation Unit
School of Computer Sciences
Universiti Sains Malaysia
Penang, Malaysia
{liantze,hussein}@cs.usm.my
ABSTRACT
original WordNet contains only English words, there have
been efforts to create WordNet-like systems for other languages. See the Global WordNet Association’s website [4]
for a list of such projects.
Currently, no WordNet-like lexical database system exist for the Malay language. Such a resource will be useful
indeed for NLP research involving Malay texts. While the
construction of a complete WordNet-like system is a daunting undertaking which requires lexicographic expertise, it is
possible to build a prototype system semi-automatically using resources accessible at our site. The prototype Malay
WordNet system and data can then be further scrutinised,
fine-tuned and improved by human lexicographers.
The main aim of developing this prototype was to explore
the design and tools available in a WordNet system, rather
than a full attempt to develop high quality Malay WordNet
data. Therefore, the methods we adopted are not as extensive as other efforts in constructing non-English WordNets,
such as the work reported in [1, 2].
This paper outlines an approach to produce a prototype
WordNet system for Malay semi-automatically, by using
bilingual dictionary data and resources provided by the original English WordNet system. Senses from an English-Malay
bilingual dictionary were first aligned to English WordNet
senses, and a set of Malay synsets were then derived. Semantic relations between the English WordNet synsets were
extracted and re-applied to the Malay synsets, using the
aligned synsets as a guide. A small Malay WordNet prototype with 12429 noun synsets and 5805 verb synsets was thus
produced. This prototype is a first step towards building a
full-fledged Malay WordNet.
KEYWORDS
WordNet, Malay, lexical knowledge base, fast prototyping
1
INTRODUCTION
2
Traditional dictionaries compile lexical information about
word meanings by listing them alphabetically by the headwords. While this arrangement is convenient for a human
reader who wants to look up the meaning of a word, it
does not provide much information about explicit semantic relations between words, besides the usual synonyms and
antonyms.
WordNet [6, 8] is a lexical database system for English
words, designed based on psycholinguistic principles. It organises word meanings (senses) on a semantic basis, rather
than by the surface morphological forms of the words. This
is done by grouping synonyms into sets, and then defining
various relations between the synonym sets (synsets). Some
examples of the semantic relations defined include hypernymy (the is-a relation) and meronymy (the part-of relation).
Armed with such semantic relations, WordNet became an
invaluable resource for natural language processing (NLP)
researchers in tackling problems like information retrieval,
word sense disambiguation, and question answering. As the
METHODOLOGY
We describe how a prototype Malay WordNet can be constructed semi-automatically using a English-Malay bilingual
dictionary, the original English WordNet, and alignments
between the two resources.
The developers of the English WordNet, the Cognitive Science Laboratory at Princeton University, have made available some useful tools that allow the custom development of
WordNet-like systems [7]. They include:
• English WordNet database files,
• WordNet Browser, a GUI front-end for searching and
viewing WordNet data,
• WordNet database search functions (as C library functions),
• GRIND, a utility tool for converting lexicographer input
files into WordNet database files.
If lexicographer input files for Malay words can be created following the required syntax, GRIND can be used to
process them to produce Malay WordNet database files, to
be viewed using the WordNet browser. This can be done
c Copyright is held by the authors.
13
by first establishing a set of Malay word synsets and the
semantic relations between them, and then generating the
lexicographer files.
2.1
• (penggabungan, penyatuan, penyepaduan, pengintegrasian)
(100803600: consolidation, integration; [the act of combining into an integral whole])
Malay Synsets
Kamus Inggeris Melayu Dewan (KIMD) [5] is an EnglishMalay bilingual dictionary and provides Malay equivalent
words or phrases for each English word sense. Linguists
at our research group had previously aligned word senses
from KIMD and WordNet 1.6. Not all KIMD and WordNet
1.6 senses were included; only the more common ones were
processed.
Here are some example alignments for some senses of dot,
consolidation and integration:
2.2
Synset Relations
For this fast prototyping exercise, we have decided to create
semantic relations between the Malay synsets based on the
existing relations between their English equivalents. Algorighm 2 shows how this can be done.
Algorithm 2 Creating relations between Malay synsets
Require: lookup ms(es):
returns Malay synset equivalent to English synset es
Require: lookup es(ms):
returns English synset equivalent to Malay synset ms
Require: get target(R, es):
returns target (English) synset of English synset es for
relation R
for all Malay synset ms do
es ⇐ lookup es(ms)
for all relation R with a pointer from es do
ms′ ⇐ null
es′ ⇐ es
if R is transitive then
repeat
es′ ⇐ get target(R, es)
ms′ ⇐ lookup ms(es ′ )
until es′ = null or ms′ 6= null
else
es′ ⇐ get target(R, es)
ms′ ⇐ lookup ms(es ′ )
end if
if ms′ 6= null then
add (R, ms′ ) to list of relations that applies to ms.
end if
end for
end for
Listing 1: Aligned senses of dot
kimd (dot, n, 1, 0, [small round spot, small circular shape], <titik,
bintik> ).
wordnet (110025218, ’dot’, n, 1, 0, [a very small circular shape] ).
Listing 2: Aligned senses of consolidation
kimd (consolidation, n, 1, 0, [act of combining, amalgamating], <
penggabungan, penyatuan>).
wordnet (105491124, ’consolidation’, n, 1, 0, [combining into a
solid mass]).
wordnet (100803600, ’consolidation’, n, 2, 0, [the act of combining
into an integral whole]).
Listing 3: Aligned senses of integration
kimd (integration, n, 1, c, [act of c. (combining into a whole)], <
penyepaduan, pengintegrasian>).
wordnet (100803600, 2, ’integration’, n, 2, 0, [the act of combining
into an integral whole]).
(The 9-digit number in each English WordNet sense above
is a unique identifier to the synset it belongs to.)
A set of Malay synsets may be approximated based on the
KIMD–WordNet alignment using Algorithm 1.
Algorithm 1 Constructing Malay synsets
for all English synset es do
ms-equivs ⇐ empty //list of Malay equivalent words
ms ⇐ null //Equivalent Malay synset
for all s ∈ {KIMD senses aligned to es} do
add Malay equivalent(s) of s to ms-equivs
end for
ms ⇐ new synset containing ms-equivs
Set ms to be equivalent Malay synset to es
end for
As an example, the hypernymy relation holds between the
English synsets (point, dot) and (disk, disc, saucer ). Therefore, a hypernymy relation is established between the corresponding Malay synsets (bintik, titik ) and (ceper, piring).
However, while searching for target synsets for a relation
R, it is always possible that there is no Malay equivalent
for an English synset. If R is transitive, as are hypernymy
and meronymy, we continue to search for the next target
synset in the transitive relation chain, until we reach the
last English synset in the chain.
Following this algorithm, the following Malay synsets are
derived from the sense alignments in Listings 1–3. The corresponding English WordNet synsets are also shown:
To illustrate, consider the English and Malay synsets in
Figure 1. The English synset (disk, disc, saucer ) has the
hypernym (round shape), which in turn has the hypernym
(shape, form). While (round shape) does not have a corresponding Malay synset in our data, (shape, form) does have
one as (bentuk, corak ). Therefore, a hypernymy relation is
established between (ceper, piring) and (bentuk, corak ).
• (titik, bintik )
(110025218: point, dot; [a very small circular shape])
• (penggabungan, penyatuan)
(105491124: consolidation; [combining into a solid
mass])
14
Figure 1: English and Malay synsets forming a hypernymy chain
2.3
5
Lexicographer Files
WordNet systems organise synsets of different syntactic categories, i.e. nouns, verbs, adjectives and adverbs, separately.
In addition, the English WordNet also assign semantic fields
to the synsets, such as noun.location, noun.animal and
verb.emotion. Synsets of different categories are to be
stored in separate lexicographer files, the names of which
correspond to their semantic fields.
For each Malay synset identified in section 2.2, we look up
f , the semantic field of its equivalent English synset. The
Malay synset, together with its relations and target synsets,
is then appended to the lexicographer file f .
3
The Malay WordNet prototype is adequate for demonstrating what a WordNet system has to offer for Malay. This
is especially helpful to give a quick preview to users who
are not yet familiar with the WordNet or lexical sense organisation paradigm. However, as acknowledged at the very
beginning, its current quality is far from satisfactory.
Part of the problem is in the dictionary used. The KIMD–
WordNet alignment work was part of a project to collect
glosses for English word senses from different dictionaries.
As such, the suitability of Malay equivalents to be lemmas
were not the main concern: all Malay equivalents were simply retained in the alignment files.
This leads to unsuitable Malay WordNet synset members
in some cases: since KIMD is a unidirectional English to
Malay dictionary, not all Malay equivalents it provides can
stand as valid lemmas. For example, KIMD provides orang,
anggota, dan lain-lain yang tidak hadir (literally ‘person,
member, etc. who are not present’) as the Malay equivalent
for English absentee. While this is valid as a Malay gloss or
description for the synset, it is unsuitable to be a member
lemma of a synset.
In addition, we also lack Malay gloss information for the
Malay synsets as these were not provided in KIMD. The prototype Malay WordNet, therefore, is forced to have English
text as glosses, intead of Malay glosses.
We also noted that the English WordNet provide verb
frames, e.g. Somebody —s something for a sense of the verb
run. The first problem is that we have yet to establish a
list of verb frames for Malay. Secondly, even if there were,
there is not necessarily a one-to-one mapping between the
English and Malay verb frames. Thirdly, as the English verb
frames are hard-coded into GRIND and WordNet, extensive
re-programming would be required to use these utilities on
different languages. Therefore, we have not attempted to
handle Malay verb frames for this prototype.
GRIND imposes a maximum of sixteen senses per word form
in each lexicographer file. This might be a problem if there
are Malay words that are very polysemous. Possible alternatives are:
IMPLEMENTATION
The procedures described in sections 2.2 and 2.3 were implemented as a suite of tools called LEXGEN in C and Java.
As a first step, only noun and verb synsets were processed
with LEXGEN. Since KIMD does not provide Malay glosses,
LEXGEN reuses glosses from English WordNet. The resulting
lexicographer files were then put through GRIND, producing
a small Malay WordNet system.
4
DISCUSSION
RESULTS
The prototype Malay WordNet system currently contains
12429 noun synsets and 5805 verb synsets. Its small coverage of the English WordNet (81426 noun synsets and 13650
verb synsets) is understandable as only a subset of KIMD
and WordNet senses was used in the earlier alignment work.
The prototype also includes the hypernymy, hyponymy, troponymy, meronymy, holonymy, entailment and causation relations. Figure 4 shows the Malay synset (bintik, titik ) and
its hypernyms as viewed in the WordNet Browser.
• further split the senses into different lexicographer files
so that each file would not contain more than sixteen
senses of the same word,
• aim for coarser sense distinctions, or
• re-program GRIND.
Figure 2: Malay WordNet as viewed in Browser
Finally, the derivation of Malay synsets from the KIMD–
WordNet alignments may be flawed. This is because multi-
15
7
ple KIMD senses may be aligned to a WordNet sense, and
vice versa. Referring back to Listing 2 and the list of Malay
synsets at the end of Section 2.1, we see that the Malay
words penggabungan and penyatuan from one KIMD sense
now appear in two synsets. To non-lexicographers, such as
the authors of this paper, it is unclear how this situation
should be handled. Are there now two senses of penyatuan
and penggabungan, or should the Malay synsets (penggabungan, penyatuan) and (penggabungan, penyatuan, penyepaduan, pengintegrasian) be merged? Since there are opinions
that the English WordNet is too fine-grained, the synsets can
perhaps be merged to avoid the problem for Malay WordNet. Nevertheless, we think a lexicographer would be more
qualified to make a decision.
6
CONCLUSION
Creating a new set of Wordnet lexicographer files from
scratch for a target language is a daunting task. A lot of
work needs to be done in compiling the lexicographer input
files and identifying relations between synsets in the language. However, we have been successful in rapidly constructing a prototype Malay Wordnet by bootstrapping the
synset relations off the English Wordnet. Hopefully, this will
lay the foundation for the creation of a more complete Malay
Wordnet system.
REFERENCES
[1] J. Atserias, S. Climent, X. Farreres, G. Rigau, and
H. Rodrı́guez. Combining multiple methods for the automatic construction of multilingual wordnets. In Proceedings of the International Conference on Recent Advances
in Natural Language Processing (RANLP’97), Tzigov
Chark, Bulgaria, 1997.
FUTURE WORK
The aim of work on the prototype Malay WordNet is but
to explore the architecture and software tools required in a
WordNet system. Future work will focus more on systematically compiling lexical data for a Malay WordNet system
by lexicographers and linguistic experts. We highlight some
issues of interest here.
[2] I. Azarova, O. Mitrofanova, A. Sinopalnikova, M. Yavorskaya, and I. Oparin. RussNet: Building a lexical
database for the Russian language. In Proceedings of
Workshop on WordNet Structures and Standardisation
and How this affect Wordnet Applications and Evaluation, pages 60–64, 2002.
• A Malay monolingual lexicon or dictionary should be
used to determine the Malay synsets, the gloss text for
each synset, as well as the synset’s semantic field.
[3] EuroWordNet. Eurowordnet: Building a multilingual
database with wordnets for several European languages,
2006. URL http://www.illc.uva.nl/EuroWordNet/.
Last accessed September 15, 2006.
• The semantic fields are hard-coded into GRIND and
WordNet. Therefore, if we are to have localised semantic fields in Malay, e.g. noun.orang (noun.person)
and noun.haiwan (noun.animal), or to add new fields,
GRIND and WordNet will need to be modified.
[4] Global WordNet Assoc.
Wordnets in the world,
2006.
URL http://www.globalwordnet.org/gwa/
wordnet table.htm. Last accessed September 15, 2006.
• Semantic relations need to be defined between the
Malay synsets. This may be aided by machine learning strategies, such as those used in [1], besides human
efforts.
[5] A. H. Johns, editor. Kamus Inggeris Melayu Dewan.
Dewan Bahasa dan Pustaka, Kuala Lumpur, Malaysia,
2000.
• A list of Malay verb frames need to be drawn up and
assigned to each verb sense.
[6] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and
K. Miller. Introduction to WordNet: An on-line lexical
database. International Journal of Lexicography (special
issue), 3(4):235–312, 1990.
• Currently, the Malay word senses are ordered at random. Ideally, the senses should be numbered to reflect
their usage frequency in natural texts. A sense-tagged
Malay corpus will help in this, as was done in the English WordNet [7, p.112].
[7] R. I. Tengi. Design and implementation of the wordnet lexical database and searching software. In C. Fellbaum, editor, WordNet: An Electronic Lexical Database,
chapter 4, pages 105–127. MIT Press, Cambridge, Massachusetts, 1998.
• It would also be interesting to align the Malay WordNet
to EuroWordNet [3], which contains wordnets for several European languages. As EuroWordNet is aligned
to English WordNet 1.5, some re-mapping would have
to be performed if we wish to re-use the KIMD–
WordNet alignment, or the prototype, as a rough guide.
[8] WordNet. WordNet: a lexical database for the English
language, 2006. URL http://wordnet.princeton.edu/.
Last accessed September 15, 2006.
16
Taxonomic Ontology Learning by using Item List
on the Basis of Text Corpora in Thai
Aurawan Imsombut
Asanee Kawtrakul
The Specialty Research Unit of Natural Language Processing
and Intelligent Information System Technology
Department of Computer Engineering,
Kasetsart University, Bangkok, Thailand
{g4685041,ak}@ ku.ac.th
In the presence of an explicit cue, an ontological
element can be detected by using the cue i.e. lexicosyntactic patterns [7] and an item list (bullet list and
numbered list). Implicit cues do not have any
concrete word to hint at the relationship [6]. In this
work, we focus on extracting hypernym and
hyponym (or taxonomic) relations because they are
the most important relation in ontology and they are
also skeleton of the knowledge. To deal with this we
use item list for hinting taxonomic relation. We
propose a method for detecting ontological item lists
and for extracting the hypernym class of list items.
The system selects the appropriate hypernym term
from a list of candidates, choosing the most likely
one (the one with the highest probability) according
to the lexicon and some contextual features. We
tested the system by using Thai corpora in the
domain of agriculture.
The remainder of this paper is organized as
follows. Section 2 presents the related works of
ontology extraction from unstructured text. Section 3
describes crucial problems for extraction of an
ontology by using item list. In section 4, we propose
methodology for automatically extracting hypernym
relation of an ontology on the basis of corpora. The
experimental results and conclusions are shown in
section 5 and 6, respectively.
Abstract
Ontologies are an essential resource to enhance
the performance of an information processing
system as they enable the re-use and sharing of
knowledge in a formal, homogeneous and
unambiguous way. We propose here a method for
the automatic learning of an taxonomic ontology
based on Thai text corpora. To build the ontology
we extract terms and relations. For the former we
use a shallow parser, while for the extraction of
taxonomic relations we use item lists, i.e. bullet
lists and numbered lists. To deal with this, we
need to identify first which lists contain an
ontological term and to solve the problems of
embedding of lists and the boundaries of the list.
Then, we extract the hypernym term of the item
lists by using the lexicon and the contextual
features of the candidate term of the hypernym.
The accuracy of the ontology extraction from
item list is 0.72.
1 Introduction
Ontology is a well-known term in the field of AI and
knowledge engineering. Numerous definitions have
been offered, and a common acceptance of the term is
to consider it as “an explicit specification of a
conceptualization.” [4]. We define ontology here as
“a general principle of any system to represent
knowledge for a given domain, with information from
multiple, heterogeneous sources. Information can be
represented by concepts and semantic relationships
between them.” An ontology can be used for many
purposes. It can be used in Natural Language
Processing to enhance the performance of machine
translation,
text
summarization,
information
extraction and document clustering.
The building of an ontology by an expert is an
expensive task. It is also a never ending process
because knowledge increases all the time in real
world, especially in the area of science. Hence we
suggest to build ontologies automatically.
Texts are a valuable resource for extracting
ontologies as they contain a lot of information
concerning the concepts and their relationships. We
can classify the expression of ontological
relationships in texts into explicit and implicit cues.
2 Related Works
There are a number of proposals to build ontologies
from unstructured text. The first one to propose the
extraction of semantic relations by using lexicosyntactic patterns in the form of regular expressions
was Hearst [5]. Secondly, statistical techniques have
often been used for the same task. Indeed, many
researches [1], [2], [8], [10] have used clustering
techniques to extract ontologies from corpora by
using different features for different clustering
algorithms. This approach allows to process a huge
set of data and a lot of features, but it needs an expert
to label each cluster node and each relationship
name. Another approach is hybrid. Maedche and
Staab [9] proposed an algorithm based on statistical
techniques and association rules of data mining
technology to detect relevant relationships between
ontological concepts. Shinzato and Torisawa [12]
17
presented an automatic method for acquiring
hyponymy relations from itemization and listing of
HTML documents. They used statistical measures
and some heuristic rules. In this paper, we suggest to
extract the ontology from itemized lists in plain text,
i.e. without any HTML markup like most of the
previous work, to select the appropriate hypernym of
the list items.
4
Methodology
for
automatically
extracting hypernym relation of an
ontology
The proposed methods in this paper for taxonomicontology extraction from itemized lists is dealt with
by a hybrid approach: natural language processing,
rule based and statistical based techniques. We
decompose the task into 3 steps that are
Morphological Analysis and Noun phrase chunking,
Item list identification and Extraction of hypernym
Term of list items.
3 Crucial Problems for the Extraction of
Hypernym relation by using item list
When using item lists as cues to signal a hypernym
relation, we need to identify first which lists contain
an ontological term and whether the lists are coherent.
Since the input of our system is plain text, we do not
have any mark up symbols to show the position and
the boundaries of the list. This is why we used bullet
symbols and numbers to indicate the list, which is not
without posing certain problems.
Morphological
chunking.
analysis
and
noun
phrase
Just as in many other Asian languages, there are no
delimiters (blank space) in Thai to signal word
boundaries. Texts are a single stream of words.
Hence, word segmentation and part-of-speech (POS)
tagging [13] are necessary for identifying a term unit
and its syntactic category. Once this is done
documents are chunked into phrases [11] to identify
shallow noun phrase boundaries within a sentence. In
this paper, the parser relies on NP rules, word
formation rules and lexical data. In Thai, NPs are
sometimes sentence-like patterns; this is why it is not
always easy to identify the boundary of NPs
composed of several words including a verb. The
candidate NP might then, be signaled by another
occurrence of the same sequence in the same
document. The to-be-selected sentence-like NPs
should be those occurring more than one time.
Embedding of lists. Some lists may have long
descriptions and some of them can contain another
list. This causes a problem of identification. We solve
this problem by detecting each list following the same
bullet symbol or order of numbering. Despite that,
there are cases where an embeded list may have a
continuous number. In this case, we assume that
different lists talk about different topics; hence we
need to identify the meaning of each one of them.
Long boundaries of description in each list item.
Since some lists may have long descriptions, it is
difficult to decide whether the focused item is meant
to continue the previous list or start a new list.
Item list identification
Since an author can use item lists to describe objects,
procedures and the like, this might lead to nonontological lists. Hence we will use here only object
lists, because in doing so we can be sure that it
contains an ontological term. In consequence, the
system will classify list types by considering only
item lists that have to fulfill the constraints:
• item lists whose items are marked as the
same Name Entity (NE) class. [3]
• item lists are composed of more than one
item.
This works well here, because in the domain of
agriculture named entities are not only names at the
leave level. For example, rice is considered as plant
named entity as well as varieties of rice and rice
does not occur at the leave level of plant ontology.
This being so, our method can efficiently identify
ontological terms. Moreover, this phenomenon is
very much alike in other domains such as bioinformatics. In our study we classified lists into
bullet lists and numbering lists. We also considered
that a bullet lists must contain the same bullet
symbol and the same NE class. The system considers
the same numbering list by ordering number and
making sure the item belong to the same NE class.
Non-ontological list item. Quite so often authors
express procedures and descriptions in list form. But
the procedure list items are not the domain’s
ontological terms, and some description list items
may not be ontology terms at all, hence the system
needs to detect the ontology term list.
Figure 1 illustrates the problem of item list
identification.
Important pest of cabbage
1. Diamonback moth is Caterpillar …
…
Treatment and protection
Long boundary description
1. ...
2. ...
2. Cut worm is usually found at …
…
Treatment and protection
Embedded of lists
1. ...
2. ...
3. Cabbage webworm will destroy the cabbage …
…
Treatment and protection
Non-ontological litst item
1. ...
2. ...
Fig. 1. Example of problems of item list identification.
18
Comment: concerning NEs we distinguish
between two features (f2 and f3), since candidate
terms of hypernym can have or not have NE class.
The case that candidate term do not have NE class
can possible, especially when they occur at a high
level of the taxonomy, e.g. /phuet trakun thua/(pulse
crops). Then, when we compare the NE class of two
terms it can be three possible values, that are ‘same
NE class’, ‘differrent NE class’ and ‘can not defined’
(this occurs if candidate term do not have NE class).
Hence, we use these two features for representing all
these possible values.
This technique can solve the problem of embedded
lists. Since we assumed that different lists talk about
different topics and different NE classes, this method
can distinguish between different item lists.
Moreover, it works with the list item that has a long
boundary as paragraph.
Extraction of hypernym Term of list items
Having identified the item list by considering bullets
or numbering, the system will discover the hypernym
term from a set of candidates by using lexical and
contextual features. In order to generate a list of
candidate hypernym terms, the system considers only
the terms that occur in the preceeding paragraphs of
the item list. The one closest to the first item term of
the list will be considered first. Next, the most likely
hypernym value (MLH) of term in the candidates list
will be computed on the basis of an estimated
function taking lexical and contextual features into
account. Let h ∈ H , H is the set of candidates of
possible hypernym terms while t ∈ T , T is the set of
term in the item list. The estimate function for
computing the most likely hypernymy term is defined
as follows:
f4: Topic term. This feature consider that candidate
term is the topic term of the document (short
document) or a topic term of the paragraph (long
document) or not. Here, topic term will be computed
by using tf/idf.
⎧1; h is a topic term of the document (short
⎪⎪ document) or a topic term of the paragraph
f 4 ( c, h ) = ⎨
⎪ (long document)
⎪⎩0; otherwise
Contextual feature. This feature has a value
between 0 and 1. It is similarity value of word cooccurrence vector of each candidate term and item
list term. Each feature in word co-occurrence vector
corresponds to a context of the sentence in which the
word occurs. We select the 500 most frequent terms
(l) in the domain of agriculture as word cooccurrence feature and represent each candidate term
of a hypernym (h) and each list item term (t) with this
set of term feature. Each value of this vector is the
frequency of co-occurrence of word co-occurrence
feature term (l) and considering term (h or t). We
compute the similarity between words h and t by
using the cosine coefficient [19].
∑l hl , tl
∑l hl , tl
→ →
f5 (h, t ) = cos( h , t ) = → → =
∑ hl2 ∑ tl2
h t
MLH (h, t ) = α1 ⋅ f1 + α 2 ⋅ f 2 + ... + α n ⋅ f n
Where
α k is
the weight of feature k, fk is the
feature k and n is total number of features (=5). f1-f4
are lexical features and f5 is contextual feature. Each
feature (fk) is weighted by α k and in our experiment
we set all weight with the same value (=1/n). The
system will select the candidate term that has the
maximum MLH value to be the hypernym of the item
list terms.
Lexical features. They have binary value. The
features are:
f1: Head word compatible. This feature consider that
head word of candidate term is compatible with list
item term or not.
l
⎧1; h is compatible with the head word term of t
f1 (h, t ) = ⎨
⎩0; otherwise
l
5 Experimental Results
The evaluation of our system was based on test cases
in the domain of agriculture. The measurement of the
system’s performance is computed with the precision
and recall by comparing the outputs of the system
with the results produced by the experts. Precision
gives the number of extracted correct results divided
by the number of total extracted results, while recall
shows the number of extracted correct results divided
by the number of total corrects.
From a corpus about 15,000 words, the system
can extract 284 concepts and 232 relations. The
accuracy of hypernyms obtained by using different
features is shown in Table 1. The result indicates that
contextual features are more effective than lexical
feature and lexical features yield a lower value of
f2: Same NE class. This feature consider that
candidate term of a hypernym belong to the same NE
class as list item term or not.
⎧1; h belong to the same NE class as t
f 2 (h, t ) = ⎨
⎩0; otherwise
f3: Different NE class. This feature consider that
candidate term of a hypernym belong to the different
NE class as list item term or not.
⎧1; h belong to the different NE class as t
f 3 ( h, t ) = ⎨
⎩0; otherwise
19
5. Hearst, M.: Automatic acquisition of hyponyms from
large text corpora. In Proceedings of the 14th
International Conference on Computational Linguistics
(1992)
6. Imsombut, A. and Kawtrakul, A.. Semi-Automatic
Semantic Relations Extraction from Thai Noun Phrases
for Ontology Learning. The Sixth Symposium on
Natural Language Processing 2005 (SNLP 2005),
Chiang Rai, Thailand. (2005)
7. Kawtrakul, A., Suktarachan, A., Imsombut A.:
Automatic
Thai
Ontology
Construction
and
Maintenance System. Workshop on OntoLex LREC
conference, Lisbon, Portugal (2004)
8. Lin, D., Pantel, P.: Concept Discovery from Text. In
Proceedings of the International Conference on
Computational Linguistics. Taipei, Taiwan (2002) 577583
9. Maedche, A., Staab, S.: Ontology Learning for the
Semantic Web. IEEE Intelligent Systems, vol. 16, no. 2.
(2001)
10. Nedellec, C.: Corpus-based learning of semantic
relations by the ILP system, ASIUM. In Learning
language in Logic, Lecture Notes in Computer Science,
vol. 1925, Springer-Verlag, June (2000) 259-278
11. Pengphon, N., Kawtrakul, A., Suktarachan, M.: Word
Formation Approach to Noun Phrase Analysis for Thai.
In Proceedings of SNLP2002, Thailand (2002)
12. Shinzato, K., Torisawa, K.: Acquiring Hyponymy
Relations from Web Documents. In Proceedings of
HLT-NAACL04, Boston, U.S.A., May (2004)
13. Sudprasert, S., Kawtrakul, A.: Thai Word Segmentation
Based on Global and Local Unsupervised Learning. In
Proceedings of NCSEC2003, Chonburi, Thailand (2003)
recall than another since some hypernym terms do not
share certain lexical features such as NE class. The
precision of the system using both features is 0.72
and recall is 0.71. The errors are caused by some item
lists are composed of two classes, for example,
disease and pest. This is why the system can not
detect this item list.
Table 1. The evaluation results of ontology extraction
from item list
Only lexicon feature
Only contextual feature
Both lexicon and contextual
feature
precision
0.47
0.56
0.72
recall
0.46
0.55
0.71
6 Conclusion
In this article we presented and evaluated the hybrid
methodologies, i.e., rule-based and learning, for the
automatic building of ontology that is composed of
term extraction and hypernym relation extraction. A
shallow parser is used for terms extraction and item
list (bullet lists and numbered list) are used for
hypernym relation extraction. We extract the
hypernym term of the item lists by using the lexicon
and the contextual features of the candidate term of
the hypernym.
We consider our results to be quite good, given
that the experiment is preliminary, but the vital
limitation of our approach is that it works well only
for documents that contain a lot of cue-words. Based
on our error analysis the performance of the system
can be improved and the methodologies can be
extended to other sets of semantic relations.
Acknowledgments. The work described in this paper
has been supported by the grant of NECTEC No. NTB-22-14-12-46-06. It was also funded in part by the
KURDI; Kasetsart University Research and
Development Institute.
References
1. Agirre, E., Ansa, O., Hovy, E., Martinez, D.: Enriching
very large ontologies using the WWW. In Proceedings
of the Workshop on Ontology Construction of the
European Conference of AI (ECAI-00) (2000)
2. Bisson, G., Nedellec, C., Cañamero, D.: Designing
Clustering Methods for Ontology Building-The Mo’K
Workbench. In Proceedings of the Workshop on
Ontology Learning, 14th European Conference on
Artificial Intelligence, ECAI’00, Berlin, Germany
(2000)
3. Chanlekha, H., Kawtrakul, A.: Thai Named Entity
Extraction by incorporating Maximum Entropy Model
with Simple Heuristic Information. In Proceedings of the
IJCNLP’ 2004, Hainan Island, China (2004)
4. Gruber, T. R. A Translation Approach to Portable
Ontology Specifications. (1993)
20
Discovery of Meaning from Text
Ong Siou Chin
Narayanan Kulathuramaiyer
Alvin W. Yeo
Faculty of Computer Science and Information Technology,
Universiti Malaysia Sarawak,
Kota Samarahan, Malaysia.
{scong, nara, alvin}@fit.unimas.my
([1], [3]) that adopted the conceptual density in
WordNet‘s IS-A hierarchy to achieve WSD. These
researchers made use of sets of words that co-occur
within a window of context in a single document. Basili
[3] further incorporated “natural” corpus-driven empirical
estimation of lexical and contextual probabilities for
semantic tagging into [1].
Patwardhan [9] discussed the use of semantic
relatedness formula, namely the Adapted Lesk Algorithm,
which considers the overlaps of gloss 1 for words to be
disambiguated. They obtained relatedness between
candidate senses in the same collocation taking into
consideration only nouns, within the window of context.
Abstract
This paper proposes a novel method to disambiguate
important words from a collection of documents. The
hypothesis that underlies this approach is that there is a
minimal set of senses that are significant in
characterizing a context. We extend Yarowsky’s one sense
per discourse [13] further to a collection of related
documents rather than a single document. We perform
distributed clustering on a set of features representing
each of the top ten categories of documents in the
Reuters-21578 dataset. Groups of terms that have a
similar term distributional pattern across documents were
identified. WordNet-based similarity measurement was
then computed for terms within each cluster. An
aggregation of the associations in WordNet that was
employed to ascertain term similarity within clusters has
provided a means of identifying clusters’ root senses.
Table 1: Summary of related works
Authors
Agirre & Rigau
Basili et al.
Chua &
Kulathuramaiyer
Pantel & Lin
Patwardhan et al.
Yarowsky
1. Introduction
Word sense disambiguation (WSD) is a two-step
process; firstly, identifying possible senses of the
candidate words then selecting the most probable sense
for the candidate word according to its context. Methods
proposed by researchers are divided into corpus-based
and dictionary-based approaches.
The corpus-based unsupervised approach as proposed
by Yarowsky [13] disambiguates word senses by
exploiting two decisive properties in human language:
one sense per collocation and one sense per discourse.
One sense per collocation indicates that words in the same
collocation provide strong indication of the correct sense
of a target word, while one sense per discourse picks up
word senses of target words that are consistent throughout
a document [13].
WordNet is a lexical database that comprises English
nouns, verbs, adjective and adverbs. Entries in WordNet
are represented as synonym sets (synsets) and the linkages
between synsets are in hierarchical form. For instance,
noun synsets in WordNet are organized into a IS-A
hierarchy,
representing
the
hyponym/hypernymy
relationship. Due to its wide-coverage, WordNet is used
as knowledge structure in most WSD dictionary-based
approaches. WordNet’s synsets and its IS-A hierarchy
are the main usage of WordNet in WSD. There are works
This paper
Supervised
2
Clustering
1
Domain/
Context
SP1
Collocation
Collocation
Global
Y
2
C
Global
Y
Collocation
Collocation
& discourse
Global
Y
3
Semantic Similarity
4
WordNet
Approach
3
4
SS
WN
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Pantel and Lin [8] presented a combination of corpusbased and dictionary-based approach. They introduced
Clustering by Committee (CBC) that discovered clusters
of words sharing a similar sense. Initially, they formed a
tight set of cluster (committee), with its centroid
represented as a feature vector. Subsequently, candidate
words are assigned to these feature vectors accordingly.
Overlapping feature vectors are removed to avoid
discovering similar senses. They employed a cosine
coefficient of words’ mutual information as a similarity
measure [8]. Pantel and Lin have further employed these
clusters of words as a means of WSD for a corpus. They
explore words that are commonly collocated with words
belonging to a cluster. They suggest that words cooccurring with clustered words can be seen as belonging
1
21
Gloss is the definition and/or example sentences for a synset [8].
to the same context. They then employ one sense per
collocation as the WSD mean.
Our work on the other hand is not corpus specific for
the WSD process. We identify concept relatedness of
terms solely based on semantic relationships in WordNet.
Our prime motivation has been the discovery of structures
with deeper meaning. These structures are compact
representations of the input documents. Comparison of
related works and our proposed word sense
disambiguation are summarized in Table 1.
from [2]. However, a small modification is done in order
to obtain n clusters. The resulting algorithm is:
1.
2.
3.
2. Word Sense Disambiguation
4.
We extend Yarowsky’s one sense per discourse [13] to
a set of documents that represents a category.
We
disambiguate important words from the collection of
documents. Our approach of word sense disambiguation
tries to identify a set of senses that are significant in
characterizing a context. The context here is represented
in a cluster. There are three phases in our sense
disambiguation algorithm: Feature Selection, Clustering
and Semantic Similarity.
Sort words (obtained from phase I) by its IG
score.
Assign n + 1 clusters with the top n + 1 words
(as singleton).
Loop until all words have been assigned.
a. Merge 2 clusters which are most similar
(based on Kullback Leibler divergence
to the mean).
b. Create a new singleton from the next
word in the list.
Merge 2 clusters which are most similar (based
on KL divergence to the mean), resulting n
clusters.
The probabilistic framework used is Naïve Bayes and
the measure of distance employed is Kullback-Leibler
(KL) Divergence to the Mean. KL Divergence to the
Mean [2] is the average of KL Divergence of each
distribution to their mean distribution. It improves KL
Divergence’s odd properties; not symmetric and infinite
value, if an event has zero probability in one of its
distributions.
2.1 Phase I: Feature Selection
Important words are extracted from the input
documents using feature selection. Feature selection is a
type of dimensionality reduction technique that involves
the removal of non-informative features and the formation
of a subset of feature from the original set. This subset of
feature is the significant and representative set of the
original feature set. The feature selection scheme used is
information gain (IG). Debole and Sebastiani [5] define
IG as how much information a word contains about a
category. The higher IG score shows that a word and a
category are more dependent, thus the words are more
informative.
The top ten categories of Reuters-21578 are used as
the dataset. We only considered nouns in our word sense
disambiguation. For filtering, WordNet is used.
2.3 Phase III: Semantic Similarity
Unlike [7] that employed co-occurrence of words in
documents after clustering, we explore the use of
semantic similarity as a quantitative indicator for sense
disambiguation.
We proposed that word with similar distribution
pattern in a corpus are likely to share the similar set of
synsets. Therefore, words in the same cluster with target
word should be able to provide a strong and consistent
clue for disambiguating the target word in the context.
For each pair of words in a cluster, the semantic similarity
(SS) of the most similar WordNet senses, ranging from 0
to 1, is obtained. The candidate sense, i of a word, W is a
sense that has semantic similarity with senses of other
words in the same cluster. The semantic similarity value
is taken as a score. The sense with the highest
accumulated score is selected as the most probable sense
for the target word, in the cluster.
2.2 Phase II: Clustering
The goal of clustering is to produce a distinct, intrinsic
grouping of data, such that similar data is assigned to the
same cluster. Distributional clustering is used to find the
structure in the set of words formed in Phase I.
Distributional clustering [2] grouped words into clusters
based on the probability distribution of each word in
different class label. The probability distribution for word
W in category C is P(C|W), that is the probability that
category C will occur given word W. P(W) is the
probability of a word W occurred in the corpus.
Assumptions made are words, W are mutually exclusive
and P(W) is an independent event. The algorithm used is
1.
2.
22
For each cluster C
a. For each word-pair in cluster C
i. Calculate semantic similarity
for all possible senses and
return the sense with highest
semantic similarity.
For each cluster C
a. For each word W in cluster C
i. For each sense, i of word, W,
add SS to SSi.
b.
sense 8: passing, loss, departure, exit, expiration,
going, release -- euphemistic expressions for death;
Two evaluations on the accuracy of results generated
are undertaken; qualitative approaches based on manual
inspection and automatic text categorization.
Return sense, i with highest SSi as most
likely sense for word, W
Semantic similarity of two concepts is dependent on
their shared information [10] that is the Most Specific
Common Abstraction (MSCA) and information content
(IC). MSCA represents the closest node that subsumes
both concepts in the taxonomy while IC indicates how
informative a concept is. Seco [11] deduced that a
concept in WordNet that has more hyponyms convey less
information than concepts that are leaf. Therefore,
concepts that are leaves in WordNet are the most
specified. He also discussed that the IC of the MSCA
based on WordNet, ICwn is
3.1 Qualitative approach
In qualitative approach, we examined the accuracy of
the results produced by this algorithm by providing nine
human judges with four clusters. The possible senses of
each word, extracted from WordNet, are provided as well.
The human judges are not informed of the categories the
clusters represented. Using others words in the same
clusters as the only clues, human judges were asked to
select the most appropriate sense for the target word.
The results from each human judge are compared with
the results generated by our algorithm. The score for each
cluster is then normalized according to the number of
words in the cluster. The average scores obtained by nine
human judges are shown in Table 2.
Despite the providing of a set of terms corresponding
to a cluster of related terms, the human subjects chose
senses that represent the typical meaning of these words.
For example: the Internet sense of the word ‘Net’ was
chosen rather than the financial sense. We repeated the
same evaluation on a human judge that has knowledge in
the dataset used. The financial sense is chosen. Therefore
this study has highlighted the need for human subjects
with a deeper understanding of the domain to conduct the
evaluation.
where hypo(c) returns the number of hyponyms, given
concept, c and maxwn is the number of concepts available
in WordNet. We used WordNet 2.0 for implementation
and there are 79689 nouns in WordNet 2.0
In accordance with previous research by Seco in which
12 semantic relatedness algorithms are benchmarked, an
improved version of Jiang and Conrath (with ICwn)
similarity measure has the highest correlation with human
judgements, provided by Miller and Charles [11].
Therefore, this formula of semantic similarity is used
here.
where icwn(c) is the IC of the concept based on WordNet
and
.
Table 2: Qualitative and baseline approach
experimental results
3. Evaluation and results
Cluster
As an example, we provide a cluster in category Earn,
namely earnC2, to illustrate our algorithm. Members of
the cluster earnC2 are:
{record, profits, loss, jan, split, income, sales, note, gain,
results, th, vs, cts, net, revs, quarter, dividend, pay, sets,
quarterly, profit, tax, prior, earnings}.
In WordNet, the target word, loss has eight senses.
Based on words co-occurring together in earnC2, sense 3
of the word loss, is selected by our algorithm and it is
closest in the meaning of context of earnC2.
sense 1: the act of losing;
sense 2: something that is lost;
sense 3: loss, red ink, red -- the amount by which the
cost of a business exceeds its revenue;
sense 4: gradual decline in amount or activity;
sense 5: loss, deprivation -- the disadvantage that
results from losing something;
sense 6: personnel casualty, loss -- military personnel
lost by death or capture
sense 7: the experience of losing a loved one;
earnC2
crudeC2
cornC2
tradeC2
Average
Accuracy
Qualitative
Qualitative
(without knowledge)
(with knowledge)
0.708
0.875
0.635
0.857
0.642
0.935
0.583
0.667
0.642
0.837
3.2 Automatic Text Categorization
Based on the results of WSD, words within a cluster,
which have semantic similarity above 0.45, were
identified. These terms were then used as a feature set of
the document category.
We compared the text
categorization results of using the semantically related
terms (employing WSD) with the original result of feature
selection using Information Gain (without WSD). The
experiment was carried out using WEKA (Waikato
Environment for Knowledge Analysis) by applying
multinomial Naïve Bayes classifier. The experimental
23
setup employed is the same as used in [4]. The f-measure
for each category is shown in Table 3.
Disambiguation”,
Proceedings
of
Language
Resources and Evaluation Conference, Lisbon,
Portugal, 2004.
Table 3: Automatic text categorization experimental
result-F measure
Category
acq
corn
crude
earn
grain
interest
money-fx
ship
trade
wheat
Without WSD
#
Accuracy
0.942
50
0.314
50
0.734
50
0.964
50
0.603
50
0.565
50
0.603
50
0.603
50
0.569
50
0.362
50
[4] Chua, S. & Kulathuramaiyer, N., “Semantic Feature
Selection Using WordNet”, 2004 IEEE/WIC/ACM
International Conference on Web Intelligence,
Beijing, China, 2004, pp. 166-172.
Employing WSD
Accuracy
#
0.883
30
0.356
30
0.751
35
0.958
30
0.621
35
0.551
35
0.649
30
0.641
40
0.642
35
0.381
30
[5] Debole, F. & Sebastiani, F., “Supervised Term
Weighting for Automated Text Categorization”,
Proceedings of SAC-03, 18th ACM Symposium on
Applied Computing, Melbourne, ACM Press, New
York, US, 2003, pp. 784--788.
[6] Jiang, J. J. & Conrath, D. W., “Semantic Similarity
based on Corpus Statistics and Lexical Taxonomy”,
Proceedings of International Conference Research
on Computational Linguistics (ROCLING X),
Taiwan, 1997.
# Number of feature
The results highlights that that the set of features
employing WSD was not only able to capture the inherent
semantics of the entire feature set, it has also been able to
remove noise, whereby the performance was better for
seven out of ten categories. The results also proved the
ability of this reduced feature set in representing the
context of documents. The newly formed feature sets
have been reduced to the range of 30 to 40 (about 60% to
80%) semantically related features per category.
[7] Lin, D. “Using syntactic dependency as a local
context to resolve word sense ambiguity”, In
Proceedings of the 35th Annual Meeting of the
Association for Computational Linguistics, Madrid,
1997.
[8] Pantel, P. & Lin, D., “Discovering Word Senses from
Text”, In Proceedings of ACM SIGKDD 02
International Conference on Knowledge Discovery &
Data Mining, Edmonton, Alberta, Canada, 2002.
4. Conclusion
In this paper, we presented a word sense
disambiguation algorithm based on semantic similarity
using WordNet, which has been applied to collection of
documents. The results of this algorithm are promising as
it is able to capture root meanings of document
collections. The results from text categorization also
highlight the ability of our approach to capture contextual
meanings of word from document collections.
[9] Patwardhan, S., Banerjee, S. & Pedersen, T., “Using
Measures of Semantic relatedness for Word Sense
Disambiguation”, Proceedings of the Fourth
International Conference on Intelligent Text
Processing and Computational Linguistics, Mexico
City, 2003.
[10] Resnik, P., “Using Information Content to Evaluate
Semantic Similarity in a Taxonomy”, Proceedings of
the 14th International Joint Conference on Artificial
Intelligence (IJCAI-95), 1995, pp. 448-453.
5. References
[1] Agirre, E. & Rigau, G., “A Proposal for Word Sense
Disambiguation Using Conceptual Distance”,
Proceedings of Recent Advances in NLP (RANLP95),
Tzigov Chark (Bulgary), 1995, pp. 258-264.
[11] Seco, N., Veale, T. & Hayes, Jer., “An Intrinsic
Information Content Metric for Semantic Similarity
in WordNet”, Proceedings of ECAI'2004, the 16th
European Conference on Artificial Intelligence,
Valencia, Spain, 2004.
[2] Baker, L. D. & McCallum, A. K., “Distributional
Clustering of Words for Text Classification”,
Proceedings of the 21st Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval, Melbourne, Australia, 2002.
[12] WordNet,
“Glossary
of
http://wordnet.princeton.edu/gloss, 2005.
[13]
terms”,
Yarowsky, D., “Unsupervised Word Sense
Disambiguation Rivaling Supervised Methods”,
Proceedings of ACL’95, 1995.
[3] Basili, R., Cammisa, M. & Zanzotto, F. M., “A
Similarity Measure for Unsupervised Semantic
24
ANALYSIS OF AGENTS FROM CALL TRANSCRIPTIONS OF A CAR RENTAL
PROCESS
Swati Challa
Shourya Roy, L. Venkata Subramaniam
Dept. of Computer Science & Engg.,
IIT Madras,
Chennai-600036, India.
swati.iitm@gmail.com
IBM India Research Lab,
IIT Delhi, Block-1,
New Delhi-110016, India.
{rshourya,lvsubram}@in.ibm.com
ABSTRACT
Telephonic conversations with call center agents
follow a fixed pattern, commonly known as call
flow. Each call flow is a sequence of states such as
greet, gather details, provide options, confirm details, conclude. We present a mechanism for segmenting the calls into these states using transcription of the calls. We also evaluate the quality of
segmentation against a hand tagged corpus. The
information about how the agents are performing
in their calls is crucial to the call center operations. In this paper we also show how the call segmentation can help in automating the monitoring
process thereby increasing the efficiency of call
center operations.
1. INTRODUCTION
Call centers provide dialog-based support from
specialized agents. A typical call center agent
handles a few hundred calls per day. While handling the calls the agents typically follow a welldefined call flow. This call flow specifies how
an agent should proceed in a call or handle customer objections or persuade customers. Within
each state the agent is supposed to ask certain key
questions. For example, in a car rental call center,
before making a booking an agent is supposed to
confirm if the driver has a valid license or not.
Call centers constantly monitor these calls to improve the way agents function and also to analyze how customers perceive their offerings. In
this paper, we present techniques to automatically
monitor the calls using transcriptions of the telephonic conversations. Using NLP techniques, we
automatically dissect each call into parts, corresponding to the states mentioned in call flow. We
provide a quantitative measure to evaluate how
well the call flow has been followed. We also propose a simple technique to identify if key ques25
tions are being asked within each segment or not.
Using this automatic technique, we show that we
are able to identify lapses on the part of the agent.
This information is crucial to the call center management as it allows them to identify good and
bad agents and train them accordingly. We evaluate the performance of our technique and show
that it achieves good accuracy.
2. BACKGROUND AND RELATED WORK
2.1. Text Segmentation
Automatically partitioning text into coherent segments has been studied extensively. In [5] segmentation is done based on the similarity of the
chunks of words appearing to the left and right of
a candidate. This approach called TextTiling can
be used to identify subtopics within a text. In [2]
a statistical model is presented for text segmentation.
2.2. Key Phrase Extraction
Extracting sentences and phrases that contain important information from a document is called key
phrase extraction. Key phrase extraction is an
important problem that has been studied extensively. For example, in [4] the key phrases are
learnt based on a tagged corpus. Extraction of key
phrases from noisy transcribed calls has also been
studied. For manually transcribed calls in [7] a
phrase level significance estimate is obtained by
combining word level estimates that were computed by comparing the frequency of a word in
a domain-specific corpus to its frequency in an
open-domain corpus. In [9] phrase level significance was obtained for noisy transcribed data
where the phrases are clustered and combined
into finite state machines.
Figure 1: Micro and macro segmentation accuracy using different methods
2.3. Processing of Call Center Dialogs
A lot of work on automatic call type classification for categorizing calls [8], call routing [6], obtaining call log summaries [3], agent assisting and
monitoring [7] has appeared in the past. In [1] call
center dialogs have been clustered to learn about
dialog traces that are similar.
3. CALL SEGMENTATION
A call can typically be broken up into call segments based on the particular action being performed in that part of the call. Here we present
a mechanism for automatically segmenting the
calls and evaluating the quality of segmentation
against a hand tagged corpus. The calls are in
XML format with relevant portions marked. Any
given call can be divided into a maximum of nine
segments. They are Greeting, Pickup-return details, Membership, Car options and rates, Customer objection and objection handling, Personal
details, Confirm specifications, Mandatory checks
and details and Conclusion. From the training set
of documents which are segmented manually we
extracted two sets of keywords for each segment:
• Frequent keywords obtained by taking the
trigrams and bigrams with the highest frequency in each segment. Unigrams are
avoided because most of the high frequency
words are stopwords(like the,is etc).
• Discriminating keywords obtained by taking
the ratio of the frequent phrases(includes unigrams, bigrams and trigrams) in a particular
26
segment to their frequency in the whole corpus with preference being given to trigrams.
The top 10 or 20 words are chosen as keywords for each segment.
Using these keywords we segment the booked
and unbooked calls automatically with the
knowledge of call flow by marking the begin
and end of each segment with the corresponding
XML tags.
Accuracy
To evaluate the accuracy of this segmentation we
compare its performance with the manual segmentation. The accuracy metrics used are :
• Micro Efficiency is computed as
P turnsM atch
turnsCount
microEf f =
n
• Macro Efficiency is computed as
macroEf f =
P
P
turnsM atchInASegement
turnsInASegment
segmentCount
n
where, turnsMatch = No. of turns (one continuous
line/sentence spoken by agent/customer) where
automatically assigned segment is the same as
manually assigned segment
turnsCount = Total no. of turns in a call
n = Total no. of calls in the test corpus
turnsMatchInASegment = No. of turns within
each segment where automatically assigned segment is the same as manually assigned segment
turnsInAsegmentMatch = No. of matches in a
correct manual segment
Figure 2: Segmentation accuracy on agent data and combined agent+customer transcriptions
segmentCount = No. of correct segments in the
call
From Figure 1 we can see that the segmentation
accuracy for manually chosen keywords is almost
same as that of discriminating keywords.
4. SEGMENTATION ON AGENT DATA
ALONE
The transcribed data is very noisy in nature. Since
it is spontaneous speech there are repeats, false
starts, a lot of pause filling words such as um
and uh, etc. Further there are no punctuations
and there are about 5-10% transcription errors.
The ASR(Automatic Speech Recognition) system
used in practice gives a transcription accuracy of
about 60-70%. The number of agents in a call
center are limited in number. So, we can train the
ASR system on agents voices to increase the transcription accuracy to 85-90%. So if we do the segmentation on agent data alone the accuracy will
be much higher compared to agent+customer data
because of the low transcription accuracy of the
customer. Here we extract only the agent conversation part from the corpus and repeat the above
segmentation process to get the efficiency. From
the results in Figure 2 we can see that the segmentation efficiency is almost equal to the efficiency using the original call transcription with
both agent and customer. This is in case of manually transcribed calls.
5. EVALUATION OF AGENTS
In this section we will show how the call segmentation can help in automating the monitor27
ing process. To see how effectively we can perform using segmentation we take a set of 60 calls
and divide them manually into two sets depending on whether the call contains the key task or
not. Now for each key task we look for the specific keywords in the corresponding positive and
negative instances of the key task separately. For
example, to check if the agent has confirmed that
the credit card is not a check/debit card we can
look for the keywords check,cheque,debit,which
is not,that is not. We search for the corresponding
key words in a particular segment where the key
task is supposed to be present (for eg, confirming
if the customer has a clean driving record should
be present in mandatory checks and details segment) and compare the result with the keywords
matches in the entire call.The comparison results
are shown below for the following key tasks:
1. Ask for sale
2. Confirm if the credit card is not a check/debit
card
3. Ask the customer for future reservations
4. Confirm if the customer is above 25yrs of
age
5. Confirm if the customer has a major credit
card in his own name
From the statistics for negative instances we
can see that there are a large number of instances which are wrongly detected as containing the key task without segmentation because
the keywords are likely to occur in other segments also. For example, consider the key task
#5
12
12
12
tem evaluation and modeling. In Proceedings of the Conference on Empirical Methods
in Natural Language Processing,, Barcelona,
Spain, July 2004.
Table 1: Statistics for positive instances
[2] D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine Learning, 34:177–210, 1999.
Key Task
No. of Calls
With Seg.
Without Seg.
Key Task
No. of Calls
With Seg.
Without Seg.
#1
19
18
18
#1
41
38
35
#2
28
28
28
#2
32
32
19
#3
22
22
22
#3
38
38
1
#4
40
40
40
#4
20
20
12
#5
48
48
1
[3] S. Douglas, D. Agarwal, T. Alonso, R. M.
Bess, M. Gilbert, D. F. Swayne, and C. Volinsky. Statistical models for text segmentation.
IEEE Transaction on Speech and Audio Processing, 13(5):652–660, 2005.
Table 2: Statistics for negative instances
3 i.e. the agent asking the customer for future
reservations we look for keywords like anything
else,any other,help,assist. These are likely to occur in other segments also like greeting etc. So by
looking at the entire call it is not possible to capture the information if the agent has performed
a particular key task or not.Hence by automating
the agent monitoring process we can increase the
efficiency of call center operations.
6. CONCLUSIONS
We have tried different approaches for automatically segmenting a call and obtained good segmentation accuracy. We showed that we can
achieve the same segmentation accuracy using
agent data alone which will reduce the transcription errors to a great extent. We also showed that
segmentation helps in automating the agent evaluation process thus increasing the efficiency of call
center operations.
7. FUTURE WORK
In future we plan to explore other segmentation
techniques like Hidden Markov Models for automatically capturing the state information thus
automatically extracting the call flow. We intend
to reduce the effect of transcription errors in segmentation by using spell check techniques. So far
we have hand coded the key tasks from the agent
monitoring form. We also hope to automate this
process.
8. REFERENCES
[1] F. Bechet, G. Riccardi, and D. HakkaniTur. Mining spoken dialogue corpora for sys28
[4] E. Frank, G. W. Paynter, I. H. Witten,
C. Gutwin, and C. G. Nevill-Manning.
Domain-specific keyphrase extraction. In
Proc. Sixteenth International Joint Conference on Artificial Intelligence,, pages 668–
673, San Fransisco, CA, 1999.
[5] M. A. Hearst.
Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics,, 23(1):33–
64, 1997.
[6] H.-K. J. Kuo and C.-H. Lee. Discriminative training of natural language call routers.
IEEE Transactions on Speech and Audio Processing,, 11(1):24–35, 2003.
[7] G. Mishne, D. Carmel, R. Hoory, A. Roytman, and A. Soffer. Automatic analysis of
call-center conversations. In Proceedings of
the Conference on Information and Knowledge Management,, Bremen, Germany, October 31-November 5 2005.
[8] M. Tang, B. Pellom, and K. Hacioglu. Calltype classification and unsupervised training
for the call center domain. In Proceedings
of the Automatic Speech Recognition and Understanding Workshop,, St. Thomas, US Virgin Islands, November 30-December 4 2003.
[9] J. Wright, A. Gorin, and G. Riccardi. Automatic acquisition of salient grammar fragments for call-type classification. In Proceedings of the Automatic Speech Recognition and
Understanding Workshop,, Rhodes, Greece,
September 1997.
Integration Techniques for multimodal Speech and Sketch map-based
system.
Loh Chee Wyai
cheewyai@gmail.com
Alvin W. Yeo
alvin@fit.unimas.my
Narayanan K.
nara@fit.unimas.my
Faculty of Computer Science and Information Technology
Universiti Malaysia Sarawak
Malaysia
Abstract
sentences and sketched objects that might not be arranged in
the correct order.
As described by Oviatt et al. (1997), when two or more
modalities work together, the integration techniques used for
combining different modalities into a whole system is very
important. The integration techniques are the main
determinants in guiding the design of the multimodal system.
To resolve a map-based multimodal system, we propose to use
natural language processing (NLP) to identify and describe the
syntax and the semantics of prepositions within the spoken
speech, in particular, when it relates to maps or directions.
From the results, the preposition syntactic and the semantic
behaviours will be used to integrate with the sketch. This
integration technique is expected to produce a better solution
in map-based multimodal system. One of the possible
frameworks to be used is PrepNet for processing the speechtext.
1. Introduction
Figure 1: Integration technique using time interval adopted
from (Lee and Yeo, 2005).
Speech and sketch are two modalities that humans naturally
use to communicate with each other especially when they
want to relate certain items like locating a place in the map or
passing some information, which requires sketching a diagram
for better understanding. Sketching with pen and paper comes
naturally to most of us (Li et al., 2005) and speech is a main
medium in our daily human communication (Atsumi et al.,
2004). The combination of these two modalities allows much
more precise and robust recognition (Zenka and Slavík, 2004).
Yet the integration technique in determining the correct pair of
multimodal inputs remains a problem in multimodal fusion of
both speech and sketch (Oviatt et al., 1997).
For a map-based system, events overlapping will occur
frequently. For example in Figure 1, this is a condition where
inputs from one event interfere in the time slot allocated for
other event. This condition is shown in event 1 and 2 in Figure
1 where speech input in event 1 interferes in the event 2’s
times slot. This will lead to the wrong pairing of the mode
inputs. For instance, the pen gesture input for event 1 happen
to be no pairing as the speech input did not happen during the
time interval and the pen gesture input for event 2 would be
paired with speech input for event 1 and event 2 since the time
interval between them is the same. Thus, this will lead to the
wrong pairing of inputs.
2. Related Work
•
INTEGRATION
INTERVAL
TECHNIQUE
WITH
In addition, the discarding of correct inputs would occur when
the second input complemented to the first input in the same
event did not occur within the preset time interval. Normally
the time interval used is 4 seconds after the end of the first
input, as used in unification integration architecture. If a
sketch input occurs first, then the system would wait for 4
seconds to capture the corresponding speech to complete the
event and vice versa. If no input were detected within that
time interval, then the whole event (presently with only one
input) would be discarded. This condition is shown in event 1,
although there is a pair of speech and pen gesture inputs for
event 1, but since the spoken input did not occurred within the
time interval, then the pen gesture event would be cancelled.
This would lead to the incorrect discard of inputs when the
input actually occurred out of the time interval. Four
TIME
Currently there are a few methods used to integrate the speech
with sketch in multimodal systems. The commonly used
methods are the Unification-based Multimodal Integration
Technique in resolving multimodal fusion (Oviatt et al., 2000;
Oviatt, 1999; Oviatt and Olsen, 1994). In this integration
technique, temporal constraint is used as the integration
parameter. In order to use this temporal constraint, the speech
and sketch inputs need to be time-stamped to mark their
beginning and their end. The main drawback here is the use of
time interval as the integration parameter. The continuous
stream of spoken and sketch inputs would consist of several
29
conditions that fail in fulfilling the time interval integration for
a map-based system are shown in Figure 2.
new spatial object and the sentences are then broken down
into words. Language Parsing Rules in natural language
processing (NLP) is adapted to identify the occurrence of the
objects within the sentences.
However, the grammar parsing rules adapted by Lee and Yeo
(2005) here is only the basic parsing rule, in which simple
syntactic approach is used to interpret the speech inputs. The
rule is mainly used to extract the objects from the sentences
and only the preposition of location is accepted as an element
of spatial information. Then, the preposition is checked to
describe the relationships between objects in terms of
topology, direction, and distance.
The sketch inputs for the system were limited to spatial object
with polygon data type in the spatial scene only. Topological,
relative directional and relative distance relations are taken
into account as shown in the Table 1 below.
Reference
Object
1
2
Object
ID
2
1
Topology
Direction
Disjoint
Southwest
Disjoint
Northeast
Distance
(map unit)
More than
0.00
More than
0.00
Table 1: Object-relation using Topological, relative directional
and relative distance relations.
Figure 2: Different types of conditions occurring in speech
and sketch when using multimodal map-based system adopted
from (Lee and Yeo, 2005).
Then, the preposition relationships between objects in terms of
topology, direction, and distance were used to integrate the
sketch inputs of topology, relative direction and relative
distance, as shown at Figure 2 below.
The first condition (Condition 1) is the absence of speech
events for an object. This condition occurs when users did not
talk about the object. The second condition (Condition 2) is
where the speech event occurs before the onset of the sketch
event for the same object. In this condition, based on the time
interval integration technique, it would only accept the speech
event for the object after the user’s sketch event. Therefore,
this speech event is not successfully found though it actually
occurred. The failure in accepting the speech input leads to the
corresponding sketch input being discarded. The third
condition (Condition 3) occurs when the wrong pair of speech
and sketch events is integrated. This condition normally
happens when users described more than one object while
performing a sketch event. The last condition (Condition 4) is
where speech or sketch event for an object does not occur
within the time-out interval (4 seconds). This occurrence is
directly discarded by this integration technique that is based
on time-out interval.
•
INTEGRATION TECHNIQUE WITH SPATIAL
QUERY
Lee and Yeo (2005) propose a technique, using spatial
integration to match both the speech and sketch inputs in a
map retrieval system. In this integration technique, an Object
Identification process is used to identify the occurrences of the
spatial object within sentences. If the object name is found
within the sentence, this name would be captured, stored as a
Figure 3: Topological model (adopt from Lee and Yeo, 2005).
Based on Table 2, the success rate from Lee and Yeo (2005)
using this integration technique is around 52% success rate
compare to Unification-based Integration Technique.
30
Analysed items
Success rate
(%)
Unificationbased
Integration
Technique
36
36
63
contact with B in a relative distance, then the results will be
used to match with the results from the speech.
Success rate
(%)
Spatial
Information
Integration
Technique
52
50
100
4. References
[1] Atsumi Imamiya, Kentaro Go, Qiaohui Zhang, Xiaoyang
Mao (2004). Overriding Errors in a Speech and Gaze
Multimodal Architecture. In Department of Computer and
Media University of Yamanashi, Japan. Proceedings of IUI
2005, Madeira, Funchal, Portugal.
Integration accuracy
Integrated spatial object
Integrated spatial
description
Table 2: Summary of results obtained from Unification-based
and Spatial Information integration technique.
[2] Blaser, A. (2000). Sketching Spatial Queries. PhD Thesis.
National Center of Geographic Information and Analysis,
Orono, University of Maine.
3. Proposed Integration Technique
[3] Lee B., Yeo A., (2005). Integrating Sketch and Speech
Inputs using Spatial Information. Proceedings of ICMI 2005,
Trento, Italy.
We propose the prepositions be used because the prepositions
used in a map-based system appears to have very useful
categories such as knowledge extraction since they convey
basic meanings of much interest like localisations,
approximations, means, etc. Thus, a semantic preposition
grammar-parsing rule can be applied to cope with the needs in
interpreting speech inputs. By categorizing the different senses
that prepositions may have in a map-based system, the
possibilities of a correct integration are likely to occur.
PrepNet is one of the frameworks can be used to describe the
syntax and semantics of prepositions used to interpret speech
inputs in a map-based system.
[4] Lee B., Yeo A., (2005). Multimodal Spatial Query. Master
Thesis. Faculty of Computer Science and Information
Technology, Universiti Malaysia Sarawak, Sarawak,
Malaysia.
[5] Li J., Dai G., Xiang, A., Xiwen Zhang (2005). Sketch
Recognition with Continuous Feedback Based On Incremental
Intention Extraction. In Intelligence Engineering Lab &
Laboratory of Computer Science, Institute of Software,
Chinese Academy of Sciences, Beijing, China. Proceedings of
IUI 2005, San Diego, California, USA.
As for Lee and Yeo (2005) spatial query technique, the
limitation is that the semantics of words was not taken into
consideration.
[6] Oviatt, S. L. and Olsen, E. (1994). Integration Themes in
Multimodal Human-Computer Interaction. In Shirai, Furui,
and Kakehi (eds.), Proceedings of the International
Conference on Spoken Language Processing, 2, pp. 551-554.
Acoustical Society of Japan.
For example the preposition of next to, the results obtained
from PrepNet contains facet, gloss, syntactic and semantic
frame. The facet and gloss result from PrepNet defines next to
as precise position, which A is in contact with B. The syntactic
frame shown A next to B and the semantic frame shown A:
next to (D, B), where D is location and B is place or thing.
With these extra information extracted, the sketched objects
will get more information to relate its objects with the speech,
which suggest a higher chance of accurate integration.
[7] Oviatt, S. L., DeAngeli, A., and Kuhn, K. (1997).
Integration and Synchronization of Input Modes During
Multimodal Human-Computer Interaction. Proceedings of
Conference on Human Factors in Computing Systems
(CHI’97), pp. 415-422. New York: ACM Press.
As for the sketch, users normally use lines to represent roads,
rivers, or railway tracks and polygons represent regions,
boundaries, or buildings (Blaser, 2000). Semantic sketch
recogniser, Vectractor is used to identify lines and polygon
that are available on a map. By using Vectractor, all the lines
and polygons identified were represented as buildings, roads
and rivers in Scalable Vector Graphic (SVG) format. These
can be used to match with the preposition results from the
processed speech input as a possible integration technique.
[8] Oviatt, S. L. (1999). Mutual Disambiguation of
Recognition Errors in a Multimodal Architecture. Proceedings
of Conference on Human Factors in Computing Systems: CHI
'99, New York, N.Y.: ACM Press, pp. 576-583.
[9] Oviatt, S. L., Cohen, P. R., Wu, L., Vergo, J., Duncan, L.,
Suhm, B., Bers, J., Holzman, T., Winograd, T., Landay, J.,
Larson, J. & Ferro, D. (2000). Designing the User Interface for
Multimodal Speech and Pen-based Gesture Applications:
State-of-the-Art Systems and Future Research Directions.
Human Computer Interaction 2000, Vol. 15, no. 4, pp. 263322.
For example an object A is next to object B. By using SVG,
the objects were not only identified as polygons or lines; it
will also identify the relative distance between the objects, by
calculating the coordinates available in SVG format between
the objects. If the relative distance between object A and
object B falls into the category of next to, which A is in
[10] Zenka R., Slavik P. (2004). Supporting UI Design by
Sketch and Speech Recognition. In Czech Technical
31
University in Prague, Czech Republic. Proceedings of the
TAMODIA 2004, Prague, Czech Republic.
32
Rule based Automated Pronunciation Generator
Ayesha Binte Mosaddeque
Department of Computer Science & Engineering
BRAC University
66 Mohakhali, Dhaka-1212.
Bangladesh
Email: lunaticbd@yahoo.com
Abstract
This paper presents a rule based pronunciation generator for Bangla words. It takes a
word and finds the pronunciations for the graphemes of the word. A grapheme is a
unit in writing that cannot be analyzed into smaller components. Resolving the
pronunciation of a polyphone grapheme (i.e. a grapheme that generates more than one
phoneme) is the major hurdle that the Automated Pronunciation Generator (APG)
encounters. Bangla is partially phonetic in nature, thus we can define rules to handle
most of the cases. Besides, up till now we lack a balanced corpus which could be used
for a statistical pronunciation generator. As a result, for the time being a rule-based
approach towards implementing the APG for Bangla turns out to be efficient.
1 Introduction
Based on the number of native speakers, Bangla (also known as Bengali) is the fourth most
widely spoken language in the world [1]. It is the official language in Bangladesh and one of
the official languages in the Indian states of West Bengal and Tripura. In recent years Bangla
websites and portals are becoming more and more common. As a result it has turn out to be
essentially important for us to develop Bangla from a computational perspective.
Furthermore, Bangla has as its sister languages Hindi, Assamese and Oriya among others, as
they have all descended from Indo-Aryan with Sanskrit as one of the temporal dialects.
Therefore, a suitable implementation of APG in Bangla would also help advancement of the
knowledge in these other languages.
The Bangla script is not completely phonetic since not every word is pronounced according to
its spelling (e.g.,
/bɔd͍dʰ͍ o,
/mod͍d͍ʰo, /ɔkʰon, /ekʰane). These cases can be
handled by rules of pronunciation. Therefore, we need to use some pre-defined rules to handle
the general cases and some case specific rules to handle exceptions. These issues have been
discussed in more details later on.
2 Previous Work
A paper about the Grapheme to Phoneme mapping for Hindi language [2] provided the
concept that, an APG for Bangla that maps graphemes to phonemes can be rule-based. No
such work has yet been made available in case of Bangla.
Although Bangla does have pronunciation dictionaries, these are not equipped with automated
generators and more importantly they are not even digitized. However, the pronunciation
dictionary by Bangla Academy provided us with a significant number of the phonetic rules
[3]. And the phonetic encoding part of the open source transliteration software ‘pata’ [4]
provided a basis.
37
3 Methodology
In the web version of the APG, queries are taken in Bangla text and it generates the phonetic
form of the given word using IPA1 transcription. Furthermore, there is another version of the
system which takes a corpus (a text file) as input and outputs another file containing the input
words tagged with the corresponding pronunciation. This version can be used in a TTS2
system for Bangla.
In terms of generating the pronunciation of Bangla graphemes a number of problems were
encountered. Consonants (except for ‘/ʃ' and ‘ /') that have vocalic allographs (with the
exception of ‘◌') are considerably easy to map. However there are a number of issues: Firstly,
the real challenge for a Bangla pronunciation generator is to distinguish the different vowel
pronunciations. Not all vowels, however, are polyphonic. ‘a/ɔ ’and ‘e/e' have polyphones
(‘ /ɔ’ can be pronounced as [o] or [ɔ], ‘/' can be pronounced as [e] or [æ], depending on
the context) and dealing with their polyphonic behavior is the key problem. Secondly, the
consonants that do not have any vocalic allograph have the same trouble as the pronunciation
of the inherent vowel may vary. Thirdly, the two consonants ‘/ʃ’ and ‘ /’ also show
polyphonic behavior. And finally, the ‘consonantal allographs’ ( /ɟ, ◌/r, ◌/b, ◌/m ), and
the grapheme ‘/j' complicate the pronunciation system further.
4
Rules
The rule-based automated pronunciation generator generates pronunciation of any word using
rules. As explained earlier, the Bangla script is not completely phonetic in view of the fact
that not every word is pronounced in accordance with its spelling. For example, the words
‘ ে/ ɔnek’ and ‘ ি/ ot̼i’ both start with ‘ /ɔ’ but their pronunciations are [ɔnek] and [ot̼i]
respectively. These changes with the pronunciation of ‘ ’ are supported by the phonetic
rules:
+ C + ◌ ( ) >
+ / C + ◌ ( ) > , where C= Consonant
An additional rule related to the pronunciation of ‘a/ɔ’ is that if ‘a/ɔ’ is followed by ‘/n’
without any vocalic allographs then ‘a/ɔ’ is pronounced as [ɔ]. For example, ‘ /ɔnol’,
‘ /ɔnto ’.
Another polyphone grapheme is ‘e/e’, it has two different pronunciation [e] and [æ]. For
example, ‘ি/eki’, ‘া/æka’. This change of pronunciation is supported by the
following pronunciation rule:
◌ / + C + / ◌ / / ◌ //◌ / /◌ / / ◌ / / ◌ >
◌ / + ◌ >
1
2
, where C= Consonant
International Phonetic Alphabet
Text To Speech
38
There are some rules that have been developed by observing general patterns, e.g., if the
length of the word is three full graphemes (e.g. /kɔlom,
/kʰɔbor, /bad͍la,
/kolmi etc.) then the inherent vowel of the medial grapheme (without any vocalic
allograph) tends to be pronounced as [o], provided the final grapheme is devoid of vocalic
allographs (e.g., /kɔlom,
/kʰɔbor). When the final grapheme has adjoining vocalic
allographs, the inherent vowel of the medial grapheme (e.g. /bad͍la, /kolmi) tends
to be silent (i.e., silenced inherent vowels can be overtly marked by attaching the diacritic
‘◌ ' ).
Hypothetically, all the pronunciations are supposed to be resolved by the existing phonetic
rules. But as a matter of fact they do not; some of them require heuristic assumptions. Apart
from the rules found in the pronunciation dictionary by Bangla Academy [3], some heuristic
rules are used in APG. They were formulated while implementing the system. Most of them
serve the purpose of generating pronunciation for some specific word pattern. All the rules are
available in http://student.bu.ac.bd/~u02201011/RAPG1 .
5 Implementation
APG has been implemented in Java (jdk1.5.0_03). The web version of APG contains a Java
applet that can be used with any web client that supports applets. The other version of APG is
also implemented in Java. Both the versions generate the pronunciation on the fly; to be
precise no look up file has been associated. Figure 1 illustrates the user interface of the web
version and Figure 2 illustrates the output format of the other version.
Figure 1 : The web interface of APG. The input word is ‘ ‘ and generated pronunciation is
‘Ǥȓeȓ’.
Figure 2 : the output file generated by the plug-in version of APG.
39
5 Result
The performance of the rule-based APG proposed by this paper is challenged by the partial
phonetic nature of Bangla script. The accuracy rate of the proposed APG for Bangla was
evaluated on two different corpora that were collected from a Bangla newspaper. The
accuracy rates observed are shown in Table 1:
Table 1
Number of words Accuracy Rate (%)
736
97.01
8399
81.95
The reason of the high accuracy rate of the 736-word corpus is that, the patterns of the words
of this corpus were used for generating the heuristic rules. The words in the other corpus were
chosen randomly. The error analysis was done manually by matching the output with the
Bangla Academy pronunciation dictionary.
6 Conclusion
The proposed APG for Bangla has been designed to generate the pronunciation of a given
Bangla word in a rule based approach. The actual challenge in implementing the APG was to
deal with the polyphone graphemes. Due to the lack of a balanced corpus, we had to select the
rule-based approach for developing the APG. However, a possible future upgrade is
implementing a hybrid approach comprising both a rule based and a statistical grapheme-tophoneme converter. Also, including a look up file will increase the efficiency of the current
version of APG immensely. This will allow the system to access a database for look up. That
way, any given word will first be looked for in the database (where the correct pronunciation
will be stored), if the word is there then the corresponding pronunciation goes to the output, or
else, the pronunciation is deduced using the rules.
References
[1] The Summer Institute for Linguistics (SIL) Ethnologue Survey (1999).
[2] Monojit Choudhury, “Rule-based Grapheme to Phoneme Mapping for Hindi Speech
Synthesis”. Proceedings of the International Conference On Knowledge-Based Computer
Systems, Vikas Publishing House, Navi Mumbai, India, pp. 343 – 353, 2002. Available online
at: http://www.mla.iitkgp.ernet.in/papers/G2PHindi.pdf
[3] Bangla Uchcharon Obhidhan, Bangla Academy, Dhaka, Bangladesh.
[4] Transaliteration Software - Pata, developed by Naushad UzZaman, CRBLP, BRAC
University. Available online at: http://student.bu.ac.bd/~naushad/pata.html
40
Transliteration from Non-Standard Phonetic Bengali to
Standard Bengali
Sourish Chaudhuri (sourish@iitkgp.ac.in)
Supervisor: Monojit Choudhury (monojit@cse.iitkgp.ernet.in)
Department of Computer Science and Engineering
Indian Institute Technology Kharagpur,
WB, India-721302.
A transliteration scheme that efficiently converts noisy text to
standard forms would find application in a number of fields such
as:
ABSTRACT
In this paper, we deal with transliterations from non-standard
forms of Bengali words written in English to their standard forms.
Familiarity of users with standard English keyboards makes it
easy for them to represent Bengali words with English letters (i.e.
Roman Script). Starting from a list of commonly used Bengali
words compiled from a corpus, we obtain a pronouncing
dictionary using a Grapheme-to-Phoneme converter. We propose
a novel method based on heuristic search techniques over the
pronouncing dictionary for transliteration of a word written in
Bengali using Roman script to its standard Bengali form.
Applications of this technique include design of phonetic
keyboards for Bengali, automatic correction of casually written
Bengali in Roman English, query correction for Bengali search
over the web, and searching loosely transliterated named entity.
The techniques used are generic and can be readily adapted to
other languages.
1. Information retrieval for Bengali language
2. Chat/SMS in Bengali
3. Automatic text to speech synthesis from transliterated texts
like chats, SMS, blogs, emails etc.
4. Automatic correction tools for text documents.
5. Design of a phonetic keyboard or an interface for entering
Bengali text using QWERTY keyboard
The transliteration scheme might especially help in web
searches for named entities. It is quite likely that the name of a
person may be spelt differently by different people who are
unaware of the exact spelling. In that case, a technique that can
recover the actual name overcoming spelling variations would
greatly improve the results. For example, the name "Saurav
Ganguly" might be spelt by different sources/users as "Sourav
Ganguly", "Saurabh Ganguly" or even "Sourabh Ganguly". If all
these representations can be mapped to the same name, the
efficiency of the search could be further increased.
1. INTRODUCTION
Bengali is the language of more than 270 million speakers,
spoken primarily in Bangladesh and eastern part of India.
Nevertheless, there is no single standard keyboard for Bengali
that is being used globally to input Bengali text into different
electronic media and devices like the computers, cell phones and
palm-tops. Besides, the number of letters in the Bengali alphabet
is considerably larger than that of English alphabet. Therefore, the
common Bengali speakers, who are familiar with the standard
English keyboards like QWERTY, find it convenient to
transliterate Bengali into Roman script while typing over some
electronic media/device. In this work we propose a technique to
transliterate a Bengali word written in English (i.e. Roman Script,
henceforth RB) to the standard Bengali form (henceforth SB).
Given an input word, the decoder should be able to generate
the corresponding standard form. We model the problem as a
noisy channel process; we assume that the standard word is
distorted to the corresponding RB form while being transmitted
over the noisy channel. The channel is modeled using a G2P
mapper coupled with statistical methods. The novelty of the work
lies in use of efficient data-structures and application of heuristic
search techniques for fast retrieval of the possible SB forms.
Although we report our work for Bengali, the technique is generic
and easily adaptable to other languages.
.
There are certain phonetically based standardized encodings
for Bengali characters using the Roman scripts [1]. However, the
average user, when using transliterated text messages rarely stick
to the encoding scheme. Her transliterations are based on the
phonetic transcription of the word, and hence, we encounter
situations where the same English letter can represent different
Bengali letters.
2. BACKGROUND
There are several techniques to carry out transliteration and
back-transliteration [2-6]. Previously, researchers have built
transliteration engines between English and Japanese [2], Chinese
[3], Bengali [4], Arabic [5] etc. Most of these works model the
problem as a noisy channel process [4]. Phonetic information [3]
and other representational information [6] are also commonly
41
used. However, most of these methods are confined to letter or
syllable level phonetic transcriptions.
3. NOISY CHANNEL FORMULATION
As we shall see shortly, such methods fail to elicit the correct
transliteration in several cases that are encountered quite
commonly. N-gram statistics applied over letters or syllables can
alleviate the problem to some extent, but statistical methods call
for a large amount of parallel data (transliterated and their
standard counterpart), which is difficult to acquire. Moreover, the
accuracy of the models are dependent on the training data. This
results in poor system performance whenever the training and the
test data are from different sources or exhibit different
distributional patterns. We make use of an accurate word-level
G2P converter for Bengali to circumvent the aforementioned
problems.
In this section, we formally define the problem as a noisy channel
process. Let S be a Bengali word in the standard form. When S is
transmitted through the noisy channel, it is converted to T, the
corresponding word in RB. The noisy channel is characterized by
the conditional probability Pr(T|S). In order to decode T to the
corresponding standard form δ(T), where δ is the decoder model,
we need to determine Pr(S|T), which can be specified in terms of
the noisy channel model as follows.
δ(T) = argmax Pr(T|S)Pr(S)
(1)
S
There are several hard issues that need to be solved for
transliterating RB to SB. All of these issues crop up due to manyto-many mapping between the RB and SB units. We find that
there are cases where one Bengali character may be represented
by more than one English characters, and also cases where one
English character can stand for more than one Bengali character.
For instance, the English 'a' might be used to represent both the
Bengali letters 1 'a' and 'A' e.g. 'jala' (water) and 'jAla' (net) might
both be written in RB as 'jal'. Similarly, the Bengali letter 'a'
might be represented using both 'a' and 'o' from the English
alphabet.
The channel can be conceived as a sequence of two subchannels. First the SB form S is converted to the corresponding
phonetic form P, which is then converted to the RB form by the
second channel. This is illustrated below.
S Æ P = p1p2p6…pr Æ T = t1t2t3…tn
The motivation behind this is as follows. When the user
thinks of a word in SB, which he wants to represent in RB, it is
the phonetic transcription of the word that he represents in the RB
Given the noisy text T, if we want to produce the source
word S, we need to reverse the above process. Thus, the
expression for the decoder model would be:
To distinguish between these forms, we require context
information in some cases, while in others the disambiguation can
be carried out without any context information. In this work, we
deal with transliterations only at word level, and hence context
based disambiguation is beyond the scope of this paper.
δ(T)= argmax ∑ [Pr(T|P).Pr(P|S).Pr(S)]
S
P
In the case of Bengali, most words have only one phonetic
representation, which implies that Pr(P|S) is 1 for a particular P*
= G2P(S) and 0 for all other phonetic strings. Here, G2P
represents the grapheme-to-phoneme mapping. Therefore, we can
rewrite the Eq. 2 as
Take the example of the Bengali word 'jala' meaning water.
There are two letters in the Bengali alphabet that are pronounced
'ja'. The Itrans representation for one is 'j' while that for the other
is 'y'. Thus, for the word, we might have any of the following
representations: 'jala', 'jAla', 'yala', 'yAla'. We can use a lexicon to
eliminate the possibilities 'yala' and 'yAla'. However, to
disambiguate between the other two options, we need to know the
context.
δ(T)= argmax [Pr(T|G2P(S)) Pr(S)]
(3)
S
In the subsequent sections, we propose a framework based on the
above assumption to carry out the transliteration process
efficiently.
Further, there is an inherent Bengali vowel 'a' at the end of
most Bengali words that do not end with some other vowel. This
vowel is silent in most of the cases - a phenomenon known as
schwa deletion. The user while transliterating a word in RB relies
on the phonetic transcription of the word and might omit it. In
cases where more complex letter combinations are used in
Bengali (especially the conjugates), letter to letter transliterations
may not be applicable. This also leads to a large Levenshtein
distance between the word in SB and word in RB. For example,
(non-std) khoma Æ kShama (std).
Further sources of error may be unintentional misspelling.
1
(2)
4. PROPOSED FRAMEWORK
In order to compute δ(T), we need to compute Pr(T|G2P(S))
and Pr(S). The latter quantity is the unigram probability of the
word S and can be estimated from a corpus. In order to compute
the former quantity, we need a G2P converter (or a pronouncing
dictionary) and a model that computes the probability of mapping
an arbitrary phonetic string to a string in Roman script. It is
interesting to note here that though apparently the probability of
mapping a phonetic string into Roman script seems to be
independent of the source language (here Bengali), in reality it is
hardly so. For example, the phonemes /T/ (retroflex or the hard t)
and /t/ (dental or the soft t) are both transliterated as “t” by the
In this paper, we use ITRANS [1] to represent the standard
Bengali forms and owing to the phonemic nature of Bengali, the
pronunciations are also written following the same convention.
42
Bengali speakers, whereas Telugu and Tamil speakers use “th” to
represent /t/ and “t” to represent /T/.
Lexicon
4.2. Resources
A lexicon containing around 13000 most frequent Bengali
words and their unigram frequencies have been obtained from the
CIIL corpus. Each word is passed through the G2P and their
phonetic transliterations are obtained. Thus, we obtain the
phonetic lexicon consisting of the words, their phonetic
representations and their frequencies.
G2P
Converter
PLG
Word in RB
A modest size transliterated corpus is required in order to
learn the probabilities Pr(e|p), where e is an English letter and p is
a phoneme used in Bengali. These probabilities are calculated for
each of the phonemes that are present in the lexicon. However, for
this work we manually assign these probability values.
PMP
FSMG
Phonetic
Lexicon
4.3. Representing the lexicon as a trie
The trie[8] is built from the phonetic lexicon consisting of
the phonetic forms of the words, and their frequencies. Starting
with a root node, transitions are defined using phonemes. Each
node N of the trie is associated with a list of Bengali words W(N)
= {w1, w2, ... wk} (possibly null) such that the unique phonetic
string P represented by the path from the root to N is the phonetic
transliteration of {w1, w2, ... wk}. That is
FSM for word
TRIE
SBG
P = G2P(w1) = G2P(w2) = ... = G2P(wk)
4.4. FSM for the RB Word
Most probable SB
Every English letter can represent one or more phoneme in
Bengali. The probabilities of Bengali phonemes mapping to
certain English graphemes can be learnt from a corpus. These
values are then used to construct a PFSM for the RB word.
Transitions in this PFSM are defined on the Bengali phonemes
that might be represented by an English letter. A traversal of this
PFSM to its final state gives the possible phonetic alternatives of
the Bengali word.
Fig1. Basic Architecture:
In Figure 1,
PLG: Phonetic Lexicon Generator
SBG: Standard Bengali Generator
FSMG: Finite State Machine Generator
ch
a
Fig 1 shows the basic architecture of the system. A list of
Bengali words (shown as lexicon in the figure) is converted to
their phonetic counterpart using a G2P. As the preprocessing step,
a forward trie is built using the phonetically transliterated words.
A probabilistic finite state machine (PFSM) is constructed for T
(the RB word) that represents the Pr(T|P). The PFSM and the trie
are searched simultaneously to find a P* = G2P(S) such that the
probability Pr(T|G2P(S))Pr(S) is maximized over all S.
c
1
h
2
A
3
4
y
o
State1 & 4: Initial and final states respectively
Figure 2. FSM for ‘cha’
4.1. Grapheme-to-Phoneme Conversion
This diagram shows the PFSM for the input RB word ‘cha’. The
first 2 letters ‘ch’ may cause a transition form state 1 to 3.
Alternatively, it may transition from state 1 to 2 on the possible
Bengali phonemes corresponding to the letter ‘c’, and then to state
3 on the phonemes corresponding to ‘h’. It may then transition to
state 4 on the possible phonemes for the letter ‘a’.
The method used to simulate the first stage of the noisy
channel formulation is by using a grapheme-to-phoneme
converter that gives the phonetic form of the given Bengali word
using IPA notations. The G2P used for this work is described in
[7]. It is a rule-based G2P that also uses morphological analysis
and an extensive list of exceptions. These features make the
accuracy of the G2P quite high (around 95% at the word level).
Note that every path from the start state to the final state is a
possible phonetic transliteration of the RB string, the probability
43
maximize f(N) rather than minimizing it. However, note that we
can carry out the computations in the logarithmic domain, such
that
of which is given by the product of the probabilities of the edged
in the path.
f*(N) = log(g(N)) + log( h(N))
5. HEURISTIC SEARCH
Moreover, since all probability values are less than 1, the
logarithms are less than 0. Thus, we can take the absolute values
while defining all the aforementioned functions. This in turn
transforms the maximization problem into a minimization
problem. It is easy to see that the heuristic function defined here
is an over-estimate.
Argmax searches have been implemented using A*
algorithms[9]. In this section, we describe the method used in this
case to implement A*.
Let T be the word in RB, which we want to decode. We
construct the PFSM MT corresponding to T. Let P = p1p2...pn be
a string of phonemes associated with the transitions e1, e2, ..., en
in MT such that the transitions, in that order, denotes a path from
the start state of MT to the final state. Therefore, the probability
Pr(T|P) is by definition Πi = 1 to n Pr(ei), where Pr(ei) is the
transition probability of the edge (transition) ei.
6. CONCLUSION
This paper only deals with the proposed theory for
obtaining transliterations at word level. Once implemented, it will
be possible to identify further parameters which can be used as
part of the heuristics to generate better results. Further,
developing a system for generating transliterations at sentence
level is a natural progression of this work.
In order to compute Pr(S), we need to know the set of words
{w1, w2, ... wk}, such that G2P(wi) = P. This can be computed
by searching for the string P in the trie. If NP is the node in the
trie representing the string P, then we define c(NP) = Pr(S) = ∑w
∈ W(NP) Pr(w), where Pr(w) is the unigram probability of the
word w.
7. REFERENCES
[1] Chopde,
A. ITRANS (version
http://www.aczoom.com/itrans/
g (N) =
{
2001,
[2] Bilac,
S., and Tanaka, H., A Hybrid BackTransliteration System for Japanese, Proceedings of
the 20th International Conference on Computational
Linguistics : COLING. pp.597 – 603, 2004
In order to search for the node NG such that the product of the
two probabilities is maximized, we simultaneously explore the MT
and the trie. We tie up the states of MT with the nodes of trie by
marking them with the same levels. Note that a node in the trie
gets a unique state label. We define the cost of reaching a node N
in the trie as follows.
1,
5.30),
[3] Chen, H.H., and Lin, W.H., Backward Machine
Transliteration By Learning Phonetic Similarity, 6th
Conference on Natural language Learning, 2002.
[4] Bandyopadhyay, S., Ekbal, A. and Naskar, S., A
Modified joint Source Channel
Transliteration, Coling ACL, 2006
if N is the root
g(par(N))×Pr(par(N) Æ N),
otherwise
Model
for
[5] Al-Onaizan,
Y., and Knight, K. Machine
Transliteration Of Names In Arabic Text, ACL
Workshop on Computational Approaches to Semitic
Languages
where par(N) is the parent of node N and Pr(par(N) Æ N) is the
probability associated with the transition in MT from the state tied
up with par(N) on the phoneme that connects par(N) to N in the
trie.
[6] Bilac,
We define the heuristic function h(N) as follows
[7] Mukherjee, A., Chakraborty, S., Choudhury, M.,
c(N)
h(N) =
{
S., and Tanaka, H. Improving back
Transliteration by Combining Information Sources,
IJCNLP, 2004.
if the node is a leaf node
Lahiri, A., Dey, S., Basu, A. Shruti-An Embedded
Text to Speech System for Indian Languages. IEE
Proceeding s on Software Engg, 2006.
∑X ∈ Ch(N) c(X) + c(N) otherwise
[8] Fredkin, E. Trie Memory. Communications of the ACM,
3(9):490-499, 1960.
where, Ch(N) is the set of children of N.
We apply A* search based on the priority of a path assigned
using the following function.
[9] Och, F.J., Zens,R., Ney, H. Efficient Search for Interactive
Statistical machine Translation, 2003.
f(N) = g(N) × h(N)
Unlike traditional A*, here the path costs are obtained
through as products rather than sum. Moreover our aim is the
44
Title
The structure of Nepali Grammar. Bal Krishna Bal, Madan Puraskar Pustakalaya, Nepal.
bal@mpp.org.np
Abstract
This paper is an attempt to provide the basic insight to the structure of the Nepali Grammar. It takes
a tour of the writing system, the parts of speech, the phrase and clausal structure and finally ending
in the sentential structure of the language. Research on the grammar of the Nepali language, which
is highly inflectional and derivational depending upon the wide range of grammatical aspects of the
language, like tense, gender, pronouns etc., is but incomplete without taking into consideration the
special characteristics of the Nepali Grammar. So wherever possible and deemed necessary,
illustrations are also provided. In the discussion and the conclusion section, the document also
presents a brief overview of the design and implementation of the Nepali Grammar Checker, to be
developed as part of the Natural Language Processing Applications Development under the PAN
Localization Project, Madan Puraskar Pustakalaya, Nepal. The findings of this study are believed to
be an invaluable resource and base document for the development of the Nepali Grammar Checker.
Introduction
The formulation of the first Nepali Grammar dates back to the history, some eighty years back with
the "Gorkha Bhasha Chandrika Vyakaran", found to have written long before the actual study of the
Nepali language started(Adhikari Hemang Raj, 2005). The Nepali language and consequently the
grammar has evolved a lot in this period. Numerous books and writings on the Nepali Grammar
have come out in the mean time. Inconsistencies and debate in opinion on several grammatical
issues, too, have continued to exist. Nevertheless, there are also aspects whereby the Grammarians
have a common meeting point.
Research methodology and objectives
The research methodology devised for the given research work is basically the qualitative approach.
The secondary data, i.e. the information available on different sources about the Nepali Grammar
has been compiled and analyzed. Besides, other primary data collecting methods and mechanisms
like active brain storming sessions, consultations with the experts also were exercised. The findings
of this research work do not in any sense capture all the aspects of the Nepali Grammar structure.
Furthermore, the findings of the study presented might be subjected to changes and corrections as
well, as newer concepts and ideologes emerge. The primary objectives of this research work is to
initiate a base document for further research.
Results
The results section summarizes the basic structure of the Nepali Grammar. This includes the writing
system, form classes (lexicon), phrase structure and clause analysis and a brief overview of the
sentential structure of the Nepali langauge. Peculiar cases and characteristics of the Nepali language
and Grammar are also noted.
Writing System of Nepali
Nepali is written in the Devanagari script. Although the statistics vary, basically the Nepali
45
language has 11 vowels and 33 consonants. Debate prevails on whether to include the alphabets,
which exist in pronunciation but not in writing and vice versa in the list of vowels and consonants.
The pronunciation closely resembles the writing system and hence the script is highly phonetic. The
script is written from left to right with an additional line on top of each word known as “dika” in
Nepali. Without the dika, a word is considered grammatically incomplete although it can be read.
There is no provision of the capital and small letters in the script. The alphabets are written in two
separate groups, namely the vowels and the consonants as shown in the table below.
Table 1. Alphabets of the Nepali Language
Vowels
Consonants
अ,आ,इ,ई,उ,ऊ,ऋ,ए,ऐ,ओ,औ
क,ख,ग,घ,ङ,च,छ,ज,झ,ञ,ट,ठ,ड,ढ,ण,त,थ,द,ध,न,प,फ,ब,भ,म,य,र,ल,व,श,ष,स,ह
The three alphabets क,त,ज are regarded as special clusters or conjuncts and hence form as a
combination of one or more consonants and special symbols. We talk about them a bit later.
In addition to the alphabets mentioned above, the following signs and symbols exist in written
Nepali as shown in the table below.
Table 2. Additional symbols in the Nepali language
Candrabindu
Anusvar or
cirabindu
Vowel signs
Visarga
Viram or
halanta
ँँ
ँं
ँा,िँ,ँी,ँु,ँू,ँृ,ँे,ँै,ँो,ँौ
:
ँ्
The vowel signs ँा,िँ,ँी,ँु,ँू,ँृ,ँे,ँै,ँो,ँौ correspond to the vowels आ,इ,ई,उ,ऊ,ऋ,ए,ऐ,ओ,औ
respectively. The alphabets categorized under the vowels are often called the free form of vowels
whereas the vowel signs are called the conjunct vowels. The text below illustrates the order of the
writing system of some of the vowel signs.
The vowel sign िँ should appear before the consonant, which however is written first and then
preceded by the vowel sign in normal written practice, िब=िँ before the consonant ब.
The vowel sign ँी follows the consonant, सी=ँी after the consonant स.
The vowel signs ँु and ँू are written at the foot of the consonant, लु =
ल+ ँु, नू=न+ँू.
When joined to र , the vowels - and ँू are written as र and र.
The three special clusters क,त,ज are formed by the combination of the other consonants with the
viram or the halanta playing a significant role in the combination as shown below:
क=क+ँ्+ष
त=त+ँ्+र
ज=ज+ँ्+ञ
46
Form classes (Lexicon)
The Nepali Grammar consists of both the inflected and the uninflected forms, sometimes also
known as the open and closed classes respectively. These constitute the parts of speech of the
Nepali Grammar. The open class includes noun, adjective, verb and adverb whereas pronoun,
coordinating conjunction, postposition, interjection, vocative and nuance particle come under the
closed class. In addition to the two form classes mentioned above, the Nepali Grammar has yet
another class named as the substitute form class. The major substitute forms in Nepali are the Kforms, or interrogative questions, the J-forms, or subordinators and the D-forms, or demonstratives.
Nominal structure
Nominal structures in Nepali include the common-noun phrase, proper-noun phrase, pronoun
phrase and dependent nominals functioning as modifiers in large nominals.
Verbal forms
The nonfinite verbal forms are:
i) infinitives marked by the infinitive suffix -na or -nu ( jaana or jaanu 'to go');
ii) participles marked by the suffixes -eko, -ne, -dai,-tai,-yera,-i,-ikana(gareko - 'done', garne
-'doing', gardai – 'doing', garikana-'having done');
iii) conditionals marked by the suffix -ye(gaye - 'if go', khaaye- 'if eat', gare- 'if do')
Verb conjugation types
The verb stems in Nepali are grouped, into three types:
i) 1'st conjugation- verbs with bases ending in consonants. For eg., gara- 'do', basa -'sit', dagura
-'run'.
ii) 2'nd conjugation- verbs with bases ending in the vowels ii and aa, with a single exception of
jaa-'go'. For eg., di-give, li-take, khaa-eat, birsi-forget.
iii) 3'rd conjugation – verbs with bases ending in the vowels: aau, a,u,and aa in the single case
of jaa -'go'.
Sentential structure of Nepali
The Nepali sentences follow the Subject, Object, Verb pattern as opposed to English which follows
the Subject, Verb, Object pattern.
For eg.,
Sentence type
English
Nepali
Declarative
I eat rice. (Subject,
Verb,Object)
Ma bhaat khaanchhu. (Subject, object, verb)
Interrogative
Do you eat rice?(Subject,
Verb,Object)
Ke timi bhaat khanchhou?(Subject, object,
verb)
Imperative
You go home.(Subject,
Verb,Object)
Timi ghara jaaun.(Subject, object, verb)
47
Discussion
Keeping in view the high degree of derivations and inflections as well as a different sentential
structure (Subject, Object, Verb) of the Nepali Language, it requires a different parsing engine as
well as a morpho-syntactic analyzer for the development of natural language processing
applications in Nepali like the Spell Checker, Grammar Checker, Machine Translation System etc.
An in-depth and an analytical research of the prerequisites of the Nepali computational grammar is
hence the need for today. These prerequisites refer to both linguistic resources and natural language
processing tools.
Recent updates in the research and development of the Nepali Grammar Checker include the
conceptualization of the general architecture of the system. In general, the Grammar Checker aims
to check the grammatical errors such as nominal and verbal agreement, parts of speech inflections
and whether the SOV pattern is observed or not.
Conclusion
The finding of this preliminary research work do not in any sense capture all the aspects of the
Nepali Grammar structure. Furthermore, the findings of the study might be subjected to changes as
newer concepts and ideologies emerge. However, this research work can serve as a strong base
document for further research. Besides, the results of the grammar checker research and
development is sure to serve a milestone for the computational linguistic works for the Nepali
Language. The specific modules to be developed for the system, viz., the Stemmer Module, POS
Tagger Module, Chunker and Parser Module, Grammatical Relation Finder Module etc. are all
being developed for the first time for the Nepali language. The development of the modules should
open doors for further research works in the Nepali language and Grammar.
Acknowledgement
This research work is supported by the International Research and Development Center (IDRC),
Canada under its PAN Localization Project, Nepal Component.
References
1)http://www.thdl.org/education/nepali/
2)http://www.omniglot.com/writing/nepali.htm
3)Teach yourself Nepali by Dr. Michael Hutt and Professor Abhi Subedi (Hodder &Stoughton, first
published in 1999)
4)Basic Course in Spoken Nepali by Tika Bahadur Karki and Chij Kumar Shrestha (Kathmandu,
Multiple editions)
5) A Course in Nepali by David Matthews. School of Oriental and African Studies, University of
London, 1992.
6) A Descriptive Grammar of Nepali and an Analyzed Corpus by Jayaraj Acharya. Geogrtown
University Press,1991.
7) CoGrOO: a Brazillian-Portuguese Grammar Checker based on the CETENFOLHA Corpus.
Jorge Kinoshita, Lais do Nascimento Salvador, Carlos Eduardo Dantas de Menezes. Universidade
da Sao Paulo (USP), Escola Politecnica, Sao-Paulo-SP-Brasil, Unifacs – Universidade de Salvador,
Nuperc, Salvador-Bahla-Brasil, Centro Universitario SENAC, Sao Paulo, SP-Brasil.
48
Email Answering Assistant for Contact Centers
Rahul Malik , L. Venkata Subramaniam+ and Saroj Kaushik
Dept. of Computer Science and Engineering,
Indian Institute of Technology, Delhi,
Hauz Khas, New Delhi, India.
{csd02442,saroj}@cse.iitd.ernet.in
+IBM India Research Lab,
Block I, IIT-Delhi, New Delhi, India
lvsubram@in.ibm.com
Abstract
Please provide the current status of the rebate
reimbursement for my phone purchases.
A contact centre is a centralized office
used for the purpose of receiving and
transmitting a large volume of customer
care requests. Now a days, customer care
in technical domain is mostly based on email and replying to so many emails is
time consuming. We present a technique
to automatically reply customer e-mails
by selecting the appropriate response template. The system can have great impact in
the contact centers. The system has been
evaluated and it achieves a good performance.
1 Introduction
A contact centre is a centralized office used for the
purpose of receiving and transmitting a large volume of customer care requests. Most major businesses use contact centers to interact with their
customers. Examples include utility companies,
mail order catalogue firms, and customer support for computer hardware and software. Some
businesses even service internal functions through
contact centers. Examples of this include help
desks and sales support.
The queries asked in one domain is fixed and
customers usually ask from a standard set of
queries only. The contact centre agents are typically provided many templates that cover different
queries asked. When a query email comes, it is
first triaged and send to the appropriate agent for
response. The agent selects the appropriate template and fills it to compose the reply. Selection
of the template is a time consuming task as agent
has to search from a lot of templates to select the
correct one.
I understand your concern regarding the mail-in
rebate.
For mail-in rebate reimbursement, please
allow 8-14 weeks to receive it.
I understand your concern regarding the mail-in
rebate.
For mail-in rebate reimbursement, please
allow <replace this> (***weeks or months***)
</Replace this> to receive it.
Figure 1: A Typical Query, Response/Template
Pair
The e-mails are in unstructured text and automatic extraction of relevant portion of e-mail that
require a response is a difficult task. In this paper, we propose a system to automatically answer
customer e-mails by:
1. Extracting, both from customer query Q and
response mails R, the set of queries q i and the
set of responses rj that decompose respectively the asked mail and the response mail.
2. Matching each query qi with its relevant response rj .
3. Finally, new questions are answered by comparing them to previously studied questions.
2 Related Work
Little work has been done in the field of contact centres emails. (Nenkova and Bagga, 2003),
(Busemann et al., 2000) learn a classifier based on
existing emails using features such as words, their
parts of speech, etc. When new queries come in
they are automatically routed to the correct agent.
Not much work has been done on automatically
answering the email queries from customers. In
49
(Scheffer, 2004) a classifier is learnt to map questions to a set of answer templates. Our work in this
paper describes methods to automatically answer
customer queries.
Extracting words and phrases that contain important information from a document is called Key
phrase extraction. Key phrase extraction based on
learning from a tagged corpus has been widely explored (Frank et al., 1999). (Turney, 1999) describes a system for key phrase extraction, GenEx,
based on a set of parameterized heuristic rules that
are fine-tuned using a genetic algorithm. (Frank
et al., 1999) use a set of training documents and
extract key phrases of length upto 3 words to generate a naive Bayes model. The model is used
to find key phrases in a new document. However, key phrase extraction technique has been
mainly applied in document summarization and
topic search. We mention this body of work in
the context of this paper because by extracting key
phrases from an email, we identify the key questions and responses.
Text similarity has been used in Information retrieval to determine documents similar to a query.
Typically, similarity between two text segments is
measured based on the number of similar lexical
units that occur in both text segments (Salton and
Lisk, 1971). However, lexical matching methods
fail to take into account the semantic similarity of
words. In (Wu and Palmer, 1994) similarity of
words is measured based on the depth of the two
concepts relative to each other in WordNet 1 . In
this paper we need to identify similar questions.
Also we need to match a question to its answer.
3 Proposed Algorithm
Here, we describe the approach taken by us in
building the system.
3.1 E-mail triaging
In a contact center, there are different agents who
look after different aspects. So, the emails are
manually triaged and are forwarded to the right
agent who responds to them. We replace this step
with automatic triaging. We use clustering to first
identify the different classes and then learn a classifier based on these classes. We classified the
emails from the larger pool into equal number of
query and response clusters using text clustering
by repeated bisection using cosine similarity. An
1
http://wordnet.princeton.edu/
SVM classifier was then learnt on the classes created by the clustering. This classifier is then used
to triage the emails.
3.2 Key-phrase Identification
Key phrase identification is an important step in
identifying the question and answers as they can
be identified by the presence of key-phrases in
them. Key phrase extraction is a classification task
where each potential phrase could be a key phrase
or not. First of all, candidate phrases are identified.
The following rules are applied to identify the candidate phrases. All the unigrams, bigrams and
trigrams are identified that are continuous. Also,
the candidate phrases cannot begin or end with a
stopword. We use a 452 word list of stopwords.
The identified phrases are passed through Porter
Stemmer2 to obtain the root. The next step is to
determine the feature vector for the training and
testing phases. The following features are used:
T F ∗ IDF , a measure of phrase’s frequency in
a document compared to its rarity in general use;
whether proper noun is there in the phrase or
not; f irst occurence, which is the distance into
the document of the phrase’s first occurence; and
num of words, which is simply the word count
in the phrase.
For training phase, the key phrases are marked
in training query and response emails. They are
made to generate a model which then is used in
the prediction. We use Naive Bayes as the machine learning scheme. Once the model is learned,
it is used to extract the key phrases from the testing
emails.
3.3 Identification of Question and Answers
The questions and answers are identified by the
presence of key phrases in them. If a key phrase
occurs in multiple sentences in the document, then
the sentence which has maximum number of key
phrases is selected. In case of a tie, the first occurence is chosen. In this manner, we identify the
question and answers in emails.
3.4 Mapping Questions to Answers
Once the questions and responses have been identified we need to map each question to its corresponding response. To accomplish this mapping
we first partition each extracted sentence into its
2
http://www.tartarus.org/ martin/PorterStemmer
50
list of tokens, removed the stop words and the remaining words are passed through a stemmer (using Porter’s stemming algorithm) to get the root
form of every word. The tokens include nouns,
verbs, adjectives and adverbs. In addition, we also
keep the cardinals since the numbers also play an
important role in understanding the text. We then
form a matrix in the following manner. The rows
of the matrix are the tokens from one sentence and
the columns are the tokens from the second sentence. Each entry in the matrix Sim(s, t) denotes
the similarity as it has been obtained in the following manner for that pair. Also, if a word s or
t does not exist in the dictionary, then we use the
edit distance similarity between the two words.
The similarity between two concepts is given as
(Jiang and Conrath, 1997):
3.5 Answering new Questions
When we get a new query, we first triage it using
the SVM classifier that was described in Section
3.1. Next, we identify the questions in it using
the procedure described in Section 3.3. Each of
these questions now needs to be answered. For
a new question that comes in, we need to determine its similarity to a question we have seen earlier and for which we know the template. The new
question is mapped to the template for the existing
question to which it is similar. Using the above
sentence similarity criterion, we compare the new
question with the questions seen earlier and return
it’s template.
4 Evaluation
We evaluate the system on Pine-Info discussion
list web archive3 . It contains emails of users reporting problems and responses from other users
offering solutions and advice. The questions that
where IC is defined as:
users ask are about problems they face in using
IC(c) = −logP (c)
pine. Other users offer solutions and advice to
and P (c) is the probability of encountering an
these problems.
instance of concept c in a large corpus. Also, LCS
The Pine-Info dataset is arranged in the form of
is the least common subsumer of the two concepts.
threads in which users ask questions and replies
We are using is Wordnet.
are made to them. This forms a thread of discusThe similarity between two sentences is detersion on a topic. We choose the first email of the
mined as follows :
thread as query email as it contains the questions
P
P
s∈Sj M S(s, Si )asked and the second email as the response as it
s∈Si M S(s, Sj ) +
SS(Si , Sj ) =
|Si | + |Sj |
contains responses to that email. It may not contain answers to all the questions asked as they may
where, M S(s, Sj ) is the word in Sj that has the
be answered in subsequent mails of the thread. We
highest semantic similarity to the word s in S i .
randomly picked up a total of 30 query-response
In addition, we used a heuristic that if a question
pairs from Pine-Info. The question sentences and
is asked in the beginning, then the chances that its
answer sentences in these were marked along with
response would also be in the beginning are more.
the mappings between them.
So, the expression for score becomes:
On the average a query email contains 1.43
pos(q) pos(r)
questions and the first response email contains 1.3
−
|)
score(q, r) = SS(q, r)× (1− |
answers and there are 1.2 question-answer pairs.
N
M
We show a query and response pair from Pine-Info
where, pos(q) = position of the Question in the
in Figure 2. The actual question and answer that
set of Questions of that query-email, N = number
have been marked by hand are shown in bold. In
of questions in that query-email, M = number of
the example shown one question has been marked
answers in that response-email
in the query email and one answer is marked in the
Each answer is then mapped to a template. This
response email.
is done by simply matching the answer with senIn pine, there are no templates. So, we are eftences in the templates.
fectively
checking that whether we are able to map
Multiple questions can match the same template
the
manually
extracted questions and answers corbecause different customers may ask the same
rectly. For evaluation purposes, we use two critequestion in different ways. So, we prune the set of
ria. In the first case we say the system is correct
questions by removing questions that have a very
1
CS im(s, t) =
IC(s) + IC(t) − 2 × IC(LCS)
high similarity score between them.
3
http://www.washington.edu/pine/pine-info
51
When printing with PINE I always lose a character
or two on the left side margin.
How can I get
PINE to print four or five spaces to the margin?
Printer works fine with all other applications.
Use the utility ’a2ps’ to print messages with a
nice margin.
See links for more information.
Figure 2: Pine-Info Query, Response Pair
only if it generates the exact answer as the agent.
In the second case we allow fractional correctness.
That is if the query contains two questions and
the system response matched the agent response
in one of them then the system is 50% correct.
As already mentioned we are looking for an answer in the first response to the query. Hence, all
questions in the query may not get addressed and
it may not be possible to get all the mappings. In
Table 1 we show the numbers obtained. Out of a
total 43 questions only 36 have been mapped to
answers in the manual annotation process. Using
the method presented, we are able to find 28 of
these maps correctly.
pool. It then composes the response from existing
templates.
The given system can improve the efficiency
of contact centers, where the communication is
largely email based and the emails are in unstructured text. The system has been tested thoroughly
and it performs well on both Pine-Info dataset and
real life customer query-response emails.
In future, we plan to improve our system to handle questions for which there are no predefined
templates. We would also like to fill some of the
details in the templates so that the agent’s work
can be reduced. Also, we would like to add some
semantic information while extracting questions
and answers to improve the efficiency.
References
Stephan Busemann, Sven Schmeier, and Roman G.
Arens. 2000. Message classification in the call center. In Proceedings of the Sixth Conference on Applied Natural Language Processing, pages 158–165,
Seattle, Washington, April 29-May 4.
Eibe Frank, Gordon W. Paynter, Ian H. Witten, Carl
Gutwin, and Craig G. Nevill-Manning. 1999.
Domain-specific keyphrase extraction. In Proc. Sixteenth International Joint Conference on Artificial
Intelligence,, pages 668–673, San Fransisco, CA.
Table 1: Results on Pine-Info Dataset for
Question-Answer Mapping
total total total actual correct % correct
mails qns
ans maps
maps
maps
J. Jiang and D. Conrath. 1997. Semantic similarity
based on corpus statistics and lexical taxonomy. In
30
43
39
36
28
77.78%
Without partial correctness, the system achieves
77.78% correctness. When we consider partial
correctness also, the it increases upto 84.3%.
We also tested the system real life queryresponse emails. We used 1320 email pairs out
of which 920 were used for training the system
and 400 were used for testing. We had 570 sample
templates. Without partial correctness, the classification accuracy achieved was 79% and when
we consider partial correctness, then it increases
to 85%.
5 Conclusion and Future Work
In this paper, we have presented a technique of
automatically composing the response email of
a query mail in a contact centre. In the training phase, the system first extracts relevant questions and responses from the emails and matches
the question to its correct response. When a new
question comes, it triages it to correct class and
matches its questions to existing questions in the
Proceedings of the International Conference on Research in Computational Linguistics.
Ani Nenkova and Amit Bagga. 2003. Email classification for contact centers. In Proceedings of
the 2003 ACM Symposium on Applied Computing,
pages 789–792, Melbourne, FL, USA, March.
G. Salton and M. E. Lisk. 1971. Computer Evaluation of Indexing and Text Processing. Prentice Hall,
Englewood Cliffs, NJ.
Tobias Scheffer. 2004. Email answering assistance
by semi-supervised text classification. Journal of
Knowledge and Process Management, 13(2):100–
107.
P. Turney. 1999. Learning to extract keyphrases from
text. Technical Report ERB-1057, National Research Council, Institute for Information Technology.
Z. Wu and M. Palmer. 1994. Verb semantics and lexical selection. In Proceedings of the 32nd Annual
Meeting of the Association for Computational Linguistics, pages 133–138, Las Cruces, New Mexico,
June 27-30.
52
Extracting Structural Rules for Matching Questions to Answers
Shen Song and Yu-N Cheah
School of Computer Sciences
Universiti Sains Malaysia, 11800 USM Penang
summerysmile@hotmail.com and yncheah@cs.usm.my
ABSTRACT
Rule-based Question Answering(QA) systems require a comprehensive set of rules to guide its search for
the right answer. Presently, many rules for QA system are derived manually. This paper presents a question
answering methodology as well as proposes an automatic rule extraction methodology to obtain sufficient
rules to guide the matching process between the question and potential answers in the repository.
Our QA methodology consists of 4 steps: (1)
ontology-based
question
and
repository
understanding, (2) matching, (3) consistency
checking, and (4) answer assembly.
1. Introduction
Matching is an essential part of a QA system. It
decides whether the final answer is reasonable
and accurate. In the past, manually extracted
rules were popularly employed to support the
matching function of the QA system. However, it
is difficult to find a certain number of rules to
suit various kinds of question-answer structures
[1].
2.1 Ontology-based question and
repository understanding
The domain ontology is a basic but important
component in our methodology. We use the
ontology to understand the words or phrases of
the tagged (or analysed) question and the tagged
document fragments stored in the repository.
From the question point of view, the ontology
facilitates query formulation and expansion.
Given a question, the question analyser will
analyse and tag the question with the relevant
POS as well as details from the ontology. The
question’s type and syntax are then determined
by the analyser.
In this paper, we introduce a proposed QA
methodology as well as a simple methodology
for automatic rule extraction to obtain rules for
matching questions to the right answers based on
clustering.
2. Our QA Methodology
The overview of our QA approach is as shown in
Figure 1. At the heart of our methodology lies a
Response Generator that executes the QA
methodology.
2.2 Matching
After
obtaining
the
ontology-based
understanding of the question (question’s goal,
keywords, etc.) and repository content, the
response generator searches for relevant
document fragments in the repository of
semantically annotated documents. The response
generator employs a variety of matching rules to
select the candidate responses. We have
presently identified two matching rules:
1.
Figure 1: Overview of our QA approach
In our QA system, answers are obtained from
semantically annotated text repositories. These
text documents are tagged for parts-of-speech
(POS). A question analysis component is also
included to identify keywords and question goals
(via Wh term analysis).
2.
53
Structural (or syntactic) matching rules:
This is facilitated by the POS tagging of the
question and document fragments in the
repository. The use of structural matching
rules is based on the assumption that the
answer may have a similar sentence structure
to the question [2].
Wh analysis rules: This is based on the idea
that certain questions are answered in a
particular way. For example, ‘how’
questions may typically contain prepositionverb structures in potential answers; or
‘why’ questions may typically contain the
word ‘because’ in the answers.
also be used. As an example, let us suppose a
sample of the question-answer pairs obtained is
as shown in Table 1.
Table 1: Question-answer pairs
We later explore the possibility of automatically
extracting structural rules for this purpose.
Question
How do I get from
Kuala Lumpur to
Penang?
Answer
To travel from Kuala
Lumpur to Penang,
you can take a bus.
2
Where is Kuala
Lumpur?
Kuala Lumpur is in
Malaysia.
3
How can I travel to
Kuala Lumpur from
Penang?
You can travel to
Kuala Lumpur from
Penang by bus.
4
Where is the
location of Penang?
Penang is located
north of Peninsular
Malaysia.
5
What can I do to
get to Kuala
Lumpur from
Penang?
You can get to Kuala
Lumpur from Penang
by bus.
6
What can I do in
Langkawi?
You can go scuba
diving in Langkawi.
No
2.3 Consistency checking
1
Usually, there are more than one document
fragments that match the question. However, not
all may be consistent with each other, i.e. they
may have minor conflicting information. So, we
explore the use of constraints to maintain the
answer’s consistency. After consistency checking,
we would then have a list of consistent matching
document fragments.
2.4 Answer assembly
In this step, we analyse the matching fragments
to eliminate redundant information and combine
the remaining document fragments. Then, we
compare the question words with the matching
answer and check the quantification, tense and
negation relationship. The semantic structure
between the question and the answer is also
checked to make sure that the question is
explained sufficiently in the answer.
3.2 Analysis of question-answer pairs
The question-answer pairs are analysed for their
structure and are tagged accordingly. The
syntactic (analysed) notations of the questionanswer pairs form the dataset for our rule
extraction. Based on Table 1, we produce our
dataset as shown in Table 2.
3. Automated Rule Extraction for
Question Answering
For the matching phase of our QA approach, we
have previously identified structural matching
rules and Wh analysis rules for matching
questions to their potential answers.
Table 2: Analysed question-answer pairs
However, in the past, these rules are induced
manually by analysing a limited number of
common question-answer structures. They are
not sufficient to solve a wider range of questionanswer matching problems. We therefore
propose a methodology to automatically induce
rules to match questions and answers. Here, we
focus on extracting structural matching rules only.
No
Our proposed methodology consists of three
phases: (1) compilation of question-answer pairs,
(2) analysis of question-answer pairs, and (3) rule
extraction via clustering.
54
Answer
1
How vb pron vb
prep location prep
location
Prep vb prep location
prep location, pron
vb vb art vehicle
2
Where vb location?
Location vb prep
location.
3
How vb pron vb
prep location prep
location?
Pron vb vb prep
location prep location
prep vehicle.
4
Where vb art n
prep location?
Location vb vb adj
prep adj location.
5
What vb pron vb
prep vb prep
location prep
location?
Pron vb vb prep
location prep location
prep vehicle.
6
What vb pron vb
prep location?
Pron vb vb adj n prep
location
3.1 Compilation of question-answer pairs
We need a mass of question-answer pairs to
support our rule extraction. So, collecting
question-answer pairs from the Internet is a good
choice for us due to the redundancy of
information on the Internet. Alternatively,
sample answers for comprehension tests may
Question
Next, within each cluster, we then analyse their
respective answer parts and choose the most
popular structure. For example, in Cluster A,
rows number 3 and 5 have similar answer
structures and is therefore the most popular
answer structure among the three rows in Cluster
A. We therefore conclude that the question
structure for Cluster A would result in answers
that have the structure Pron vb vb prep location
prep location prep vehicle. The rule that can be
extracted from this may be in the form:
3.3 Rule extraction via clustering
We propose three ways of clustering the analysed
dataset for rule extraction [3]:
1.
2.
3.
Cluster only the question part
Cluster only the answer part
Cluster both the question and answer parts
together.
In this paper, we describe the method of
clustering the question part only.
Firstly, we cluster the question part of our
analysed dataset. For this purpose, we check for
the similarity in the structure of the question part.
From Table 2, let us assume three clusters are
obtained: Cluster A consists of rows number 1, 3
and 5; Cluster B consists of rows number 2 and 4;
and Cluster C consists of row number 6 only.
This is because the question structures in each
cluster are deemed to be similar enough (see
Table 3). Each cluster would then need to have a
representative question structure. For our purpose,
the most popular question structure within each
cluster would be selected. For example, for
Cluster A, the representative question structure
would be How vb pron vb prep location prep
location.
IF
How vb pron vb prep location prep
location
THEN Pron vb vb prep location prep location
prep vehicle
4. Using the Extracted Rules: An
Example
Following the extraction of rules, the matching
process can then be carried out by the Response
Generator. The rules basically guide the
matching process to find an answer in the
repository that is able to answer a given question,
at least from a structural point of view (semantic
details would be resolved via the ontology). Here,
the issue of similarity between the rules’
specification and the structure present in the
question and answer needs to be addressed [4].
Table 3: Clustered question-answer pairs
Cluster
A
No
Question
Question
Structure
Answer
Answer Structure
1
How do I get
from Kuala
Lumpur to
Penang?
How vb pron vb
prep location prep
location?
To travel from
Kuala Lumpur to
Penang, you can
take a bus.
Prep vb prep
location prep
location, pn vb vb
art vehicle
3
How can I travel
to Kuala Lumpur
from Penang?
How vb pron vb
prep location prep
location?
You can travel to
Kuala Lumpur from
Penang by bus.
Pron vb vb prep
location prep
location prep
vehicle.
5
What can I do to
get to Kuala
Lumpur from
Penang?
What vb pron vb
prep vb prep
location prep
location?
You can get to
Kuala Lumpur from
Penang by bus.
Pron vb vb prep
location prep
location prep
vehicle.
2
B
4
C
6
Where is Kuala
Lumpur?
Where is the
location of
Penang?
What can I do in
Langkawi?
Where vb location?
Where vb art n prep
location?
What vb pron vb
prep location?
55
Kuala Lumpur is in
Malaysia.
Penang is located
north of Peninsular
Malaysia.
You can go scuba
diving in Langkawi.
Location vb prep
location.
Location vb vb adj
prep adj location.
Pn vb vb adj n prep
location
As an example of a matching process, let us
assume we would like to answer the question,
“How do I get from Kuala Lumpur to Penang?”
Firstly, the question will be analysed and converted
into the form as follow: How vb pron vb prep
location prep location?
Based on the rule extracted above, we know this
kind of structure belongs to Cluster A and the
corresponding answer’s structure should be: Pron
vb vb prep location prep location prep vehicle.
Finally, from the repository, we select answers
with this answer structure. Likely answers are
therefore: “You can travel to Kuala Lumpur from
Penang by bus” or “You can get to Kuala Lumpur
from Penang by bus”.
5. Concluding Remarks
In this paper, we introduced our proposed QA
methodology and a methodology for automatic rule
extraction to obtain matching rules based on
clustering. Our research is still in the initial stages.
We need to improve the rule extraction
methodology by (1) improving the analysis and
tagging of the question-answer pairs; (2) designing
an efficient algorithm to cluster the dataset; and (3)
developing a better method to aggregate similar
56
question or answer structures into a single
representative structure.
6. References
[1] Riloff, E., Thelen, M., A Rule-based Question
Answering
System
for
Reading
Comprehension Tests, ANLP/NAACL-2000
Workshop on Reading Comprehension Tests
as Evaluation for Computer-Based Language
Understanding Systems, 2000.
[2] Li, W., Srihari, R.K., Li, X., Srikanth, M.,
Zhang, X., Niu, C., Extracting Exact Answers
to Questions Based on Structural Links,
Proceeding of the 2002 Conference on
Multilingual Summarization and Question
Answering, Taipei, Taiwan, 2002.
[3] Lin, D., Pantel, P., Discovery of Inference
Rules for Question Answering, Natural
Language Engineering, 7(4), 2001, pp. 343360.
[4] Jeon, J., Croft, W.B., Lee, J.H., Finding
Semantically Similar Questions Based On
Their Answers, Proceedings of the 28th
Annual International ACM SIGIR Conference
on Research and Development in Information
Retrieval (SIGIR 2005), Salvador, Brazil,
2005, pp. 617-618.
“Who” Question Analysis
Rapepun Piriyakul amd Asanee Kawtrakul
Department of Computer Engineering
Kasetsart University
Bangkok , Thailand
rapepunnight@yahoo.com ak@.ku.ac.th
answer came from the wrong question analysis
[T. Solorio et , al 2005].
However, this paper only concerns of “Who”
question of the Thai language because we can
keep tracking about the expert’s knowledge for
solving problems.
We confront many
characteristics of Thai questions, such as the
implicit question word, the question word can
be placed at any position of the question text;
the appearance of question word does not
always make the question. After the question
type identification, the next step of question
analysis is focus analyzer which based on
syntactic and semantic analysis. It is necessary
to determine the question focus to gain the
correct answer. To fulfill the complete
question representation, we also propose to
construct the Extended Feature Knowledge
(EFK) to enhance the Answering System with
the cooperative way.
In this paper, we have 6 sections. We begin
with the introduction, and then we will discuss
the problems in section 2, related works in
section 3, the framework will be observed in
section 4. Then we will evaluate in section 5
and plan our future study in section 6.
Abstract
The purpose of this research is to automatically
analysis the “Who” question for tracking the
expert’s knowledge. There are two main
problems involved in the “Who” question
analysis. The first problem is to identify the
question type which relies on the cue
ambiguity and the syntactic ambiguity. The
second one is to identify the question focus
which based on syntactic and semantic
ambiguity. We propose mining features for the
question identification, and the usage of focus
rules for the focus identification. Furthermore,
this “Who” question analysis shows the 80%
precision and the 78% recall.
1. Introduction
People demand information as a response to
their question about a certain fact. In the past,
Information Retrieval was used to assist the
people in retrieving information. The way the
Information Retrieval works is that the system
looks up for the frequencies of the essential
words that the person is looking for over other
databases
of
information.
Information
Retrieval, however, fails to be efficient at
taking consideration of the desired context of
the individual, which can be gained through
the understanding of the true question asked by
the individual. To allow this, ‘Question
Answering System’ was introduced. Question
Answering System is a type of Information
Retrieval with the purpose of retrieving
answers to the questions impose, done by
applying techniques that will enabled the
system to understand the natural language
( http://.www.wikpedia.org ).The Question
Answering system (QA) consists of two sub
systems. The first is question analysis and the
second is the answering system. These two
sub systems are significantly related. Since
question analysis is the front end of QA
system. Error analysis of an open domain QA
systems found that 36.4 % of inaccurate
2. Crucial Problems
There are two main problems, the
identification of “Who” question, and focus
question.
2,1 Question identification
The objective of question identification is to
assure that the question text is “Who” question
and a true question. Since Thai question has no
word marker “?” , we use set of cues to
identify a question type “Who”, for example:
{“khrai”, “dai”, “a rai”..}. There are three
problems of the question identification as the
movement of a question cue, the question cue
ambiguity and the syntactic problem.
2.1.1 Movement of question cue
57
A question cue can occur at any place in the
interrogative sentence as shown in the
following examples
a./Khrai// khue//Elvis Pressley/
Who was Elvis Pressley?
b. /Elvis Pressley/ /Khrai// khue/
Who was Elvis Pressley?
Question a and b have the same meaning.
property
of
Elvis
Pressley.
The
accomplishment to automatically identify
focus is based on syntactic, semantic analysis
and the world knowledge. The efficiency of
Answering System is based on the empowering
of question analysis.
3. Related Work
Most approaches to question answering system
focus on how to select precise answer .[Luc
2002] was developed question analysis phase
to determine the expected type of the answer to
a particular question based on some extraction
functions with parameters .For instance , the
question focus from “Which was radioactive
substance was Eda Charlton injected in 1915
?” was substance. Machine learning
techniques are being used to tackle the
problem of question classification [ Solorio et .
al,. 2005]. [ Hacioglu ,et. all., 2003] used the
first step in statistical QC (Question
Classification) to design a taxonomy of
question types. One can distinguish among
taxonomies of having flat and hierarchical
structure, or taxonomies of having a small (1030) and large number of categories (above 50)
.YorkQA [Alfonseca et, al., 2001] took
inspiration from the generic question
answering algorithm presented in [Simmons
1973] which was similar to the basic
algorithms used in the Question Answering
systems built for TREC. The algorithm carries
on three procedurals steps, the first was
accumulating a database of semantic structures
representing penitence meaning, the second
was selecting a set of appears relevant to the
question. Relevance was measured by the
number of lexical concepts in common
between the proposed answer and the question.
[Alfonseca and Marco ,2001 ] extended some
procedure ie : the question analyzer used
pattern matching based on Wh –words and
simple part–of-speech information combined
with the use of semantic information provide
by WordNet to determine question types. QA
in Webclopedia [ Hovy et.al., 2004] a system
that used a CONTEXT parser to parse and
analyze the question. To demonstrate the
question analysis part of this system, they
parse input question using CONTEXT to
obtain a semantic representation of the
question. The phases/words from syntactic
analysis were assigned significance scores
according to the frequency of their type in
Webclopedia question corpus (a collection of
27,000 + questions and answers) secondarily
by their length, and finally by their
significance scores derived from word
frequencies in the question corpus.
2.2.2 Question cue ambiguity
The presence of question cue, e.g. “khrai”,
does not always make the sentence a question.
For example:
c. /Khrai/ /pen/ /na-yok/ /rat-ta-montri/
/khong /pra- thet/ Thai/
(Who is the prime minister of Thailand ?)
d. /Khrai/ pen// na-yok/ /rat-ta montri//khong/
/ pra- thet// Thai// ko// kong// me// pun-ha/ .
(Anyone who is a prime minister of Thailand
will met the problem.)
The example c is a question whereas d is the
narrative sentence as a result of the word /ko/ (
/Ko/ is a conjunction). .
2.2.3 Syntactic problems
From the above example a-d, they do not
specify what tense they are because the Thai
language does not have verb derivation of
specifying the tense. And, they also lack of the
verb derivation of specifying the number.
These syntactic problems cause the problem in
Thai QA because the answer can be
represented both individual and list of person.
For exemples:
/Khrai// rien// NLP/
(Who is/are studying NLP?)
(Who was/were studying NLP?)
The answer can be the individual such as “A is
studying NLP” or list of person such as “A, B,
C and D are studying NLP” or the
representation can be a group of person such as
“The second year students are studying NLP”.
2.2 Question Focus identification
To identify the question focus is an important
to achieve the precise answer. There are many
types of focus with respect to “Who” question
i.e. Person’s description, Organization‘s
definition, Person or Organization’s name,
Person’s properties. The following examples
are shown the pattern of question.
e. /Khrai// sang //tuk// World Trade/
(Who built the World Trade building?)
f. /Khrai// khue//Elvis Pressley/
(Who was Elvis Pressley?)
Question e, focus is the name of person or
organization but question c, focus is the
58
i. / Na-yok/ /rat-ta-montri/ /khong/
/Thai//khon/ti/laew/khue/ /Khrai/
(Who was the prevous priminister of Thai ?)
j. / Na-yok/ /rat-ta-montri/ /khong/
/Thai//khon/ti/laew/laew//khue/ /Khrai/
(Who were the prevous priministers of Thai ?)
Our work is especially deep analysis of “Who”
question. Our research is integrated from QA
TREC but we are modified and extended some
part to be applicable for Thai QA.
4. Framework for Question Analysis
With word /laew/, question i and j are signified
to the singular- past for question i and plural
-past for question j.
Our work is based on the preprocessing of
question text on word segment, POS, and NE
Recognition.
The classification of Wh
question is based on the syntactic analysis of
interrogative pronoun and the Wh words in
table 1. Posterior Bay's probability is using to
confirm the accuracy of the classification.
In case of question with verb is ”is-a” then
the focus is NE .The syntactic and semantic
can not identify the type of focus so the
system will be supported by the world
knowledge to identify NE for Person or
Organization. From figure 1,we can classify
verbs on ”Who” question into four groups and
each group as show below
Group1 = {/pen/ is a }
Group2= {/sang/ built,/kit-khon/ invent ,
/patana/ develop,/tum/ do }
Group3 = {/khue/ is a}
Group4 = {/khao//kai/ be qualify,/me/ has }
Verb “/me/has,have” in group4 must follow by
a constraint word such as /me//sit-ti/ has right ,
/me/aum-nat / has power or authority .
The set of cue in table 1 is used to be a first
coarse classifier for “Who”.
Table 1 The set of cue word
Who ,
Whom
Which
What
Thai Word
/khrai//khue/
/khue//khrai/
//phu//dai//
//tan//dai//
//boog-khon//nai//
/khon/dai/
/ khon//nai/
/a-rai/
Remark
/ khrai/
//tan//dai//
//boogkhon//dai//
are common
noun
Which one
We must
combine two
word together
/choe/a-rai/
What name
We use the posterior Bays’ probability to
determine the classification of “Wh” type.
Pr( q _ type / Wh _ word ) =
Pr( q _ type ∩ Wh _ word )
Pr(Wh _ word )
The posterior Bayes’ probability for a question
with given a cue “/khrai/” of our study domain
is 0.7. The probability value is use to make
decision on the first step of question
classification.
After classifying the question, the non question
text is pruning by a set of cues. These cues act
as the guards to select only the true question
for the next step.
Based on our observation, we found beneficial
characteristic of Thai question as the following
examples:
g.
Figure 1 The patterns of Who question
From each verb group we can conclude to four
rules.
Rule1 : If the verb in group1 , then focus is
“Person” or “Organization” and the answer
is NE.
If <IP /khrai/><is a /pen/ > <NP> Then
<Focus is NP> and <Answer = NE> where
IP=interrogative pronoun
Rule2: If the verb in group2+NP , then focus is
NP and the answer is NE.
If <IP /khrai/><VP=V/verb group3/+NP>
Then <Focus is NP> and <Answer = NE>
Rule3: If the verb in group3 , then focus is
“Person” or “Organization” and the answer
is Description or Definition.
/Khrai// ko//dai/chuay/dua/
(Any one can help me.)
h. /Khrai/ pen// na-yok/ /rat-ta montri//khong/
/ pra- thet// Thai// ko// kong// me// pun-ha/
(Anyone who is a prime minister of Thailand
will met the problem.)
Statement g and h are not question by the
determination of /ko/.
59
If <IP /khrai/><is a /khue/ > <NP> Then
<Focus is NP> and <Answer = Description or
Definition>
Rule 4 If the verb in group4 , then focus is VP
and the answer is list of properties.
If <IP /khrai/><VP=V/verb group4/ + NP
>Then <Focus is NP> and <Answer = list of
properties>
To enhance the answering system for a precise
and concise answer, we mine extended features
to detect lexical terms such as description for
Who
(person)
and
definition
for
Who(organization). We examine the feature
space of description properties as Gender,
Spouse ,Location, Nationality, Occupation,
Award , Education ,Position ,Expertise )
from the sample of personal profile .To
simplify , these features are represented by a
set of feature = (x1,x2,…, x9) where i=1,2,…,9
and x1 is any feature on the feature space. The
mining features are based on the proportion
test with threshold 0.05.
Table 2 Show the experiment result
Feature
X1
X2
X3
X4
X5
X6
X7
X8
X9
Freq
Occ
P
20
0
1
12
8
0.6
1
19
.05
2
18
0.1
1
19
.05
15
5
0.75
3
17
0.15
14
6
0.7
1
19
.05
5. Evaluation
The precision of our experiment to classify the
question type by using question word with
posterior Bayer’s technique is 80 % and the
recall is 78% .So far we use the process to
solve the problems as the application of
appropriate rules to individual problem .
The accuracy of question recognizer is 75 %
by comparing QA pair (examine by expert).
6 Future Works
We would be collecting the features of other
Wh question. We would also append the synset
and update the Question Ontology, enhance
question analysis by combining reasoning,
constraint and suggestion for optimum size for
answering system.
References
1. Alfonseca. Enrique ., Marco De Boni, JoséLuis Jara-Valencia, Suresh Manandhar.,2001.
A prototype Question Answering system using
syntactic and semantic information for answer
retrieval. Proceedings of the TREC-9
Conference 2001.
2. E.Hovy, L.Gerber, U. Hernjakob, C. Lin .
2004. Question Answering in Webclopedia .
Proceedings of the TREC-10 Conference 2002.
3. Hacioglu, Kadri and Ward, Wayne., 2003.
Question Classification with Support
Vector Machines and Error Correcting Codes .,
in the Proceedings of NAACL/HLT-2003.
4. Luc Plamondon and Leila Kosseim. , 2002.
QUANTUM: A Function-Based Question
Answering System. In Robin Cohen and Bruce
Spencer (editors), Advances in Artificial
Intelligence, 15th Conference of the Canadian
Society for Computational Studies of
Intelligence, AI 2002, Calgary, Canada.
5.T.Solorio1,ManuelP´erez-Couti˜no1, Manuel
Montes-y-G´omez1,2.,LuisVillase˜nor-Pineda1
and
Aurelio
L´opez-L´opez1.,2005.A
Language Independent Method for Question
Classification., .page 291.,Computational
Linguistics And Intelligent Text Processing:
6th International Conference. CICLing 2005,
held in Mexico City.
Table2 shows the proportions of the 9 features.
The results of mining features are Position,
Nationality, Occupation, Location, Expertise
and Award. These extended features are
storing in the knowledge base to access from
system. Figure 2 is the question analysis
system. .
Figure 2 The Question Analysis System
The procedure of question analysis are
1.Input text” /sun//thon//phu//khue//khrai/”
2.Preprocessing
3. Identify question type and pruning
4. Identify focus by semantic analysis, world
knowledge and Rule3,
5.Query representation = question type
+focus+ list of extended features.
Figure 3 is a prototype of query representation
( Q focus NE Answer => Description
NE [ Person : Description of NE]
Class X (Person) : Evaluation criteria
{
{ subclass {Person :(Description :
Name
Title […] <Rank>
- Location ,Experience, Achievement,..
Publication - Books . Research . Innovation }}}
Figure 3 Query representation
60
Mind Your Language: Some Information Retrieval and
Natural Language Processing Issues in the Development of an
Indonesian Digital Library
Stéphane Bressan
Mirna Adriani
Zainal A. Hasibuan
Bobby Nazief
National University of Singapore
steph@nus.edu.sg
University of Indonesia
{mirna,zhasibua,nazief}@cs.ui.ac.id
afford. This compels semi-automatic or automatic and
adaptive methods.
1. Introduction
In 1928, the vernacular Malay language was proclaimed
by the Youth Congress, an Indonesian nationalist
movement, the national language of Indonesia and
renamed “Bahasa Indonesia”, or the Indonesian
language. The Indonesian language is now the official
language of the republic of Indonesia, the fourth most
populated country in the world. Although several
hundreds regional languages and dialects are used in the
Republic, the Indonesian language is spoken by an
estimated 228 million people, not counting an additional
20 million Malay speakers who can understand it. For a
nation composed of several thousands islands and for its
diasporas of students and professionals, the Internet and
the applications it supports such as the World Wide Web,
email, discussion groups, and digital libraries are
essential media for cultural, economical, and social
development. At the same time the development of the
Internet and its applications can be either a threat to the
survival of indigenous languages or an opportunity for
their development. The choice between cultural diversity
and linguistic uniformity is in our hands and the outcome
depends on our capability to devise, design, and use tools
and techniques for the processing of natural languages.
Unfortunately natural language processing requires
extensive expertise and large collections of reference data.
The project we are presenting in this paper is a
collaboration between the National University of
Singapore and the University of Indonesia. The research
conducted in this project is concerned with the
economical and therefore semi-automatic or automatic
acquisition and processing of such linguistic information
necessary for the development of other-than-English
indigenous and multilingual information systems. The
practical objective of is to provide a better access to the
wealth of information and documents in the Indonesian
language available on the World Wide Web and to
technically sustain the development of an Indonesian
digital library [25].
In this paper we present an overview of the issues we
have met and addressed in the design and development of
tools and techniques for the retrieval of information and
the processing of text in the Indonesian language. We
illustrate the need for adaptive methods by reporting the
main results of four experiments in the identification of
Indonesian documents, the stemming of Indonesian
words, the tagging of part of speech, and the extraction
of name-entities, respectively.
2. Identifying Indonesian Documents
The Indonesian Web, or the part of the World Wide Web
containing documents primarily in the Indonesian
language, is not an easily identifiable component. By the
very nature of the Web itself, it is dynamic. Formally,
using methods such as the one described in [14], or
informally, one can safely estimate the size of the
Indonesian Web to be several millions of documents.
Web pages in Indonesian link to documents in English,
Dutch, Arabic, or any other language. As we only wish to
index Indonesian web pages, a language identification
system that can tell whether a given document is written
in Indonesian or not is needed.
Linguistic is everything but a prescriptive science. The
rules underlying a language and its usages come from
observation. Furthermore the speakers continuously
modify existing rules and internalize new rules under the
influence of the socio-linguistic factors, the least one of
which is not the penetration of foreign words. The
Indonesian language is a particularly vivid example a
living language in constant evolution. It includes
vocabulary and constructions from a variety of other
languages from Javanese to Arabic, English, and Dutch.
It comprises an unusual variety of idioms ranging from a
respected literary style to numerous regional dialects (e.g.
Betawi) and slangs (e.g. Bahasa Gaul). Linguistic rules
and data collections (dictionaries, grammars, etc.) are the
foundation of computational linguistic and information
retrieval. But their acquisition requires the convergence
of significant amounts of effort and competence that
smaller or economically challenged communities cannot
Methods available for language identification [15] yield
near perfect performance. However these methods
require a training set of documents in the languages to be
discriminated. This setting is unrealistic in the context of
the web as one can neither know in advance nor predict
the languages to be discriminated. We devised a method
61
In [18], we proposed a morphology-based stemmer for
the Indonesian language. An evaluation using inflectional
words from an Indonesian Dictionary [23] has shown that
the algorithm achieved over 90% correctness in
identifying root words [7, 18, 21]. The use of the
algorithm improved the retrieval effectiveness of
Indonesian documents [18]. In comparison with this
morphology-based stemmer, a Porter stemmer and a
corpus-based stemmer have been developed for Bahasa
Indonesia [7]. However, the evaluation using inflectional
words from an Indonesian Dictionary [23] showed that
the morphology-based algorithm performed better in
identifying root words [7, 18, 21]. Applying a root-word
dictionary to all of the stemmer algorithms improved the
identification of root words further [7].
[24] that can learn from a training set of documents in the
language to be distinguished only. To put it in the
Machine Learning terms, we devised an algorithm that
learns from positive examples only. As its predecessor,
our method is based on trigrams. The effectiveness of our
method relies on the specificity of the trigram frequencies
for a given language. The comparative performance
evaluation shows a precision of 92% and for a recall
close to 100%. Figure 2.1 illustrate the performance of
the initial method after learning from iteratively larger
sets of positive examples.
Recall
Precision
In evaluating the effectiveness of the stemmer algorithms,
we applied the stemmers to the retrieval of Indonesian
documents using an information retrieval system. The
result shows that the performance of Porter and corpusbased stemmer algorithms for Bahasa Indonesia is
comparable to that of the morphology-based algorithm.
100%
90%
80%
70%
60%
th
In the field of information retrieval [25], stemming is
used to abstract keywords from the morphological
idiosyncrasies. Hopefully, improvement in retrieval
performance is resulted. We noticed however a lower
than expected retrieval performance after stemming
(independently of the stemming algorithm). We
explained this phenomenon by the fact that Indonesian
morphology is essentially derivational (conceptual
variations) as opposed to the morphologies of languages
such as French or Slovene [19], which are primarily
derivational (grammatical variations). This result refines
the conclusion of [19] that the effectiveness of stemming
is commensurate to the degree of morphological
complexity in that we showed that it also depends on the
nature of the morphology.
10
9t
h
7t
h
8t
h
6t
h
5t
h
d
4t
h
3r
t
1s
2n
d
In
iti
al
50%
Set
Figure 2.1: Language Identification Performance
Yet this performance is still lower than the one of the
algorithms based on discriminating corpuses. To improve
the initial performance and to make the solution adaptive
to changes in the language and usage, we devised a
continuously learning method that uses the documents
labeled as Indonesian by the algorithm to further train the
algorithm itself. The performance evaluation of this
Continuous-Learning Language Distinction quickly
converged toward total recall and precision for random
samples from the Web. The method even performs well
under harsh conditions: it has for instance been able to
distinguish Indonesian documents from documents in
morphologically similar languages such a Tagalog and
even Malay at very respectable levels of precision.
In a recent development of our research we have devised
and evaluated a method for the mining of stemming rules
from a training corpus of documents [11, 12]. The
method induces prefix and suffix rules (and possibly
infix rules although this feature is computationally
intensive). The method achieves from 80% to 90%
accuracy (i.e. is able to induce rules 80% to 90% of
which are correct stemming) from corpuses as short as
10000 words. In the experiments above, we have
successfully applied the method to the Indonesian and
Italian languages as well as to Tagalog.
3. Stemming
One of the basic tools for textual information indexing
and retrieval is word stemmer. Yet effective stemming
algorithms are difficult to devise as they require a sound
and complete knowledge of the morphology of the
language.
4. Part of Speech Tagging
The Indonesian language is a morphologically rich
language. There are around 35 standard affixes (prefixes,
suffixes, circumfixes, and some infixes inherited from the
Javanese language) (see [6]). Affixes can virtually be
attached to any word and they can be iteratively
combined. The wide use of affixes seems to have created
a trend among Indonesian speakers to invent new affixes
and affixation rules. This trend is discussed and
documented in [23]. We refer to this set of affixes, which
includes the standard set, as extended.
Part-of-speech tagging is the task of assigning the correct
class (part-of-speech) to each word in a sentence. A partof-speech can be a noun, verb, adjective, adverb etc.
Different word classes may occupy the same position,
and similarly, a part-of-speech can take on different roles
in a sentence. Automatic part-of-speech tagging is
therefore the assignment of a part-of-speech class (or tag)
to terms in a document.
62
components that require global, ancillary, or external
knowledge. Indeed, although, we expect similar methods
(association rules, maximum entropy) can be used to
learn the model of combination of elementary entities
into complex elements, we also expect that global,
ancillary, and external knowledge will be necessary such
as lists of names of personalities (Mike O'Brien,
Megawati Sukarnoputri), gazetteers (Jakarta is in
Indonesia), document temporal and geographical context
(Jakarta, 05/06/2003), etc.
In [8, 9, 10] we present our preliminary results in an
effort to extract structured information in the form of an
XML document from texts. We believe this is possible
under some ontological hypothesis for a given and well
identified application domain. Our preliminary results are
only concerned with the individual tagging of named
entities such as locations, person names, and
organizations. Table 5.1 illustrates the performance of an
association rule based technique with a corpus of 1.258
articles from the online versions of two mainstream
Indonesian newspaper Kompas (kompas.com) and
Republika (republika.co.id).
In [11] and [12] we present several methods for the fully
automatic acquisition of the knowledge necessary for
part-of-speech tagging. The methods follow and extend
the ideas in [21]. In particular they use various clustering
algorithms. The methods we have devised neither use a
tagged training corpus such as the method in [3] nor
consider a predefined set of tags such as the method in
[13]. Our evaluation of the effectiveness of the proposed
methods using the Brown corpus [5] tagged by the Penn
Treebank Project [16] shows that the best of our methods
achieves a consistent improvement over all other
methods to which we compared with more than 80% of
the words in the tested corpus being correctly tagged.
The detailed results are illustrated in table 4.1. The table
reports the average precision, recall and percentage of
correctly tagged words for several methods based on
trigrams (trigram 1,2 and 3), the state of the art methods
(Schutze 1 and 2), and our proposed method (Extended
Schutze).
Table 4.1: Part of Speech Tagging Performance
Method
Average
Precision
Average
Recall
% Correct
Trigram 1
0.70
0.60
64%
Trigram 2
0.74
0.62
66%
Trigram 3
0.76
0.62
67%
Extended
Schutze’s
0.90
0.72
81%
Schutze1
0.53
0.52
65%
Schutze2
0.78
0.71
80%
Table 5.1 Named Entity Recognition Performance
Recall
Precision
F-Measure
60.16%
58.86%
59.45%
On the corpuses to which we have applied it, our method
outperforms state of the art techniques such as [4].
A particularly striking result is the appearance of finer
granularity clusters of words which are not only of the
same part of speech but also share the same affixes (e.g.
“menangani”, “mengatasi”, “mengulangi” share the
circumfix “me-i”), the same semantic category (e.g.
“Indonesia”, “Jepang”, “Eropa”, “Australia” are names of
geo-political entities) or both (e.g. “mengatakan”,
“menyatakan”, “mengungkapkan”, “menegaskan” are
synonyms meaning “to say”). Indeed, the Indonesian
language has not only a derivational morphology, but
also, as most languages, a concordance of the
paradigmatic and the syntagmatic components.
<meeting>
<date>05/06/2003</date format=europe>
<location>
<name>State Palace</name> <city>Jakarta</city>
<country>Indonesia</country>
</location>
<participants>
<person>
<name>Megawati Soekarnoputri</name>
<quality>President </quality>
<country>Indonesia</country>
</person>
<person>
<name>Mike O'Brien</name>
<quality>Foreign Office
Minister</quality> <country>Britain</country>
</person>
</participants>
</meeting>
5. Named Entity Extraction
Figure 5.1: Sample XML extracted from a text
The last remark suggests that a similar approach to the
one we have used for part-speech tagging can be applied
for a mainly paradigmatic tagging and therefore for the
extraction of information.
We applied the named entity tagger that identifies persons,
organizations, and locations [10] to an information retrieval
task, a question answering task for Indonesian documents [17].
The limited success of the experiment compels further research
in this domain.
To illustrate our objective, let us consider the motivating
example from which we wish to extract an XML
document describing the meeting taking place:
“British Foreign Office Minister O'Brien (right) and
President Megawati pose for photographers at the State
Palace.”
Figure 5.1 contains the manually constructed XML we
hope to obtain in fine. In italic are highlighted the
6. Conclusion
While attempting to design and implement tools and
techniques for the processing of documents in the
Indonesian language on the Web and for the construction
of an Indonesian digital library, we were faced with the
unavailability of linguistic data and knowledge as well as
63
with the prohibitive cost of data and knowledge
collection.
Speech Features into a Knowledge Engineering
Approach. Discovery Science (2005)
This situation compelled the design and development of
semi-automatic or automatic techniques for or ancillary
to tasks as varied as language identification, stemming,
part-of-speech tagging and information extraction. The
dynamic nature of languages in general and of the
Indonesian language in particular also compelled
adaptive methods. We have summarized in this paper the
main results we have obtained so far.
[11] Indradjaja, L. and Bressan, S., Automatic Learning
of Stemming Rules for the Indonesian Language, In
Proc. of the The 17th Pacific Asia Conference on
Language, Information (2003).
[12] Indradjaja, L., and Bressan, S., Penemuan Aturan
Pengakaran Kata secara Otomatis. In Proc. of the
Seminar on Bringing Indonesian Language toward
Globalization through Language, Information and
Communication Technology (2003). (in Indonesian)
Our work continues in the same philosophy while
addressing new tasks such as spelling error correction,
structured information extraction as mentioned above, or
phonology for text-to-speech and speech-to-text
conversion.
[13] Jelinek, F. Robust Part-of-speech Tagging using a
Hidden Markov Model. Technical Report. IBM, T.J.
Watson Research Center (1985).
[14] Lawrence, Steve and Giles, C. Lee. Searching the World
Wide Web Science V 280, 1998.
References
[15] Lazzari, G., et all. Speaker-language identification and
[1] Adriani, Mirna and Rinawati. Finding Answers to
Indonesian Questions from English Documents. In
Working Notes of the Workshop in Cross-Language
Evaluation Forum (CLEF), Vienna, September 2005.
speech translation. Part of Multilingual Information
Management: Current Levels and Future Abilities,
delivered to US Defense ARPA, April 1999.
[16] Marcus, M., Kim, G., Marcinkiewicz, M., MacIntyre, R.,
[2] Bressan, S., and Indradjaja, L., Part-of-Speech
Tagging without Training Intelligence in Proc. of
Communication Systems, IFIP International
Conference, INTELLCOMM (2004).
Bies, A., Ferguson, M., Katz, K., and Schasberger, B. The
Penn Treebank: Annotating Predicate Argument Structure.
In ARPA Human Language Technology Workshop (1994).
[17] Natalia, Dessy. Penemuan Jawaban Pada Dokumen
Berbahasa Indonesia. Tugas Akhir S-1. Fakultas Ilmu
Komputer, Universitas Indonesia, 2006.
[3] Brill, E. Automatic Grammar Induction and Parsing
Free Text: A Transformation-based Approach. In
Proceedings of ACL 31. Columbus OH (1993).
[18] Nazief, Bobby and Adriani, Mirna. A Morphology-Based
Stemming Algorithm for Bahasa Indonesia. Technical
Report. Faculty of Computer Science, 1996.
[4] Chieu, H.L., and Hwee Tou Ng, “Named Entity
Recognition: A Maximum Entropy Approach Using
Global Information”, Proceedings of the 19th
International
Conference
on
Computational
Linguistics, (2002).
[19] Popovic, Mirko., and Willett, Peter., The effectiveness of
[5] Francis, W.N. and Kucera, F. Frequency Analysis of
English Usage. Houghton Mifflin, Boston (1982).
[20] Schutze, Hinrich (1999). Distributional Part-of-speech
[6] Harimurti Kridalaksana, Pembentukan Kata Dalam
Bahasa Indonesia. P.T. Gramedia, Jakarta 1989.
[21] Siregar, Neil Edwin F. Pencarian Kata Berimbuhan Pada
stemming for natural-language access to Slovene textual
data. Journal of the American Society for Information
Science, Vo. 43, June 1992, pp 384-390.
Tagging. In EACL7, pages 141-148.
Kamus Besar Bahasa Indonesia dengan menggunakan
algoritma stemming. Tugas Akhir S-1, Fakultas Ilmu
Komputer, Universitas Indonesia, 1995.
[7] Ichsan,
Muhammad.
Pemotong
Imbuhan
Berdasarkan Korpus Untuk Kata Bahasa Indonesia.
Tugas Akhir S-1, Fakultas Ilmu Komputer,
Universitas Indonesia, 2005.
[22] Tim
Penyusun Kamus, Kamus
Indonesia.2ed. Balai Pustaka, 1999.
Besar
Bahasa
[23] Vinsensius, V., and Bressan, S., Continuous-Learning
[8] Indra, Budi., Bressan, S., and
Hasibuan, Z.,
Pencarian Association Rules untuk Pengenalan
Entitas Nama. In Proc. of the Seminar on Bringing
Indonesian Language toward Globalization through
Language,
Information
and
Communication
Technology (2003). (in Indonesian)
Weighted-Trigram Approach for Indonesian Language
Distinction: A Preliminary Study. In Proceedings of 19th
International Conference on Computer Processing of
Oriental Languages, 2001.
[24] Vinsensius, V., and Bressan, S., “Temu-Kembali Informasi
untuk Dokumen-dokumen dalam Bahasa Indonesia”, In
Electronic proceedings of Indonesia DLN Seminar, 2001.
(In Indonesian).
[9] Indra, Budi. and Bressan, S., Association Rules
Mining for Name Entity Recognition. In Proc. of
Conference on Web Information Systems
Engineering (WISE) (2003).
[25] Yates, R. B. and Neto, B R., Modern Information Retrieval.
ACM Press New York, 1999.
[10] Indra Budi, Stéphane Bressan, Gatot Wahyudi,
Zainal A. Hasibuan, Bobby Nazief: Named Entity
Recognition for the Indonesian Language:
Combining Contextual, Morphological and Part-of-
64
Searching Method for English-Malay Translation Memory
Based on Combination and Reusing Word Alignment
Information
Suhaimi Ab. Rahman, Normaziah Abdul Aziz, Abdul Wahab Dahalan
Knowledge Technology Lab
MIMOS,
Technology Park Malaysia,
57000 Kuala Lumpur, Malaysia.
smie@mimos.my, naa@mimos.my, wahab@mimos.my
Abstract
This paper describes the searching method
used in a Translation Memory (TM) for
translating English to Malay language. It
applies phrase look-up matching
techniques. In phrase look-up matching,
the system locates the translation
fragments in several examples. The longer
the length of each fragment, the better the
matching is. The system then generates the
translation suggestion by combining these
translation fragments. This technique
generates a translation suggestion with the
assistance of word alignment information.
We developed a TM as an additional tool on top
of our existing Machine Translation system1 to
translate documents from English to Malay.
2
English-Malay TM System - Basic Principle
Zerfass (2002) described the text to be translated
consists of smaller units like headings, sentences,
list items, index entries and so on. These text
components are called segments. Figure 1 shows an
overall process of lookup segment using phrase
look-up matching.
Input sentence
1
Introduction
The purpose of Translation Memory system
(TM) is to assist human translation by re-using
pre-translated examples. Several work have been
done in this area such as (Hua et al., 2005;
Simard and Langlais, 2001; Macklovitch and
Russell, 2000) among others. A TM system has
three parts: a) Translation memory itself, which
records example translation pairs (together with
word alignment information); b) Search engine,
which retrieves related examples from the
translation memory and c) On-line learning
mechanism, which learns newly translated
translation pairs. When translating a sentence, the
TM provides the translation of the best-matched
pre-translated example as the translation
suggestion.
TM
database
Phrase look-up
matching
Fill in translation of
an identical segment
Save new translation
to database
Offer translation
Accepted translation
Translators/
users
Figure 1. An overall process of lookup segment
using phrase look-up matching
1
The present MT is an EBMT, a project we embarked
with Universiti Sains Malaysia and available for usage
at www.terjemah.net.my
65
Input Sentence:
If you choose a non-clinical program.
3 Phrase Look-up Matching
The phrase look-up matching is used to find
suggested meaning for a phrase by parsing the
phrase into sub-phrases and finding a meaning to
these sub-phrases and combine the results to get
the final output. Below is a figure showing an
example of the phrase look-up matching process.
Example retrieved from Sentence Alignment
Table (SAT):
Source sentence (E): If you choose a non-clinical
program you have a greater responsibility for
monitoring your own health
Target sentence (M): Jika anda memilih suatu
program bukan klinikal anda mempunyai
tanggungjawab yang lebih besar untuk memantau
kesihatan anda
input sentence:
Selected planting materials are picked when they
are 30-60 cm high at about 4 months before
harvesting
split into:
Selected planting materials are picked when they
are 30-60 cm high at about 4 months before |
harvesting
process the left part, no result found, split
further
:
:
Selected planting materials are picked when they
are 30-60 cm high | at about 4 months before
harvesting
result found, applying algorithm, add result to
the output
basic output:
“Bahan-bahan tanaman terpilih dipetik apabila ianya
mencapai ketinggian 30-60 sm”
process the right side
at about 4 months before harvesting
result found, applying algorithm,
add result to previous output
basic output:
“Bahan-bahan tanaman terpilih dipetik apabila ianya
mencapai ketinggian 30-60 sm” + “pada kira-kira 4
bulan sebelum penuaian”
Figure 2. Example of The Phrase Look-Up
Matching using a Bi-Section Algorithm
3.1 Repetition Avoidance
The basic output has no problems in structure, but
it may contain repeated words. These repeated
words are generated because of the way we deal
with the source target pairs. We implement the
repetition avoidance algorithm to solve this
problem. Let us consider an example shows in
Figure 3.
The target word “anda” for the source you and
your is repeated 3 times in the output although
your is mentioned once in the source sentence.
Basic Output:
“Jika anda memilih suatu program bukan klinikal
anda anda”
Figure 3. An Example of Basic Output With a
Repetition of Word
We might notice that the word you and your are
two different words, treated as two repeated similar
words by the program, this is because the repetition
avoidance algorithm consider “anda” in the target
sentence as a repeated word, and since you and
your both means “anda” in Malay, the repetition
avoidance algorithm will consider 3 repeated
“anda” in the output.
To determine which are the words “anda” need
to be select from the above basic output, we used a
mathematical model, named as the inter-phrase
word-to-word distance summation.
3.2 Inter-Phrase
Summation
Word-to-Word
Distance
This algorithm uses mathematical calculation to
calculate a summation value that will give a clue on
which word is repeated and needs to be omitted and
which one is not. Each word that belongs to the
basic output will have one summation value
calculated. We call that summation value as dj. The
word with a big dj value is more likely to be a
repeated word. In order to know the summation
value dj of that word, we have to sum the word-toword distance value between that word and the rest
of the word in that basic output. Word to word
distance is the number of words that separate one
word to another in the original SAT entry. The
selection combination of each word/s from SAT is
based on the aligned word retrieved from Word
Alignment Table (WAT).
66
In order to determine the value for each of the
distance word between word in basic output (loci)
and word in SAT’s target sentence (locj) we will
use this formula: di = |locj-loci|
where :
i,j = 0,1,….,n
idx
0
1
2
3
4
5
loci : location value for basic output
locj : location value for SAT’s target
sentence
By applying this formula, the distance value for
di will be getting by doing subtraction of the two
values - locj and loci. This process will continue
until all location of the words has been subtracted
properly.
Table 1 shows the value of loci and locj for each
of word in the basic output and target sentence
from SAT, while Table 2 shows the matrix table
for all distance values di.
Running the Inter-Phrase Word-to-Word
Distance Summation algorithm will first give us
the total summation of all distance calculation
result.
The dj is a summation
n for all the distance word
values generated from di. Below is an equation
formula on how we can get the summation value
for dj:
Vj
dj = | locj – loci |
i=0
It is the sums of the absolute value of the
difference between the locations of word locj and
each other single entry in the basic output word
loci. The value of this summation will be the main
source of judgment for the choice between
repeated words.
Table 2 describes the details for the summation
value of word-to-word distances. The dj
summation values will be used as judgment value
to omit the extra “anda”. We have crossed the row
values belong to the conflicted words, for example
word “anda”. Then we sum the rest to get the dj
of each word.
Since we have two words you in the source
sentence from SAT, but there is only one in the
input, so one of them must be omitted.
Figure 4 depicts the plot of the summations
values (dj) vs the basic output (loci).
6
7
Basic
output
jika
anda
memilih
suatu
program
bukan
klinikal
anda
anda
loci
0
1
2
3
4
5
6
7
SAT
Jika
anda
memilih
suatu
program
bukan
klinikal
anda
mempunyai
:
lebih besar
:
anda
locj
0
1
2
3
4
5
6
7
10
14
Table 1. The Location of word in the basic output
and SAT.
The thick dotted line represents the margin
between acceptance and non-acceptance words, i.e.
everything left hand side the thick dotted line is
acceptance and the rest is not.
After removing the dropped words from the
basic output, we will have the final output as
follows: “Jika anda memilih suatu program
bukan klinikal”.
4 Result
We have tested this technique with our 7,000
English-Malay bilingual sentences and found that
phrase lookup matching with Inter-Phrase Distance
Summation technique can reduce some important
error for the repetition of the same word meaning
from the target sentence.
5
Conclusion
This paper describes a translation memory system
using a look-up matching techniques. This
technique generates translation suggestions through
word alignment information in the pre-translated
examples. We have also implemented a
mathematical model that is used to make logical
judgments that helps in maintaining the accuracy of
the output sentence structure. The accuracy and the
quality of the translation is dependent on the
number of the examples in our TM database i.e. we
can improve the quality by increasing the number
of examples in the translation memory and word
alignment information database.
67
locj
loci
dj
Table 2. Summation of word-to-word Distances
Summations Values (dj)
Accepted area
Basic Output (loci)
Figure 4. Relation Between the Index and the
Inter-Phrase Distance Summation
Acknowledgement
We would like to acknowledge our research
assistant, Ahmed M.Mahmood from the
International Islamic University Malaysia, who had
also contributed his ideas in this work.
“Annotation Standards for Temporal Information
in Natural Language” (LREC 2002), Las Palmas,
Canary Islands – Spain.
Atril – Déjà Vu.
http://www.atril.com
Elliott Macklovitch and Graham Russell. 2000.
What’s been Forgotten in Translation Memory.
In Proc. of the 4th Conference of the
Association for Machine Translation in the
Americas (AMTA-2000), pages 137-146.
Mexico.
Michel Simard and Philippe Langlais. 2001. SubSentential
Exploitation
of
Translation
Memories. In Proc. of the 8th Machine
Translation Summit (MT Summit VIII), pages
331-339, Santiago de Compostela, Galicia,
Spain.
Trados – Translator’s Workbench.
http://www.trados.com/
WU Hua, WANG Haifeng, LIU Zhanyi, TANG
Kai. 2005. Improving Translation Memory with
Word Alignment Information. In Proc. of the
10th Machine Translation Summit (MT Summit
X), pages 364-371, Phuket, Thailand.
References
Angelika Zerfasss. 2002. Evaluating Translation
Memory Systems. In Proc. of the workshop on
68
A Phrasal EBMT System for Translating English to Bengali
Sudip Kumar Naskar
Comp. Sc. & Engg. Dept.
Jadavpur University
Kolkata, India 700032
sudip.naskar@gmail.com
Abstract
The present work describes a hybrid MT
system from English to Bengali that uses
the TnT tagger to assign POS category to
the tokens, identifies the phrases through
a shallow analysis, retrieves the target
phrases using a Phrasal Example Base and
then finally assigns a meaning to the
sentence as a whole by combining the
target language translations of the
constituent phrases.
1
Introduction
Bengali is the fifth language in the world in
terms of the number of native speakers and is an
important language in India. But till date there is
no English- Bengali machine translation system
available (Naskar and Bandyopadhyay, 2005b).
2
Translation Strategy
In order to translate from English to Bengali
(Naskar and Bandyopadhyay, 2005a), the tokens
identified from the input sentence are POS tagged
using the hugely popular TnT tagger (Brants,
2000). The TnT tagger identifies the syntactic
category the tokens belong to in the particular
context. The output of the TnT tagger is filtered to
identify multiword expressions (MWEs) and the
basic POS of each word / term alongwith
additional information from WordNet (Fellbaum,
1998). During morphological analysis, the root
words / terms (including idioms, named entities,
abbreviations, acronyms), along with associated
syntactico-semantic information are extracted.
Based on the POS tags assigned to the words /
terms, a rule-based chunker (shallow parser)
identifies the constituent chunks (basic nonrecursive phrase units) of the source language
sentence and tags them to encode all relevant
information that might be needed to translate this
phrase and perhaps resolve ambiguities in other
phrases. A DFA has been written for identifying
each type of chunk: NP, VP, PP, ADJP and ADVP.
Verb phrase (VP) translation scheme is rule based
and uses Morphological Paradigm Suffix Tables.
Rest of the phrases (NP, PP, ADJP and ADVP) are
translated using Example bases of syntactic
transfer rules. A phrasal Example Base is used to
retrieve the target language phrase structure
corresponding to each input phrase. Each phrase is
translated individually to the target language
(Bengali) using Bengali synthesis rules (Naskar
and Bandyopadhyay, 2005c). Finally, those target
language phrases are arranged using some
heuristics, based on the word ordering rules of
Bengali, to form the target language representation
of the source language sentence. Named Entities
are transliterated using a modified joint sourcechannel model (Ekbal et al., 2006).
The structures of NP, ADJP and ADVP are
somewhat similar in both English and Bengali. But
the VP and PP constructions differ markedly in
English and Bengali. First of all, in Bengali, there
is no concept of preposition. English prepositions
are handled in Bengali using inflexions to the
reference objects (i.e., the noun that follows a
preposition in a PP), and / or post-positional words
after them (Naskar and Bandyopadhyay, 2006).
Moreover, inflexions in Bengali get attached to the
reference objects and relate it with the main verb
of the sentence in case or karaka relations. An
inflexion has no existence of its own in Bengali,
and it does not have any meaning as well, but in
English prepositions have their own existence, i.e.,
they are separate words. Verb phrases in both
English and Bengali depend on the person, number
information of the subject and tense and aspect
information of the verb. But for any particular root
69
[NP Teaching history] [VP gave] [NP
him] [NP a special point of view]
[PP toward current events].
verb, there are only a few verb forms in English,
whereas in Bengali it shows a lot of variations.
3
POS Tagging and Morphological Analysis
The input text is first segmented into sentences.
And each sentence is tokenized into words. The
tokens identified at this stage are then subjected to
the TnT tagger that assigns a POS tag to every
word. The HMM based TnT tagger (Brants, 2000)
is at per with other state-of-the-art POS taggers.
The output of the TnT tagger is filtered to
identify MWEs using WordNet and additional
resources like list of acronyms, abbreviations,
named entities, idioms, figure of speech, phrasal
adjectives, phrase prepositions.
Although, the freely available WordNet
(version 2.0) package provides with it the set of
programs for accessing and integrating WordNet,
we have developed our own interface to integrate
WordNet into our system for implementing the
particular set of functionalities required for our
system.
In addition to the existing eight noun suffixes in
the WordNet, we have added three more noun
suffixes – “ ’s ”, “ ’ ”, “ s’ ” in the noun suffix set.
Multiword expressions or terms are identified in
this phase and are treated as a single token. These
include multi-word nouns, verbs, adjectives,
adverbs, phrase prepositions, phrase adjectives,
idioms etc. Sequences of digits and certain types of
numerical expressions, such as dates and times,
monetary expressions, and percents are also treated
as a single token. They can also appear in different
forms as any number of variations.
4
5
The tables containing the proper nouns,
acronyms, abbreviations, and figure of speech in
English and the corresponding Bengali translation
are the literal example bases. The phrasal templates
for the NPs, PPs, ADJPs, and ADVPs store the part
of speech of the constituent words along with
necessary semantic information. The source and
the target phrasal templates are stored in example
bases, expressed as context-sensitive rewrite rules,
using semantic features. These translation rules are
effectively transfer rules. Some of the MWEs in
the WordNet represent pattern examples (e.g.,
make up one's mind, cool one's
heels, get under one's skin; one's
representing a possessive pronoun).
6
Translating NPs and PPs
NPs and PPs are translated using phrasal
example base and bilingual dictionaries. Some
examples of transfer rules are given below for NPs:
<det & a> <n & singular, human,
nom> Æ <ekjon> <n′>
<det & a> <adj> <n & singular,
inanimate> Æ <ekti> <adj′> <n′>
<prn & genitive> <n & plural,
human, nom> Æ <prn′> <n′> <-era/ra>
Below are some examples of transfer rules for
PPs.
<prep & with/by> <n & singular,
instrument > Æ <n′> <diye>
<prep & with> <n & singular, person
> Æ <n′> <-yer/er/r> <songe>
<prep & before> <n & artifact> Æ
<n′> <-yer/er/r> <samne>
<prep & before> <n & !artifact> Æ
<n′> <-yer/er/r> <age>
<prep & till> <n & time/place> Æ
<n′> <porjonto>
<prep & in/on/at> <n & singular,
place> ÅÆ <n′> <-e/te/y>
Syntax Analysis
In this module, a rule-based (chunker) shallow
parser has been developed that identifies and
extracts the various chunks (basic non-recursive
phrase units) from a sentence and tags them.
A sentence can have different types of phrases NP, VP, PP, ADJP and ADVP. We have defined a
formal grammar for each of them that identify the
phrase structure based on the POS information of
the tokens (words / terms).
For example, the system chunks the sentence
“Teaching history gave him a special point of view
toward current events” as given below:
Parallel Example Base
Using the transfer rules, we can translate the
following NPs as:
70
<det & a> <n & man (sng, human,
nom)> ÅÆ <ekjon> <chele>
<det & a > <n & book (sng,
inanimate, acc)>ÅÆ<ekti> <boi>
<prn & my (gen)> <n & friends
(plr, human, nom)> ÅÆ <amar>
<bondhura>
<n & Ram’s (sng, gen)> <n &
friends (plr, human, dat)> ÅÆ
<ramer> <bondhuderke>
Similarly, below are some candidate PP
translations.
<prep & with> <prn & his (gen)>
<n & friends (plr, human, nom)>
ÅÆ <tar> <bondhuder> <sathe>
<prep & in> <n & school (sng,
inanimate, loc)> ÅÆ <bidyalaye>
7
Translating VPs
Bengali verbs have to agree with the subject in
person and formality. Bengali verb phrases are
formed by appending appropriate suffixes to the
root verb. Some verbs in English are translated in
Bengali using a combination of a semantically
‘light’ verb and another meaning unit (a noun,
generally) to convey the appropriate meaning. In
English to Bengali context this phenomenon is
very common, e.g., to swim – santar (swimming)
kata (cut), to try – chesta (try) kara (do).
Bengali verbs are morphologically very rich. A
single verb root has many morphological variants.
The Bengali representation of the ‘be’ verb is
formed by suffixing to the present root ach, past
root chil and the future root thakb for appropriate
tense and person information. The negative form of
the ‘be’ verb in present tense is nei for any person
information. And in past and future tense, it is
formed by simply adding the word na
postpositionally after their corresponding assertive
form.
Root verbs in Bengali can be classified into
different groups according to the spelling pattern.
All the verbs belonging to the same spelling
pattern category, take the same suffix for same
person, tense, aspect information. These suffixes
also change from the Classical to Colloquial form
of Bengali. There are separate morphological
paradigm suffix tables for the verb stems that have
the same spelling pattern. There are some
exceptions to these rules.
The negative forms are formed by adding na or
ni postpositionally. Other verb forms (gerundparticiple,
dependent
gerund,
conjunctive
participle, infinitive-participle etc.) are also taken
care of in the same way by adding appropriate
suffixes from a suffix table. Further details can be
seen in (Naskar and Bandyopadhyay, 2004).
8
Word Sense Disambiguation
The word sense disambiguation algorithm is
based on eXtended WordNet (version 2.0-1.1)
(Harabagiu et al, 1999). The algorithm takes a
global approach where all the words in the context
window are simultaneously disambiguated in a bid
to get the best combination of senses for all the
words in the window instead of only a single word.
The context window is made up of the all WordNet
word tokens present in the current sentence under
consideration. A word bag is constructed for each
sense of every content word. The word bag for a
word-sense combination contains synonyms and
content words from the associated tagged glosses
of the synsets that are related to the word-sense
through various WordNet relationships for
different parts of speech. Each word (say Wi) in the
context is compared with every word in the glossbag for every sense (say Sj) of every other word
(say Wk) in the context. If a match is found, they
are checked further for part-of-speech match. If the
words match in part-of-speech as well, a score is
assigned to both the words: the word being
matched (Wi) and the word whose gloss-bag
contains the match (Wj). This matching event
indicates mutual confidence towards each other, so
both words are rewarded by scoring for this event.
A word-sense pair thus gets scores from two
different sources, when disambiguating the word
itself and when disambiguating neighboring words.
Finally, these two scores are combined to arrive at
the combination score for a word-sense pair. The
sense of a word for which maximum overlap is
obtained between the context and the word bag is
identified as the disambiguated sense of the word.
The baseline algorithm is modified to include more
contexts. Increase in the context size, by adding the
previous and next sentence in the context, resulted
in much better performance. It resulted in 61.77%
precision and 85.9% recall, tested on the first 10
Semcor 2.0 files .
71
9
Resources
WordNet (version 2.0) is the main lexical
resource used by the system. We have a separate
non-content word dictionary. An English-Bengali
dictionary has been developed which maps
WordNet English synsets to its Bengali synsets.
For the actual translation purpose, the first Bengali
word (synonym) in the synset is always taken by
the system. So, there is no scope for lexical choice
in this work. But, during the dictionary
development, the Bengali word that is mostly used
by the native speakers is kept at the beginning of
the Bengali synset. So effectively, the most
frequently used Bengali synonyms are picked up
by the system during the dictionary look up.
Figure of Speech expressions in English have
been paired with their corresponding counterparts
in the target language and these pairs have been
stored in a separate Figure of Speech Dictionary.
Idioms also are translated using a direct example
base. The morphological suffix paradigm tables are
maintained for all verb groups. They help in
translating VPs.
Named Entities are transliterated. If there is any
acronym or abbreviation within the named entity, it
is translated. For this translation purpose, the
system uses an acronym / abbreviation dictionary
that includes the different acronyms/abbreviations
occurring in the news domain and their
corresponding representation in the Bengali. The
transliteration scheme is knowledge-based. It has
been trained on a bilingual proper name examplebase containing more than 6000 parallel names of
Indian origin. The transliteration process uses a
modified joint source-channel approach. The
transliteration mechanism (especially the chunking
of transliteration units) is linguistically motivated
and makes use of a linguistic knowledge base.
For sense disambiguation purpose we make use
of eXtended WordNet (version 2.0-1.1).
10 Conclusion
Anaphora resolution has not been considered by
the system, sine this is required only for proper
translation of personal pronouns. Only second and
third person personal pronouns have honorific
variants in Bengali. Pronouns can be translated
assuming a default highest honor.
The system has not been evaluated as some
parts (specially the dictionary creation) are under
development. We intend to evaluate the MT
system using the BLEU metric (Papineni et al.,
2002).
References
Asif Ekbal, Sudip Kumar Naskar and Sivaji
Bandyopadhyay. 2006. A Modified Joint SourceChannel Model for Transliteration; In the
proceedings of COLING-ACL 2006, Sydney,
Australia.
Fellbaum, C. ed., WordNet – An Electronic Lexical
Database, MIT Press, 1998.
Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing
Zhu. 2002. Bleu: a method for automatic
evaluation of machine translation. IBM Research
Division Technical Report, RC22176 (W0190-022),
Yorktown Heights, N.Y.
S. Harabagiu, G. Miller, D. Moldovan, WordNet2 - a
Morphologically and Semantically Enhanced
Resource. In Proceedings of SIGLEX-99, pages 1-8,
University of Mariland, 1999.
Sudip Kumar Naskar and Sivaji Bandyopadhyay. 2006.
Handling of Prepositions in English to Bengali
Machine Translation. In the proceedings of Third
ACL-SIGSEM Workshop on Prepositions, EACL
2006. Trento, Italy.
Sudip Kumar Naskar and Sivaji Bandyopadhyay. 2005a.
A Phrasal EBMT System for Translating English to
Bengali. In the proceedings of MT SUMMIT X.
Phuket, Thailand.
Sudip Naskar and Sivaji Bandyopadhyay. 2005b. Use of
Machine Translation in India: Current Status.
Proceedings of MT SUMMIT X. Phuket, Thailand.
Sudip Naskar, Sivaji Bandyopadhyay. 2005c. Using
Bengali Synthesis Rules in an English-Bengali
Machine Translation System. In the Proceedings of
Workshop on Morphology-2005, 31st March, 2005.
IIT Bombay.
Sudip Kumar Naskar, Sivaji Bandyopadhyay. 2005d.
Transliteration of Indian Names for English to
Bengali. In the Proceedings of Platinum Jubilee
International Conference of Linguistic Society of
India, University of Hyderabad. December 6-8, 2005.
Sudip Naskar and Sivaji Bandyopadhyay. 2004.
Translation of Verb Phrases from English to Bengali,
In the Proceedings of the CODIS 2004. Kolkata,
India.
Thorsten Brants. TnT a statistical part-of-speech tagger.
In Proceedings of the 6th Applied NLP Conference,
pages 224--231, 2000.
72
PREPOSITIONS IN MALAY: Instrumentality
Zahrah Abd Ghafur
Universiti Kebangsaan Malaysia, Kuala Lumpur
Abstract
This paper examines how Malay language manifests instrumentality. Two ways were shown to be the case: by
introducing the notion with the preposition dengan, and by verbalisation of the instruments itself. The same
preposition seems to have carried other notions too if translated to English. But, in thinking in Malay, there is
only one underlying notion i.e. the prepositional phrase occurs at the same time as the action verb.
Instrumentality
Language has peculiar ways in introducing instruments. Malay has at least two ways in expressing this notion:
one is by using preposition and the other is by verbalising the instruments themselves.
A.
Preposition that introduced objects as instruments to perform the actions in the main sentence:
1. dengan
STRUCTURE 1: The instrument is introduced by preposition: dengan
X
actions
Y
dengan
Objects
(instruments)
Dia
memukul
anjing itu
dengan
sebatang kayu.
He
hit
the dog
with
a stick
Dia
menghiasi
rumahnya
dengan
peralatan moden.
She
ornamented
her house
with
modern equipment.
In the above examples, the word menggunakan 'use' can be inserted between the prepositions and the
instruments. Together the group 'dengan menggunakan' can be translated as 'using' or 'with the use of'.
X
actions
Y
dengan + menggunakan
Objects
(instruments)
Dia
membuka
sampul surat itu
dengan menggunakan
pembuka surat.
He
opened
the envelope
Using/ with the use of
a letter opener
Dia
memukul
anjing itu
dengan menggunakan
sebatang kayu.
He
hit
the dog
Using/ with the use of
a stick
Dia
menghiasi
rumahnya
dengan menggunakan
peralatan moden.
She
decorated
her house
Using/ with the use of
modern equipment.
These examples show that it is possible to delete 'menggunakan' from the group 'dengan menggunakan'
without losing their instrumental meaning.
STRUCTURE 2: The instrument is one of the arguments of the verb menggunakan 'use' and the action is
explicitly expressed after the preposition untuk 'for_ing/to'.
X
menggunakan
Objects
untuk
actions
Y
(instruments)
Dia
menggunakan
baji
untuk
membelah
kayu itu.
He
used
(a) wedge
to
split
the wood
All the structures in 1 can be paraphrased into structure 2 and vice versa:
X + actions + Y+ dengan + menggunakan + objects (instruments)
(instruments) + untuk + actions + Y
73
X + menggunakan + objects
e.g.
X
menggunakan
Dia
He
Dia
He
Dia
She
menggunakan
used
menggunakan
used
menggunakan
used
Objects
(instruments)
pembuka surat
a letter opener
sebatang kayu.
a stick
peralatan moden.
modern equipment.
untuk
actions
Y
untuk
to
untuk
to
untuk
to
membuka
open
memukul
hit
menghiasi
decorate
sampul surat itu
the envelope
anjing itu
the dog
rumahnya
her house
STRUCTURE 3: Verbalisation of a Noun_Instrument: meN- + Noun_Instrument
In Malay most noun instruments can be verbalised by prefixing with the prefix meN-.
Noun_Instrument
Verb
e.g.
tenggala
menenggala
Ali menenggala tanah sawahnya.
‘a plough’
‘to plough’
‘Ali ploughed his padi field.’
gunting
menggunting
Dia menggunting rambutnya.
‘scissors’
‘to cut’
‘He has his hair cut’
komputer
mengkomputerkan
Dia mengkomputerkan sistem pentadiran.
'computer'
'to computerise'
'He computerised the administrative system'.
Menenggala 'to plough using a tenggala.'
Menggunting 'to cut using a gunting'
Mengkomputerkan 'to computerise'
In these cases, the derived verbs will adopt the default usage of the instruments. Probably that accounts for
i. not all instruments can be verbalised in this way.
ii. instruments which can be verbalised in this way are instrument with specific use.
Thus:
1.
2.
*memisau is never derived from pisau 'knife' (pisau has many uses)
Menggunting will always mean 'cutting with a pair of scissors'.
Membunuh seseorang dengan menggunakan gunting (killed someone with a pair of scissors) can
never be alternated with menggunting seseorang.
Other examples:
• Dia membelah kayu dengan kapak. (He split the piece of wood with an axe)
= Dia mengapak kayu.
• Mereka menusuk kadbod itu dengan gunting. (They pierced the cardboard with scissors)
* Mereka menggunting kadbod.
• Dia menggunting berita itu dari surat khabar semalam. (He clipped the news item from yesterday’s
papers)
= Dia memotong berita itu dengan gunting …
• Orang-orang itu memecahkan lantai dengan menggunakan tukul besi. (The men were breaking up
the floor with hammers).
*menukul lantai
√menukul paku (to hammer nails)
• Pada masa dahulu tanaman dituai dengan menggunakan sabit. (In those days the crops were cut
by hand with a sickle.
√menyabit tanaman
Other than introducing instruments, dengan also introduces something else.
a.
Accompaniment:
STRUCTURE 4: Verb + dengan + NP entity
Verb (intransitive)
dengan
berjalan
dengan
‘to walk’
‘with’
bercakap
dengan
‘talk’
‘to’
bersaing
dengan
'to compete'
'with'
NP Entity
Ali.
‘Ali’
dia
‘him’
seseorang
'someone'.
74
Verb (transitive)
membuat
‘to do’
menyanyi
‘to sing’
membina
'to start'
Object
Sesuatu
'something'
lagu-lagu asli
'traditional songs'
Sebuah keluarga
'a family'
dengan
dengan
‘with’
dengan
‘with’
dengan
'with'
NP Entity
Ali.
‘Ali’
Kawan-kawannya
‘his friends’
seseorang
'someone'.
The use of Dengan in both these structures may be alternated with bersama-sama 'together with'. In all the cases
examined, in these structures the prepositional phrase (PP) accompanies the subject of the sentence and the
action verb is capable of having multiple subjects at the same time.
If a verb has a default of having a single subject, the dengan PP will accompany the object of the verb:
Verb (transitive)
Object
dengan
NP Entity
menggoreng
ikan
dengan
kunyit.
‘to fry’
'fish'
‘with’
‘turmeric'
ternampak
seseorang
dengan
kawan-kawannya
‘saw’
'someone'
‘with’
‘his friends’
membeli
Sebuah rumah
dengan
Tanahnya sekali.
'to buy
'a house'
'(together) with'
'the land'.
b.
Quality:
STRUCTURE 5: Verb (intransitive) + dengan + NP Quality
Verb (intransitive)
dengan
Quality (adj/adv)
berjalan
dengan
lancar.
‘to progress’
‘smoothly’
bercakap
dengan
kuat
‘talk’
‘loudly’
bersaing
dengan
adil
'to compete'
'fairly'
bekerja
dengan
keras
'to work'
'hard'
Verb (transitive) + dengan + NP Quality
Verb (transitive)
Object
dengan
Quality (adj/adv)
menyepak
bola
dengan
cantik.
‘to kick’
'the ball'
‘beautifully'
mengikut
peraturan
dengan
Berhati-hati
‘folow'
'the rule'
‘faithfully’
menutup
pintu
dengan
kuat.
'to close'
'the door'
'forcefully'.
In these cases the PP will modify the transitive as well as the intransitive verb forming an adverbial phrase
describing quality (stative description).
If these qualities are substituted by verb phrases (VPs), the group will refer to manner.
c.
Manner:
STRUCTURE 6: Verb (intransitive) + dengan + VP
Verb (intransitive)
dengan
berjalan
dengan
‘to walk'
by
menyanyi
dengan
‘to sing’
in
melawan
dengan
'to compete'
'by'
Verb (transitive) + dengan + VP
Verb (transitive)
Object
mempelajari
sesuatu
‘learning'
'something'
VP
Mengangkat kaki tinggi-tinggi.
‘lifting the feet high’
Menggunakan suara tinggi.
‘in a loud voice’
Menunjukkan kekuatannya.
'exhibiting his strength'
dengan
dengan
by
75
VP
Membaca buku.
'reading (books)'
mengikut
‘folow'
menutup
'to cover'
peraturan
'the rule'
makanan
'the food'
dengan
by
dengan
by
membeli
'to buy'
Sebuah rumah
'a house'
dengan
'on'
Membeli barang-barang tempatan.
‘buying local products'
Meletakkan daun pisang di
atasnya.
'putting banana leaves over it'.
berhutang
'loan'.
Stative verbs can be modified by an NP introduced by dengan.
d.
e.
f.
Modifier to stative verbs:
STRUCTURE 7: Verb (intransitive) + dengan + VP
Stative Verb (adj?)
dengan
taat
dengan
‘faithful'
to
tahan
dengan
‘stand’
meluat
dengan
'pissed off'
'by'
senang
dengan
'comfortable'
'with'
complement to certain verbs:
STRUCTURE 8: Verb + dengan + NP
Verb
dengan
berhubung
dengan
‘connected'
to
berseronok
dengan
‘enjoying’
Tak hadir
dengan
'absent'
with
Ditambah
dengan
'Adding'
Diperkuatkan
Dengan
'reenforced'
'with'
NP
Perintah agama.
‘religious teachings’
kritikan.
‘the criticism’
Masalah dalaman.
'the internal problems'
Dasar itu.
'the policy'
NP
sesuatu
‘something’
Keadaan itu.
‘the event’
kebenaran.
'permission'
Vitamin A
Kalsium
calsium
to link comparative NPS:
STRUCTURE 9: Comp Prep + NP + dengan + NP
Comp PREP
NP
dengan
Di antara
A
dengan
‘between'
and
bagaikan
Langit
dengan
‘as'
'the sky'
'and'
Tak hadir
dengan
'absent'
with
Ditambah
dengan
'Adding'
Diperkuatkan
Dengan
'reenforced'
'with'
NP
B
Bumi..
‘the earth'
kebenaran.
'permission'
Vitamin A
Kalsium
'calcium'
Conclusion
The use of the same form of the preposition may point to one direction. It’s a manifestation of the same idea in
the language. It suggests that the prepositional phrase occurs at the same time as the verb it qualifies.
76