1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language
1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language
1.1 General: Resourced" Languages. To Enhance The Translation Performance of Dissimilar Language
INTRODUCTION
1.1 GENERAL
Machine Translation is an automatic translation of one natural language text to another
using computer. Initial attempts for Machine Translation made in 1950’s didn’t meet
with success. Now internet users need a fast automatic translation system between
languages. Several approaches like Linguistic based and Interlingua based systems are
used to develop a machine translation system. But currently, statistical methods
dominate the machine translation field. Statistical Machine Translation (SMT)
approach draws knowledge from automata theory, artificial intelligence, data structure
and statistics. SMT system treats translation as a machine learning problem. This means
that a learning algorithm is applied to a large amount of parallel corpora. Parallel
corpora are sentences in one language along with its translation. Learning algorithms
create a model from parallel sentences and using this model, unseen sentences are
translated. If parallel corpora are available for a language pair then it is easy to build a
bilingual SMT system. The accuracy of the system is highly dependent on the quality
and quantity of the parallel corpus and the domain. These parallel corpora are
constantly growing. Parallel corpora are the fundamental resource for SMT system.
Parallel corpora are available from government’s bi-lingual text books, news papers,
websites and novels.
SMT models are giving good accuracy for language pairs, particularly for similar
languages in specific domains or languages that have large availability of bi-lingual
corpora. If a sentence in language pair is not structurally similar, then the translation
patterns are difficult to learn. Huge amounts of parallel corpora are required for
learning the pattern, therefore statistical methods are difficult to use in “less
resourced” languages. To enhance the translation performance of dissimilar language
pairs and less resourced languages, an external preprocessing is required. This
preprocessing is performed using linguistic tools.
In SMT system, statistical methods are used for mapping of source language
phrases into target language phrases. Statistical model parameters are estimated from
bi-lingual and mono-lingual corpora. There are two models in the SMT system. They
1
are Translation model and Language model. The translation model takes parallel
sentences and finds the translation hypothesis between the phrases. Language model is
based on the statistical properties of n-grams. It uses the monolingual corpora.
Several translation models are available in SMT system. Some important models
are phrase based model, syntax based model and factored model. Phrase Based
Statistical Machine Translation (PBSMT) is limited to the mapping of small text
chunks. Factored translation model is an extension of phrase based models. It integrates
linguistic information at the word level. This thesis proposes a pre-processing method
that uses linguistic tools to the development of English to Tamil machine translation
system. In this translation system, external linguistic tools are used to augment the
linguistic information into the parallel corpora. The pre and post processing
methodology proposed in this thesis are applicable to other language pairs too.
Machine translation is one of the major oldest and the most active area in natural
language processing. The word ‘translation’ refers to transformation of text or speech
from one language into other. Machine translation can be defined as, the application of
computers to the task of translating texts from one natural language to another. It is a
focussed field of research in linguistic concepts of syntax, semantics, pragmatics and
discourse.
Today a number of systems are available for producing translations, though they
are not perfect. In the process of translation, which is either carried out manually or
automated through machines, the context of the text in the source language when
translated must convey the exact context in the target language. Translation is not just
word level replacement. A translator, either a machine or human, must interpret and
analyse all the elements in the text. Also human/machine should be familiar with all the
issues during the translation process and must know how to handle it. This requires in-
depth knowledge in grammar, sentence structure, meanings, etc and also an
understanding in each language’s culture in order to handle idioms and phrases
originated from different culture. The cross culture understanding is an important issue
that holds the accuracy of the translation.
2
It will be a great challenge for humans to design automatic machine translation
system. It is difficult for translating sentences by taking into consideration all the
required information. Humans need several revisions to make the perfect translation.
No two individual human translators can generate identical translations of the same text
in the same language pair. Hence it will be a greater challenge for humans to design a
fully automated machine translation system to produce high quality translations.
Machine Translation is used for translating texts for assimilation purpose which
aids bilingual or cross-lingual communication and also for searching, accessing and
understanding foreign language information from databases and web-pages [3]. In the
field of information retrieval a lot of research is going on in Cross-Language
Information Retrieval (CLIR), i.e. information retrieval systems capable of searching
databases in many different languages [4].
3
good automatic translation system, students can improve their translation and writing
skills. Such system can break the language barriers of students and language learners.
Traditionally, rule based approaches are used to develop a machine translation system.
Rule based approach feeds the rules into machine using appropriate representations.
Feeding all linguistic knowledge into a machine would be very hard. In this context, the
statistical approach to Machine Translation has some attractive qualities that made it
the preferred approach in machine translation research over the past two decades.
Statistical translation models learn translation patterns directly from data, and
generalize them to translate a new text. The SMT approach is largely language-
independent, i.e. the models can be applied to any language pair.
System based on statistical methods is much better than the traditional rule-based
systems. In SMT, implementation and development times are much shorter. SMT can
improve by coupling new models for reordering and decoding. It only needs to learn
parallel corpora for generating a translation system. In contrast, rule based system
needs transfer rules which only linguistic experts can generate. These rules are entirely
dependent on language pair involved and defining general “transfer-rules” is not an
easy task, especially for languages with different structures [5].
SMT system can be developed rapidly if the appropriate corpus is available. A Rule
Based Machine Translation (RBMT) system requires a lot of development and
customization costs until it reaches the desired quality threshold. Packaged RBMT
systems have been already developed and it is extremely difficult to reprogram models
and equivalences. Above all, RBMT has a much longer process involving more human
resources. RBMT system is retrained by adding new rules and vocabulary among other
things [5].
4
nowadays thanks to the wider availability of more powerful computers. RBMT requires
a longer deployment and compilation time by experts so that, in principle, building
costs are also higher. SMT generates statistical patterns automatically, including a good
learning of exceptions to rules. As regards to the rules governing the transfer of RBMT
systems, certainly they can be seen as special cases of statistical standards.
Nevertheless, they generalize too much and cannot handle exceptions. Finally SMT
systems can be upgraded with syntactic information and even semantics, like the
RBMT. A SMT engine can generate improved translations if retrained or adapted
again. In contrast, the RBMT generates very similar translations after retraining [5].
SMT systems, in general, have trouble in handling the morphology on the source or
the target side especially for morphologically rich languages. Errors in morphology can
have severe consequences on meaning of the sentence. They change the grammatical
function of words or the interpretation of the sentence through the wrong verb tense.
Factored translation models try to solve this issue by explicitly handling morphology on
the generation side.
5
1.5 MOTIVATION OF THE THESIS
Machine translation (MT) is the application of computers to the task of translating texts
from one natural language to another. Even though machine translation was envisioned
as a computer application in the 1950’s, machine translation is still considered to be an
open problem [3].
6
In such a situation, there is a big market for translation between English and the
various Indian languages. Currently, the translation is done manually. Use of
automation is largely restricted to word processing. Two specific examples of high
volume manual translation are translation of news from English into local languages,
translation of annual reports of government departments and public sector units among
English, Hindi and the local language. Many resources such as news, weather reports,
books, etc., in English are being manually translated to Indian languages. Of these,
News and weather reports from all around the world are translated from English to
Indian languages by human translators more often. Human translation is slow and also
consumes more time and cost compared to machine translation. It is clear from this that
there is large market available for machine translation rather than human translation
from English into Indian languages. The reason for choosing automatic machine
translation rather than human translation is that machine translation is faster and
cheaper than human translation.
Tamil, a Dravidian language, is spoken by around 72 million people and has the
official status in the state of Tamilnadu and Indian union territory of Puducherry.
Tamil is also an official language of Sri Lanka and Singapore. Tamil is also spoken
by significant minorities in Malaysia and Mauritius as well as emigrant communities
around the world. It is one of the 22 scheduled languages of India and declared a
classical language by the government of India in 2004 [9].
7
• Develop a pre-processing module (Reordering, Compounding and
Factorization) for English language sentence to transform the structure to
more similar to that of Tamil.
The pre-processing module for source language includes three stages, which are
reordering, factorization and compounding. In reordering stage, the source language
sentence is to be syntactically reordered according to the Tamil language syntax.
After reordering, the English words will be factored into lemma and other
morphological features. It will be followed by the compounding process, in which
the various function words are removed from the reordered sentence and attached
as a morphological factor to the corresponding content word.
Tamil POS tagger is going to develop using Support Vector Machine (SVM)
based machine learning tool. POS annotated corpus will be created for training the
automatic tagger system.
8
• Develop a Tamil Morphological Generator system to generate Tamil surface
word form.
Parallel corpora are used to train the statistical translation models. Parallel corpora
are created and converted into factored parallel corpora using preprocessing. English
sentences are factored using Stanford Parser tool and Tamil sentences are factored
using Tamil POS Tagger and Morphological analyzer. Monolingual corpus is collected
from various news papers and factored using Tamil linguistic tools. This mono-lingual
corpus is used in language model. Finally, in post-processing, Tamil morphological
generator is used for generating a surface word from output factors.
9
Figure 1.1 Morphoology based
d Factored SMT
S for En
nglish to Tam
mil languagge
Maachine Transslation systeem for languuage pair wiith disparatee morphologgical structurre
neeeds appropriiate pre-proccessing or moodeling befoore translatioon. The prepprocessing caan
be performed on the raw source langguage sentennce to makee it more ap
ppropriate fo
for
trannslating into
o target lannguage senttence. The pre-processing modulee for Englissh
lannguage sentence consistss of reorderinng, factorizaation and com
mpounding.
10
Reordering rules are handcrafted using the syntactic word order difference between
English and Tamil language. 180 reordering rules are created based on the sentence
structure of English and Tamil. Reordering significantly improves the performance of
the Machine Translation system. Lexicalized distortion reordering model is
implemented in Moses toolkit [180]. But this automatic reordering in Moses toolkit is
good for short range sentences. Therefore external tool or component is needed for
dealing the long distance reordering. This reordering is also a one way of indirectly
integrating syntactic information to the source language. 80% of English sentences are
reordered correctly according to the rules which are developed. Example for English
reordering is given in the Figure 1.2.
Factored models can be used for morphologically rich languages, in order to reduce
the amount of bi-lingual data. Factorization refers splitting the word into linguistic
factors and integrates as a vector. Stanford Parser is used to parse the English
sentences. From the parsed tree, the linguistic information such as lemma, part-of-
speech tags, syntactic information and dependency information are retrieved. This
linguistic information is integrated as factors in the original word.
11
morphological structure of Tamil language sentence. In compounding phase, the
function words are identified from the English factored corpora using dependency
information. After finding the function words, these are removed from the factored
sentence and attached as a morphological factor to the corresponding content word.
Compounding process reduces the length of the English sentence. Like function words,
auxiliary verbs and model verbs are also removed and attached as a morphological
factor of source language word. Now the morphological representation of the English
language sentence is similar to that of the Tamil language sentence. This compounding
step indirectly integrates dependency information into the source language factor. Table
1.1 and Table 1.2 show the factored and compounded sentences respectively.
I | i | PN | prn
my | my | PN | PRP$
home | home | N | NN
to | to | TO | TO
vegetables | vegetable | N | NNS
bought | buy | V | VBD .
I | i | PN | prn_i
my | my | PN | PRP$
home | home | N |NN_to
vegetables | vegetable | N | NNS
bought | buy | V | VBD_1S.
12
morphological analyzer. Morphological analyzer split the word to lemma and
morphological information. Parallel corpora as well as the monolingual corpora are
preprocessed in this stage.
POS tagging means labeling grammatical classes i.e. assigning parts of speech tags
to each and every word of the given sentence. Tamil sentences are POS tagged using
Tamil POS Tagger tool. This tagger was developed, using Support Vector Machine
(SVM) based machine learning tool, SVMTool [12], which make the task simple and
efficient. In this method, POS tagged corpus is created and used to generate a trained
model. The SVMTool is used for creating models using tagged sentences and untagged
sentences are tagged using those models. 42k sentences (approx 5 lakh words) are
tagged for this Part-of-Speech tagger with the help of eminent Tamil linguist. The
experiments are conducted with our tagged corpus. The overall accuracy of 94.6% is
obtained for the test set which contains 6K sentences (approx 35 thousand words).
After POS tagging, sentences in the corpora are morphologically analyzed for
finding the lemma and morphological information. Morphological analyzer is a
software tool used to segment the word into meaningful units. Morphological analysis
of Tamil is a complex process because of its “morphological-rich” nature. Generally,
rule based approaches are used to develop morphological analyzer system. For a
morphologically rich language like Tamil, the creation of rules is a challenging task.
Here a novel machine learning based approach is proposed and implemented for Tamil
verb and noun Morphological analyzer. Additionally, this approach is tested for
languages such as Malayalam, Telugu and Kannada.
13
assigning grammatical classes to each morpheme. The SVM based tool was used for
training the data. This tool segments each word into its lemma and morphological
information.
14
“Minimized-POS” and “Compound-Tag” factors of English word is aligned to
“Morphological information” factor of Tamil word. Here, the important thing is Tamil
surface new words are not generated in SMT decoder. Only factors are generated from
SMT system and the surface word is generated in the post processing stage. Tamil
morphological generator is used in post processing to generate a Tamil surface word
from output factors. The system is evaluated with different sentence patterns like
simple, continuous and model auxiliaries and with these types, 85% of the sentences
are translated correctly. In addition, for other sentence types, the performance is 60%.
The prototype machine translation system which is developed properly handles the
noun-verb agreement. This is an essential requirement for translating into
morphologically rich languages like Tamil. BLEU and NIST evaluation scores clearly
show that the factored model with an integration of linguistic knowledge gives better
result for English to Tamil Statistical Machine Translation system.
15
lemma and word-class as input and gives the lemma’s paradigm number and word’s
stem as output. This paradigm number is referred as column index. Paradigm number
provides information about all the possible inflected words of a lemma in a particular
word class. The second module takes morpho-lexical information as an input and gives
its index number as an output. From the complete morpho-lexical information list, the
index number of the corresponding input morpho-lexical information factor is identified
and this is referred as row index. In third module, a two dimensional suffix-table is
used to generate the word using row index and column index. Finally the identified
suffix is attached with the stem to create a word form. For pronouns, pattern matching
approach is followed for generating pronoun word form.
This thesis shows how preprocessing and post processing can be used to improve
the statistical machine translation for English to Tamil language. The main focus of this
research is on translation from English into Tamil language, but also the development
of linguistic tools for Tamil language. The contributions are,
• Introduced a novel pre-processing method for English sentences which is
based on reordering and compounding. Reordering rearrange the English
sentence structures according to Tamil sentence. Compounding removes the
function words and auxiliaries then merged to the morphological factor of
content word. This pre-processing reorganizes the English sentence structure
according to the structure of Tamil sentence.
• Created a Tamil POS Tagger and tagged corpora size of 5 lakh words which
is a part of pre-processing Tamil language sentence.
• Introduced a novel method for developing Tamil morphological analyser
which is based on Machine learning approach. Corpora developed for this
approach contains 4 lakh morphologically segmented Tamil verbs and 2 lakh
Tamil nouns.
• Introduced a novel algorithm for developing Tamil morphological generator
with the use of paradigms and suffixes. Using this generator, it is possible to
generate 10 thousand distinct word form of a single Tamil verb.
• Successfully integrated these pre-processing and post-processing modules and
developed English to Tamil factored SMT system.
16
1.9 ORGANIZATION OF THE THESIS
This thesis is divided into ten chapters. Figure 1.4 shows the Organization of the thesis.
Chapter‐I INTRODUCTION
Chapter‐3 BACKGROUND
PREPROCESSING PREPROCESSING
Chapter‐4 ENGLISH TAMIL
LANGUAGE LANGUAGE
POS TAGGER
Chapter‐5
FOR TAMIL
MORPH ANALYZER
Chapter‐6 FOR TAMIL
MORPHOLOGICAL
Chapter‐8 GENERATOR FOR TAMIL
EXPERIMENTS AND
Chapter‐9 RESULTS
Chapter‐10 CONCLUSION
17
This thesis is organized as follows. General introduction is presented in chapter 1.
Chapter 2 presents the literature survey for linguistic tools and available Machine
Translation systems for Indian languages. In Chapter 3, the theoretical background and
language processing for Tamil is described. Chapter 4 contains the different stages of
preprocessing English language sentences. Stages include reordering, factorization and
compounding. Chapter 5 and 6 presents the preprocessing of Tamil sentence using
linguistic tools. In Chapter 5, development of Tamil POS tagger is explained and
Chapter 6 illustrates the Morphological Analyzer for Tamil language. This
morphological analyzer is developed based on the new machine learning based
approach. Additionally, the detailed descriptions of the method and data resources are
also illustrated. Chapter 7 presents the Factored SMT system for English to Tamil
language. This chapter explains how the factored corpora are trained and decoded using
SMT Toolkit. Post-processing for Tamil language is discussed in chapter 8.
Morphological generator is used as a Post-processing tool. This chapter also explains
the detailed description about a new algorithm which is developed for Tamil
Morphological generator. Chapter 9 explains the experiment and results of English to
Tamil Statistical Machine Translation system. It also describes the training and testing
details of SMT toolkit. The output of the developed system is evaluated using BLEU
and NIST metrics. Finally Chapter 10 concludes the thesis and explains the future
directions about this research.
18