Introducing corpus-based rules and algorithms in a rule-based machine translation system
View/ Open
source_files.zip (1.835Mb)
Date
28/11/2013Author
Dugast, Loic
Metadata
Abstract
Machine translation offers the challenge of automatically translating a text from one
natural language into another. Statistical methods - originating from the field of information
theory - have shown to be a major breakthrough in the field of machine
translation. Prior to this paradigm, many systems had been developed following a
rule-based approach. This denotes a system based on a linguistic description of the
languages involved and of how translation occurs in the mind of the (human) translator.
Statistical models on the contrary use empirical means and may work with very
little linguistic hypothesis on language and translation as performed by humans. This
had implications for rule-based translation systems, in terms of software architecture
and the nature of the rules, which were manually input and lack any statistical feature.
In the view of such diverging paradigms, we can imagine trying to combine both
in a hybrid system. In the present work, we start by examining the state-of-the-art of
both rule-based and statistical systems. We restrict the rule-based approach to transfer-based
systems. We compare rule-based and statistical paradigms in terms of global
translation quality and give a qualitative analysis of their respective specific errors. We
also introduce initial black-box hybrid models that confirm there is an expected gain
in combining the two approaches.
Motivated by the qualitative analysis, we focus our study and experiments on lexical
phrasal rules. We propose a setup allowing to extract such resources from corpora.
Going one step further in the integration of rule-based and statistical approaches, we
then examine how to combine the extracted rules with decoding modules that will allow
for a corpus-based handling of ambiguity. This then leads to the final delivery of
this work: a rule-based system for which we can learn non-deterministic rules from
corpora, and whose decoder can be optimised on a tuning set in the same domain.