Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                


Moses
statistical
machine translation
system

Hybrid Translation

Contents

XML Markup

Sometimes we have external knowledge that we want to bring to the decoder. For instance, we might have a better translation system for translating numbers of dates. We would like to plug in these translations to the decoder without changing the model.

The -xml-input flag is used to activate this feature. It can have one of four values:

  • exclusive Only the XML-specified translation is used for the input phrase. Any phrases from the phrase table that overlap with that span are ignored.
  • inclusive The XML-specified translation competes with all the phrase table choices for that span.
  • constraint The XML-specified translation competes with phrase table choices that contain the specified translation.
  • ignore The XML-specified translation is ignored completely.
  • pass-through (default) For backwards compatibility, the XML data is fed straight through to the decoder. This will produce erroneous results if the decoder is fed data that contains XML markup.

The decoder has an XML markup scheme that allows the specification of translations for parts of the sentence. In its simplest form, we can tell the decoder what to use to translate certain words or phrases in the sentence:

 % echo 'das ist <np translation="a cute place">ein kleines haus</np>' \
   | moses -xml-input exclusive -f moses.ini
 this is a cute place

 % echo 'das ist ein kleines <n translation="dwelling">haus</n>' \
   | moses -xml-input exclusive -f moses.ini
 this is a little dwelling

The words have to be surrounded by tags, such as <np...> and </np>. The name of the tags can be chosen freely. The target output is specified in the opening tag as a parameter value for a parameter that is called english for historical reasons (the canonical target language).

We can also provide a probability along with these translation choice. The parameter must be named prob and should contain a single float value. If not present, an XML translation option is given a probability of 1.

 % echo 'das ist ein kleines <n translation="dwelling" prob="0.8">haus</n>' \
   | moses -xml-input exclusive -f moses.ini \
 this is a little dwelling

This probability isn't very useful without letting the decoder have other phrase table entries "compete" with the XML entry, so we switch to inclusive mode. This allows the decoder to use either translations from the model or the specified xml translation:

 % echo 'das ist ein kleines <n translation="dwelling" prob="0.8">haus</n>' \
   | moses -xml-input inclusive -f moses.ini
 this is a small house

The switch -xml-input inclusive gives the decoder a choice between using the specified translations or its own. This choice, again, is ultimately made by the language model, which takes the sentence context into account.

This doesn't change the output from the non-XML sentence because that prob value is first logged, then split evenly among the number of scores present in the phrase table. Additionally, in the toy model used here, we are dealing with a very dumb language model and phrase table. Setting the probability value to something astronomical forces our option to be chosen.

 % echo 'das ist ein kleines <n translation="dwelling" prob="0.8">haus</n>' \
   | moses -xml-input inclusive -f moses.ini
 this is a little dwelling

Multiple translation can be specified if separated by two bars (||):

 % echo 'das ist ein kleines <n translation="dwelling||house" prob="0.8||0.2">haus</n>' \
   | moses -xml-input inclusive -f moses.ini

The XML-input implementation is NOT currently compatible with factored models or confusion networks.

Options:

  • -xml-input ('pass-through' (default), 'inclusive', 'constraint', 'exclusive', 'ignore')

Specifying Reordering Constraints

For various reasons, it may be useful to specify reordering constraints to the decoder, for instance because of punctuation. Consider the sentence:

 I said " This is a good idea . " , and pursued the plan .

The quoted material should be translated as a block, meaning that once we start translating some of the quoted words, we need to finish all of them. We call such a block a zone and allow the specification of such constraints using XML markup.

 I said <zone> " This is a good idea . " </zone> , and pursued the plan .

Another type of constraints are walls which are hard reordering constraints: First all words before a wall have to be translated, before words afterwards are translated. For instance:

 This is the first part . <wall /> This is the second part .

Walls may be specified within zones, where they act as local walls, i.e. they are only valid within the zone.

 I said <zone> " <wall /> This is a good idea . <wall /> " </zone> , and pursued the plan .

If you add such markup to the input, you need to use the option -xml-input with either exclusive or inclusive (there is no difference between these options in this context).

Specifying reordering constraints around punctuation is often a good idea.
The switch -monotone-at-punctuation introduces walls around the punctuation tokens ,.!?:;".

Options:

  • walls and zones have to specified in the input using the tags <zone>, </zone>, and <wall>.
  • -xml-input -- needs to be exclusive or inclusive
  • -monotone-at-punctuation (-mp) -- adds walls around punctuation ,.!?:;".

Fuzzy Match Rule Table for Hierachical Models

Another method of extracting rules from parallel data is described in (Koehn, Senellart, 2010-1 AMTA) and (Koehn, Senellart, 2010-2 AMTA).

To use this extraction method in the decoder, add this to the moses.ini file:

    [feature]
    PhraseDictionaryFuzzyMatch source=<source/path> target=<target/path> alignment=<alignment/path>

It has not yet been integrated into the EMS.

Note: The translation rules generated by this algorith is intended to be used in the chart decoder. It can't be used in the phrase-based decoder.

Placeholder

Placeholders are symbols that replaces a word or phrase. For example, numbers ('42.85') can be replaced with a symbol '@num@'. Other words and phrases that can potentially be replaced with placeholders include dates and time, and named-entities. When passing multiple placeholders to the extract command, separate them with a comma (,)

This is good in training since the sparse numbers are replaced with more numerous placeholders symbols, producing more reliable statistics for the MT models.

The same reason also applies during decoding - the raw number is often an unknown symbol in the phrase-table and language models. Unknown symbols are translated as single words, disabling the advantage of phrasal translation. The reordering of unknown symbols can also be unreliable as we don't have statistics for it.

However, 2 issues arises using placeholder:

    1. Translate the original word or phrase. In the example, '42.85' should be translated. If the language pair is en-fr, then it may be translated as '42,85'.
    2. How do we insert this translation into the output if the word has be replaced with the placeholder.

Moses has support for placeholders in training and decoding.

Training

When preparing your data, process with data with the script

   scripts/generic/ph_numbers.perl -c

The script was designed to run after tokenization, that is, instead of tokenizing like this:

   cat [RAW-DATA] | ./scripts/tokenizer/tokenizer.perl -a -l en > TOK-DATA

do this

   cat [RAW-DATA] | ./scripts/tokenizer/tokenizer.perl -a -l en | scripts/generic/ph_numbers.perl -c > TOK-DATA

Do this for both source and target language, for parallel and monolingual data.

The script will replace numbers with the symbol @num@.

NB. - this script is currently very simple and language independent. It can be improved to create better translations.

During extraction, add the following to the extract command (phrase-based only for now):

   ./extract --Placeholders @num@ ....

This will discard any extracted translation rule which are non-consistent with the placeholders. That is, all placeholders must be aligned to 1-to-1 with a placeholder in the other language.

Decoding

The input sentence must also be processed with the placeholder script to replace numbers with placeholder symbol. However, don't add the -c argument so that the original number will be retained in the output as an XML entry. For example,

    generic $echo  "you owe me $ 100 ." | ./ph_numbers.perl 

will output

   you owe me $ <ne translation="@num@" entity="100">@num@</ne> .

Add this to the decoder command when executing the decoder (phrase-based only for now):

   ./moses  -placeholder-factor 1 -xml-input exclusive

The factor must NOT be one which is being used by the source side of the translation model. For vanilla models, only factor 0 is used.

The argument -xml-input can be any permitted value, except 'pass-through'.

The output from the decoder will contain the number, not the placeholder. The is the case in the best output, and the n-best list.

EMS

The above changes can be added to the EMS config file.

For my (Hieu) experiment, these are the changes I made:

   1. In the [GENERAL section, change
         input-tokenizer = "$misc-script-dir/normalize-punctuation.perl $input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $input-extension"
      to
         input-tokenizer = "$misc-script-dir/normalize-punctuation.perl $input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $input-extension | $moses-script-dir/generic/ph_numbers.perl -c"

      and change
         output-tokenizer = "$misc-script-dir/normalize-punctuation.perl $output-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $output-extension | $moses-script-dir/generic/ph_numbers.perl -c"
      to
        output-tokenizer = "$misc-script-dir/normalize-punctuation.perl $output-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $output-extension | $moses-script-dir/generic/ph_numbers.perl -c" 

   2. In the [TRAINING] section, add
        extract-settings = "--Placeholders @num@"

   3. In the [TUNING] section, change
        decoder-settings = "-threads 8"
      to
        decoder-settings = "-threads 8 -placeholder-factor 1 -xml-input exclusive"
      And in the [EVALUATION] section, change
        decoder-settings = "-mbr -mp -search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000 -threads 8"
      to
         decoder-settings = "-mbr -mp -search-algorithm 1 -cube-pruning-pop-limit 5000 -s 5000 -threads 8 -placeholder-factor 1 -xml-input exclusive"

   4. In the [EVALUATION] section, add
        input-tokenizer = "$misc-script-dir/normalize-punctuation.perl $input-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $input-extension | $moses-script-dir/generic/ph_numbers.perl"
        output-tokenizer = "$misc-script-dir/normalize-punctuation.perl $output-extension | $moses-script-dir/tokenizer/tokenizer.perl -a -l $output-extension

"

Results

This was tested on some experiments, trained with Europarl data. It didn't have a positive effect on BLEU score, even reducing it slightly.

However, it may still be helpful to users who translate text with lots of numbers or dates etc. Also, the recognizer script could be improved.

   en-es:
         baseline: 24.59.
         with placeholder: 24.68
   es-en:
         baseline: 23.00
         with placeholder: 22.84
   en-cs:
         baseline: 11.05
         with placeholder: 10.62
    cs-en:
         baseline: 15.80
         with placeholder: 15.62
Page last modified on December 23, 2017, at 10:26 PM