Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1620853.1620881dlproceedingsArticle/Chapter ViewAbstractPublication PagesnaaclConference Proceedingsconference-collections
research-article
Free access

A simple sentence-level extraction algorithm for comparable data

Published: 31 May 2009 Publication History

Abstract

The paper presents a novel sentence pair extraction algorithm for comparable data, where a large set of candidate sentence pairs is scored directly at the sentence-level. The sentence-level extraction relies on a very efficient implementation of a simple symmetric scoring function: a computation speed-up by a factor of 30 is reported. On Spanish-English data, the extraction algorithm finds the highest scoring sentence pairs from close to 1 trillion candidate pairs without search errors. Significant improvements in BLEU are reported by including the extracted sentence pairs into the training of a phrase-based SMT (Statistical Machine Translation) system.

References

[1]
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. 1993. The Mathematics of Statistical Machine Translation: Parameter Estimation. CL, 19(2):263--311.
[2]
Pascale Fung and Percy Cheung. 2004. Mining Very-Non-Parallel Corpora: Parallel Sentence and Lexicon Extraction via Bootstrapping and EM. In Proc, of EMNLP 2004, pages 57--63, Barcelona, Spain, July.
[3]
Dave Graff. 2006. LDC2006T12: Spanish Gigaword Corpus First Edition. LDC.
[4]
Dave Graff. 2007. LDC2007T07: English Gigaword Corpus Third Edition. LDC.
[5]
Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical Phrase-Based Translation. In Proc. of HLT-NAACL'03, pages 127--133, Edmonton, Alberta, Canada, May 27 -- June 1.
[6]
Dragos S. Munteanu and Daniel Marcu. 2005. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. CL, 31(4):477--504.
[7]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In In Proc. of ACL'02, pages 311--318, Philadelphia, PA, July.
[8]
Chris Quirk, Raghavendra Udupa, and Arul Menezes. 2007. Generative Models of Noisy Translations with Applications to Parallel Fragment Extraction. In Proc. of the MT Summit XI, pages 321--327, Copenhagen, Demark, September.
[9]
Philip Resnik and Noah Smith. 2003. The Web as Parallel Corpus. CL, 29(3):349--380.
[10]
Noah A. Smith. 2002. From Words to Corpora: Recognizing Translation. In Proc. of EMNLP02, pages 95--102, Philadelphia, July.
[11]
Matthew Snover, Bonnie Dorr, and Richard Schwartz. 2008. Language and Translation Model Adaptation using Comparable Corpora. In Proc. of EMNLP08, pages 856--865, Honolulu, Hawaii, October.
[12]
Masao Utiyama and Hitoshi Isahara. 2003. Reliable Measures for Aligning Japanese-English News Articles and Sentences. In Proc. of ACL03, pages 72--79, Sapporo, Japan, July.

Cited By

View all
  • (2012)Twitter translation using translation-based cross-lingual retrievalProceedings of the Seventh Workshop on Statistical Machine Translation10.5555/2393015.2393074(410-421)Online publication date: 7-Jun-2012
  • (2011)Extracting parallel phrases from comparable dataProceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web10.5555/2024236.2024248(61-68)Online publication date: 24-Jun-2011
  • (2011)Two ways to use a noisy parallel news corpus for improving statistical machine translationProceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web10.5555/2024236.2024246(44-51)Online publication date: 24-Jun-2011
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
NAACL-Short '09: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
May 2009
317 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 31 May 2009

Qualifiers

  • Research-article

Acceptance Rates

Overall Acceptance Rate 21 of 29 submissions, 72%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)5
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2012)Twitter translation using translation-based cross-lingual retrievalProceedings of the Seventh Workshop on Statistical Machine Translation10.5555/2393015.2393074(410-421)Online publication date: 7-Jun-2012
  • (2011)Extracting parallel phrases from comparable dataProceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web10.5555/2024236.2024248(61-68)Online publication date: 24-Jun-2011
  • (2011)Two ways to use a noisy parallel news corpus for improving statistical machine translationProceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web10.5555/2024236.2024246(44-51)Online publication date: 24-Jun-2011
  • (2010)Extracting parallel sentences from comparable corpora using document level alignmentHuman Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics10.5555/1857999.1858062(403-411)Online publication date: 2-Jun-2010
  • (2009)A beam-search extraction algorithm for comparable dataProceedings of the ACL-IJCNLP 2009 Conference Short Papers10.5555/1667583.1667653(225-228)Online publication date: 4-Aug-2009

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media