2023
pdf
bib
abs
Findings of the WMT 2023 Shared Task on Parallel Data Curation
Steve Sloto
|
Brian Thompson
|
Huda Khayrallah
|
Tobias Domhan
|
Thamme Gowda
|
Philipp Koehn
Proceedings of the Eighth Conference on Machine Translation
Building upon prior WMT shared tasks in document alignment and sentence filtering, we posed the open-ended shared task of finding the best subset of possible training data from a collection of Estonian-Lithuanian web data. Participants could focus on any portion of the end-to-end data curation pipeline, including alignment and filtering. We evaluated results based on downstream machine translation quality. We release processed Common Crawl data, along with various intermediate states from a strong baseline system, which we believe will enable future research on this topic.
2019
pdf
bib
abs
Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data
Amittai Axelrod
|
Anish Kumar
|
Steve Sloto
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
We introduce a purely monolingual approach to filtering for parallel data from a noisy corpus in a low-resource scenario. Our work is inspired by Junczysdowmunt:2018, but we relax the requirements to allow for cases where no parallel data is available. Our primary contribution is a dual monolingual cross-entropy delta criterion modified from Cynical data selection Axelrod:2017, and is competitive (within 1.8 BLEU) with the best bilingual filtering method when used to train SMT systems. Our approach is featherweight, and runs end-to-end on a standard laptop in three hours.
2018
pdf
bib
Leveraging Data Resources for Cross-Linguistic Information Retrieval Using Statistical Machine Translation
Steve Sloto
|
Ann Clifton
|
Greg Hanneman
|
Patrick Porter
|
Donna Gates
|
Almut Hildebrand
|
Anish Kumar
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)