Findings of the WMT 2023 Shared Task on Parallel Data Curation
Steve Sloto
Brian Thompson
Huda Khayrallah
Tobias Domhan
Thamme Gowda
Philipp Koehn
Proceedings of the Eighth Conference on Machine Translation
Building upon prior WMT shared tasks in document alignment and sentence filtering, we posed the open-ended shared task of finding the best subset of possible training data from a collection of Estonian-Lithuanian web data. Participants could focus on any portion of the end-to-end data curation pipeline, including alignment and filtering. We evaluated results based on downstream machine translation quality. We release processed Common Crawl data, along with various intermediate states from a strong baseline system, which we believe will enable future research on this topic.
Dual Monolingual Cross-Entropy Delta Filtering of Noisy Parallel Data
Amittai Axelrod
Anish Kumar
Steve Sloto
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
We introduce a purely monolingual approach to filtering for parallel data from a noisy corpus in a low-resource scenario. Our work is inspired by Junczysdowmunt:2018, but we relax the requirements to allow for cases where no parallel data is available. Our primary contribution is a dual monolingual cross-entropy delta criterion modified from Cynical data selection Axelrod:2017, and is competitive (within 1.8 BLEU) with the best bilingual filtering method when used to train SMT systems. Our approach is featherweight, and runs end-to-end on a standard laptop in three hours.
Leveraging Data Resources for Cross-Linguistic Information Retrieval Using Statistical Machine Translation
Steve Sloto
Ann Clifton
Greg Hanneman
Patrick Porter
Donna Gates
Almut Hildebrand
Anish Kumar
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 2: User Track)