CALICO: Conversational Agent Localization via Synthetic Data Generation

Rosenbaum, Andy; Kharazmi, Pegah; Banijamali, Ershad; Zeng, Lu; DiPersio, Christopher; Wei, Pan; Oz, Gokmen; Chung, Clement; Owczarzak, Karolina; Triefenbach, Fabian; Hamza, Wael

Computer Science > Computation and Language

arXiv:2412.05388 (cs)

[Submitted on 6 Dec 2024]

Title:CALICO: Conversational Agent Localization via Synthetic Data Generation

Authors:Andy Rosenbaum, Pegah Kharazmi, Ershad Banijamali, Lu Zeng, Christopher DiPersio, Pan Wei, Gokmen Oz, Clement Chung, Karolina Owczarzak, Fabian Triefenbach, Wael Hamza

View PDF HTML (experimental)

Abstract:We present CALICO, a method to fine-tune Large Language Models (LLMs) to localize conversational agent training data from one language to another. For slots (named entities), CALICO supports three operations: verbatim copy, literal translation, and localization, i.e. generating slot values more appropriate in the target language, such as city and airport names located in countries where the language is spoken. Furthermore, we design an iterative filtering mechanism to discard noisy generated samples, which we show boosts the performance of the downstream conversational agent. To prove the effectiveness of CALICO, we build and release a new human-localized (HL) version of the MultiATIS++ travel information test set in 8 languages. Compared to the original human-translated (HT) version of the test set, we show that our new HL version is more challenging. We also show that CALICO out-performs state-of-the-art LINGUIST (which relies on literal slot translation out of context) both on the HT case, where CALICO generates more accurate slot translations, and on the HL case, where CALICO generates localized slots which are closer to the HL test set.

Comments:	Accepted to The 37th International Conference on Neural Information Processing Systems (NeurIPS 2023) December 10-16, 2023 - SyntheticData4ML Workshop, New Orleans, United States this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2412.05388 [cs.CL]
	(or arXiv:2412.05388v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2412.05388

Submission history

From: Andy Rosenbaum [view email]
[v1] Fri, 6 Dec 2024 19:29:16 UTC (273 KB)

Computer Science > Computation and Language

Title:CALICO: Conversational Agent Localization via Synthetic Data Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:CALICO: Conversational Agent Localization via Synthetic Data Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators