Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

Ma, Xutai; Gong, Hongyu; Liu, Danni; Lee, Ann; Tang, Yun; Chen, Peng-Jen; Hsu, Wei-Ning; Koehn, Phillip; Pino, Juan

Computer Science > Computation and Language

arXiv:2110.08250 (cs)

[Submitted on 15 Oct 2021 (v1), last revised 12 Jan 2022 (this version, v2)]

Title:Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

Authors:Xutai Ma, Hongyu Gong, Danni Liu, Ann Lee, Yun Tang, Peng-Jen Chen, Wei-Ning Hsu, Phillip Koehn, Juan Pino

View PDF

Abstract:We present a direct simultaneous speech-to-speech translation (Simul-S2ST) model, Furthermore, the generation of translation is independent from intermediate text representations. Our approach leverages recent progress on direct speech-to-speech translation with discrete units, in which a sequence of discrete representations, instead of continuous spectrogram features, learned in an unsupervised manner, are predicted from the model and passed directly to a vocoder for speech synthesis on-the-fly. We also introduce the variational monotonic multihead attention (V-MMA), to handle the challenge of inefficient policy learning in speech simultaneous translation. The simultaneous policy then operates on source speech features and target discrete units. We carry out empirical studies to compare cascaded and direct approach on the Fisher Spanish-English and MuST-C English-Spanish datasets. Direct simultaneous model is shown to outperform the cascaded model by achieving a better tradeoff between translation quality and latency.

Subjects:	Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2110.08250 [cs.CL]
	(or arXiv:2110.08250v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2110.08250

Submission history

From: Xutai Ma [view email]
[v1] Fri, 15 Oct 2021 17:59:15 UTC (271 KB)
[v2] Wed, 12 Jan 2022 22:30:25 UTC (346 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.CL

< prev | next >

new | recent | 2021-10

Change to browse by:

cs
cs.SD
eess
eess.AS

References & Citations

DBLP - CS Bibliography

listing | bibtex

Xutai Ma
Hongyu Gong
Ann Lee
Yun Tang
Peng-Jen Chen

…

export BibTeX citation

Computer Science > Computation and Language

Title:Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators