arXiv:2010.14920 Abstract | arXiv Analytics

arXiv:2010.14920 [cs.CL]Abstract References Reviews Resources

Bridging the Modality Gap for Speech-to-Text Translation

Yuchen Liu, Junnan Zhu, Jiajun Zhang, Chengqing Zong

Published 2020-10-28Version 1

End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously, which ignores the speech-and-text modality differences and makes the encoder overloaded, leading to great difficulty in learning such a model. To address these issues, we propose a Speech-to-Text Adaptation for Speech Translation (STAST) model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text. Specifically, we decouple the speech translation encoder into three parts and introduce a shrink mechanism to match the length of speech representation with that of the corresponding text transcription. To obtain better semantic representation, we completely integrate a text-based translation model into the STAST so that two tasks can be trained in the same latent space. Furthermore, we introduce a cross-modal adaptation method to close the distance between speech and text representation. Experimental results on English-French and English-German speech translation corpora have shown that our model significantly outperforms strong baselines, and achieves the new state-of-the-art performance.

Categories: cs.CL

Keywords: modality gap, speech-to-text translation, end-to-end speech translation aims, representation, model significantly outperforms strong baselines

Related articles: Most relevant | Search more

arXiv:1702.03856 [cs.CL] (Published 2017-02-13)

Towards speech-to-text translation without speech recognition

Sameer Bansal, Herman Kamper, Adam Lopez, Sharon Goldwater

arXiv:2002.00388 [cs.CL] (Published 2020-02-02)

A Survey on Knowledge Graphs: Representation, Acquisition and Applications

Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, Philip S. Yu

arXiv:2104.00424 [cs.CL] (Published 2021-04-01)

High-dimensional distributed semantic spaces for utterances

Jussi Karlgren, Pentti Kanerva