arXiv Analytics

Sign in

arXiv:2309.15826 [cs.CL]AbstractReferencesReviewsResources

Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing

Brian Yan, Xuankai Chang, Antonios Anastasopoulos, Yuya Fujita, Shinji Watanabe

Published 2023-09-27Version 1

Recent works in end-to-end speech-to-text translation (ST) have proposed multi-tasking methods with soft parameter sharing which leverage machine translation (MT) data via secondary encoders that map text inputs to an eventual cross-modal representation. In this work, we instead propose a ST/MT multi-tasking framework with hard parameter sharing in which all model parameters are shared cross-modally. Our method reduces the speech-text modality gap via a pre-processing stage which converts speech and text inputs into two discrete token sequences of similar length -- this allows models to indiscriminately process both modalities simply using a joint vocabulary. With experiments on MuST-C, we demonstrate that our multi-tasking framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU without any external MT data. Further, we show that this framework incorporates external MT data, yielding +0.8 BLEU, and also improves transfer learning from pre-trained textual models, yielding +1.8 BLEU.

Related articles: Most relevant | Search more
arXiv:1702.03856 [cs.CL] (Published 2017-02-13)
Towards speech-to-text translation without speech recognition
arXiv:2407.03169 [cs.CL] (Published 2024-07-03)
Investigating Decoder-only Large Language Models for Speech-to-text Translation
arXiv:1912.07240 [cs.CL] (Published 2019-12-16)
Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding
Yuchen Liu et al.