Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data


Abstract

While large language models (LLMs) have been explored in the speech domain for both generation and recognition tasks, their applications are predominantly confined to monolingual scenarios, with limited exploration in multilingual and code-switched (CS) contexts. Additionally, generation and recognition tasks are often handled separately, such as VALL-E and Qwen-Audio. In this paper, we propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM. Furthermore, we develop an effective data construction approach that splits and concatenates words from different languages to equip LLMs with CS synthesis ability without relying on CS data. The experimental results demonstrate that our model outperforms other baselines with a comparable data scale. Furthermore, our data construction approach not only equips LLMs with CS speech synthesis capability with comparable speaker consistency and similarity to any given speaker, but also improves the performance of LLMs in multilingual speech generation and recognition tasks.

Methodology

Model Architecture

  • Fig. 1. The architecture of MLMT system. Figure (a) shows the pipeline of implementing ASR task. Figure (b) shows the pipeline of implementing TTS task. Note that all parameters of adapter and LLM are the same implementing different tasks.
  • Data constrcution strategy

  • Fig. 2. The procedure of constructing CS data.

  • System Comparison

  • GT: Ground truth speech
  • SpeechGPT: Open-sourced SpeechGPT (stage-2)
  • Initial HuBERT: A model trained with initial mHubert tokens using LLaMA 3 backbone
  • VALL-E X: Open-sourced VALLE-X implementation with 704 hours English data and 598 hours Mandarin data
  • MLMT_mixed500: Our proposed model trained with 500 hours Libripseech-960, 500 hours AISHELL-2 and 10 hour constructed CS data
  • MLMT_no500: Our proposed model trained with 500 hours Librispeech-960, 500 hours AISHELL-2
  • MLMT_mixed: Our proposed model trained with 960 hours Libripseech-960, 1000 hours AISHELL-2 and 10 hour constructed CS data
  • MLMT_no: Our proposed model trained with 960 hours Librispeech-960, 1000 hours AISHELL-2
  • Results on English TTS

    No. Transcript GT SpeechGPT Initial Hubert VALL-E X MLMT_no500 MLMT_mixed500 MLMT_no MLMT_mixed
    1 Stuff it into you his belly counselled him
    2 He tried to think how it could be
    3 Be ware of making that mistake
    4 He had the faith in him that moves mountains
    5 He could wait no longer
    6 She sent me the pages in question before she died
    7 Probably not till the second post
    8 Cried the ladies whose departure had been fixed
    9 It was the beauty of it
    10 If spoken to, she would not speak again

    Results on Mandarin TTS

    No. Transcript GT VALL-E X MLMT_no500 MLMT_mixed500 MLMT_no MLMT_mixed
    1 把她推进去
    2 你那几句话
    3 简单但非常有效
    4 我帮你问问从那怎么到我家
    5 你想要的不是钱
    6 实际上这些画不是我的
    7 在东德克萨斯一带活动
    8 好像多年一样
    9 你说什么我都会信的
    10 也不了解自己在说什么

    Results on CSTTS

    The text are randomly sampled from ASRU-2019 Challenge test set. The text is generated using the speaker from the reference speech. The reference speech are randomly selected from TIMIT (English) and ASCEND (Mandarin).
    No. Transcript Reference speech VALL-E X MLMT_no500 MLMT_mixed500 MLMT_no MLMT_mixed
    1 Boy crisis出现真不是没道理的
    2 工资还剩六百元的话就买riddle joker
    3 一个漂亮的spear成功双肩压地三秒
    4 去每个公寓找它的reception
    5 awkward moment在商场遇到以前的学生
    6 你不会连airmail都私吞吧
    7 修改的patch都亲自写好了
    8 Lily送我的tulip种子总算开了一朵花
    9 目击者都很shaken了话筒还戳人家脸前
    10 除非你是superman穿在外面