Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora

Abstract

While Large Language Models (LLMs) have shown potential in speech generation and recognition, their applications are mainly confined to monolingual scenarios, with limited explorations in code-switched (CS) contexts. In this paper, we propose a Code-Switched Large Language Model (CS-LLM) to enhance the code-switched text-to-speech synthesis (CS TTS) capability in LLMs with only monolingual corpora. Specifically, we begin by enhancing the multilingual speech processing ability of LLMs through multilingual speech recognition and synthesis tasks. Then, we develop an effective CS data construction strategy that splits and concatenates words from different monolingual speech corpora to equip LLMs with improved CS TTS ability. Experiments show that our approach outperforms baselines in CS TTS in terms of naturalness, speaker consistency and similarity even with limited data. Additionally, the constructed CS data further improves multilingual speech synthesis and recognition.

Methodology

Model Architecture

Fig. 1. A pipeline of CS-LLM for implementing multilingual ASR task (Figure (a)) and multilingual TTS task (Figure (b)). Note parameters of all adapters and the LLM backbone are the same when implementing different tasks.

Data constrcution strategy

Fig. 2. The illustration of CS data construction strategy.

System Comparison

GT: Ground truth speech

SpeechGPT: Open-sourced SpeechGPT (stage-2)

Initial HuBERT: A model trained with initial mHubert tokens using LLaMA 3 backbone

VALL-E X: Open-sourced VALLE-X implementation with 704 hours English data and 598 hours Mandarin data

CS-LLM_mixed500: Our proposed model trained with 500 hours Libripseech-960, 500 hours AISHELL-2 and 10 hour constructed CS data

CS-LLM_no500: Our proposed model trained with 500 hours Librispeech-960, 500 hours AISHELL-2

CS-LLM_mixed: Our proposed model trained with 960 hours Libripseech-960, 1000 hours AISHELL-2 and 10 hour constructed CS data

CS-LLM_no: Our proposed model trained with 960 hours Librispeech-960, 1000 hours AISHELL-2

Results on English TTS

No.	Transcript	GT	SpeechGPT	Initial Hubert	VALL-E X	CS-LLM_no500	CS-LLM_mixed500	CS-LLM_no	CS-LLM_mixed
1	Stuff it into you his belly counselled him
2	He tried to think how it could be
3	Be ware of making that mistake
4	He had the faith in him that moves mountains
5	He could wait no longer
6	She sent me the pages in question before she died
7	Probably not till the second post
8	Cried the ladies whose departure had been fixed
9	It was the beauty of it
10	If spoken to, she would not speak again

Results on Mandarin TTS

No.	Transcript	GT	VALL-E X	CS-LLM_no500	CS-LLM_mixed500	CS-LLM_no	CS-LLM_mixed
1	把她推进去
2	你那几句话
3	简单但非常有效
4	我帮你问问从那怎么到我家
5	你想要的不是钱
6	实际上这些画不是我的
7	在东德克萨斯一带活动
8	好像多年一样
9	你说什么我都会信的
10	也不了解自己在说什么

Results on CSTTS

The text are randomly sampled from ASRU-2019 Challenge test set. The text is generated using the speaker from the reference speech. The reference speech are randomly selected from TIMIT (English) and ASCEND (Mandarin).

No.	Transcript	Reference speech	VALL-E X	CS-LLM_no500	CS-LLM_mixed500	CS-LLM_no	CS-LLM_mixed
1	awkward moment在商场遇到以前的学生
2	你不会连airmail都私吞吧
3	修改的patch都亲自写好了
4	Lily送我的tulip种子总算开了一朵花
5	目击者都很shaken了话筒还戳人家脸前
6	除非你是superman穿在外面
7	Boy crisis出现真不是没道理的
8	工资还剩六百元的话就买riddle joker
9	一个漂亮的spear成功双肩压地三秒
10	去每个公寓找它的reception