Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised model

Abstract

Self-supervised (SSL) models have shown great performance in various downstream tasks like speech recognition and synthesis. However, they are typically developed for a limited set of languages, and may encounter new languages when applied in real-world scenarios. Developing a SSL model from scratch for a new language is costly due to its huge parameters. Thus, it is vital to figure out how to efficiently adapt an existed SSL model to a new language without impairing its original abilities. We propose novel adaptation methods and preservation strategies to achieve this goal. Applied to mHuBERT, we investigate their effectiveness on multi-lingual speech re-synthesis task. Experiments show that adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with the relative value of WER on speech re-synthesis task reduced up to 61.72%. Also, our preservation strategies ensure that the performance on speech re-synthesis for both existed and new languages remains intact.

Results on Mandarin adaptation

System comparison

Ground Truth: Ground truth speech

Mandarin HuBERT: Open-sourced HuBERT trained on 1W+ hours Wenetspeech

Unadapted mHuBERT + Untrained K-means: Un-adapted mHuBERT model with released K-means model

Unadapted mHuBERT + Trained K-means: Un-adapted mHuBERT model with K-means model trained on extracted Mandarin representations

Adapted mHuBERT(one-iteration) + Trained K-means (proposed): Adapted mHuBERT model using one-iteration adaptation strategy with K-means model trained on extracted Mandarin representations

Adapted mHuBERT(two-iteration) + Trained K-means (proposed): Adapted mHuBERT model using two-iteration adaptation strategy with K-means model trained on extracted Mandarin representations

No.	Transcript	Ground Truth	Mandarin HuBERT	Unadapted mHuBERT + Untrained K-means	Unadapted mHuBERT + Trained K-means	Adaptated mHuBERT(one-iteration)+ Trained K-means	Adpated mHuBERT(two-iteration) + Trained K-means
1	武术始终被看作我国的国粹
2	新政的推出是一项长远的制度安排
3	其实无论是通过和什么厂商合作
4	使自己得到更多的实惠
5	而且对手也大多是中国人
6	不排除内幕交易的可能
7	来一首死了都要爱
8	可以和车主直接进行对话

Results on language preservation and adaptation

System comparison

In the system notion, before and after denote before and after the implementation of preservation strategies, respectively. With system before preservation strategies adopted, the K-means model used in English speech re-synthesis is trained using representations extracted from un-adapted model.

GT: Ground truth speech

zh-2iter-before: Adapted mHuBERT using only Mandarin speech with two-iteration adaptation strategy (before preservation)

zh-1iter-before: Apdated mHuBERT using only Mandarin speech with one-iteraiton adaptation strategy (before preservation)

zh-2iter-after: Adapted mHuBERT using only Mandarin speech with two-iteration adaptation strategy and reclustering preservation strategy (after preservation)

zh-1iter-after: Adapted mHuBERT using only Mandarin speech with one-iteration adaptation strategy and reclustering preservation strategy (after preservation)

enzh-2iter-after: Adapted mHuBERT using data combination preservation strategy with one-iteration adaptation strategy(after preservation)

No.	Transcript	GT	zh-2iter-before	zh-1iter-before	zh-2iter-after	zh-1iter-after	enzh-2iter-after
1	Ask her to bring these things with her from the store.
2	He had no enemies.
3	She's a spy!
4	It's a real challenge.
5	Art is extra.
6	I've spent a lot of time on buses.
7	Painful, but only because it's true.
8	He seems to be pleased with the picture.
9	武术始终被看作我国的国粹
10	新政的推出是一项长远的制度安排
11	其实无论是通过和什么厂商合作
12	使自己得到更多的实惠
13	而且对手也大多是中国人
14	不排除内幕交易的可能
15	来一首死了都要爱
16	可以和车主直接进行对话

Ablation study on cluster numbers

System comparison on Mandarin adaptation

We only focus on the performance of new language adaptation in this experiment.

Ground Truth: Ground truth speech

zh-2iter-1000: Adapted mHuBERT model using one-iteration adaptation strategy with K-means model trained on extracted Mandarin representations and cluster number of K-means model is 1000

zh-2iter-3000: Adapted mHuBERT model using two-iteration adaptation strategy with K-means model trained on extracted Mandarin representations and cluster number of K-means model is 3000

No.	Transcript	Ground Truth	zh-2iter-1000	zh-2iter-3000
1	武术始终被看作我国的国粹
2	新政的推出是一项长远的制度安排
3	其实无论是通过和什么厂商合作
4	使自己得到更多的实惠
5	而且对手也大多是中国人
6	不排除内幕交易的可能
7	来一首死了都要爱
8	可以和车主直接进行对话

System comparison on language preservation and adaptation

We consider the balance between new languages and existed languages in this experiment.

GT: Ground truth speech

zh-2iter-1000: Adapted mHuBERT model using one-iteration adaptation strategy and re-clustering preservation strategy with K-means model trained on extracted Mandarin representations and cluster number of K-means model is 1000

zh-2iter-3000: Adapted mHuBERT model using two-iteration adaptation strategy and re-clustering preservation strategy with K-means model trained on extracted Mandarin representations and cluster number of K-means model is 3000

No.	Transcript	GT	zh-2iter-1000	zh-2iter-3000
1	Ask her to bring these things with her from the store.
2	He had no enemies.
3	She's a spy!
4	It's a real challenge.
5	Art is extra.
6	I've spent a lot of time on buses.
7	Painful, but only because it's true.
8	He seems to be pleased with the picture.
9	武术始终被看作我国的国粹
10	新政的推出是一项长远的制度安排
11	其实无论是通过和什么厂商合作
12	使自己得到更多的实惠
13	而且对手也大多是中国人
14	不排除内幕交易的可能
15	来一首死了都要爱
16	可以和车主直接进行对话