Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised model


Abstract

Self-supervised (SSL) models have shown great performance in various downstream tasks like speech recognition and synthesis. However, they are typically developed for a limited set of languages, and may encounter new languages when applied in real-world scenarios. Developing a SSL model from scratch for a new language is costly due to its huge parameters. Thus, it is vital to figure out how to efficiently adapt an existed SSL model to a new language without impairing its original abilities. We propose novel adaptation methods and preservation strategies to achieve this goal. Applied to mHuBERT, we investigate their effectiveness on multi-lingual speech re-synthesis task. Experiments show that adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with the relative value of WER on speech re-synthesis task reduced up to 61.72%. Also, our preservation strategies ensure that the performance on speech re-synthesis for both existed and new languages remains intact.

Results on Mandarin adaptation

System comparison

  • Ground Truth: Ground truth speech
  • Mandarin HuBERT: Open-sourced HuBERT trained on 1W+ hours Wenetspeech
  • Unadapted mHuBERT + Untrained K-means: Un-adapted mHuBERT model with released K-means model
  • Unadapted mHuBERT + Trained K-means: Un-adapted mHuBERT model with K-means model trained on extracted Mandarin representations
  • Adapted mHuBERT(one-iteration) + Trained K-means (proposed): Adapted mHuBERT model using one-iteration adaptation strategy with K-means model trained on extracted Mandarin representations
  • Adapted mHuBERT(two-iteration) + Trained K-means (proposed): Adapted mHuBERT model using two-iteration adaptation strategy with K-means model trained on extracted Mandarin representations
  • No. Transcript Ground Truth Mandarin HuBERT Unadapted mHuBERT + Untrained K-means Unadapted mHuBERT + Trained K-means Adaptated mHuBERT(one-iteration)+ Trained K-means Adpated mHuBERT(two-iteration) + Trained K-means
    1 武术始终被看作我国的国粹
    2 新政的推出是一项长远的制度安排
    3 其实无论是通过和什么厂商合作
    4 使自己得到更多的实惠
    5 而且对手也大多是中国人
    6 不排除内幕交易的可能
    7 来一首死了都要爱
    8 可以和车主直接进行对话

    Results on language preservation and adaptation

    System comparison

    In the system notion, before and after denote before and after the implementation of preservation strategies, respectively. With system before preservation strategies adopted, the K-means model used in English speech re-synthesis is trained using representations extracted from un-adapted model.
  • GT: Ground truth speech
  • zh-2iter-before: Adapted mHuBERT using only Mandarin speech with two-iteration adaptation strategy (before preservation)
  • zh-1iter-before: Apdated mHuBERT using only Mandarin speech with one-iteraiton adaptation strategy (before preservation)
  • zh-2iter-after: Adapted mHuBERT using only Mandarin speech with two-iteration adaptation strategy and reclustering preservation strategy (after preservation)
  • zh-1iter-after: Adapted mHuBERT using only Mandarin speech with one-iteration adaptation strategy and reclustering preservation strategy (after preservation)
  • enzh-2iter-after: Adapted mHuBERT using data combination preservation strategy with one-iteration adaptation strategy(after preservation)
  • No. Transcript GT zh-2iter-before zh-1iter-before zh-2iter-after zh-1iter-after enzh-2iter-after
    1 Ask her to bring these things with her from the store.
    2 He had no enemies.
    3 She's a spy!
    4 It's a real challenge.
    5 Art is extra.
    6 I've spent a lot of time on buses.
    7 Painful, but only because it's true.
    8 He seems to be pleased with the picture.
    9 武术始终被看作我国的国粹
    10 新政的推出是一项长远的制度安排
    11 其实无论是通过和什么厂商合作
    12 使自己得到更多的实惠
    13 而且对手也大多是中国人
    14 不排除内幕交易的可能
    15 来一首死了都要爱
    16 可以和车主直接进行对话

    Ablation study on cluster numbers

    System comparison on Mandarin adaptation

    We only focus on the performance of new language adaptation in this experiment.
  • Ground Truth: Ground truth speech
  • zh-2iter-1000: Adapted mHuBERT model using one-iteration adaptation strategy with K-means model trained on extracted Mandarin representations and cluster number of K-means model is 1000
  • zh-2iter-3000: Adapted mHuBERT model using two-iteration adaptation strategy with K-means model trained on extracted Mandarin representations and cluster number of K-means model is 3000
  • No. Transcript Ground Truth zh-2iter-1000 zh-2iter-3000
    1 武术始终被看作我国的国粹
    2 新政的推出是一项长远的制度安排
    3 其实无论是通过和什么厂商合作
    4 使自己得到更多的实惠
    5 而且对手也大多是中国人
    6 不排除内幕交易的可能
    7 来一首死了都要爱
    8 可以和车主直接进行对话

    System comparison on language preservation and adaptation

    We consider the balance between new languages and existed languages in this experiment.
  • GT: Ground truth speech
  • zh-2iter-1000: Adapted mHuBERT model using one-iteration adaptation strategy and re-clustering preservation strategy with K-means model trained on extracted Mandarin representations and cluster number of K-means model is 1000
  • zh-2iter-3000: Adapted mHuBERT model using two-iteration adaptation strategy and re-clustering preservation strategy with K-means model trained on extracted Mandarin representations and cluster number of K-means model is 3000
  • No. Transcript GT zh-2iter-1000 zh-2iter-3000
    1 Ask her to bring these things with her from the store.
    2 He had no enemies.
    3 She's a spy!
    4 It's a real challenge.
    5 Art is extra.
    6 I've spent a lot of time on buses.
    7 Painful, but only because it's true.
    8 He seems to be pleased with the picture.
    9 武术始终被看作我国的国粹
    10 新政的推出是一项长远的制度安排
    11 其实无论是通过和什么厂商合作
    12 使自己得到更多的实惠
    13 而且对手也大多是中国人
    14 不排除内幕交易的可能
    15 来一首死了都要爱
    16 可以和车主直接进行对话