Deep Learning Team : Colin

Abstract

You may have experienced changing the voice of the guided voice while using AI speakers or navigation. I set the speaker voice with the voice of my favorite actor Yoo In-na, and it has become important to synthesize speech with various voices as speech synthesis technology has been incorporated into various parts of life, such as personal assistants, news broadcasts, and voice directions. And there is a growing demand to use not only other people’s voices but also their voices as AI voice, which is called custom voice synthesis in the field of speech synthesis research.

Today, we will look at a text-to-speech (TTS) model called AdaSpeech that appeared for custom voice synthesis. The technology to generate custom voice is mainly done through the process of adapting the pre-trained source TTS model to the user’s voice. Most of the user’s speech data used at this time is small for convenience purposes, and since the amount is small, it is a very difficult task to make the generated voice feel natural and similar to the original voice. There are two main problems with training neural nets with customized voice.

First, certain user’s voices often have acoustic conditions different from the speech data learned from the source TTS model. For example, there are a variety of rhymes, styles, emotions, strengths, and recording environments of speakers, and differences in speech data resulting from them can hinder the generalization performance of the source model, resulting in poor adaptation quality.

Second, when adapting the source TTS model to a new voice, there is trade-off in fine-tuning parameters and voice quality. In other words, the more adaptive parameters you use, the better quality you can produce, but the higher memory usage and the higher cost of deploying the model.

Existing studies have approached by specifying a method of fine-tuning the entire model or part (especially decoder), fine-tuning only speaker embedding used to distinguish speakers in multi-speaker speech synthesis, training speaker encoder module, and assuming that the domain of source speech and adaptive data is same. However, there is a problem with actual use because there are too many parameters or it does not produce satisfactory quality.

AdaSpeech is a TTS model that can efficiently generate new users’ (or speakers) voices with high quality while solving the above problems. The pipeline was largely divided into three stages: pre-training, fine-tuning, and inference, and two techniques are used to solve the existing difficulties. From now on, we will look at them together! 🙂

Summary for Busy People

The generalization performance of the model was improved by extracting acoustic features according to various scopes from speech data and adding them to existing phoneme encoding vectors through acoustic condition modeling.
They have efficiently improved the process of adapting the source model to the data of the new speaker using conditional layer normalization.
It has become possible to create high-quality custom voices with fewer parameters and less new speech data than traditional baseline models.

Model Structure

AdaSpeech’s backbone model is FastSpeech 2. It consists largely of phoneme encoders, variance adaptors, and mel decoder. It includes two new elements (pink areas in Figure 1) devised by the authors.

Acoustic Condition Modeling

In general, it is important to increase the generalization performance of the model because the source voice used in model training cannot cover all the acoustic features of the new user’s voice. Since it is difficult to contain these acoustic features in the text entered by the model in TTS, the model has a bias in remembering acoustic features in the training data, which acts as a hindrance to generalization performance when generating custom voices. The simplest way to solve this problem is to provide acoustic features as input of the model, which is divided into speaker level, utterance level, and phoneme level, and is called acoustic condition modeling, which includes a variety of sound features from wide-area to peripheral information. Each level contains the following information.

Speaker level: A level that captures the overall characteristics of a speaker, representing the largest range of acoustic characteristics (e.g., speaker embedding).
Utterance level: A level that catches features that appear when pronouncing a sentence, and a mel spectrogram of a reference voice is used as input and a feature vector is output from it. When training the model, the target voice becomes a reference voice, and in inference, one of the voices of the speaker you want to synthesize is randomly selected and used as a reference voice.
Phoneme level: The smallest range of levels that capture features in units of phonemes in a sentence (e.g., strength for a particular phoneme, pitch, rhyme, and temporary ambient noise). In this case, the phoneme level mel spectrogram expressed by substituting the mel frames corresponding to the same phoneme with the average within the section is input. And in inference, although the structure is the same, we use an acoustic predictor that receives the hidden vector from the phoneme encoder as an input and predicts the phoneme level vector.

Conditional Layer Normalization

AdaSpeech’s mel decoder consists of self-attention and feed-forward network based on the Transformer model, and since many parameters are used in it, the process of fine-tuning to new voice will not be efficient. So the authors applied conditional layer normalization to the self-attention and feed-forward network on each layer and reduced the number of parameters updated during fine-tuning by updating the scale and bias used here to suit the user. And the scale and bias used here are named conditional because they pass through the linear layer as above figure and these vectors are calculated from speaker embedding.

Training and Inference Process

The process of training AdaSpeech and inferring voice to new speakers can be summarized with the algorithm above. First, pre-train the source model with as much text-speech data as possible, and then update the parameters used for conditional layer normalization and speaker embedding with new speaker’s speech data through fine-tuning. In inference, it can be seen that the value of the parameter that needs to be calculated from the speaker information and the value of the not fine-tuned through learning are utilized together to create a mel spectrogram.

Experiment Results

Custom Voice Quality Evaluation

MelGAN was used as a vocoder, and the naturalness of the synthesized custom voice was evaluated as MOS, and similarity was evaluated on a metric called SMOS. It can be seen that AdaSpeech can synthesize high-quality voices with only fewer or similar parameters than baseline. And since the source TTS model was pre-trained for a dataset called LibriTTS, of course, it seems to receive the highest score when adapted as a new speaker of LibriTTS.

Ablation Study

Using CMOS (comparison MOS), which can evaluate relative quality, they conducted an ablation study on techniques claimed as contribution in this paper. Since the CMOS of AdaSpeech, which removed certain parts, were lower than the basic AdaSpeech from Table 2, we can conclude that all techniques contribute to quality improvement.

Acoustic Condition Modeling Analysis

Figure 4(a) shows the learned speakers’ utterance-level acoustic vector in t-SNE. It can be seen that different sentences pronounced by the same speaker are classified into the same cluster, and from this, it is judged that the model has learned the unique characteristics of one speaker when speaking a sentence. Some exceptions are seen, but these sentences are usually short or emotional speech, making it difficult to distinguish them from other speakers’ utterances.

Conditional Layer Normalization Analysis

Compared with CMOS, it can be seen that the voice quality is the best when using conditional layer normalization. Therefore, when performing layer normalization, it is better to modify the scale and bias by reflecting the speaker’s characteristics, and it can be summarized that updating only them has a positive effect on the model’s adaptability.

Amount of Adaptive Data Analysis

Finally, the authors conducted an experiment to test how much new user’s speech data is needed to determine if this model is practical. As can be seen from Figure 4(b), the quality of the synthesized voice improves rapidly until 10 samples are used, but since then, there is no significant improvement, so it is okay to fine-tune the AdaSpeech using only 10 samples for each speaker.

Conclusion and Opinion

AdaSpeech is a TTS model that has the ability to adapt to new users while making good use of the advantages of FastSpeech, which has previously improved speed with parallel speech synthesis. Acoustic condition modeling improves the generalization performance of the model by capturing the characteristics of the voice, and if it is further subdivided, AI that speaks more similarly to the user’s characteristics may be created. In addition, I think the value of use is endless in that it is a model that can satisfy custom voice TTS with only 10 samples, but even so, it is regrettable in practical terms that the user’s voice and corresponding text should be used as data for fine-tuning together. In fact, even if you can record your voice among those who use AI voice synthesis services, there will be more users who will be bothered to type text together. So, in the next session, we will introduce a modified version of AdaSpeech that allows custom voice synthesis without text-speech paired data.