Deep Learning Team : Colin
You may have experienced changing the voice of the guided voice while using AI speakers or navigation. I set the speaker voice with the voice of my favorite actor Yoo In-na, and it has become important to synthesize speech with various voices as speech synthesis technology has been incorporated into various parts of life, such as personal assistants, news broadcasts, and voice directions. And there is a growing demand to use not only other people’s voices but also their voices as AI voice, which is called custom voice synthesis in the field of speech synthesis research.
Today, we will look at a text-to-speech (TTS) model called AdaSpeech that appeared for custom voice synthesis. The technology to generate custom voice is mainly done through the process of adapting the pre-trained source TTS model to the user’s voice. Most of the user’s speech data used at this time is small for convenience purposes, and since the amount is small, it is a very difficult task to make the generated voice feel natural and similar to the original voice. There are two main problems with training neural nets with customized voice.
First, certain user’s voices often have acoustic conditions different from the speech data learned from the source TTS model. For example, there are a variety of rhymes, styles, emotions, strengths, and recording environments of speakers, and differences in speech data resulting from them can hinder the generalization performance of the source model, resulting in poor adaptation quality.
Second, when adapting the source TTS model to a new voice, there is trade-off in fine-tuning parameters and voice quality. In other words, the more adaptive parameters you use, the better quality you can produce, but the higher memory usage and the higher cost of deploying the model.
Existing studies have approached by specifying a method of fine-tuning the entire model or part (especially decoder), fine-tuning only speaker embedding used to distinguish speakers in multi-speaker speech synthesis, training speaker encoder module, and assuming that the domain of source speech and adaptive data is same. However, there is a problem with actual use because there are too many parameters or it does not produce satisfactory quality.
AdaSpeech is a TTS model that can efficiently generate new users’ (or speakers) voices with high quality while solving the above problems. The pipeline was largely divided into three stages: pre-training, fine-tuning, and inference, and two techniques are used to solve the existing difficulties. From now on, we will look at them together! 🙂
AdaSpeech’s backbone model is FastSpeech 2. It consists largely of phoneme encoders, variance adaptors, and mel decoder. It includes two new elements (pink areas in Figure 1) devised by the authors.
In general, it is important to increase the generalization performance of the model because the source voice used in model training cannot cover all the acoustic features of the new user’s voice. Since it is difficult to contain these acoustic features in the text entered by the model in TTS, the model has a bias in remembering acoustic features in the training data, which acts as a hindrance to generalization performance when generating custom voices. The simplest way to solve this problem is to provide acoustic features as input of the model, which is divided into speaker level, utterance level, and phoneme level, and is called acoustic condition modeling, which includes a variety of sound features from wide-area to peripheral information. Each level contains the following information.
AdaSpeech’s mel decoder consists of self-attention and feed-forward network based on the Transformer model, and since many parameters are used in it, the process of fine-tuning to new voice will not be efficient. So the authors applied conditional layer normalization to the self-attention and feed-forward network on each layer and reduced the number of parameters updated during fine-tuning by updating the scale and bias used here to suit the user. And the scale and bias used here are named conditional because they pass through the linear layer as above figure and these vectors are calculated from speaker embedding.
The process of training AdaSpeech and inferring voice to new speakers can be summarized with the algorithm above. First, pre-train the source model with as much text-speech data as possible, and then update the parameters used for conditional layer normalization and speaker embedding with new speaker’s speech data through fine-tuning. In inference, it can be seen that the value of the parameter that needs to be calculated from the speaker information and the value of the not fine-tuned through learning are utilized together to create a mel spectrogram.
MelGAN was used as a vocoder, and the naturalness of the synthesized custom voice was evaluated as MOS, and similarity was evaluated on a metric called SMOS. It can be seen that AdaSpeech can synthesize high-quality voices with only fewer or similar parameters than baseline. And since the source TTS model was pre-trained for a dataset called LibriTTS, of course, it seems to receive the highest score when adapted as a new speaker of LibriTTS.
Using CMOS (comparison MOS), which can evaluate relative quality, they conducted an ablation study on techniques claimed as contribution in this paper. Since the CMOS of AdaSpeech, which removed certain parts, were lower than the basic AdaSpeech from Table 2, we can conclude that all techniques contribute to quality improvement.
Figure 4(a) shows the learned speakers’ utterance-level acoustic vector in t-SNE. It can be seen that different sentences pronounced by the same speaker are classified into the same cluster, and from this, it is judged that the model has learned the unique characteristics of one speaker when speaking a sentence. Some exceptions are seen, but these sentences are usually short or emotional speech, making it difficult to distinguish them from other speakers’ utterances.
Compared with CMOS, it can be seen that the voice quality is the best when using conditional layer normalization. Therefore, when performing layer normalization, it is better to modify the scale and bias by reflecting the speaker’s characteristics, and it can be summarized that updating only them has a positive effect on the model’s adaptability.
Finally, the authors conducted an experiment to test how much new user’s speech data is needed to determine if this model is practical. As can be seen from Figure 4(b), the quality of the synthesized voice improves rapidly until 10 samples are used, but since then, there is no significant improvement, so it is okay to fine-tune the AdaSpeech using only 10 samples for each speaker.
AdaSpeech is a TTS model that has the ability to adapt to new users while making good use of the advantages of FastSpeech, which has previously improved speed with parallel speech synthesis. Acoustic condition modeling improves the generalization performance of the model by capturing the characteristics of the voice, and if it is further subdivided, AI that speaks more similarly to the user’s characteristics may be created. In addition, I think the value of use is endless in that it is a model that can satisfy custom voice TTS with only 10 samples, but even so, it is regrettable in practical terms that the user’s voice and corresponding text should be used as data for fine-tuning together. In fact, even if you can record your voice among those who use AI voice synthesis services, there will be more users who will be bothered to type text together. So, in the next session, we will introduce a modified version of AdaSpeech that allows custom voice synthesis without text-speech paired data.
(1) [FastSpeech 2 논문] FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
(2) [AdaSpeech 논문] AdaSpeech: Adaptive Text to Speech for Custom Voice
(3) [AdaSpeech 음성 데모] https://speechresearch.github.io/adaspeech/
(1) [FastSpeech 2 Paper] FastSpeech 2: Fast and High-Quality End-to-End Text to Speech
(2) [AdaSpeech Paper] AdaSpeech: Adaptive Text to Speech for Custom Voice
(3) [AdaSpeech Demo] https://speechresearch.github.io/adaspeech/