Training a model

Training a model

Introduction

To train a TTS model for the target speech variety, in this case, Grunnegs, certain steps have to be taken. First of all, is to have access to hardware that has enough computing power to process the amount of data in a reasonable amount of time.

The University of Groningen has a high-performance computing (HPC) cluster called Peregrine. This is available for employees that might need to use it in their research. An account has to be requested. The link for the form to request an account is: 

https://www.rug.nl/society-business/centre-for-information-technology/research/services/hpc/facilities/request-peregrine-account.

Figure 1. Screenshot of the online form to request a Peregrine account

Students are also allowed to request accounts given that they are part of a research project. More information on the Peregrine computing cluster can be found in their wiki:

https://wiki.hpc.rug.nl/peregrine/start.

Once you have an account, to log in, type: 

ssh username@peregrine.hpc.rug.nl

into the command line, replacing “username” for either your s-number or p-number. Once you are logged in you should see your username left of the command-line (see Figure 2).

Figure 2.  After successfully logging in you will see a similar screen.

Installing dependencies

One of the requirements to run Tacotron2 to train our TTS model is Python3. The Peregrine cluster already has Python3 installed, so this step is already done. Should you run this on a cluster other than Peregrine, check that the correct version of Python is available.

The python scripts in this repository require TensorFlow 1.4. If you are working on your own HPC cluster, you can download the latest version of TensorFlow. It is advisable to install it with GPU support. Otherwise, you can ask your clusters admin whether the required version is installed.

To install the requirements needed to run the python scripts, type the following into the command-line:

pip install -r requirements.txt

  GNU nano 2.3.1                                                               

Figure 3. Installing required dependencies.

The list of requirements includes the following dependencies:

falcon==1.2.0

inflect==0.2.5

audioread==2.1.5

librosa==0.5.1

matplotlib==2.0.2

numpy==1.14.0

scipy==1.0.0

tqdm==4.11.2

Unidecode==0.4.20

pyaudio==0.2.11

sounddevice==0.3.10

lws

keras

These are currently available to be used on the Peregrine cluster given that, to be able to run Tacotron on Peregrine and train our models, we requested the cluster’s admin to check whether these were available and installed. However, it is always advisable to email the cluster’s admin to check whether the dependencies in the requirements list are all installed on the server. 

Cloning the repository

Once you are logged in, the first step is to clone the repository from GitHub to the desired folder. To do this type the following into the command line:

git clone https://github.com/Rayhane-mamah/Tacotron-2.git

This will download all the python scripts and will replicate the folder structure of the repository, ready to be used.

Bear in mind that there are multiple repositories on GitHub that offer different implementations of the Tacotron algorithm. The one we chose was the one that best suited the HPC cluster, as all dependencies required were compatible. 

Download a dataset

To train our model we will use a technique that is called bootstrapping. Bootstrapping is a technique that uses resources available for well-resourced languages, in this case, English, and makes use of these to develop resources for under-resourced languages, in this case, Grunnegs. For this reason, we will have two datasets. One in English and one in Grunnegs. However, for experimental purposes, we also compiled a third corpus, of spoken Dutch.

The English dataset that we will be using is called LJ Speech. The link to download this dataset is: https://keithito.com/LJ-Speech-Dataset/. This dataset consists of 13,100 short audio files, recorded by a single speaker, reading texts from 7 different non-fiction books. There is a transcription available for each file. This amounts to up to 24 hours of spoken data.

The Grunnegs dataset that we used has been produced by this team. It comprises two thousand sentences, recorded by a single speaker, reading texts retrieved from the internet, mainly short fiction, with some non-fiction excerpts. It is about 1,5 hours long.

The Dutch dataset, also produced by this team, comprises about two thousand sentences recorded by a single speaker, reading texts retrieved from the internet, in this case, Wikipedia articles. It is about 2,5 hours long.

It is possible to use other datasets. To do this it is important to either follow the same format that is expected by the python scripts. You can either follow the expected formats and format your dataset accordingly or, if your dataset is already formatted in a particular way, you might want to revise the code and make the required adjustments, if this might be less work than reformatting your corpus.

Preprocessing

In order to start the training, it is important to preprocess the loaded text and audio files. This requires the following command:

python preprocess.py

The process of preprocessing prepares the corpus to be ready to be processed by the training process

Training

Given that we are bootstrapping, the training process is not one step, but at least two steps. First, we will train a model for the English data, and then we will use this model as a checkpoint in order to train, together with the Grunnegs data, a Grunnegs model. 

Training from scratch

To start the training process, now that the corpus is in the corresponding folder, is to type into the command line:

python train.py –model=’Tacotron’ –tacotron_train_steps=500000

If you want to train both models (Tacotron and Wavenet): 

python train.py –model=’Tacotron-2′

This implementation of Tacotron generates checkpoints every 5000 steps and stores them in the logs-Tacotron”.

After training the English model

After the English model has been completed it is important to run the Groningen version. For this, we use the previously discussed bootstrapping method.

Step 1 

Remove the English files present in the LJSpeech folder and load the Groningen version of the dataset in the LJspeech folder

Step 2

Move the Groningen zip files inside the folder and extract the zip document. 

Step 3

Preprocess the files (python preprocess.py) 

Step 4 

Continue training the model using 

python train.py –model=’Tacotron’ –tacotron_train_steps=750000

Training using a pre-trained model

To train a model starting from a previous checkpoint, type into the command line:

python train.py –model=’Tacotron’ –tacotron_train_steps=700000

Advice

The Peregrine cluster sets limits to the amount of time jobs are allowed to run. Therefore it is important that when a job is set to run on the cluster, that the required calculations are made in order to avoid the job being cancelled before it actually gets to yield results, i.e. usable checkpoints. If you are working on your own HPC cluster or you have no restrictions you can, of course, ignore this remark.

Synthesizing text

After having trained your model, you are ready to produce synthesized utterances. To do this, you first need to write your text. Be sure that the text is spelt according to the spelling you used to train the model because the model can only accurately synthesize fragments of text that follow the same spelling conventions that the texts used to train the model have used.

To input the text you are intending to synthesize you need to open a terminal window and log yourself into Peregrine. After you have successfully logged yourself in you need to go to the model’s home folder (where the model is saved, easily recognized as the folder where the file synthesize.py can be found). Here you will find a file called ‘sentences.txt’.  To edit the text in this file you have to use some text editor that runs in a Linux OS. There are several available. However, the most basic tool that can be used is ‘nano’. So to edit the file, you just have to type into the command line: 

nano sentences.txt 

Figure 4. When editing sentences.txt for the first time the sample text will appear

This will open the .txt file and will enable you to edit it. Once you input the desired text, you can go on and save it by pressing ‘ctrl+o’ and then pressing ‘ctrl+x’ to exit. You can also directly press ‘ctrl+x’ and if there are any changes from the original it will ask you if you want to save it first, but saving from time to time with ‘ctrl+o’ is always advisable.

Another step that has to be taken before the text is synthesized is to load all the dependencies that are required. In the model folder, there is a file that contains a list of all dependencies that need to be loaded before you go on and synthesize the text, as, without these dependencies, the program will inevitably render an error message at one point or another. This list can be accessed by 

Once this is done you are ready to type: 

python synthesize.py ‘tacotron2’ ‘sentences.txt’

into the command line. This will start the process of loading the model first, and, after a while, begin to synthesize the text. There is a progress bar that shows in percentage how far advanced the process is. The process will render graphic representations of Mel spectrograms and .wav files of the sentences that were included in the sentences.txt file, both using a linear model as well as a Mel spectrogram model. Baring a few exceptions, the Mel-based .wav files are easier to understand and sound better.