tts-arabic-pytorch

[Samples 1] [Samples 2] [ONNX models] [Flutter app]

TTS models (Tacotron2, FastPitch), trained on Nawar Halabi's Arabic Speech Corpus, including the HiFi-GAN vocoder for direct TTS inference.

Papers:

Tacotron2 | Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions (arXiv)

FastPitch | FastPitch: Parallel Text-to-speech with Pitch Prediction (arXiv)

HiFi-GAN | HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (arXiv)

Audio Samples

You can listen to some audio samples here.

Multispeaker model (in progress)

Multispeaker weights are available for the FastPitch model. Currently, another male voice and two female voices have been added. Audio samples can be found here. Download weights here. There also exists an ONNX version for this model.

The multispeaker dataset was created by synthesizing data with Coqui's XTTS-v2 model and a mix of voices from the Tunisian_MSA dataset.

Quick Setup

The models were trained with the mse loss as described in the papers. I also trained the models using an additional adversarial loss (adv). The difference is not large, but I think that the (adv) version often sounds a bit clearer. You can compare them yourself.

Running python download_files.py will download all pretrained weights, alternatively:

Download the pretrained weights for the Tacotron2 model (mse | adv).

Download the pretrained weights for the FastPitch model (mse | adv).

Download the HiFi-GAN vocoder weights (link). Either put them into pretrained/hifigan-asc-v1 or edit the following lines in configs/basic.yaml.

# vocoder
vocoder_state_path: pretrained/hifigan-asc-v1/hifigan-asc.pth
vocoder_config_path: pretrained/hifigan-asc-v1/config.json

This repo includes the diacritization models Shakkala and Shakkelha.

The weights can be downloaded here. There also exists a separate repo and package.

-> Alternatively, download all models and put the content of the zip file into the pretrained folder.

Required packages:

torch torchaudio pyyaml

~ for training: librosa matplotlib tensorboard

~ for the demo app: fastapi "uvicorn[standard]"

Using the models

The Tacotron2/FastPitch from models.tacotron2/models.fastpitch are wrappers that simplify text-to-mel inference. The Tacotron2Wave/FastPitch2Wave models includes the HiFi-GAN vocoder for direct text-to-speech inference.

Inference options

text = "اَلسَّلامُ عَلَيكُم يَا صَدِيقِي."

wave = model.tts(
    text_input = text, # input text
    speed = 1, # speaking speed
    denoise = 0.005, # HifiGAN denoiser strength
    speaker_id = 0, # speaker id
    batch_size = 2, # batch size for batched inference
    vowelizer = None, # vowelizer model
    pitch_mul = 1, # pitch multiplier (for FastPitch)
    pitch_add = 0, # pitch offset (for FastPitch)
    return_mel = False # return mel spectrogram?
)

Inferring the Mel spectrogram

from models.tacotron2 import Tacotron2
model = Tacotron2('pretrained/tacotron2_ar_adv.pth')
model = model.cuda()
mel_spec = model.ttmel("اَلسَّلامُ عَلَيكُم يَا صَدِيقِي.")

from models.fastpitch import FastPitch
model = FastPitch('pretrained/fastpitch_ar_adv.pth')
model = model.cuda()
mel_spec = model.ttmel("اَلسَّلامُ عَلَيكُم يَا صَدِيقِي.")

End-to-end Text-to-Speech

from models.tacotron2 import Tacotron2Wave
model = Tacotron2Wave('pretrained/tacotron2_ar_adv.pth')
model = model.cuda()
wave = model.tts("اَلسَّلامُ عَلَيكُم يَا صَدِيقِي.")

wave_list = model.tts(["صِفر" ,"واحِد" ,"إِثنان", "ثَلاثَة" ,"أَربَعَة" ,"خَمسَة", "سِتَّة" ,"سَبعَة" ,"ثَمانِيَة", "تِسعَة" ,"عَشَرَة"])

from models.fastpitch import FastPitch2Wave
model = FastPitch2Wave('pretrained/fastpitch_ar_adv.pth')
model = model.cuda()
wave = model.tts("اَلسَّلامُ عَلَيكُم يَا صَدِيقِي.")

wave_list = model.tts(["صِفر" ,"واحِد" ,"إِثنان", "ثَلاثَة" ,"أَربَعَة" ,"خَمسَة", "سِتَّة" ,"سَبعَة" ,"ثَمانِيَة", "تِسعَة" ,"عَشَرَة"])

By default, Arabic letters are converted using the Buckwalter transliteration, which can also be used directly.

wave = model.tts(">als~alAmu Ealaykum yA Sadiyqiy.")
wave_list = model.tts(["Sifr", "wAHid", "<i^nAn", "^alA^ap", ">arbaEap", "xamsap", "sit~ap", "sabEap", "^amAniyap", "tisEap", "Ea$arap"])

Unvocalized text

text_unvoc = "اللغة العربية هي أكثر اللغات السامية تحدثا، وإحدى أكثر اللغات انتشارا في العالم"
wave_shakkala = model.tts(text_unvoc, vowelizer='shakkala')
wave_shakkelha = model.tts(text_unvoc, vowelizer='shakkelha')

Inference from text file

python inference.py
# default parameters:
python inference.py --list data/infer_text.txt --out_dir samples/results --model fastpitch --checkpoint pretrained/fastpitch_ar_adv.pth --batch_size 2 --denoise 0

Testing the model

To test the model run:

python test.py
# default parameters:
python test.py --model fastpitch --checkpoint pretrained/fastpitch_ar_adv.pth --out_dir samples/test

Processing details

This repo uses Nawar Halabi's Arabic-Phonetiser but simplifies the result such that different contexts are ignored (see text/symbols.py). Further, a doubled consonant is represented as consonant + doubling-token.

The Tacotron2 model can sometimes struggle to pronounce the last phoneme of a sentence when it ends in an unvocalized consonant. The pronunciation is more reliable if one appends a word-separator token at the end and cuts it off using the alignments weights (details in models.networks). This option is implemented as a default postprocessing step that can be disabled by setting postprocess_mel=False.

Training the model

Before training, the audio files must be resampled. The model was trained after preprocessing the files using scripts/preprocess_audio.py.

To train the model with options specified in the config file run:

python train.py
# default parameters:
python train.py --config configs/nawar.yaml

Web app

The web app uses the FastAPI library. To run the app you need the following packages:

fastapi: for the backend api | uvicorn: for serving the app

Install with: pip install fastapi "uvicorn[standard]"

Run with: python app.py

Preview:

Acknowledgements

I referred to NVIDIA's Tacotron2 implementation for details on model training.

The FastPitch files stem from NVIDIA's DeepLearningExamples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tts-arabic-pytorch

Audio Samples

Multispeaker model (in progress)

Quick Setup

Required packages:

Using the models

Inference options

Inferring the Mel spectrogram

End-to-end Text-to-Speech

Unvocalized text

Inference from text file

Testing the model

Processing details

Training the model

Web app

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
app		app
configs		configs
data		data
models		models
pretrained		pretrained
scripts		scripts
text		text
utils		utils
vocoder		vocoder
.gitignore		.gitignore
README.md		README.md
app.py		app.py
download_files.py		download_files.py
inference.py		inference.py
test.ipynb		test.ipynb
test.py		test.py
train.py		train.py

nipponjo/tts-arabic-pytorch

Folders and files

Latest commit

History

Repository files navigation

tts-arabic-pytorch

Audio Samples

Multispeaker model (in progress)

Quick Setup

Required packages:

Using the models

Inference options

Inferring the Mel spectrogram

End-to-end Text-to-Speech

Unvocalized text

Inference from text file

Testing the model

Processing details

Training the model

Web app

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages