Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2407.04291 (eess)

[Submitted on 5 Jul 2024]

Title:We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

Authors:Ismail Rasim Ulgen, Carlos Busso, John H. L. Hansen, Berrak Sisman

Abstract:In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings. The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center. We apply our proposed embedding in voice conversion task and show that our method provides better naturalness and prosody in synthesized speech.

Comments:	Submitted to IEEE Signal Processing Letters
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
Cite as:	arXiv:2407.04291 [eess.AS]
	(or arXiv:2407.04291v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2407.04291

Submission history

From: Ismail Rasim Ulgen [view email]
[v1] Fri, 5 Jul 2024 06:54:24 UTC (1,245 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators