Text2video: Text-driven talking-head video synthesis with personalized phoneme-pose dictionary

S Zhang, J Yuan, M Liao…�- ICASSP 2022-2022 IEEE�…, 2022 - ieeexplore.ieee.org
ICASSP 2022-2022 IEEE International Conference on Acoustics�…, 2022ieeexplore.ieee.org
With the advance of deep learning technology, automatic video generation from audio or text
has become an emerging and promising research topic. In this paper, we present a novel
approach to synthesize video from the text. The method builds a phoneme-pose dictionary
and trains a generative adversarial network (GAN) to generate video from interpolated
phoneme poses. Compared to audio-driven video generation algorithms, our approach has
a number of advantages: 1) It only needs about 1 min of the training data, which is�…
With the advance of deep learning technology, automatic video generation from audio or text has become an emerging and promising research topic. In this paper, we present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video from interpolated phoneme poses. Compared to audio-driven video generation algorithms, our approach has a number of advantages: 1) It only needs about 1 min of the training data, which is significantly less than audio-driven approaches; 2) It is more flexible and not subject to vulnerability due to speaker variation; 3) It significantly reduces the preprocessing and training time from several days for audio-based methods to 4 hours, which is 10 times faster. We perform extensive experiments to compare the proposed method with state-of-the-art talking face generation methods on a benchmark dataset and datasets of our own. The results demonstrate the effectiveness and superiority of our approach.
ieeexplore.ieee.org