Abstract
Speech-driven gesture synthesis is a field of growing interest in virtual human creation. However, a critical challenge is the inherent intricate one-to-many mapping between speech and gestures. Previous studies have explored and achieved significant progress with generative models. Notwithstanding, most synthetic gestures are still vastly less natural. This paper presents DiffMotion, a novel speech-driven gesture synthesis architecture based on diffusion models. The model comprises an autoregressive temporal encoder and a denoising diffusion probability Module. The encoder extracts the temporal context of the speech input and historical gestures. The diffusion module learns a parameterized Markov chain to gradually convert a simple distribution into a complex distribution and generates the gestures according to the accompanied speech. Compared with baselines, objective and subjective evaluations confirm that our approach can produce natural and diverse gesticulation and demonstrate the benefits of diffusion-based models on speech-driven gesture synthesis. Project page: https://github.com/zf223669/DiffMotion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Austin, J., Johnson, D.D., Ho, J., Tarlow, D., van den Berg, R.: Structured denoising diffusion models in discrete state-spaces. Adv. Neural Inf. Process. Syst. 34, 17981–17993 (2021)
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)
David, M.: Gesture and Thought. University of Chicago press, Chicago (2008)
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021)
Dinh, L., Krueger, D., Bengio, Y.: NICE: non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. arXiv preprint arXiv:1605.08803 (2016)
Eje, H.G., Simon, A., Jonas, B.: MoGlow: probabilistic and controllable motion synthesis using normalising flows. ACM Trans. Graph. 39(6), 1–14 (2020)
Grassia, F.: Sebastian: Practical parameterization of rotations using the exponential map. J. Graph. Tool. 3(3), 29–48 (1998)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020)
Ian, G., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27 (2014)
Jing, L., et al.: Audio2Gestures: generating diverse gestures from speech audio with conditional variational autoencoders. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11293–11302 (2021)
Kendon, A.: Gesticulation and speech: two aspects of the process of utterance. Relat. verbal Nonverbal Commun. 25(1980), 207–227 (1980)
Kucherenko, T., Jonell, P., Yoon, Y., Wolfert, P., Henter, G.E.: The GENEA challenge 2020: benchmarking gesture-generation systems on common data. In: International Workshop on Generation and Evaluation of Non-Verbal Behaviour for Embodied Agents (GENEA workshop) 2020 (2020)
Kucherenko, T., Jonell, P., Yoon, Y., Wolfert, P., Henter, G.E.: A large, crowdsourced evaluation of gesture generation systems on common data: the GENEA challenge 2020. In: 26th International Conference on Intelligent User Interfaces, pp. 11–21 (2021)
Li, H., et al.: SRDiff: single image super-resolution with diffusion probabilistic models. Neurocomputing 479, 47–59 (2022)
Matthew, B.: Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 21–28 (1999)
McNeill, D.: Hand and mind: what gestures reveal about thought. In: Advances in Visual Semiotics, p. 351 (1992)
P., K.D., Prafulla, D.: Glow: generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039 (2018)
Paul, L.: sur la théorie du mouvement brownien. C. R. Acad. Sci. 65(11), 146, 530–533 (1908), publisher: American Association of Physics Teachers
Press, W.H., Teukolsky, S.A.: Savitzky-golay smoothing filters. Comput. Phys. 4(6), 669–672 (1990)
Rasul, K., Seward, C., Schuster, I., Vollgraf, R.: Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In: International Conference on Machine Learning, pp. 8857–8868 (2021)
Sarah, T., Jonathan, W., David, G., Iain, M.: Speech-driven conversational agents using conditional flow-VAEs. In: European Conference on Visual Media Production, pp. 1–9 (2021)
Simon, A., Eje, H.G., Taras, K., Jonas, B.: Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer Graphics Forum. vol. 39, no. 2, pp. 487–496. Wiley Online Library (2020)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Wolfert, P., Robinson, N., Belpaeme, T.: A review of evaluation practices of gesture generation in embodied conversational agents. IEEE Trans. Human Mach. Syst. 52(3), 379–389 (2022)
Yang, L., Zhang, Z., Hong, S., Zhang, W., Cui, B.: Diffusion models: A comprehensive survey of methods and applications (Sep 2022)
Yi, Y., Deva, R.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2012)
Ylva, F., Michael, N., Rachel, M.: Multi-objective adversarial gesture generation. In: Motion, Interaction and Games, pp. 1–10. ACM, Newcastle upon Tyne United Kingdom (2019)
Ylva, F., Rachel, M.: Investigating the use of recurrent motion modelling for speech gesture generation. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 93–98 (2018)
Yoon, Y., et al.: The GENEA challenge 2022: A large evaluation of data-driven co-speech gesture generation (2022)
Zhang, Q., Chen, Y.: Diffusion normalizing flow. In: Advances in Neural Information Processing Systems. vol. 34 (2021)
Zhu, Y., Wu, Y., Olszewski, K., Ren, J., Tulyakov, S., Yan, Y.: Discrete contrastive diffusion for cross-modal and conditional generation (2022)
Acknowledgements
This work was supported by the Key Program and development projects of Zhejiang Province of China (No.2021C03137), the Public Welfare Technology Application Research Project of Zhejiang Province, China (No.LGF22F020008), and the Key Lab of Film and TV Media Technology of Zhejiang Province (No.2020E10015).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 2 (mp4 65601 KB)
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, F., Ji, N., Gao, F., Li, Y. (2023). DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13833. Springer, Cham. https://doi.org/10.1007/978-3-031-27077-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-27077-2_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27076-5
Online ISBN: 978-3-031-27077-2
eBook Packages: Computer ScienceComputer Science (R0)