DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Fan Zhang ORCID: orcid.org/0000-0002-9534-1777^15,16,
Naye Ji ORCID: orcid.org/0000-0002-6986-3766¹⁶,
Fuxing Gao¹⁶ &
…
Yongping Li^15,17

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13833))

Included in the following conference series:

International Conference on Multimedia Modeling

1857 Accesses
12 Citations

Abstract

Speech-driven gesture synthesis is a field of growing interest in virtual human creation. However, a critical challenge is the inherent intricate one-to-many mapping between speech and gestures. Previous studies have explored and achieved significant progress with generative models. Notwithstanding, most synthetic gestures are still vastly less natural. This paper presents DiffMotion, a novel speech-driven gesture synthesis architecture based on diffusion models. The model comprises an autoregressive temporal encoder and a denoising diffusion probability Module. The encoder extracts the temporal context of the speech input and historical gestures. The diffusion module learns a parameterized Markov chain to gradually convert a simple distribution into a complex distribution and generates the gestures according to the accompanied speech. Compared with baselines, objective and subjective evaluations confirm that our approach can produce natural and diverse gesticulation and demonstrate the benefits of diffusion-based models on speech-driven gesture synthesis. Project page: https://github.com/zf223669/DiffMotion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Co-speech Gesture Generation with Variational Auto Encoder

Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation

Style Transfer for Co-speech Gesture Animation: A Multi-speaker Conditional-Mixture Approach

References

Austin, J., Johnson, D.D., Ho, J., Tarlow, D., van den Berg, R.: Structured denoising diffusion models in discrete state-spaces. Adv. Neural Inf. Process. Syst. 34, 17981–17993 (2021)
Google Scholar
Avrahami, O., Lischinski, D., Fried, O.: Blended diffusion for text-driven editing of natural images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18208–18218 (2022)
Google Scholar
David, M.: Gesture and Thought. University of Chicago press, Chicago (2008)
Google Scholar
Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. Adv. Neural Inf. Process. Syst. 34, 8780–8794 (2021)
Google Scholar
Dinh, L., Krueger, D., Bengio, Y.: NICE: non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. arXiv preprint arXiv:1605.08803 (2016)
Eje, H.G., Simon, A., Jonas, B.: MoGlow: probabilistic and controllable motion synthesis using normalising flows. ACM Trans. Graph. 39(6), 1–14 (2020)
Google Scholar
Grassia, F.: Sebastian: Practical parameterization of rotations using the exponential map. J. Graph. Tool. 3(3), 29–48 (1998)
Article Google Scholar
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Ian, G., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems 27 (2014)
Google Scholar
Jing, L., et al.: Audio2Gestures: generating diverse gestures from speech audio with conditional variational autoencoders. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11293–11302 (2021)
Google Scholar
Kendon, A.: Gesticulation and speech: two aspects of the process of utterance. Relat. verbal Nonverbal Commun. 25(1980), 207–227 (1980)
Article Google Scholar
Kucherenko, T., Jonell, P., Yoon, Y., Wolfert, P., Henter, G.E.: The GENEA challenge 2020: benchmarking gesture-generation systems on common data. In: International Workshop on Generation and Evaluation of Non-Verbal Behaviour for Embodied Agents (GENEA workshop) 2020 (2020)
Google Scholar
Kucherenko, T., Jonell, P., Yoon, Y., Wolfert, P., Henter, G.E.: A large, crowdsourced evaluation of gesture generation systems on common data: the GENEA challenge 2020. In: 26th International Conference on Intelligent User Interfaces, pp. 11–21 (2021)
Google Scholar
Li, H., et al.: SRDiff: single image super-resolution with diffusion probabilistic models. Neurocomputing 479, 47–59 (2022)
Article Google Scholar
Matthew, B.: Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, pp. 21–28 (1999)
Google Scholar
McNeill, D.: Hand and mind: what gestures reveal about thought. In: Advances in Visual Semiotics, p. 351 (1992)
Google Scholar
P., K.D., Prafulla, D.: Glow: generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039 (2018)
Paul, L.: sur la théorie du mouvement brownien. C. R. Acad. Sci. 65(11), 146, 530–533 (1908), publisher: American Association of Physics Teachers
Google Scholar
Press, W.H., Teukolsky, S.A.: Savitzky-golay smoothing filters. Comput. Phys. 4(6), 669–672 (1990)
Article Google Scholar
Rasul, K., Seward, C., Schuster, I., Vollgraf, R.: Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In: International Conference on Machine Learning, pp. 8857–8868 (2021)
Google Scholar
Sarah, T., Jonathan, W., David, G., Iain, M.: Speech-driven conversational agents using conditional flow-VAEs. In: European Conference on Visual Media Production, pp. 1–9 (2021)
Google Scholar
Simon, A., Eje, H.G., Taras, K., Jonas, B.: Style-controllable speech-driven gesture synthesis using normalising flows. In: Computer Graphics Forum. vol. 39, no. 2, pp. 487–496. Wiley Online Library (2020)
Google Scholar
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Google Scholar
Wolfert, P., Robinson, N., Belpaeme, T.: A review of evaluation practices of gesture generation in embodied conversational agents. IEEE Trans. Human Mach. Syst. 52(3), 379–389 (2022)
Google Scholar
Yang, L., Zhang, Z., Hong, S., Zhang, W., Cui, B.: Diffusion models: A comprehensive survey of methods and applications (Sep 2022)
Google Scholar
Yi, Y., Deva, R.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2012)
Google Scholar
Ylva, F., Michael, N., Rachel, M.: Multi-objective adversarial gesture generation. In: Motion, Interaction and Games, pp. 1–10. ACM, Newcastle upon Tyne United Kingdom (2019)
Google Scholar
Ylva, F., Rachel, M.: Investigating the use of recurrent motion modelling for speech gesture generation. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 93–98 (2018)
Google Scholar
Yoon, Y., et al.: The GENEA challenge 2022: A large evaluation of data-driven co-speech gesture generation (2022)
Google Scholar
Zhang, Q., Chen, Y.: Diffusion normalizing flow. In: Advances in Neural Information Processing Systems. vol. 34 (2021)
Google Scholar
Zhu, Y., Wu, Y., Olszewski, K., Ren, J., Tulyakov, S., Yan, Y.: Discrete contrastive diffusion for cross-modal and conditional generation (2022)
Google Scholar

Download references

Acknowledgements

This work was supported by the Key Program and development projects of Zhejiang Province of China (No.2021C03137), the Public Welfare Technology Application Research Project of Zhejiang Province, China (No.LGF22F020008), and the Key Lab of Film and TV Media Technology of Zhejiang Province (No.2020E10015).

Author information

Authors and Affiliations

Faculty of Humanities and Arts, Macau University of Science and Technology, Macau, China
Fan Zhang & Yongping Li
College of Media Engineering, Communication University of Zhejiang, Hangzhou, China
Fan Zhang, Naye Ji & Fuxing Gao
College of Digital Technology and Engineering, Ningbo University of Finance and Economics, Ningbo, China
Yongping Li

Authors

Fan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Naye Ji
View author publications
You can also search for this author in PubMed Google Scholar
Fuxing Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yongping Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Naye Ji .

Editor information

Editors and Affiliations

University of Bergen, Bergen, Norway
Duc-Tien Dang-Nguyen
Dublin City University, Dublin, Ireland
Cathal Gurrin
Radboud University Nijmegen, Nijmegen, The Netherlands
Martha Larson
Dublin City University, Dublin, Ireland
Alan F. Smeaton
University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
National Institute of Information and Communications Technology, Tokyo, Japan
Minh-Son Dao
Department of Information Science and Media Studies, University of Bergen, Bergen, Norway
Christoph Trattner
La Trobe University, Melbourne, VIC, Australia
Phoebe Chen

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5612 KB)

Supplementary material 2 (mp4 65601 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, F., Ji, N., Gao, F., Li, Y. (2023). DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model. In: Dang-Nguyen, DT., et al. MultiMedia Modeling. MMM 2023. Lecture Notes in Computer Science, vol 13833. Springer, Cham. https://doi.org/10.1007/978-3-031-27077-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-27077-2_18
Published: 29 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27076-5
Online ISBN: 978-3-031-27077-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Co-speech Gesture Generation with Variational Auto Encoder

Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation

Style Transfer for Co-speech Gesture Animation: A Multi-speaker Conditional-Mixture Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5612 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Co-speech Gesture Generation with Variational Auto Encoder

Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation

Style Transfer for Co-speech Gesture Animation: A Multi-speaker Conditional-Mixture Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5612 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation