Skip to main content

Showing 1–8 of 8 results for author: Ekstedt, E

  1. arXiv:2403.06487  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The re… ▽ More

    Submitted 14 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for presentation at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) and represents the author's version of the work

  2. arXiv:2401.04868  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input contex… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and represents the author's version of the work

  3. arXiv:2305.17971  [pdf, other

    eess.AS cs.SD

    Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis

    Authors: Erik Ekstedt, Siyang Wang, Éva Székely, Joakim Gustafson, Gabriel Skantze

    Abstract: Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023, 5 pages, 2 figures, 4 tables

  4. arXiv:2305.02101  [pdf, other

    cs.CL

    What makes a good pause? Investigating the turn-holding effects of fillers

    Authors: Bing'er Jiang, Erik Ekstedt, Gabriel Skantze

    Abstract: Filled pauses (or fillers), such as "uh" and "um", are frequent in spontaneous speech and can serve as a turn-holding cue for the listener, indicating that the current speaker is not done yet. In this paper, we use the recently proposed Voice Activity Projection (VAP) model, which is a deep learning model trained to predict the dynamics of conversation, to analyse the effects of filled pauses on t… ▽ More

    Submitted 3 May, 2023; originally announced May 2023.

    Comments: Accepted to ICPhS 2023; 5 pages, 4 figures

  5. arXiv:2305.02036  [pdf, other

    cs.CL cs.LG

    Response-conditioned Turn-taking Prediction

    Authors: Bing'er Jiang, Erik Ekstedt, Gabriel Skantze

    Abstract: Previous approaches to turn-taking and response generation in conversational systems have treated it as a two-stage process: First, the end of a turn is detected (based on conversation history), then the system generates an appropriate response. Humans, however, do not take the turn just because it is likely, but also consider whether what they want to say fits the position. In this paper, we pres… ▽ More

    Submitted 3 May, 2023; originally announced May 2023.

    Comments: Accepted by Findings of ACL 2023; 6 pages, 4 figures

  6. arXiv:2209.05161  [pdf, other

    eess.AS

    How Much Does Prosody Help Turn-taking? Investigations using Voice Activity Projection Models

    Authors: Erik Ekstedt, Gabriel Skantze

    Abstract: Turn-taking is a fundamental aspect of human communication and can be described as the ability to take turns, project upcoming turn shifts, and supply backchannels at appropriate locations throughout a conversation. In this work, we investigate the role of prosody in turn-taking using the recently proposed Voice Activity Projection model, which incrementally models the upcoming speech activity of… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

    Comments: SIGDIAL 2022 Best Paper Award Winner

  7. arXiv:2205.09812  [pdf, other

    eess.AS cs.SD

    Voice Activity Projection: Self-supervised Learning of Turn-taking Events

    Authors: Erik Ekstedt, Gabriel Skantze

    Abstract: The modeling of turn-taking in dialog can be viewed as the modeling of the dynamics of voice activity of the interlocutors. We extend prior work and define the predictive task of Voice Activity Projection, a general, self-supervised objective, as a way to train turn-taking models without the need of labeled data. We highlight a theoretical weakness with prior approaches, arguing for the need of mo… ▽ More

    Submitted 19 May, 2022; originally announced May 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 4 figures

  8. TurnGPT: a Transformer-based Language Model for Predicting Turn-taking in Spoken Dialog

    Authors: Erik Ekstedt, Gabriel Skantze

    Abstract: Syntactic and pragmatic completeness is known to be important for turn-taking prediction, but so far machine learning models of turn-taking have used such linguistic information in a limited way. In this paper, we introduce TurnGPT, a transformer-based language model for predicting turn-shifts in spoken dialog. The model has been trained and evaluated on a variety of written and spoken dialog data… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

    Comments: Accepted to Findings of ACL: EMNLP 2020

    ACM Class: I.2.7