Skip to main content

Showing 1–50 of 74 results for author: Kawahara, T

  1. arXiv:2410.15929  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection

    Authors: Koji Inoue, Divesh Lala, Gabriel Skantze, Tatsuya Kawahara

    Abstract: In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

  2. Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

    Authors: Tomoki Honda, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Recently, Conformer has achieved state-of-the-art performance in many speech recognition tasks. However, the Transformer-based models show significant deterioration for long-form speech, such as lectures, because the self-attention mechanism becomes unreliable with the computation of the square order of the input length. To solve the problem, we incorporate a kind of state-space model, Hungry Hung… ▽ More

    Submitted 5 October, 2024; originally announced October 2024.

    Comments: Submitted to InterSpeech2024, Sample code is available at https://github.com/mirrormouse/Hybrid-H3-Conformer

  3. arXiv:2410.03147  [pdf, other

    cs.CL cs.HC cs.RO

    Analysis and Detection of Differences in Spoken User Behaviors between Autonomous and Wizard-of-Oz Systems

    Authors: Mikey Elmers, Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara

    Abstract: This study examined users' behavioral differences in a large corpus of Japanese human-robot interactions, comparing interactions between a tele-operated robot and an autonomous dialogue system. We analyzed user spoken behaviors in both attentive listening and job interview dialogue scenarios. Results revealed significant differences in metrics such as speech length, speaking rate, fillers, backcha… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

    Comments: Accepted and will be presented at the 27th conference of the Oriental COCOSDA (O-COCOSDA 2024)

  4. arXiv:2410.01365  [pdf

    eess.IV cs.CV

    Anti-biofouling Lensless Camera System with Deep Learning based Image Reconstruction

    Authors: Naoki Ide, Tomohiro Kawahara, Hiroshi Ueno, Daiki Yanagidaira, Susumu Takatsuka

    Abstract: In recent years, there has been an increasing demand for underwater cameras that monitor the condition of offshore structures and check the number of individuals in aqua culture environments with long-period observation. One of the significant issues with this observation is that biofouling sticks to the aperture and lens densely and prevents cameras from capturing clear images. This study examine… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

    Comments: 9 pages, 8 figures, Ocean Optics 2024

  5. Robotic Backchanneling in Online Conversation Facilitation: A Cross-Generational Study

    Authors: Sota Kobuki, Katie Seaborn, Seiki Tokunaga, Kosuke Fukumori, Shun Hidaka, Kazuhiro Tamura, Koji Inoue, Tatsuya Kawahara, Mihoko Otake-Mastuura

    Abstract: Japan faces many challenges related to its aging society, including increasing rates of cognitive decline in the population and a shortage of caregivers. Efforts have begun to explore solutions using artificial intelligence (AI), especially socially embodied intelligent agents and robots that can communicate with people. Yet, there has been little research on the compatibility of these agents with… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: Published at Proceedings of the 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2023)

  6. arXiv:2409.12524  [pdf, other

    cs.CL cs.AI

    Should RAG Chatbots Forget Unimportant Conversations? Exploring Importance and Forgetting with Psychological Insights

    Authors: Ryuichi Sumida, Koji Inoue, Tatsuya Kawahara

    Abstract: While Retrieval-Augmented Generation (RAG) has shown promise in enhancing long-term conversations, the increasing memory load as conversations progress degrades retrieval accuracy. Drawing on psychological insights, we propose LUFY, a simple yet effective method that focuses on emotionally arousing memories and retains less than 10% of the conversation. In the user experiment, participants interac… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

  7. arXiv:2409.08039  [pdf, other

    cs.SD eess.AS

    Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

    Authors: Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, Tatsuya Kawahara

    Abstract: This study presents an innovative Zero-Shot any-to-any Singing Voice Conversion (SVC) method, leveraging a novel clustering-based phoneme representation to effectively separate content, timbre, and singing style. This approach enables precise voice characteristic manipulation. We discovered that datasets with fewer recordings per artist are more susceptible to timbre leakage. Extensive testing on… ▽ More

    Submitted 14 October, 2024; v1 submitted 12 September, 2024; originally announced September 2024.

  8. arXiv:2409.00815  [pdf, other

    cs.SD cs.AI eess.AS

    Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition

    Authors: Hao Shi, Yuan Gao, Zhaoheng Ni, Tatsuya Kawahara

    Abstract: Serialized output training (SOT) attracts increasing attention due to its convenience and flexibility for multi-speaker automatic speech recognition (ASR). However, it is not easy to train with attention loss only. In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. This ad… ▽ More

    Submitted 10 September, 2024; v1 submitted 1 September, 2024; originally announced September 2024.

  9. arXiv:2408.16180  [pdf, other

    eess.AS cs.CL cs.SD

    Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

    Authors: Yuka Ko, Sheng Li, Chao-Han Huck Yang, Tatsuya Kawahara

    Abstract: With the strong representational power of large language models (LLMs), generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and phonetic refinements to address ASR errors. This work explores how LLM-based GER can enhance and expand the capabilities of Japanese language processing, presenting the first GER benchmark for Japanese ASR with 0.9-2.6k text u… ▽ More

    Submitted 11 October, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

  10. arXiv:2408.02271  [pdf, other

    cs.CL

    StyEmp: Stylizing Empathetic Response Generation via Multi-Grained Prefix Encoder and Personality Reinforcement

    Authors: Yahui Fu, Chenhui Chu, Tatsuya Kawahara

    Abstract: Recent approaches for empathetic response generation mainly focus on emotional resonance and user understanding, without considering the system's personality. Consistent personality is evident in real human expression and is important for creating trustworthy systems. To address this problem, we propose StyEmp, which aims to stylize the empathetic response generation with a consistent personality.… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Accepted by the 25th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2024)

  11. arXiv:2407.05295  [pdf, ps, other

    cond-mat.mes-hall

    Simulation of temperature-dependent quantum gates in silicon quantum dots with frequency shifts

    Authors: Yudai Sato, Takayuki Kawahara

    Abstract: To achieve quantum computing using semiconductor spin qubits, the spin qubits must be precisely controlled. However, unexpected noise limits this precision and prevents the implementation of error correction codes. Specifically, frequency shifts have been found to suppress one-qubit gate fidelity. Although the exact source of the frequency shifts remains unclear, several experiments have indicated… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: 10 pages, 9 figures

  12. arXiv:2403.06487  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The re… ▽ More

    Submitted 14 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for presentation at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) and represents the author's version of the work

  13. arXiv:2402.18275  [pdf, other

    cs.SD cs.CL eess.AS

    Exploration of Adapter for Noise Robust Automatic Speech Recognition

    Authors: Hao Shi, Tatsuya Kawahara

    Abstract: Adapting an automatic speech recognition (ASR) system to unseen noise environments is crucial. Integrating adapters into neural networks has emerged as a potent technique for transfer learning. This study thoroughly investigates adapter-based ASR adaptation in noisy environments. We conducted experiments using the CHiME--4 dataset. The results show that inserting the adapter in the shallow layer y… ▽ More

    Submitted 4 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

  14. arXiv:2402.14863  [pdf, other

    cs.CL

    Evaluation of a semi-autonomous attentive listening system with takeover prompting

    Authors: Haruki Kawai, Divesh Lala, Koji Inoue, Keiko Ochi, Tatsuya Kawahara

    Abstract: The handling of communication breakdowns and loss of engagement is an important aspect of spoken dialogue systems, particularly for chatting systems such as attentive listening, where the user is mostly speaking. We presume that a human is best equipped to handle this task and rescue the flow of conversation. To this end, we propose a semi-autonomous system, where a remote operator can take contro… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

  15. arXiv:2402.12770  [pdf, other

    cs.CL

    Acknowledgment of Emotional States: Generating Validating Responses for Empathetic Dialogue

    Authors: Zi Haur Pang, Yahui Fu, Divesh Lala, Keiko Ochi, Koji Inoue, Tatsuya Kawahara

    Abstract: In the realm of human-AI dialogue, the facilitation of empathetic responses is important. Validation is one of the key communication techniques in psychology, which entails recognizing, understanding, and acknowledging others' emotional states, thoughts, and actions. This study introduces the first framework designed to engender empathetic dialogue with validating responses. Our approach incorpora… ▽ More

    Submitted 20 February, 2024; originally announced February 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024)

  16. arXiv:2401.13249  [pdf, other

    eess.AS cs.MM

    MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction

    Authors: Wangjin Zhou, Zhengdong Yang, Chenhui Chu, Sheng Li, Raj Dabre, Yi Zhao, Tatsuya Kawahara

    Abstract: Automatic Mean Opinion Score (MOS) prediction is employed to evaluate the quality of synthetic speech. This study extends the application of predicted MOS to the task of Fake Audio Detection (FAD), as we expect that MOS can be used to assess how close synthesized speech is to the natural human voice. We propose MOS-FAD, where MOS can be leveraged at two key points in FAD: training data selection a… ▽ More

    Submitted 24 January, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted in ICASSP2024

  17. arXiv:2401.05871  [pdf, other

    cs.CL

    Enhancing Personality Recognition in Dialogue by Data Augmentation and Heterogeneous Conversational Graph Networks

    Authors: Yahui Fu, Haiyue Song, Tianyu Zhao, Tatsuya Kawahara

    Abstract: Personality recognition is useful for enhancing robots' ability to tailor user-adaptive responses, thus fostering rich human-robot interactions. One of the challenges in this task is a limited number of speakers in existing dialogue corpora, which hampers the development of robust, speaker-independent personality recognition models. Additionally, accurately modeling both the interdependencies amon… ▽ More

    Submitted 8 March, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024)

  18. arXiv:2401.04868  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input contex… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and represents the author's version of the work

  19. arXiv:2401.04867  [pdf, other

    cs.CL cs.AI cs.HC

    An Analysis of User Behaviors for Objectively Evaluating Spoken Dialogue Systems

    Authors: Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze

    Abstract: Establishing evaluation schemes for spoken dialogue systems is important, but it can also be challenging. While subjective evaluations are commonly used in user experiments, objective evaluations are necessary for research comparison and reproducibility. To address this issue, we propose a framework for indirectly but objectively evaluating systems based on users' behaviors. In this paper, to this… ▽ More

    Submitted 23 January, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and represents the author's version of the work

  20. arXiv:2309.09223  [pdf, other

    cs.SD eess.AS

    Zero- and Few-shot Sound Event Localization and Detection

    Authors: Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

    Abstract: Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few… ▽ More

    Submitted 17 January, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2024

  21. arXiv:2308.11020  [pdf, other

    cs.CL cs.HC cs.RO

    Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors

    Authors: Koji Inoue, Divesh Lala, Keiko Ochi, Tatsuya Kawahara, Gabriel Skantze

    Abstract: This paper tackles the challenging task of evaluating socially situated conversational robots and presents a novel objective evaluation approach that relies on multimodal user behaviors. In this study, our main focus is on assessing the human-likeness of the robot as the primary evaluation metric. While previous research often relied on subjective evaluations from users, our approach aims to evalu… ▽ More

    Submitted 25 September, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted by 25th ACM International Conference on Multimodal Interaction (ICMI '23), Late-Breaking Results

  22. arXiv:2308.00085  [pdf, other

    cs.CL cs.AI

    Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation

    Authors: Yahui Fu, Koji Inoue, Chenhui Chu, Tatsuya Kawahara

    Abstract: Recent approaches to empathetic response generation try to incorporate commonsense knowledge or reasoning about the causes of emotions to better understand the user's experiences and feelings. However, these approaches mainly focus on understanding the causalities of context from the user's perspective, ignoring the system's perspective. In this paper, we propose a commonsense-based causality expl… ▽ More

    Submitted 5 September, 2023; v1 submitted 27 July, 2023; originally announced August 2023.

    Comments: Accepted by the 24th Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL 2023)

  23. arXiv:2305.10734  [pdf, other

    cs.SD cs.CL eess.AS

    Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

    Authors: Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

    Abstract: Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us… ▽ More

    Submitted 28 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  24. arXiv:2303.14593  [pdf, other

    cs.SD eess.AS

    Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

    Authors: Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang, Tatsuya Kawahara

    Abstract: Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information,… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

  25. arXiv:2303.00146  [pdf, other

    cs.HC cs.RO cs.SD eess.AS

    I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue

    Authors: Yuanchao Li, Koji Inoue, Leimin Tian, Changzeng Fu, Carlos Ishi, Hiroshi Ishiguro, Tatsuya Kawahara, Catherine Lai

    Abstract: Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective reactions based on its current behaviors before the user speaks. In this work, we investigate two scenarios: speech and laughter. In speech, we propo… ▽ More

    Submitted 17 March, 2023; v1 submitted 28 February, 2023; originally announced March 2023.

    Comments: Accepted to CHI2023 Late-Breaking Work

  26. arXiv:2211.08526  [pdf, other

    cs.RO

    Alzheimer's Dementia Detection through Spontaneous Dialogue with Proactive Robotic Listeners

    Authors: Yuanchao Li, Catherine Lai, Divesh Lala, Koji Inoue, Tatsuya Kawahara

    Abstract: As the aging of society continues to accelerate, Alzheimer's Disease (AD) has received more and more attention from not only medical but also other fields, such as computer science, over the past decade. Since speech is considered one of the effective ways to diagnose cognitive decline, AD detection from speech has emerged as a hot topic. Nevertheless, such approaches fail to tackle several key is… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: Accepted for HRI2022 Late-Breaking Report

  27. arXiv:2209.04062  [pdf, other

    cs.CL cs.SD eess.AS

    Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of text-only data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slo… ▽ More

    Submitted 8 September, 2022; originally announced September 2022.

    Comments: Accepted in Interspeech2022

  28. arXiv:2209.02030  [pdf, other

    cs.CL cs.SD eess.AS

    Distilling the Knowledge of BERT for CTC-based ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive because of their fast inference in automatic speech recognition (ASR). Language model (LM) integration approaches such as shallow fusion and rescoring can improve the recognition accuracy of CTC-based ASR by taking advantage of the knowledge in text corpora. However, they significantly slow down the inference of CTC. In this… ▽ More

    Submitted 5 September, 2022; originally announced September 2022.

  29. arXiv:2207.03169  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-end Speech-to-Punctuated-Text Recognition

    Authors: Jumon Nozaki, Tatsuya Kawahara, Kenkichi Ishizuka, Taiichi Hashimoto

    Abstract: Conventional automatic speech recognition systems do not produce punctuation marks which are important for the readability of the speech recognition results. They are also needed for subsequent natural language processing tasks such as machine translation. There have been a lot of works on punctuation prediction models that insert punctuation marks into speech recognition results as post-processin… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: Accepted to INTERSPEECH2022

  30. arXiv:2110.01857  [pdf, other

    cs.CL eess.AS

    ASR Rescoring and Confidence Estimation with ELECTRA

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: In automatic speech recognition (ASR) rescoring, the hypothesis with the fewest errors should be selected from the n-best list using a language model (LM). However, LMs are usually trained to maximize the likelihood of correct word sequences, not to detect ASR errors. We propose an ASR rescoring method for directly detecting errors with ELECTRA, which is originally a pre-training method for NLP ta… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

    Comments: Accepted in ASRU2021

  31. arXiv:2109.04411  [pdf, other

    eess.AS cs.CL cs.SD

    Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

    Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. End-to-end speech translation models have several advantages over traditional cascade systems such as inference latency reduction. However, conventional AR decoding methods are not fast enough because each token is generated incrementally. NAR models, however, can accelera… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

  32. arXiv:2107.07509  [pdf, other

    eess.AS cs.CL cs.SD

    VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

    Authors: Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-… ▽ More

    Submitted 15 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  33. arXiv:2107.00635  [pdf, other

    eess.AS cs.CL cs.SD

    StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR

    Authors: Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: While attention-based encoder-decoder (AED) models have been successfully extended to the online variants for streaming automatic speech recognition (ASR), such as monotonic chunkwise attention (MoChA), the models still have a large label emission latency because of the unconstrained end-to-end training objective. Previous works tackled this problem by leveraging alignment information to control t… ▽ More

    Submitted 15 July, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  34. arXiv:2106.02325  [pdf, other

    cs.CL cs.HC

    ERICA: An Empathetic Android Companion for Covid-19 Quarantine

    Authors: Etsuko Ishii, Genta Indra Winata, Samuel Cahyawijaya, Divesh Lala, Tatsuya Kawahara, Pascale Fung

    Abstract: Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: Accepted in SIGDIAL 2021

  35. arXiv:2105.00403  [pdf, other

    cs.CL cs.AI cs.RO

    Intelligent Conversational Android ERICA Applied to Attentive Listening and Job Interview

    Authors: Tatsuya Kawahara, Koji Inoue, Divesh Lala

    Abstract: Following the success of spoken dialogue systems (SDS) in smartphone assistants and smart speakers, a number of communicative robots are developed and commercialized. Compared with the conventional SDSs designed as a human-machine interface, interaction with robots is expected to be in a closer manner to talking to a human because of the anthropomorphism and physical presence. The goal or task of… ▽ More

    Submitted 2 May, 2021; originally announced May 2021.

    Comments: 7 pages, 5 figures, 1 table

  36. arXiv:2104.06457  [pdf, other

    cs.CL cs.SD eess.AS

    Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation

    Authors: Hirofumi Inaguma, Tatsuya Kawahara, Shinji Watanabe

    Abstract: A conventional approach to improving the performance of end-to-end speech translation (E2E-ST) models is to leverage the source transcription via pre-training and joint training with automatic speech recognition (ASR) and neural machine translation (NMT) tasks. However, since the input modalities are different, it is difficult to leverage source language text successfully. In this work, we focus o… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted at NAACL-HLT 2021 (short paper)

  37. arXiv:2103.00422  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

    Authors: Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the… ▽ More

    Submitted 22 August, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

  38. arXiv:2010.13047  [pdf, other

    cs.CL cs.SD eess.AS

    Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

    Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems. End-to-end (E2E) models based on the encoder-decoder architecture are more suitable for this goal than traditional cascaded systems, but their effectiveness regarding decoding speed has not been explored so far. Inspired by recent progress in non-autoregressive (NAR) methods in text-based tr… ▽ More

    Submitted 18 February, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted at IEEE ICASSP 2021

  39. arXiv:2009.07117  [pdf, other

    cs.CL

    Multi-Referenced Training for Dialogue Response Generation

    Authors: Tianyu Zhao, Tatsuya Kawahara

    Abstract: In open-domain dialogue response generation, a dialogue context can be continued with diverse responses, and the dialogue models should capture such one-to-many relations. In this work, we first analyze the training objective of dialogue models from the view of Kullback-Leibler divergence (KLD) and show that the gap between the real world probability distribution and the single-referenced data's p… ▽ More

    Submitted 18 October, 2020; v1 submitted 15 September, 2020; originally announced September 2020.

  40. arXiv:2008.12048  [pdf, ps, other

    eess.AS

    End-to-end Music-mixed Speech Recognition

    Authors: Jeongwoo Woo, Masato Mimura, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: Automatic speech recognition (ASR) in multimedia content is one of the promising applications, but speech data in this kind of content are frequently mixed with background music, which is harmful for the performance of ASR. In this study, we propose a method for improving ASR with background music based on time-domain source separation. We utilize Conv-TasNet as a separation network, which has ach… ▽ More

    Submitted 27 August, 2020; originally announced August 2020.

    Comments: Submitted to APSIPA 2020

  41. arXiv:2008.03822  [pdf, other

    cs.CL eess.AS

    Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Attention-based sequence-to-sequence (seq2seq) models have achieved promising results in automatic speech recognition (ASR). However, as these models decode in a left-to-right way, they do not have access to context on the right. We leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation. In our proposed method, BERT generat… ▽ More

    Submitted 9 August, 2020; originally announced August 2020.

    Comments: Accepted in INTERSPEECH2020

  42. arXiv:2005.09394  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Enhancing Monotonic Multihead Attention for Streaming ASR

    Authors: Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

    Abstract: We investigate a monotonic multihead attention (MMA) by extending hard monotonic attention to Transformer-based automatic speech recognition (ASR) for online streaming applications. For streaming inference, all monotonic attention (MA) heads should learn proper alignments because the next token is not generated until all heads detect the corresponding token boundaries. However, we found not all MA… ▽ More

    Submitted 30 September, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  43. arXiv:2005.09256  [pdf, other

    eess.AS cs.CL

    Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition

    Authors: Kohei Matsuura, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: It is important to transcribe and archive speech data of endangered languages for preserving heritages of verbal culture and automatic speech recognition (ASR) is a powerful tool to facilitate this process. However, since endangered languages do not generally have large corpora with many speakers, the performance of ASR models trained on them are considerably poor in general. Nevertheless, we are… ▽ More

    Submitted 31 July, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted for Interspeech 2020

  44. arXiv:2005.04712  [pdf, other

    cs.CL cs.LG

    CTC-synchronous Training for Monotonic Attention Model

    Authors: Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

    Abstract: Monotonic chunkwise attention (MoChA) has been studied for the online streaming automatic speech recognition (ASR) based on a sequence-to-sequence framework. In contrast to connectionist temporal classification (CTC), backward probabilities cannot be leveraged in the alignment marginalization process during training due to left-to-right dependency in the decoder. This results in the error propagat… ▽ More

    Submitted 6 August, 2020; v1 submitted 10 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  45. arXiv:2004.11419  [pdf, other

    cs.SD cs.CL eess.AS

    End-to-end speech-to-dialog-act recognition

    Authors: Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: Spoken language understanding, which extracts intents and/or semantic concepts in utterances, is conventionally formulated as a post-processing of automatic speech recognition. It is usually trained with oracle transcripts, but needs to deal with errors by ASR. Moreover, there are acoustic features which are related with intents but not represented with the transcripts. In this paper, we present a… ▽ More

    Submitted 28 July, 2020; v1 submitted 23 April, 2020; originally announced April 2020.

  46. arXiv:2004.04908  [pdf, ps, other

    cs.CL

    Designing Precise and Robust Dialogue Response Evaluators

    Authors: Tianyu Zhao, Divesh Lala, Tatsuya Kawahara

    Abstract: Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental… ▽ More

    Submitted 24 April, 2020; v1 submitted 10 April, 2020; originally announced April 2020.

    Comments: Accepted at ACL 2020

  47. arXiv:2002.06675  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language

    Authors: Kohei Matsuura, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan. It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance. Although a considerable amount of voice recordings of Ainu folklore has been produced and accumulated to save their culture, only a quite limited parts of… ▽ More

    Submitted 16 May, 2020; v1 submitted 16 February, 2020; originally announced February 2020.

    Comments: Accepted in LREC 2020

  48. Polaron Masses in CH3NH3PbX3 Perovskites Determined by Landau Level Spectroscopy in Low Magnetic Fields

    Authors: Yasuhiro Yamada, Hirofumi Mino, Takuya Kawahara, Kenichi Oto, Hidekatsu Suzuura, Yoshihiko Kanemitsu

    Abstract: We investigate the electron-phonon coupling in CH3NH3PbX3 lead halide perovskites through the observation of Landau levels and high-order excitons at weak magnetic fields, where the cyclotron energy is significantly smaller than the longitudinal optical phonon energy. The reduced masses of the carriers and the exciton binding energies obtained from these data are clearly influenced by polaron form… ▽ More

    Submitted 12 May, 2021; v1 submitted 22 January, 2020; originally announced January 2020.

    Journal ref: Phys. Rev. Lett. 126, 237401 (2021)

  49. arXiv:1910.00254  [pdf, ps, other

    cs.CL eess.AS

    Multilingual End-to-End Speech Translation

    Authors: Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: In this paper, we propose a simple yet effective framework for multilingual end-to-end speech translation (ST), in which speech utterances in source languages are directly translated to the desired target languages with a universal sequence-to-sequence architecture. While multilingual models have shown to be useful for automatic speech recognition (ASR) and machine translation (MT), this is the fi… ▽ More

    Submitted 31 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

    Comments: Accepted to ASRU 2019

  50. arXiv:1909.09993  [pdf, other

    cs.CL

    Improving OOV Detection and Resolution with External Language Models in Acoustic-to-Word ASR

    Authors: Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Acoustic-to-word (A2W) end-to-end automatic speech recognition (ASR) systems have attracted attention because of an extremely simplified architecture and fast decoding. To alleviate data sparseness issues due to infrequent words, the combination with an acoustic-to-character (A2C) model is investigated. Moreover, the A2C model can be used to recover out-of-vocabulary (OOV) words that are not cover… ▽ More

    Submitted 22 September, 2019; originally announced September 2019.

    Comments: SLT2018