-
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models
Authors:
Marvin Lavechin,
Yaya Sy,
Hadrien Titeux,
María Andrea Cruz Blandón,
Okko Räsänen,
Hervé Bredin,
Emmanuel Dupoux,
Alejandrina Cristia
Abstract:
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and b…
▽ More
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children's language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
△ Less
Submitted 8 June, 2023; v1 submitted 2 June, 2023;
originally announced June 2023.
-
ProsAudit, a prosodic benchmark for self-supervised speech models
Authors:
Maureen de Seyssel,
Marvin Lavechin,
Hadrien Titeux,
Arthur Thomas,
Gwendal Virlet,
Andrea Santos Revilla,
Guillaume Wisniewski,
Bogdan Ludusan,
Emmanuel Dupoux
Abstract:
We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inser…
▽ More
We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.
△ Less
Submitted 1 June, 2023; v1 submitted 23 February, 2023;
originally announced February 2023.
-
Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation
Authors:
Marvin Lavechin,
Marianne Métais,
Hadrien Titeux,
Alodie Boissonnet,
Jade Copet,
Morgane Rivière,
Elika Bergelson,
Alejandrina Cristia,
Emmanuel Dupoux,
Hervé Bredin
Abstract:
Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in…
▽ More
Most automatic speech processing systems register degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a neural network jointly trained to extract speech/non-speech segments, speech-to-noise ratios, and C50room acoustics from single-channel recordings. Brouhaha is trained using a data-driven approach in which noisy and reverberant audio segments are synthesized. We first evaluate its performance and demonstrate that the proposed multi-task regime is beneficial. We then present two scenarios illustrating how Brouhaha can be used on naturally noisy and reverberant data: 1) to investigate the errors made by a speaker diarization model (pyannote.audio); and 2) to assess the reliability of an automatic speech recognition model (Whisper from OpenAI). Both our pipeline and a pretrained model are open source and shared with the speech community.
△ Less
Submitted 25 May, 2023; v1 submitted 24 October, 2022;
originally announced October 2022.
-
Comparison of Speaker Role Recognition and Speaker Enrollment Protocol for conversational Clinical Interviews
Authors:
Rachid Riad,
Hadrien Titeux,
Laurie Lemoine,
Justine Montillot,
Agnes Sliwinski,
Jennifer Hamet Bagnou,
Xuan Nga Cao,
Anne-Catherine Bachoud-Lévi,
Emmanuel Dupoux
Abstract:
Conversations between a clinician and a patient, in natural conditions, are valuable sources of information for medical follow-up. The automatic analysis of these dialogues could help extract new language markers and speed-up the clinicians' reports. Yet, it is not clear which speech processing pipeline is the most performing to detect and identify the speaker turns, especially for individuals wit…
▽ More
Conversations between a clinician and a patient, in natural conditions, are valuable sources of information for medical follow-up. The automatic analysis of these dialogues could help extract new language markers and speed-up the clinicians' reports. Yet, it is not clear which speech processing pipeline is the most performing to detect and identify the speaker turns, especially for individuals with speech and language disorders. Here, we proposed a split of the data that allows conducting a comparative evaluation of speaker role recognition and speaker enrollment methods to solve this task. We trained end-to-end neural network architectures to adapt to each task and evaluate each approach under the same metric. Experimental results are reported on naturalistic clinical conversations between Neuropsychologist and Interviewees, at different stages of Huntington's disease. We found that our Speaker Role Recognition model gave the best performances. In addition, our study underlined the importance of retraining models with in-domain data. Finally, we observed that results do not depend on the demographics of the Interviewee, highlighting the clinical relevance of our methods.
△ Less
Submitted 5 November, 2020; v1 submitted 30 October, 2020;
originally announced October 2020.
-
Vocal markers from sustained phonation in Huntington's Disease
Authors:
Rachid Riad,
Hadrien Titeux,
Laurie Lemoine,
Justine Montillot,
Jennifer Hamet Bagnou,
Xuan Nga Cao,
Emmanuel Dupoux,
Anne-Catherine Bachoud-Lévi
Abstract:
Disease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and M…
▽ More
Disease-modifying treatments are currently assessed in neurodegenerative diseases. Huntington's Disease represents a unique opportunity to design automatic sub-clinical markers, even in premanifest gene carriers. We investigated phonatory impairments as potential clinical markers and propose them for both diagnosis and gene carriers follow-up. We used two sets of features: Phonatory features and Modulation Power Spectrum Features. We found that phonation is not sufficient for the identification of sub-clinical disorders of premanifest gene carriers. According to our regression results, Phonatory features are suitable for the predictions of clinical performance in Huntington's Disease.
△ Less
Submitted 31 July, 2020; v1 submitted 9 June, 2020;
originally announced June 2020.
-
Seshat: A tool for managing and verifying annotation campaigns of audio data
Authors:
Hadrien Titeux,
Rachid Riad,
Xuan-Nga Cao,
Nicolas Hamilakis,
Kris Madden,
Alejandrina Cristia,
Anne-Catherine Bachoud-Lévi,
Emmanuel Dupoux
Abstract:
We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following sp…
▽ More
We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following specific rules that can be implemented in personalised parsers. Finally, we propose a double-annotation mode, for which Seshat computes automatically an associated inter-annotator agreement with the $γ$ measure taking into account the categorisation and segmentation discrepancies.
△ Less
Submitted 17 February, 2021; v1 submitted 3 March, 2020;
originally announced March 2020.
-
Speaker detection in the wild: Lessons learned from JSALT 2019
Authors:
Paola Garcia,
Jesus Villalba,
Herve Bredin,
Jun Du,
Diego Castan,
Alejandrina Cristia,
Latane Bullock,
Ling Guo,
Koji Okabe,
Phani Sankar Nidadavolu,
Saurabh Kataria,
Sizhu Chen,
Leo Galmant,
Marvin Lavechin,
Lei Sun,
Marie-Philippe Gill,
Bar Ben-Yair,
Sajjad Abdoli,
Xin Wang,
Wassim Bouaziz,
Hadrien Titeux,
Emmanuel Dupoux,
Kong Aik Lee,
Najim Dehak
Abstract:
This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker dete…
▽ More
This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection.
△ Less
Submitted 2 December, 2019;
originally announced December 2019.
-
pyannote.audio: neural building blocks for speaker diarization
Authors:
Hervé Bredin,
Ruiqing Yin,
Juan Manuel Coria,
Gregory Gelly,
Pavel Korshunov,
Marvin Lavechin,
Diego Fustes,
Hadrien Titeux,
Wassim Bouaziz,
Marie-Philippe Gill
Abstract:
We introduce pyannote.audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote.audio also comes with pre-trained models covering a wide range of domains for voice activity detection,…
▽ More
We introduce pyannote.audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote.audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding -- reaching state-of-the-art performance for most of them.
△ Less
Submitted 4 November, 2019;
originally announced November 2019.