Skip to main content

Showing 1–50 of 74 results for author: Sahidullah, M

  1. arXiv:2410.00023  [pdf, other

    eess.SP cs.LG cs.SD eess.AS

    Self-Tuning Spectral Clustering for Speaker Diarization

    Authors: Nikhil Raghav, Avisek Gupta, Md Sahidullah, Swagatam Das

    Abstract: Spectral clustering has proven effective in grouping speech representations for speaker diarization tasks, although post-processing the affinity matrix remains difficult due to the need for careful tuning before constructing the Laplacian. In this study, we present a novel pruning algorithm to create a sparse affinity matrix called \emph{spectral clustering on p-neighborhood retained affinity matr… ▽ More

    Submitted 16 September, 2024; originally announced October 2024.

    Comments: Submitted to ICASSP 2025

  2. arXiv:2409.15356  [pdf, ps, other

    eess.AS cs.LG cs.SD

    TCG CREST System Description for the Second DISPLACE Challenge

    Authors: Nikhil Raghav, Subhajit Saha, Md Sahidullah, Swagatam Das

    Abstract: In this report, we describe the speaker diarization (SD) and language diarization (LD) systems developed by our team for the Second DISPLACE Challenge, 2024. Our contributions were dedicated to Track 1 for SD and Track 2 for LD in multilingual and multi-speaker scenarios. We investigated different speech enhancement techniques, voice activity detection (VAD) techniques, unsupervised domain categor… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

  3. arXiv:2409.07884  [pdf, other

    cs.LG eess.AS

    Graph Neural Networks for Parkinsons Disease Detection

    Authors: Shakeel A. Sheikh, Yacouba Kaloga, Md Sahidullah, Ina Kodrasi

    Abstract: Despite the promising performance of state of the art approaches for Parkinsons Disease (PD) detection, these approaches often analyze individual speech segments in isolation, which can lead to suboptimal results. Dysarthric cues that characterize speech impairments from PD patients are expected to be related across segments from different speakers. Isolated segment analysis fails to exploit these… ▽ More

    Submitted 16 September, 2024; v1 submitted 12 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  4. arXiv:2408.08739  [pdf, other

    eess.AS cs.AI cs.SD

    ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

    Authors: Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi

    Abstract: ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks, and the design of detection solutions. Compared to previous challenges, the ASVspoof 5 database is built from crowdsourced data collected from a vastly greater number of speakers in diverse acoustic conditions. Attacks, also crowdsourced, are generated and tested using surrogat… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

    Comments: 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)

  5. arXiv:2406.17246  [pdf, other

    cs.SD cs.AI eess.AS

    Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing

    Authors: Hye-jin Shim, Md Sahidullah, Jee-weon Jung, Shinji Watanabe, Tomi Kinnunen

    Abstract: Current trends in audio anti-spoofing detection research strive to improve models' ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend c… ▽ More

    Submitted 26 August, 2024; v1 submitted 24 June, 2024; originally announced June 2024.

    Comments: 5 pages, 1 figure, 5 tables, ISCA Interspeech 2024 SynData4GenAI Workshop

  6. arXiv:2406.09999  [pdf, other

    eess.AS

    ROAR: Reinforcing Original to Augmented Data Ratio Dynamics for Wav2Vec2.0 Based ASR

    Authors: Vishwanath Pratap Singh, Federico Malato, Ville Hautamaki, Md. Sahidullah, Tomi Kinnunen

    Abstract: While automatic speech recognition (ASR) greatly benefits from data augmentation, the augmentation recipes themselves tend to be heuristic. In this paper, we address one of the heuristic approach associated with balancing the right amount of augmented data in ASR training by introducing a reinforcement learning (RL) based dynamic adjustment of original-to-augmented data ratio (OAR). Unlike the fix… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted: Interspeech 2024

    Journal ref: Interspeech 2024

  7. arXiv:2403.14290  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Exploring Green AI for Audio Deepfake Detection

    Authors: Subhajit Saha, Md Sahidullah, Swagatam Das

    Abstract: The state-of-the-art audio deepfake detectors leveraging deep neural networks exhibit impressive recognition performance. Nonetheless, this advantage is accompanied by a significant carbon footprint. This is mainly due to the use of high-performance computing with accelerators and high training time. Studies show that average deep NLP model produces around 626k lbs of CO\textsubscript{2} which is… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: This manuscript is under review in a conference

  8. arXiv:2403.14286  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization

    Authors: Nikhil Raghav, Md Sahidullah

    Abstract: Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker di… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: Manuscript Under Review

  9. arXiv:2402.15214  [pdf, other

    eess.AS cs.SD

    ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification

    Authors: Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen

    Abstract: The accuracy of modern automatic speaker verification (ASV) systems, when trained exclusively on adult data, drops substantially when applied to children's speech. The scarcity of children's speech corpora hinders fine-tuning ASV systems for children's speech. Hence, there is a timely need to explore more effective ways of reusing adults' speech data. One promising approach is to align vocal-tract… ▽ More

    Submitted 23 February, 2024; originally announced February 2024.

    Comments: The following article has been accepted by The Journal of the Acoustical Society of America (JASA). After it is published, it will be found at https://pubs.aip.org/asa/jasa

  10. arXiv:2401.11156  [pdf, other

    cs.CR cs.AI cs.SD eess.AS

    Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

    Authors: Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

    Abstract: It is now well-known that automatic speaker verification (ASV) systems can be spoofed using various types of adversaries. The usual approach to counteract ASV systems against such attacks is to develop a separate spoofing countermeasure (CM) module to classify speech input either as a bonafide, or a spoofed utterance. Nevertheless, such a design requires additional computation and utilization effo… ▽ More

    Submitted 27 January, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

    Comments: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (doi updated)

  11. arXiv:2306.07501  [pdf, other

    eess.AS cs.SD

    Speaker Verification Across Ages: Investigating Deep Speaker Embedding Sensitivity to Age Mismatch in Enrollment and Test Speech

    Authors: Vishwanath Pratap Singh, Md Sahidullah, Tomi Kinnunen

    Abstract: In this paper, we study the impact of the ageing on modern deep speaker embedding based automatic speaker verification (ASV) systems. We have selected two different datasets to examine ageing on the state-of-the-art ECAPA-TDNN system. The first dataset, used for addressing short-term ageing (up to 10 years time difference between enrollment and test) under uncontrolled conditions, is VoxCeleb. The… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Journal ref: Interspeech 2023

  12. arXiv:2306.00689  [pdf, other

    cs.SD cs.LG eess.AS

    Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings

    Authors: Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni

    Abstract: The adoption of advanced deep learning architectures in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets. To this end, this work introduces the application of speech embeddings extracted from pre-trained deep learning models trained on large audio datasets for different tasks. In particular, we explore audio representations obtained using emphasized… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted in International Journal of Speech Technology, Springer 2023 substantial overlap with arXiv:2204.01564

  13. arXiv:2306.00044  [pdf, ps, other

    cs.LG cs.CR cs.SD eess.AS

    How to Construct Perfect and Worse-than-Coin-Flip Spoofing Countermeasures: A Word of Warning on Shortcut Learning

    Authors: Hye-jin Shim, Rosa González Hautamäki, Md Sahidullah, Tomi Kinnunen

    Abstract: Shortcut learning, or `Clever Hans effect` refers to situations where a learning agent (e.g., deep neural networks) learns spurious correlations present in data, resulting in biased models. We focus on finding shortcuts in deep learning based spoofing countermeasures (CMs) that predict whether a given utterance is spoofed or not. While prior work has addressed specific data artifacts, such as sile… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

    Comments: Interspeech 2023

  14. arXiv:2305.19051  [pdf, other

    eess.AS cs.AI cs.SD

    Towards single integrated spoofing-aware speaker verification embeddings

    Authors: Sung Hwan Mun, Hye-jin Shim, Hemlata Tak, Xin Wang, Xuechen Liu, Md Sahidullah, Myeonghun Jeong, Min Hyun Han, Massimiliano Todisco, Kong Aik Lee, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, Nam Soo Kim, Jee-weon Jung

    Abstract: This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outpe… ▽ More

    Submitted 1 June, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023. Code and models are available in https://github.com/sasv-challenge/ASVSpoof5-SASVBaseline

  15. arXiv:2303.01126  [pdf, other

    cs.SD cs.CR eess.AS

    Speaker-Aware Anti-Spoofing

    Authors: Xuechen Liu, Md Sahidullah, Kong Aik Lee, Tomi Kinnunen

    Abstract: We address speaker-aware anti-spoofing, where prior knowledge of the target speaker is incorporated into a voice spoofing countermeasure (CM). In contrast to the frequently used speaker-independent solutions, we train the CM in a speaker-conditioned way. As a proof of concept, we consider speaker-aware extension to the state-of-the-art AASIST (audio anti-spoofing using integrated spectro-temporal… ▽ More

    Submitted 8 June, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

  16. arXiv:2303.01125  [pdf, other

    cs.SD cs.LG eess.AS

    Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: Even though deep speaker models have demonstrated impressive accuracy in speaker verification tasks, this often comes at the expense of increased model size and computation time, presenting challenges for deployment in resource-constrained environments. Our research focuses on addressing this limitation through the development of small footprint deep speaker embedding extraction using knowledge di… ▽ More

    Submitted 19 December, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: Submitted to Data & Knowledge Engineering at Dec. 2023. Copyright may be transferred without notice

  17. arXiv:2302.11343  [pdf, other

    cs.SD cs.LG eess.AS

    Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning

    Authors: Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni

    Abstract: Stuttering is a neuro-developmental speech impairment characterized by uncontrolled utterances (interjections) and core behaviors (blocks, repetitions, and prolongations), and is caused by the failure of speech sensorimotors. Due to its complex nature, stuttering detection (SD) is a difficult task. If detected at an early stage, it could facilitate speech therapists to observe and rectify the spee… ▽ More

    Submitted 21 February, 2023; originally announced February 2023.

    Comments: Accepted in IEEE Journal of Biomedical Health Informatics 2023

  18. arXiv:2302.05110  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Cross-Corpora Spoken Language Identification with Domain Diversification and Generalization

    Authors: Spandan Dey, Md Sahidullah, Goutam Saha

    Abstract: This work addresses the cross-corpora generalization issue for the low-resourced spoken language identification (LID) problem. We have conducted the experiments in the context of Indian LID and identified strikingly poor cross-corpora generalization due to corpora-dependent non-lingual biases. Our contribution to this work is twofold. First, we propose domain diversification, which diversifies the… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

    Comments: Accepted for publication in Elsevier Computer Speech & Language

  19. Modulation spectral features for speech emotion recognition using deep neural networks

    Authors: Premjeet Singh, Md Sahidullah, Goutam Saha

    Abstract: This work explores the use of constant-Q transform based modulation spectral features (CQT-MSF) for speech emotion recognition (SER). The human perception and analysis of sound comprise of two important cognitive parts: early auditory analysis and cortex-based processing. The early auditory analysis considers spectrogram-based representation whereas cortex-based analysis includes extraction of tem… ▽ More

    Submitted 14 January, 2023; originally announced January 2023.

    Comments: Accepted for publication in Elsevier's Speech Communication Journal

    Journal ref: Volume 146, January 2023, Pages 53-69

  20. arXiv:2212.03812  [pdf, other

    cs.CL cs.SD eess.AS

    An Overview of Indian Spoken Language Recognition from Machine Learning Perspective

    Authors: Spandan Dey, Md Sahidullah, Goutam Saha

    Abstract: Automatic spoken language identification (LID) is a very important research field in the era of multilingual voice-command-based human-computer interaction (HCI). A front-end LID module helps to improve the performance of many speech-based applications in the multilingual scenario. India is a populous country with diverse cultures and languages. The majority of the Indian population needs to use t… ▽ More

    Submitted 30 November, 2022; originally announced December 2022.

    Comments: Accepted for publication in ACM Transactions on Asian and Low-Resource Language Information Processing

    Journal ref: ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 21, Issue 6 November 2022, Article No 128

  21. Analysis of constant-Q filterbank based representations for speech emotion recognition

    Authors: Premjeet Singh, Shefali Waldekar, Md Sahidullah, Goutam Saha

    Abstract: This work analyzes the constant-Q filterbank-based time-frequency representations for speech emotion recognition (SER). Constant-Q filterbank provides non-linear spectro-temporal representation with higher frequency resolution at low frequencies. Our investigation reveals how the increased low-frequency resolution benefits SER. The time-domain comparative analysis between short-term mel-frequency… ▽ More

    Submitted 29 November, 2022; originally announced November 2022.

    Comments: Accepted for publication in Elsevier's Digital Signal Processing Journal

    Journal ref: Volume 130, October 2022, 103712

  22. arXiv:2211.01091  [pdf, ps, other

    eess.AS cs.AI cs.SD

    I4U System Description for NIST SRE'20 CTS Challenge

    Authors: Kong Aik Lee, Tomi Kinnunen, Daniele Colibro, Claudio Vair, Andreas Nautsch, Hanwu Sun, Liang He, Tianyu Liang, Qiongqiong Wang, Mickael Rouvier, Pierre-Michel Bousquet, Rohan Kumar Das, Ignacio Viñals Bailo, Meng Liu, Héctor Deldago, Xuechen Liu, Md Sahidullah, Sandro Cumani, Boning Zhang, Koji Okabe, Hitoshi Yamamoto, Ruijie Tao, Haizhou Li, Alfonso Ortega Giménez, Longbiao Wang , et al. (1 additional authors not shown)

    Abstract: This manuscript describes the I4U submission to the 2020 NIST Speaker Recognition Evaluation (SRE'20) Conversational Telephone Speech (CTS) Challenge. The I4U's submission was resulted from active collaboration among researchers across eight research teams - I$^2$R (Singapore), UEF (Finland), VALPT (Italy, Spain), NEC (Japan), THUEE (China), LIA (France), NUS (Singapore), INRIA (France) and TJU (C… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

    Comments: SRE 2021, NIST Speaker Recognition Evaluation Workshop, CTS Speaker Recognition Challenge, 14-12 December 2021

  23. arXiv:2210.02437  [pdf, other

    cs.SD cs.CR cs.MM eess.AS

    ASVspoof 2021: Towards Spoofed and Deepfake Speech Detection in the Wild

    Authors: Xuechen Liu, Xin Wang, Md Sahidullah, Jose Patino, Héctor Delgado, Tomi Kinnunen, Massimiliano Todisco, Junichi Yamagishi, Nicholas Evans, Andreas Nautsch, Kong Aik Lee

    Abstract: Benchmarking initiatives support the meaningful comparison of competing solutions to prominent problems in speech and language processing. Successive benchmarking evaluations typically reflect a progressive evolution from ideal lab conditions towards to those encountered in the wild. ASVspoof, the spoofing and deepfake detection initiative and challenge series, has followed the same trend. This ar… ▽ More

    Submitted 22 June, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

    Comments: IEEE/ACM Transactions on Audio, Speech, and Language Processing

  24. Robust Acoustic Domain Identification with its Application to Speaker Diarization

    Authors: A Kishore Kumar, Shefali Waldekar, Md Sahidullah, Goutam Saha

    Abstract: With the rise in multimedia content over the years, more variety is observed in the recording environments of audio. An audio processing system might benefit when it has a module to identify the acoustic domain at its front-end. In this paper, we demonstrate the idea of \emph{acoustic domain identification} (ADI) for \emph{speaker diarization}. For this, we first present a detailed study of the va… ▽ More

    Submitted 8 August, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

    Comments: Accepted for publication in International Journal of Speech Technology (Springer Nature)

  25. arXiv:2207.10817  [pdf, other

    cs.SD cs.LG eess.AS

    End-to-End and Self-Supervised Learning for ComParE 2022 Stuttering Sub-Challenge

    Authors: Shakeel Ahmad Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni

    Abstract: In this paper, we present end-to-end and speech embedding based systems trained in a self-supervised fashion to participate in the ACM Multimedia 2022 ComParE Challenge, specifically the stuttering sub-challenge. In particular, we exploit the embeddings from the pre-trained Wav2Vec2.0 model for stuttering detection (SD) on the KSoF dataset. After embedding extraction, we benchmark with several met… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: Accepted in ACM MM 2022 Conference : Grand Challenges, "\c{opyright} {Owner/Author | ACM} {2022}. This is the author's version of the work. It is posted here for your personal use. Not for redistribution

  26. arXiv:2205.00288  [pdf, other

    eess.AS cs.SD

    Baselines and Protocols for Household Speaker Recognition

    Authors: Alexey Sholokhov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: Speaker recognition on household devices, such as smart speakers, features several challenges: (i) robustness across a vast number of heterogeneous domains (households), (ii) short utterances, (iii) possibly absent speaker labels of the enrollment data (passive enrollment), and (iv) presence of unknown persons (guests). While many commercial products exist, there is less published research and no… ▽ More

    Submitted 5 May, 2022; v1 submitted 30 April, 2022; originally announced May 2022.

    Comments: Accepted to Odyssey 2022

  27. arXiv:2204.09976  [pdf, other

    cs.SD eess.AS

    Baseline Systems for the First Spoofing-Aware Speaker Verification Challenge: Score and Embedding Fusion

    Authors: Hye-jin Shim, Hemlata Tak, Xuechen Liu, Hee-Soo Heo, Jee-weon Jung, Joon Son Chung, Soo-Whan Chung, Ha-Jin Yu, Bong-Jin Lee, Massimiliano Todisco, Héctor Delgado, Kong Aik Lee, Md Sahidullah, Tomi Kinnunen, Nicholas Evans

    Abstract: Deep learning has brought impressive progress in the study of both automatic speaker verification (ASV) and spoofing countermeasures (CM). Although solutions are mutually dependent, they have typically evolved as standalone sub-systems whereby CM solutions are usually designed for a fixed ASV system. The work reported in this paper aims to gauge the improvements in reliability that can be gained f… ▽ More

    Submitted 21 April, 2022; originally announced April 2022.

    Comments: 8 pages, accepted by Odyssey 2022

  28. arXiv:2204.01735  [pdf, other

    eess.AS cs.LG cs.SD

    Robust Stuttering Detection via Multi-task and Adversarial Learning

    Authors: Shakeel Ahmad Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni

    Abstract: By automatic detection and identification of stuttering, speech pathologists can track the progression of disfluencies of persons who stutter (PWS). In this paper, we investigate the impact of multi-task (MTL) and adversarial learning (ADV) to learn robust stutter features. This is the first-ever preliminary study where MTL and ADV have been employed in stuttering identification (SI). We evaluate… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Under Review in European Signal Processing Conference 2022

  29. arXiv:2204.01564  [pdf, other

    cs.SD cs.LG eess.AS

    Introducing ECAPA-TDNN and Wav2Vec2.0 Embeddings to Stuttering Detection

    Authors: Shakeel Ahmad Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni

    Abstract: The adoption of advanced deep learning (DL) architecture in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets. To this end, this work introduces the application of speech embeddings extracted with pre-trained deep models trained on massive audio datasets for different tasks. In particular, we explore audio representations obtained using emphasized cha… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech 2022

  30. arXiv:2203.10992  [pdf, other

    cs.SD cs.AI eess.AS

    Spoofing-Aware Speaker Verification with Unsupervised Domain Adaptation

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: In this paper, we initiate the concern of enhancing the spoofing robustness of the automatic speaker verification (ASV) system, without the primary presence of a separate countermeasure module. We start from the standard ASV framework of the ASVspoof 2019 baseline and approach the problem from the back-end classifier based on probabilistic linear discriminant analysis. We employ three unsupervised… ▽ More

    Submitted 26 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Accepted by Speaker Odyssey 2022

  31. arXiv:2202.05236  [pdf, other

    cs.SD cs.AI eess.AS

    Learnable Nonlinear Compression for Robust Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: In this study, we focus on nonlinear compression methods in spectral features for speaker verification based on deep neural network. We consider different kinds of channel-dependent (CD) nonlinear compression methods optimized in a data-driven manner. Our methods are based on power nonlinearities and dynamic range compression (DRC). We also propose multi-regime (MR) design on the nonlinearities, a… ▽ More

    Submitted 10 February, 2022; originally announced February 2022.

    Comments: Accepted by ICASSP2022

  32. arXiv:2110.10983  [pdf, other

    cs.SD cs.AI eess.AS

    Optimizing Multi-Taper Features for Deep Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: Multi-taper estimators provide low-variance power spectrum estimates that can be used in place of the windowed discrete Fourier transform (DFT) to extract speech features such as mel-frequency cepstral coefficients (MFCCs). Even if past work has reported promising automatic speaker verification (ASV) results with Gaussian mixture model-based classifiers, the performance of multi-taper MFCCs with d… ▽ More

    Submitted 21 October, 2021; originally announced October 2021.

    Comments: To appear in IEEE Signal Processing Letters

  33. arXiv:2109.12058  [pdf, other

    cs.SD cs.AI eess.AS

    Optimized Power Normalized Cepstral Coefficients towards Robust Deep Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: After their introduction to robust speech recognition, power normalized cepstral coefficient (PNCC) features were successfully adopted to other tasks, including speaker verification. However, as a feature extractor with long-term operations on the power spectrogram, its temporal processing and amplitude scaling steps dedicated on environmental compensation may be redundant. Further, they might sup… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.

    Comments: Accepted for publication at ASRU 2021

  34. arXiv:2109.12056  [pdf, other

    cs.SD cs.AI eess.AS

    Parameterized Channel Normalization for Far-field Deep Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: We address far-field speaker verification with deep neural network (DNN) based speaker embedding extractor, where mismatch between enrollment and test data often comes from convolutive effects (e.g. room reverberation) and noise. To mitigate these effects, we focus on two parametric normalization methods: per-channel energy normalization (PCEN) and parameterized cepstral mean normalization (PCMN).… ▽ More

    Submitted 24 September, 2021; originally announced September 2021.

    Comments: Accepted for publication at ASRU 2021

  35. arXiv:2109.00537  [pdf, other

    eess.AS cs.CR cs.LG cs.SD

    ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection

    Authors: Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, Héctor Delgado

    Abstract: ASVspoof 2021 is the forth edition in the series of bi-annual challenges which aim to promote the study of spoofing and the design of countermeasures to protect automatic speaker verification systems from manipulation. In addition to a continued focus upon logical and physical access tasks in which there are a number of advances compared to previous editions, ASVspoof 2021 introduces a new task in… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: Accepted to the ASVspoof 2021 Workshop

  36. arXiv:2109.00535  [pdf, other

    eess.AS cs.CR cs.LG cs.SD

    ASVspoof 2021: Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan

    Authors: Héctor Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Jose Patino, Md Sahidullah, Massimiliano Todisco, Xin Wang, Junichi Yamagishi

    Abstract: The automatic speaker verification spoofing and countermeasures (ASVspoof) challenge series is a community-led initiative which aims to promote the consideration of spoofing and the development of countermeasures. ASVspoof 2021 is the 4th in a series of bi-annual, competitive challenges where the goal is to develop countermeasures capable of discriminating between bona fide and spoofed or deepfake… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: http://www.asvspoof.org

  37. arXiv:2109.00281  [pdf, other

    cs.CR cs.SD eess.AS

    Benchmarking and challenges in security and privacy for voice biometrics

    Authors: Jean-Francois Bonastre, Hector Delgado, Nicholas Evans, Tomi Kinnunen, Kong Aik Lee, Xuechen Liu, Andreas Nautsch, Paul-Gauthier Noe, Jose Patino, Md Sahidullah, Brij Mohan Lal Srivastava, Massimiliano Todisco, Natalia Tomashenko, Emmanuel Vincent, Xin Wang, Junichi Yamagishi

    Abstract: For many decades, research in speech technologies has focused upon improving reliability. With this now meeting user expectations for a range of diverse applications, speech technology is today omni-present. As result, a focus on security and privacy has now come to the fore. Here, the research effort is in its relative infancy and progress calls for greater, multidisciplinary collaboration with s… ▽ More

    Submitted 1 September, 2021; originally announced September 2021.

    Comments: Submitted to the symposium of the ISCA Security & Privacy in Speech Communications (SPSC) special interest group

  38. arXiv:2107.04057  [pdf, other

    cs.SD cs.LG eess.AS

    Machine Learning for Stuttering Identification: Review, Challenges and Future Directions

    Authors: Shakeel Ahmad Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni

    Abstract: Stuttering is a speech disorder during which the flow of speech is interrupted by involuntary pauses and repetition of sounds. Stuttering identification is an interesting interdisciplinary domain research problem which involves pathology, psychology, acoustics, and signal processing that makes it hard and complicated to detect. Recent developments in machine and deep learning have dramatically rev… ▽ More

    Submitted 16 November, 2022; v1 submitted 8 July, 2021; originally announced July 2021.

    Comments: Accepted in Journal of Neurocomputing 2022 https://doi.org/10.1016/j.neucom.2022.10.015

  39. arXiv:2106.06362  [pdf, other

    cs.SD cs.LG eess.AS stat.AP

    Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing

    Authors: Tomi Kinnunen, Andreas Nautsch, Md Sahidullah, Nicholas Evans, Xin Wang, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee

    Abstract: Whether it be for results summarization, or the analysis of classifier fusion, some means to compare different classifiers can often provide illuminating insight into their behaviour, (dis)similarity or complementarity. We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers in response to a common dataset. Based upon rank cor… ▽ More

    Submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021. Example code available at https://github.com/asvspoof-challenge/classifier-adjacency

  40. arXiv:2105.11728  [pdf

    cs.LG eess.SP

    Utterance partitioning for speaker recognition: an experimental review and analysis with new findings under GMM-SVM framework

    Authors: Nirmalya Sen, Md Sahidullah, Hemant Patil, Shyamal Kumar das Mandal, Sreenivasa Krothapalli Rao, Tapan Kumar Basu

    Abstract: The performance of speaker recognition system is highly dependent on the amount of speech used in enrollment and test. This work presents a detailed experimental review and analysis of the GMM-SVM based speaker recognition system in presence of duration variability. This article also reports a comparison of the performance of GMM-SVM classifier with its precursor technique Gaussian mixture model-u… ▽ More

    Submitted 25 May, 2021; originally announced May 2021.

    Comments: International Journal of Speech Technology, Springer Verlag, In press

  41. arXiv:2105.05599  [pdf, other

    eess.AS cs.LG cs.SD

    StutterNet: Stuttering Detection Using Time Delay Neural Network

    Authors: Shakeel A. Sheikh, Md Sahidullah, Fabrice Hirsch, Slim Ouni

    Abstract: This paper introduces StutterNet, a novel deep learning based stuttering detection capable of detecting and identifying various types of disfluencies. Most of the existing work in this domain uses automatic speech recognition (ASR) combined with language models for stuttering detection. Compared to the existing work, which depends on the ASR module, our method relies solely on the acoustic signal.… ▽ More

    Submitted 8 June, 2021; v1 submitted 12 May, 2021; originally announced May 2021.

    Comments: Accepted in EUSIPCO 2021: European Signal Processing Conference

  42. arXiv:2105.04806  [pdf, ps, other

    eess.AS cs.LG cs.SD eess.SP

    Deep scattering network for speech emotion recognition

    Authors: Premjeet Singh, Goutam Saha, Md Sahidullah

    Abstract: This paper introduces scattering transform for speech emotion recognition (SER). Scattering transform generates feature representations which remain stable to deformations and shifting in time and frequency without much loss of information. In speech, the emotion cues are spread across time and localised in frequency. The time and frequency invariance characteristic of scattering coefficients prov… ▽ More

    Submitted 11 May, 2021; originally announced May 2021.

    Comments: 5 pages, 4 figures, Accepted for publication in 2021 European Signal Processing Conference (EUSIPCO 2021)

  43. arXiv:2105.04639  [pdf, ps, other

    eess.AS cs.SD

    Cross-Corpora Language Recognition: A Preliminary Investigation with Indian Languages

    Authors: Spandan Dey, Goutam Saha, Md Sahidullah

    Abstract: In this paper, we conduct one of the very first studies for cross-corpora performance evaluation in the spoken language identification (LID) problem. Cross-corpora evaluation was not explored much in LID research, especially for the Indian languages. We have selected three Indian spoken language corpora: IIITH-ILSC, LDC South Asian, and IITKGP-MLILSC. For each of the corpus, LID systems are traine… ▽ More

    Submitted 12 May, 2021; v1 submitted 10 May, 2021; originally announced May 2021.

    Comments: Accepted in EUSIPCO 2021 : European Signal Processing Conference

  44. arXiv:2103.14602  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    Data Quality as Predictor of Voice Anti-Spoofing Generalization

    Authors: Bhusan Chettri, Rosa González Hautamäki, Md Sahidullah, Tomi Kinnunen

    Abstract: Voice anti-spoofing aims at classifying a given utterance either as a bonafide human sample, or a spoofing attack (e.g. synthetic or replayed sample). Many anti-spoofing methods have been proposed but most of them fail to generalize across domains (corpora) -- and we do not know \emph{why}. We outline a novel interpretative framework for gauging the impact of data quality upon anti-spoofing perfor… ▽ More

    Submitted 21 June, 2021; v1 submitted 26 March, 2021; originally announced March 2021.

    Comments: INTERSPEECH 2021

  45. arXiv:2102.10322  [pdf, other

    cs.SD cs.LG eess.AS

    Learnable MFCCs for Speaker Verification

    Authors: Xuechen Liu, Md Sahidullah, Tomi Kinnunen

    Abstract: We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allowing the model to be adapted to data flexibly. In practice, we formulate data-driven versions of the four linear transforms of a standard MFCC extracto… ▽ More

    Submitted 20 February, 2021; originally announced February 2021.

    Comments: Accepted to ISCAS 2021

  46. arXiv:2102.09939  [pdf, ps, other

    eess.AS cs.SD eess.SP

    ABSP System for The Third DIHARD Challenge

    Authors: A Kishore Kumar, Shefali Waldekar, Goutam Saha, Md Sahidullah

    Abstract: This report describes the speaker diarization system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our primary contribution is to develop acoustic domain identification (ADI) system for speaker diarization. We investigate speaker embeddings based ADI system. We apply a domain-dependent threshold for agglomerative hierarchical clustering. Besides, we optim… ▽ More

    Submitted 10 February, 2021; originally announced February 2021.

  47. arXiv:2102.05889  [pdf, other

    eess.AS cs.CR cs.SD

    ASVspoof 2019: spoofing countermeasures for the detection of synthesized, converted and replayed speech

    Authors: Andreas Nautsch, Xin Wang, Nicholas Evans, Tomi Kinnunen, Ville Vestman, Massimiliano Todisco, Héctor Delgado, Md Sahidullah, Junichi Yamagishi, Kong Aik Lee

    Abstract: The ASVspoof initiative was conceived to spearhead research in anti-spoofing for automatic speaker verification (ASV). This paper describes the third in a series of bi-annual challenges: ASVspoof 2019. With the challenge database and protocols being described elsewhere, the focus of this paper is on results and the top performing single and ensemble system submissions from 62 teams, all of which o… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

    Journal ref: IEEE Transactions on Biometrics, Behavior, and Identity Science 2021

  48. arXiv:2102.04029  [pdf, ps, other

    eess.AS cs.LG cs.SD eess.SP

    Non-linear frequency warping using constant-Q transformation for speech emotion recognition

    Authors: Premjeet Singh, Goutam Saha, Md Sahidullah

    Abstract: In this work, we explore the constant-Q transform (CQT) for speech emotion recognition (SER). The CQT-based time-frequency analysis provides variable spectro-temporal resolution with higher frequency resolution at lower frequencies. Since lower-frequency regions of speech signal contain more emotion-related information than higher-frequency regions, the increased low-frequency resolution of CQT ma… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

    Comments: Accepted for publication in 2021 IEEE International Conference on Computer Communication and Informatics (IEEE ICCCI 2021)

  49. arXiv:2102.02074  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Data Generation Using Pass-phrase-dependent Deep Auto-encoders for Text-Dependent Speaker Verification

    Authors: Achintya Kumar Sarkar, Md Sahidullah, Zheng-Hua Tan

    Abstract: In this paper, we propose a novel method that trains pass-phrase specific deep neural network (PP-DNN) based auto-encoders for creating augmented data for text-dependent speaker verification (TD-SV). Each PP-DNN auto-encoder is trained using the utterances of a particular pass-phrase available in the target enrollment set with two methods: (i) transfer learning and (ii) training from scratch. Next… ▽ More

    Submitted 3 February, 2021; originally announced February 2021.

  50. arXiv:2101.09884  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Domain-Dependent Speaker Diarization for the Third DIHARD Challenge

    Authors: A Kishore Kumar, Shefali Waldekar, Goutam Saha, Md Sahidullah

    Abstract: This report presents the system developed by the ABSP Laboratory team for the third DIHARD speech diarization challenge. Our main contribution in this work is to develop a simple and efficient solution for acoustic domain dependent speech diarization. We explore speaker embeddings for \emph{acoustic domain identification} (ADI) task. Our study reveals that i-vector based method achieves considerab… ▽ More

    Submitted 24 January, 2021; originally announced January 2021.

    Comments: This work was presented in The Third DIHARD Speech Diarization Challenge Workshop