subscribe to arXiv mailings

On-Device LLMs for SMEs: Challenges and Opportunities

Authors: Jeremy Stephen Gabriel Yee, Pai Chet Ng, Zhengkui Wang, Ian McLoughlin, Aik Beng Ng, Simon See

Abstract: This paper presents a systematic review of the infrastructure requirements for deploying Large Language Models (LLMs) on-device within the context of small and medium-sized enterprises (SMEs), focusing on both hardware and software perspectives. From the hardware viewpoint, we discuss the utilization of processing units like GPUs and TPUs, efficient memory and storage solutions, and strategies for… ▽ More This paper presents a systematic review of the infrastructure requirements for deploying Large Language Models (LLMs) on-device within the context of small and medium-sized enterprises (SMEs), focusing on both hardware and software perspectives. From the hardware viewpoint, we discuss the utilization of processing units like GPUs and TPUs, efficient memory and storage solutions, and strategies for effective deployment, addressing the challenges of limited computational resources typical in SME settings. From the software perspective, we explore framework compatibility, operating system optimization, and the use of specialized libraries tailored for resource-constrained environments. The review is structured to first identify the unique challenges faced by SMEs in deploying LLMs on-device, followed by an exploration of the opportunities that both hardware innovations and software adaptations offer to overcome these obstacles. Such a structured review provides practical insights, contributing significantly to the community by enhancing the technological resilience of SMEs in integrating LLMs. △ Less

Submitted 22 October, 2024; v1 submitted 21 October, 2024; originally announced October 2024.

Comments: 9 pages, 1 figure. The work is supported by the SIT-NVIDIA Joint AI Centre

MSC Class: 68T07 ACM Class: I.2

arXiv:2409.17656 [pdf, other]

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

Authors: Pengfei Cai, Yan Song, Nan Jiang, Qing Gu, Ian McLoughlin

Abstract: A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(… ▽ More A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes, which is important in real scenarios in which multiple labels may apply to unsupervised data frames. A final stage of fine-tuning with just a small amount of labeled data yields a very high performing SED model. On like-for-like tests using the DESED task, our method achieves a PSDS1 score of 62.5\%, surpassing current state-of-the-art models and demonstrating the superiority of the proposed technique. △ Less

Submitted 26 September, 2024; originally announced September 2024.

Comments: Submitted to ICASSP2025; The code for this paper will be available at https://github.com/cai525/Transformer4SED after the paper is accepted

arXiv:2408.08673 [pdf, other]

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

Authors: Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin

Abstract: Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training,… ▽ More Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively. △ Less

Submitted 19 August, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

Comments: Received by interspeech 2024

arXiv:2408.05928 [pdf, other]

Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Authors: Xiaoxiao Miao, Yuxiang Zhang, Xin Wang, Natalia Tomashenko, Donny Cheng Lock Soh, Ian Mcloughlin

Abstract: A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve… ▽ More A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve emotion. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker's identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2305.12111 [pdf, other]

Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

Authors: Xiao-Min Zeng, Yan Song, Zhu Zhuo, Yu Zhou, Yu-Hong Li, Hui Xue, Li-Rong Dai, Ian McLoughlin

Abstract: In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-t… ▽ More In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-task framework. Besides cross-entropy loss between classes, contrastive loss is used to separate PAE output and original samples within each class. GeCo aims to better capture context information among frames, thanks to the self-attention mechanism for PAE model. Furthermore, GeCo combines generative and contrastive learning from which we aim to yield more effective and informative representations, compared to existing methods. Extensive experiments have been conducted on the DCASE2020 Task2 development dataset, showing that GeCo outperforms state-of-the-art generative and discriminative methods. △ Less

Submitted 20 May, 2023; originally announced May 2023.

Comments: Accepted by ICASSP2023

arXiv:2303.03689 [pdf, other]

AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

Authors: Kang Li, Yan Song, Li-Rong Dai, Ian McLoughlin, Xin Fang, Lin Liu

Abstract: In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences betwe… ▽ More In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have recently shown promise on DCASE2022 challenge task4 where they help mitigate a lack of sufficient real annotated data. However, mainly due to differences between the AT and SED tasks, it is suboptimal to directly utilize outputs from a pretrained AST model. Hence the proposed AST-SED adopts an encoder-decoder architecture to enable effective and efficient fine-tuning without needing to redesign or retrain the AST model. Specifically, the Frequency-wise Transformer Encoder (FTE) consists of transformers with self attention along the frequency axis to address multiple overlapped audio events issue in a single clip. The Local Gated Recurrent Units Decoder (LGD) consists of nearest-neighbor interpolation (NNI) and Bidirectional Gated Recurrent Units (Bi-GRU) to compensate for temporal resolution loss in the pretrained AST model output. Experimental results on DCASE2022 task4 development set have demonstrated the superiority of the proposed AST-SED with FTE-LGD architecture. Specifically, the Event-Based F1-score (EB-F1) of 59.60% and Polyphonic Sound detection Score scenario1 (PSDS1) score of 0.5140 significantly outperform CRNN and other pretrained AST-based systems. △ Less

Submitted 7 March, 2023; originally announced March 2023.

Comments: accepted to ICASSP 2023

arXiv:2302.13028 [pdf, other]

A Light-weight Deep Learning Model for Remote Sensing Image Classification

Authors: Lam Pham, Cam Le, Dat Ngo, Anh Nguyen, Jasmin Lampert, Alexander Schindler, Ian McLoughlin

Abstract: In this paper, we present a high-performance and light-weight deep learning model for Remote Sensing Image Classification (RSIC), the task of identifying the aerial scene of a remote sensing image. To this end, we first valuate various benchmark convolutional neural network (CNN) architectures: MobileNet V1/V2, ResNet 50/151V2, InceptionV3/InceptionResNetV2, EfficientNet B0/B7, DenseNet 121/201, C… ▽ More In this paper, we present a high-performance and light-weight deep learning model for Remote Sensing Image Classification (RSIC), the task of identifying the aerial scene of a remote sensing image. To this end, we first valuate various benchmark convolutional neural network (CNN) architectures: MobileNet V1/V2, ResNet 50/151V2, InceptionV3/InceptionResNetV2, EfficientNet B0/B7, DenseNet 121/201, ConNeXt Tiny/Large. Then, the best performing models are selected to train a compact model in a teacher-student arrangement. The knowledge distillation from the teacher aims to achieve high performance with significantly reduced complexity. By conducting extensive experiments on the NWPU-RESISC45 benchmark, our proposed teacher-student models outperforms the state-of-the-art systems, and has potential to be applied on a wide rage of edge devices. △ Less

Submitted 25 February, 2023; originally announced February 2023.

arXiv:2206.08317 [pdf, other]

Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition

Authors: Zhifu Gao, Shiliang Zhang, Ian McLoughlin, Zhijie Yan

Abstract: Transformers have recently dominated the ASR field. Although able to yield good performance, they involve an autoregressive (AR) decoder to generate tokens one by one, which is computationally inefficient. To speed up inference, non-autoregressive (NAR) methods, e.g. single-step NAR, were designed, to enable parallel generation. However, due to an independence assumption within the output tokens,… ▽ More Transformers have recently dominated the ASR field. Although able to yield good performance, they involve an autoregressive (AR) decoder to generate tokens one by one, which is computationally inefficient. To speed up inference, non-autoregressive (NAR) methods, e.g. single-step NAR, were designed, to enable parallel generation. However, due to an independence assumption within the output tokens, performance of single-step NAR is inferior to that of AR models, especially with a large-scale corpus. There are two challenges to improving single-step NAR: Firstly to accurately predict the number of output tokens and extract hidden variables; secondly, to enhance modeling of interdependence between output tokens. To tackle both challenges, we propose a fast and accurate parallel transformer, termed Paraformer. This utilizes a continuous integrate-and-fire based predictor to predict the number of tokens and generate hidden variables. A glancing language model (GLM) sampler then generates semantic embeddings to enhance the NAR decoder's ability to model context interdependence. Finally, we design a strategy to generate negative samples for minimum word error rate training to further improve performance. Experiments using the public AISHELL-1, AISHELL-2 benchmark, and an industrial-level 20,000 hour task demonstrate that the proposed Paraformer can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup. △ Less

Submitted 30 March, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: 5 pages, 3 figures, accepted by INTERSPEECH 2022

arXiv:2201.03054 [pdf, ps, other]

An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory Anomalies

Authors: Lam Pham, Dat Ngo, Truong Hoang, Alexander Schindler, Ian McLoughlin

Abstract: In this paper, we evaluate various deep learning frameworks for detecting respiratory anomalies from input audio recordings. To this end, we firstly transform audio respiratory cycles collected from patients into spectrograms where both temporal and spectral features are presented, referred to as the front-end feature extraction. We then feed the spectrograms into back-end deep learning networks f… ▽ More In this paper, we evaluate various deep learning frameworks for detecting respiratory anomalies from input audio recordings. To this end, we firstly transform audio respiratory cycles collected from patients into spectrograms where both temporal and spectral features are presented, referred to as the front-end feature extraction. We then feed the spectrograms into back-end deep learning networks for classifying these respiratory cycles into certain categories. Finally, results from high-performed deep learning frameworks are fused to obtain the best score. Our experiments on ICBHI benchmark dataset achieve the highest ICBHI score of 57.3 from a late fusion of inception based and transfer learning based deep learning frameworks, which outperforms the state-of-the-art systems. △ Less

Submitted 9 January, 2022; originally announced January 2022.

arXiv:2104.05784 [pdf, other]

Extremely Low Footprint End-to-End ASR System for Smart Device

Authors: Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian McLoughlin

Abstract: Recently, end-to-end (E2E) speech recognition has become popular, since it can integrate the acoustic, pronunciation and language models into a single neural network, which outperforms conventional models. Among E2E approaches, attention-based models, e.g. Transformer, have emerged as being superior. Such models have opened the door to deployment of ASR on smart devices, however they still suffer… ▽ More Recently, end-to-end (E2E) speech recognition has become popular, since it can integrate the acoustic, pronunciation and language models into a single neural network, which outperforms conventional models. Among E2E approaches, attention-based models, e.g. Transformer, have emerged as being superior. Such models have opened the door to deployment of ASR on smart devices, however they still suffer from requiring a large number of model parameters. We propose an extremely low footprint E2E ASR system for smart devices, to achieve the goal of satisfying resource constraints without sacrificing recognition accuracy. We design cross-layer weight sharing to improve parameter efficiency and further exploit model compression methods including sparsification and quantization, to reduce memory storage and boost decoding efficiency. We evaluate our approaches on the public AISHELL-1 and AISHELL-2 benchmarks. On the AISHELL-2 task, the proposed method achieves more than 10x compression (model size reduces from 248 to 24MB), at the cost of only minor performance loss (CER reduces from 6.49% to 6.92%). △ Less

Submitted 6 July, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

Comments: 5 pages, 2 figures, accepted by INTERSPEECH 2021

arXiv:2103.02420 [pdf, ps, other]

Multi-view Audio and Music Classification

Authors: Huy Phan, Huy Le Nguyen, Oliver Y. Chén, Lam Pham, Philipp Koch, Ian McLoughlin, Alfred Mertins

Abstract: We propose in this work a multi-view learning approach for audio and music classification. Considering four typical low-level representations (i.e. different views) commonly used for audio and music recognition tasks, the proposed multi-view network consists of four subnetworks, each handling one input types. The learned embedding in the subnetworks are then concatenated to form the multi-view emb… ▽ More We propose in this work a multi-view learning approach for audio and music classification. Considering four typical low-level representations (i.e. different views) commonly used for audio and music recognition tasks, the proposed multi-view network consists of four subnetworks, each handling one input types. The learned embedding in the subnetworks are then concatenated to form the multi-view embedding for classification similar to a simple concatenation network. However, apart from the joint classification branch, the network also maintains four classification branches on the single-view embedding of the subnetworks. A novel method is then proposed to keep track of the learning behavior on the classification branches and adapt their weights to proportionally blend their gradients for network training. The weights are adapted in such a way that learning on a branch that is generalizing well will be encouraged whereas learning on a branch that is overfitting will be slowed down. Experiments on three different audio and music classification tasks show that the proposed multi-view network not only outperforms the single-view baselines but also is superior to the multi-view baselines based on concatenation and late fusion. △ Less

Submitted 3 March, 2021; originally announced March 2021.

Comments: Accepted to ICASSP 2021

arXiv:2012.13699 [pdf, ps, other]

Inception-Based Network and Multi-Spectrogram Ensemble Applied For Predicting Respiratory Anomalies and Lung Diseases

Authors: Lam Pham, Huy Phan, Ross King, Alfred Mertins, Ian McLoughlin

Abstract: This paper presents an inception-based deep neural network for detecting lung diseases using respiratory sound input. Recordings of respiratory sound collected from patients are firstly transformed into spectrograms where both spectral and temporal information are well presented, referred to as front-end feature extraction. These spectrograms are then fed into the proposed network, referred to as… ▽ More This paper presents an inception-based deep neural network for detecting lung diseases using respiratory sound input. Recordings of respiratory sound collected from patients are firstly transformed into spectrograms where both spectral and temporal information are well presented, referred to as front-end feature extraction. These spectrograms are then fed into the proposed network, referred to as back-end classification, for detecting whether patients suffer from lung-relevant diseases. Our experiments, conducted over the ICBHI benchmark meta-dataset of respiratory sound, achieve competitive ICBHI scores of 0.53/0.45 and 0.87/0.85 regarding respiratory anomaly and disease detection, respectively. △ Less

Submitted 26 December, 2020; originally announced December 2020.

arXiv:2010.14099 [pdf, other]

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Authors: Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin

Abstract: Recently, online end-to-end ASR has gained increasing attention. However, the performance of online systems still lags far behind that of offline systems, with a large gap in quality of recognition. For specific scenarios, we can trade-off between performance and latency, and can train multiple systems with different delays to match the performance and latency requirements of various application s… ▽ More Recently, online end-to-end ASR has gained increasing attention. However, the performance of online systems still lags far behind that of offline systems, with a large gap in quality of recognition. For specific scenarios, we can trade-off between performance and latency, and can train multiple systems with different delays to match the performance and latency requirements of various application scenarios. In this work, in contrast to trading-off between performance and latency, we envisage a single system that can match the needs of different scenarios. We propose a novel architecture, termed Universal ASR that can unify streaming and non-streaming ASR models into one system. The embedded streaming ASR model can configure different delays according to requirements to obtain real-time recognition results, while the non-streaming model is able to refresh the final recognition result for better performance. We have evaluated our approach on the public AISHELL-2 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. The experimental results show that the Universal ASR provides an efficient mechanism to integrate streaming and non-streaming models that can recognize speech quickly and accurately. On the AISHELL-2 task, Universal ASR comfortably outperforms other state-of-the-art systems. △ Less

Submitted 27 October, 2020; originally announced October 2020.

Comments: 5 pages, 2 figures, submitted to ICASSP 2021

arXiv:2010.10584 [pdf, other]

doi 10.1109/ACCESS.2021.3058579

Incandescent Bulb and LED Brake Lights:Novel Analysis of Reaction Times

Authors: Ramaswamy Palaniappan, Surej Mouli, Evangelina Fringi, Howard Bowman, Ian McLoughlin

Abstract: Rear-end collision accounts for around 8% of all vehicle crashes in the UK, with the failure to notice or react to a brake light signal being a major contributory cause. Meanwhile traditional incandescent brake light bulbs on vehicles are increasingly being replaced by a profusion of designs featuring LEDs. In this paper, we investigate the efficacy of brake light design using a novel approach to… ▽ More Rear-end collision accounts for around 8% of all vehicle crashes in the UK, with the failure to notice or react to a brake light signal being a major contributory cause. Meanwhile traditional incandescent brake light bulbs on vehicles are increasingly being replaced by a profusion of designs featuring LEDs. In this paper, we investigate the efficacy of brake light design using a novel approach to recording subject reaction times in a simulation setting using physical brake light assemblies. The reaction times of 22 subjects were measured for ten pairs of LED and incandescent bulb brake lights. Three events were investigated for each subject, namely the latency of brake light activation to accelerator release (BrakeAcc), the latency of accelerator release to brake pedal depression (AccPdl), and the cumulative time from light activation to brake pedal depression (BrakePdl). To our knowledge, this is the first study in which reaction times have been split into BrakeAcc and AccPdl. Results indicate that the two brake lights containing incandescent bulbs led to significantly slower reaction times compared to the tested eight LED lights. BrakeAcc results also show that experienced subjects were quicker to respond to the activation of brake lights by releasing the accelerator pedal. Interestingly, the analysis also revealed that the type of brake light influenced the AccPdl time, although experienced subjects did not always act quicker than inexperienced subjects. Overall, the study found that different designs of brake light can significantly influence driver response times. △ Less

Submitted 20 October, 2020; originally announced October 2020.

Comments: 10 pages, 18 figures

Journal ref: For a revised version and its published version refer to IEEE Access journal, 2021

arXiv:2010.09132 [pdf, ps, other]

Self-Attention Generative Adversarial Network for Speech Enhancement

Authors: Huy Phan, Huy Le Nguyen, Oliver Y. Chén, Philipp Koch, Ngoc Q. K. Duong, Ian McLoughlin, Alfred Mertins

Abstract: Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation, which may obscure temporal dependencies across the sequence input. To remedy this issue, we propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of a speech enhancement GAN (SEGAN) using raw signal input. Further, we… ▽ More Existing generative adversarial networks (GANs) for speech enhancement solely rely on the convolution operation, which may obscure temporal dependencies across the sequence input. To remedy this issue, we propose a self-attention layer adapted from non-local attention, coupled with the convolutional and deconvolutional layers of a speech enhancement GAN (SEGAN) using raw signal input. Further, we empirically study the effect of placing the self-attention layer at the (de)convolutional layers with varying layer indices as well as at all of them when memory allows. Our experiments show that introducing self-attention to SEGAN leads to consistent improvement across the objective evaluation metrics of enhancement performance. Furthermore, applying at different (de)convolutional layers does not significantly alter performance, suggesting that it can be conveniently applied at the highest-level (de)convolutional layer with the smallest memory overhead. △ Less

Submitted 6 February, 2021; v1 submitted 18 October, 2020; originally announced October 2020.

Comments: 46th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2021). Source code is available at http://github.com/pquochuy/sasegan

arXiv:2009.05527 [pdf, ps, other]

On Multitask Loss Function for Audio Event Detection and Localization

Authors: Huy Phan, Lam Pham, Philipp Koch, Ngoc Q. K. Duong, Ian McLoughlin, Alfred Mertins

Abstract: Audio event localization and detection (SELD) have been commonly tackled using multitask models. Such a model usually consists of a multi-label event classification branch with sigmoid cross-entropy loss for event activity detection and a regression branch with mean squared error loss for direction-of-arrival estimation. In this work, we propose a multitask regression model, in which both (multi-l… ▽ More Audio event localization and detection (SELD) have been commonly tackled using multitask models. Such a model usually consists of a multi-label event classification branch with sigmoid cross-entropy loss for event activity detection and a regression branch with mean squared error loss for direction-of-arrival estimation. In this work, we propose a multitask regression model, in which both (multi-label) event detection and localization are formulated as regression problems and use the mean squared error loss homogeneously for model training. We show that the common combination of heterogeneous loss functions causes the network to underfit the data whereas the homogeneous mean squared error loss leads to better convergence and performance. Experiments on the development and validation sets of the DCASE 2020 SELD task demonstrate that the proposed system also outperforms the DCASE 2020 SELD baseline across all the detection and localization metrics, reducing the overall SELD error (the combined metric) by approximately 10% absolute. △ Less

Submitted 11 September, 2020; originally announced September 2020.

Comments: Accepted for publication in DCASE 2020 Workshop

arXiv:2006.01713 [pdf, other]

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

Authors: Zhifu Gao, Shiliang Zhang, Ming Lei, Ian McLoughlin

Abstract: End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-att… ▽ More End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-attention instead of recurrent mechanisms, enabling both encoder and decoder to capture long-range dependencies with lower computational complexity.In this work, we propose boosting the self-attention ability with a DFSMN memory block, forming the proposed memory equipped self-attention (SAN-M) mechanism. Theoretical and empirical comparisons have been made to demonstrate the relevancy and complementarity between self-attention and the DFSMN memory block. Furthermore, the proposed SAN-M provides an efficient mechanism to integrate these two modules. We have evaluated our approach on the public AISHELL-1 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. On both tasks, SAN-M systems achieved much better performance than the self-attention based Transformer baseline system. Specially, it can achieve a CER of 6.46% on the AISHELL-1 task even without using any external LM, comfortably outperforming other state-of-the-art systems. △ Less

Submitted 20 May, 2020; originally announced June 2020.

Comments: submitted to INTERSPEECH2020

arXiv:2004.04072 [pdf, ps, other]

CNN-MoE based framework for classification of respiratory anomalies and lung disease detection

Authors: Lam Pham, Huy Phan, Ramaswamy Palaniappan, Alfred Mertins, Ian McLoughlin

Abstract: This paper presents and explores a robust deep learning framework for auscultation analysis. This aims to classify anomalies in respiratory cycles and detect disease, from respiratory sound recordings. The framework begins with front-end feature extraction that transforms input sound into a spectrogram representation. Then, a back-end deep learning network is used to classify the spectrogram featu… ▽ More This paper presents and explores a robust deep learning framework for auscultation analysis. This aims to classify anomalies in respiratory cycles and detect disease, from respiratory sound recordings. The framework begins with front-end feature extraction that transforms input sound into a spectrogram representation. Then, a back-end deep learning network is used to classify the spectrogram features into categories of respiratory anomaly cycles or diseases. Experiments, conducted over the ICBHI benchmark dataset of respiratory sounds, confirm three main contributions towards respiratory-sound analysis. Firstly, we carry out an extensive exploration of the effect of spectrogram type, spectral-time resolution, overlapped/non-overlapped windows, and data augmentation on final prediction accuracy. This leads us to propose a novel deep learning system, built on the proposed framework, which outperforms current state-of-the-art methods. Finally, we apply a Teacher-Student scheme to achieve a trade-off between model performance and model complexity which additionally helps to increase the potential of the proposed framework for building real-time applications. △ Less

Submitted 2 June, 2020; v1 submitted 4 April, 2020; originally announced April 2020.

arXiv:2002.04857 [pdf, ps, other]

Deep Feature Embedding and Hierarchical Classification for Audio Scene Classification

Authors: Lam Pham, Ian McLoughlin, Huy Phan, Ramaswamy Palaniappan, Alfred Mertins

Abstract: In this work, we propose an approach that features deep feature embedding learning and hierarchical classification with triplet loss function for Acoustic Scene Classification (ASC). In the one hand, a deep convolutional neural network is firstly trained to learn a feature embedding from scene audio signals. Via the trained convolutional neural network, the learned embedding embeds an input into t… ▽ More In this work, we propose an approach that features deep feature embedding learning and hierarchical classification with triplet loss function for Acoustic Scene Classification (ASC). In the one hand, a deep convolutional neural network is firstly trained to learn a feature embedding from scene audio signals. Via the trained convolutional neural network, the learned embedding embeds an input into the embedding feature space and transforms it into a high-level feature vector for representation. In the other hand, in order to exploit the structure of the scene categories, the original scene classification problem is structured into a hierarchy where similar categories are grouped into meta-categories. Then, hierarchical classification is accomplished using deep neural network classifiers associated with triplet loss function. Our experiments show that the proposed system achieves good performance on both the DCASE 2018 Task 1A and 1B datasets, resulting in accuracy gains of 15.6% and 16.6% absolute over the DCASE 2018 baseline on Task 1A and 1B, respectively. △ Less

Submitted 12 February, 2020; originally announced February 2020.

arXiv:2002.04502 [pdf, ps, other]

Robust Acoustic Scene Classification using a Multi-Spectrogram Encoder-Decoder Framework

Authors: Lam Pham, Huy Phan, Truc Nguyen, Ramaswamy Palaniappan, Alfred Mertins, Ian McLoughlin

Abstract: This article proposes an encoder-decoder network model for Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording from its acoustic signature. We make use of multiple low-level spectrogram features at the front-end, transformed into higher level features through a well-trained CNN-DNN front-end encoder. The high level features and their combination (via a trai… ▽ More This article proposes an encoder-decoder network model for Acoustic Scene Classification (ASC), the task of identifying the scene of an audio recording from its acoustic signature. We make use of multiple low-level spectrogram features at the front-end, transformed into higher level features through a well-trained CNN-DNN front-end encoder. The high level features and their combination (via a trained feature combiner) are then fed into different decoder models comprising random forest regression, DNNs and a mixture of experts, for back-end classification. We report extensive experiments to evaluate the accuracy of this framework for various ASC datasets, including LITIS Rouen and IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Task 1, 2017 Task 1, 2018 Tasks 1A & 1B and 2019 Tasks 1A & 1B. The experimental results highlight two main contributions; the first is an effective method for high-level feature extraction from multi-spectrogram input via the novel C-DNN architecture encoder network, and the second is the proposed decoder which enables the framework to achieve competitive results on various datasets. The fact that a single framework is highly competitive for several different challenges is an indicator of its robustness for performing general ASC tasks. △ Less

Submitted 11 February, 2020; originally announced February 2020.

arXiv:2002.03894 [pdf, other]

Robust Deep Learning Framework For Predicting Respiratory Anomalies and Diseases

Authors: Lam Pham, Ian McLoughlin, Huy Phan, Minh Tran, Truc Nguyen, Ramaswamy Palaniappan

Abstract: This paper presents a robust deep learning framework developed to detect respiratory diseases from recordings of respiratory sounds. The complete detection process firstly involves front end feature extraction where recordings are transformed into spectrograms that convey both spectral and temporal information. Then a back-end deep learning model classifies the features into classes of respiratory… ▽ More This paper presents a robust deep learning framework developed to detect respiratory diseases from recordings of respiratory sounds. The complete detection process firstly involves front end feature extraction where recordings are transformed into spectrograms that convey both spectral and temporal information. Then a back-end deep learning model classifies the features into classes of respiratory disease or anomaly. Experiments, conducted over the ICBHI benchmark dataset of respiratory sounds, evaluate the ability of the framework to classify sounds. Two main contributions are made in this paper. Firstly, we provide an extensive analysis of how factors such as respiratory cycle length, time resolution, and network architecture, affect final prediction accuracy. Secondly, a novel deep learning based framework is proposed for detection of respiratory diseases and shown to perform extremely well compared to state of the art methods. △ Less

Submitted 21 January, 2020; originally announced February 2020.

arXiv:2001.05532 [pdf, other]

doi 10.1109/LSP.2020.3025020

Improving GANs for Speech Enhancement

Authors: Huy Phan, Ian V. McLoughlin, Lam Pham, Oliver Y. Chén, Philipp Koch, Maarten De Vos, Alfred Mertins

Abstract: Generative adversarial networks (GAN) have recently been shown to be efficient for speech enhancement. However, most, if not all, existing speech enhancement GANs (SEGAN) make use of a single generator to perform one-stage enhancement mapping. In this work, we propose to use multiple generators that are chained to perform multi-stage enhancement mapping, which gradually refines the noisy input sig… ▽ More Generative adversarial networks (GAN) have recently been shown to be efficient for speech enhancement. However, most, if not all, existing speech enhancement GANs (SEGAN) make use of a single generator to perform one-stage enhancement mapping. In this work, we propose to use multiple generators that are chained to perform multi-stage enhancement mapping, which gradually refines the noisy input signals in a stage-wise fashion. Furthermore, we study two scenarios: (1) the generators share their parameters and (2) the generators' parameters are independent. The former constrains the generators to learn a common mapping that is iteratively applied at all enhancement stages and results in a small model footprint. On the contrary, the latter allows the generators to flexibly learn different enhancement mappings at different stages of the network at the cost of an increased model size. We demonstrate that the proposed multi-stage enhancement approach outperforms the one-stage SEGAN baseline, where the independent generators lead to more favorable results than the tied generators. The source code is available at http://github.com/pquochuy/idsegan. △ Less

Submitted 12 September, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

Comments: This letter has been accepted for publication in IEEE Signal Processing Letters

arXiv:1912.09003 [pdf, ps, other]

LSTM-TDNN with convolutional front-end for Dialect Identification in the 2019 Multi-Genre Broadcast Challenge

Authors: Xiaoxiao Miao, Ian McLoughlin

Abstract: This paper presents a novel Dialect Identification (DID) system developed for the Fifth Edition of the Multi-Genre Broadcast challenge, the task of Fine-grained Arabic Dialect Identification (MGB-5 ADI Challenge). The system improves upon traditional DNN x-vector performance by employing a Convolutional and Long Short Term Memory-Recurrent (CLSTM) architecture to combine the benefits of a convolut… ▽ More This paper presents a novel Dialect Identification (DID) system developed for the Fifth Edition of the Multi-Genre Broadcast challenge, the task of Fine-grained Arabic Dialect Identification (MGB-5 ADI Challenge). The system improves upon traditional DNN x-vector performance by employing a Convolutional and Long Short Term Memory-Recurrent (CLSTM) architecture to combine the benefits of a convolutional neural network front-end for feature extraction and a back-end recurrent neural to capture longer temporal dependencies. Furthermore we investigate intensive augmentation of one low resource dialect in the highly unbalanced training set using time-scale modification (TSM). This converts an utterance to several time-stretched or time-compressed versions, subsequently used to train the CLSTM system without using any other corpus. In this paper, we also investigate speech augmentation using MUSAN and the RIR datasets to increase the quantity and diversity of the existing training data in the normal way. Results show firstly that the CLSTM architecture outperforms a traditional DNN x-vector implementation. Secondly, adopting TSM-based speed perturbation yields a small performance improvement for the unbalanced data, finally that traditional data augmentation techniques yield further benefit, in line with evidence from related speaker and language recognition tasks. Our system achieved 2nd place ranking out of 15 entries in the MGB-5 ADI challenge, presented at ASRU 2019. △ Less

Submitted 18 December, 2019; originally announced December 2019.

arXiv:1907.13177 [pdf, ps, other]

doi 10.1109/TBME.2020.3020381

Towards More Accurate Automatic Sleep Staging via Deep Transfer Learning

Authors: Huy Phan, Oliver Y. Chén, Philipp Koch, Zongqing Lu, Ian McLoughlin, Alfred Mertins, Maarten De Vos

Abstract: Background: Despite recent significant progress in the development of automatic sleep staging methods, building a good model still remains a big challenge for sleep studies with a small cohort due to the data-variability and data-inefficiency issues. This work presents a deep transfer learning approach to overcome these issues and enable transferring knowledge from a large dataset to a small cohor… ▽ More Background: Despite recent significant progress in the development of automatic sleep staging methods, building a good model still remains a big challenge for sleep studies with a small cohort due to the data-variability and data-inefficiency issues. This work presents a deep transfer learning approach to overcome these issues and enable transferring knowledge from a large dataset to a small cohort for automatic sleep staging. Methods: We start from a generic end-to-end deep learning framework for sequence-to-sequence sleep staging and derive two networks as the means for transfer learning. The networks are first trained in the source domain (i.e. the large database). The pretrained networks are then finetuned in the target domain (i.e. the small cohort) to complete knowledge transfer. We employ the Montreal Archive of Sleep Studies (MASS) database consisting of 200 subjects as the source domain and study deep transfer learning on three different target domains: the Sleep Cassette subset and the Sleep Telemetry subset of the Sleep-EDF Expanded database, and the Surrey-cEEGrid database. The target domains are purposely adopted to cover different degrees of data mismatch to the source domains. Results: Our experimental results show significant performance improvement on automatic sleep staging on the target domains achieved with the proposed deep transfer learning approach. Conclusions: These results suggest the efficacy of the proposed approach in addressing the above-mentioned data-variability and data-inefficiency issues. Significance: As a consequence, it would enable one to improve the quality of automatic sleep staging models when the amount of data is relatively small. The source code and the pretrained models are available at http://github.com/pquochuy/sleep_transfer_learning. △ Less

Submitted 27 August, 2020; v1 submitted 30 July, 2019; originally announced July 2019.

Comments: This article has been published in IEEE Transactions on Biomedical Engineering

arXiv:1904.03543 [pdf, ps, other]

Spatio-Temporal Attention Pooling for Audio Scene Classification

Authors: Huy Phan, Oliver Y. Chén, Lam Pham, Philipp Koch, Maarten De Vos, Ian McLoughlin, Alfred Mertins

Abstract: Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The… ▽ More Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The bidirectional recurrent layers are then able to encode the temporal dynamics of the resulting convolutional features. Afterwards, a two-dimensional attention mask is formed via the outer product of the spatial and temporal attention vectors learned from two designated attention layers to weigh and pool the recurrent output into a final feature vector for classification. The network is trained with between-class examples generated from between-class data augmentation. Experiments demonstrate that the proposed method not only outperforms a strong convolutional neural network baseline but also sets new state-of-the-art performance on the LITIS Rouen dataset. △ Less

Submitted 28 June, 2019; v1 submitted 6 April, 2019; originally announced April 2019.

Comments: To appear at the 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019)

arXiv:1811.01095 [pdf, ps, other]

Beyond Equal-Length Snippets: How Long is Sufficient to Recognize an Audio Scene?

Authors: Huy Phan, Oliver Y. Chén, Philipp Koch, Lam Pham, Ian McLoughlin, Alfred Mertins, Maarten De Vos

Abstract: Due to the variability in characteristics of audio scenes, some scenes can naturally be recognized earlier than others. In this work, rather than using equal-length snippets for all scene categories, as is common in the literature, we study to which temporal extent an audio scene can be reliably recognized given state-of-the-art models. Moreover, as model fusion with deep network ensemble is preva… ▽ More Due to the variability in characteristics of audio scenes, some scenes can naturally be recognized earlier than others. In this work, rather than using equal-length snippets for all scene categories, as is common in the literature, we study to which temporal extent an audio scene can be reliably recognized given state-of-the-art models. Moreover, as model fusion with deep network ensemble is prevalent in audio scene classification, we further study whether, and if so, when model fusion is necessary for this task. To achieve these goals, we employ two single-network systems relying on a convolutional neural network and a recurrent neural network for classification as well as early fusion and late fusion of these networks. Experimental results on the LITIS-Rouen dataset show that some scenes can be reliably recognized with a few seconds while other scenes require significantly longer durations. In addition, model fusion is shown to be the most beneficial when the signal length is short. △ Less

Submitted 8 May, 2019; v1 submitted 2 November, 2018; originally announced November 2018.

Comments: Accepted to 2019 AES Conference on Audio Forensics

arXiv:1811.01092 [pdf, ps, other]

Unifying Isolated and Overlapping Audio Event Detection with Multi-Label Multi-Task Convolutional Recurrent Neural Networks

Authors: Huy Phan, Oliver Y. Chén, Philipp Koch, Lam Pham, Ian McLoughlin, Alfred Mertins, Maarten De Vos

Abstract: We propose a multi-label multi-task framework based on a convolutional recurrent neural network to unify detection of isolated and overlapping audio events. The framework leverages the power of convolutional recurrent neural network architectures; convolutional layers learn effective features over which higher recurrent layers perform sequential modelling. Furthermore, the output layer is designed… ▽ More We propose a multi-label multi-task framework based on a convolutional recurrent neural network to unify detection of isolated and overlapping audio events. The framework leverages the power of convolutional recurrent neural network architectures; convolutional layers learn effective features over which higher recurrent layers perform sequential modelling. Furthermore, the output layer is designed to handle arbitrary degrees of event overlap. At each time step in the recurrent output sequence, an output triple is dedicated to each event category of interest to jointly model event occurrence and temporal boundaries. That is, the network jointly determines whether an event of this category occurs, and when it occurs, by estimating onset and offset positions at each recurrent time step. We then introduce three sequential losses for network training: multi-label classification loss, distance estimation loss, and confidence loss. We demonstrate good generalization on two datasets: ITC-Irst for isolated audio event detection, and TUT-SED-Synthetic-2016 for overlapping audio event detection. △ Less

Submitted 18 February, 2019; v1 submitted 2 November, 2018; originally announced November 2018.

Comments: Accepted for the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019)

arXiv:1712.08034 [pdf, other]

doi 10.1109/ICASSP.2019.8682625

On the Use of a Spectral Glottal Model for the Source-filter Separation of Speech

Authors: Olivier Perrotin, Ian Vince McLoughlin

Abstract: The estimation of glottal flow from a speech waveform is a key method for speech analysis and parameterization. Significant research effort has been made to dissociate the first vocal tract resonance from the glottal formant (the low-frequency resonance describing the open-phase of the vocal fold vibration). However few methods cope with estimation of high-frequency spectral tilt to describe the r… ▽ More The estimation of glottal flow from a speech waveform is a key method for speech analysis and parameterization. Significant research effort has been made to dissociate the first vocal tract resonance from the glottal formant (the low-frequency resonance describing the open-phase of the vocal fold vibration). However few methods cope with estimation of high-frequency spectral tilt to describe the return-phase of the vocal fold vibration, which is crucial to the perception of vocal effort. This paper proposes an improved version of the well-known Iterative Adaptive Inverse Filtering (IAIF) called GFM-IAIF. GFM-IAIF includes a full spectral model of the glottis that incorporates both glottal formant and spectral tilt features. Comparisons with the standard IAIF method show that while GFM-IAIF maintains good performance on vocal tract removal, it significantly improves the perceptive timbral variations associated to vocal effort. △ Less

Submitted 21 December, 2017; originally announced December 2017.

Comments: 8 pages, 4 figures

arXiv:1712.02116 [pdf, ps, other]

doi 10.1109/ICASSP.2018.8461859

Enabling Early Audio Event Detection with Neural Networks

Authors: Huy Phan, Philipp Koch, Ian McLoughlin, Alfred Mertins

Abstract: This paper presents a methodology for early detection of audio events from audio streams. Early detection is the ability to infer an ongoing event during its initial stage. The proposed system consists of a novel inference step coupled with dual parallel tailored-loss deep neural networks (DNNs). The DNNs share a similar architecture except for their loss functions, i.e. weighted loss and multitas… ▽ More This paper presents a methodology for early detection of audio events from audio streams. Early detection is the ability to infer an ongoing event during its initial stage. The proposed system consists of a novel inference step coupled with dual parallel tailored-loss deep neural networks (DNNs). The DNNs share a similar architecture except for their loss functions, i.e. weighted loss and multitask loss, which are designed to efficiently cope with issues common to audio event detection. The inference step is newly introduced to make use of the network outputs for recognizing ongoing events. The monotonicity of the detection function is required for reliable early detection, and will also be proved. Experiments on the ITC-Irst database show that the proposed system achieves state-of-the-art detection performance. Furthermore, even partial events are sufficient to achieve good performance similar to that obtained when an entire event is observed, enabling early event detection. △ Less

Submitted 6 April, 2019; v1 submitted 6 December, 2017; originally announced December 2017.

Comments: Published version available at https://ieeexplore.ieee.org/document/8461859

Journal ref: Published in Proceedings of 43rd IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 141-145, 2018

arXiv:1612.09089 [pdf, other]

doi 10.23919/EUSIPCO.2017.8081709

What Makes Audio Event Detection Harder than Classification?

Authors: Huy Phan, Philipp Koch, Fabrice Katzberg, Marco Maass, Radoslaw Mazur, Ian McLoughlin, Alfred Mertins

Abstract: There is a common observation that audio event classification is easier to deal with than detection. So far, this observation has been accepted as a fact and we lack of a careful analysis. In this paper, we reason the rationale behind this fact and, more importantly, leverage them to benefit the audio event detection task. We present an improved detection pipeline in which a verification step is a… ▽ More There is a common observation that audio event classification is easier to deal with than detection. So far, this observation has been accepted as a fact and we lack of a careful analysis. In this paper, we reason the rationale behind this fact and, more importantly, leverage them to benefit the audio event detection task. We present an improved detection pipeline in which a verification step is appended to augment a detection system. This step employs a high-quality event classifier to postprocess the benign event hypotheses outputted by the detection system and reject false alarms. To demonstrate the effectiveness of the proposed pipeline, we implement and pair up different event detectors based on the most common detection schemes and various event classifiers, ranging from the standard bag-of-words model to the state-of-the-art bank-of-regressors one. Experimental results on the ITC-Irst dataset show significant improvements to detection performance. More importantly, these improvements are consistent for all detector-classifier combinations. △ Less

Submitted 17 May, 2018; v1 submitted 29 December, 2016; originally announced December 2016.

Comments: Published version available at https://ieeexplore.ieee.org/document/8081709/

Journal ref: Published in Proceedings of the 25th European Signal Processing Conference (EUSIPCO), pp. 2739-2743, 2017

arXiv:1604.08716 [pdf, other]

doi 10.1109/ICASSP.2016.7471667

Learning Compact Structural Representations for Audio Events Using Regressor Banks

Authors: Huy Phan, Marco Maass, Lars Hertel, Radoslaw Mazur, Ian McLoughlin, Alfred Mertins

Abstract: We introduce a new learned descriptor for audio signals which is efficient for event representation. The entries of the descriptor are produced by evaluating a set of regressors on the input signal. The regressors are class-specific and trained using the random regression forests framework. Given an input signal, each regressor estimates the onset and offset positions of the target event. The esti… ▽ More We introduce a new learned descriptor for audio signals which is efficient for event representation. The entries of the descriptor are produced by evaluating a set of regressors on the input signal. The regressors are class-specific and trained using the random regression forests framework. Given an input signal, each regressor estimates the onset and offset positions of the target event. The estimation confidence scores output by a regressor are then used to quantify how the target event aligns with the temporal structure of the corresponding category. Our proposed descriptor has two advantages. First, it is compact, i.e. the dimensionality of the descriptor is equal to the number of event classes. Second, we show that even simple linear classification models, trained on our descriptor, yield better accuracies on audio event classification task than not only the nonlinear baselines but also the state-of-the-art results. △ Less

Submitted 29 April, 2016; originally announced April 2016.

Comments: To appear in Proceedings of IEEE ICASSP 2016

Showing 1–31 of 31 results for author: McLoughlin, I