subscribe to arXiv mailings

Freeze and Learn: Continual Learning with Selective Freezing for Speech Deepfake Detection

Authors: Davide Salvi, Viola Negroni, Luca Bondi, Paolo Bestagini, Stefano Tubaro

Abstract: In speech deepfake detection, one of the critical aspects is developing detectors able to generalize on unseen data and distinguish fake signals across different datasets. Common approaches to this challenge involve incorporating diverse data into the training process or fine-tuning models on unseen datasets. However, these solutions can be computationally demanding and may lead to the loss of kno… ▽ More In speech deepfake detection, one of the critical aspects is developing detectors able to generalize on unseen data and distinguish fake signals across different datasets. Common approaches to this challenge involve incorporating diverse data into the training process or fine-tuning models on unseen datasets. However, these solutions can be computationally demanding and may lead to the loss of knowledge acquired from previously learned data. Continual learning techniques offer a potential solution to this problem, allowing the models to learn from unseen data without losing what they have already learned. Still, the optimal way to apply these algorithms for speech deepfake detection remains unclear, and we do not know which is the best way to apply these algorithms to the developed models. In this paper we address this aspect and investigate whether, when retraining a speech deepfake detector, it is more effective to apply continual learning across the entire model or to update only some of its layers while freezing others. Our findings, validated across multiple models, indicate that the most effective approach among the analyzed ones is to update only the weights of the initial layers, which are responsible for processing the input features of the detector. △ Less

Submitted 26 September, 2024; originally announced September 2024.

Comments: Submitted to ICASSP 2025

arXiv:2406.18447 [pdf]

How to Achieve High Spatial Resolution in Organic Optobioelectronic Devices?

Authors: Luca Fabbri, Ludovico Migliaccio, Aleksandra Širvinskytė, Giacomo Rizzi, Luca Bondi, Cristiano Tamarozzi, Stefan A. L. Weber, Beatrice Fraboni, Eric Daniel Glowacki, Tobias Cramer

Abstract: Light activated local stimulation and sensing of biological cells offers enormous potential for minimally invasive bioelectronic interfaces. Organic semiconductors are a promising material class to achieve this kind of transduction due to their optoelectronic properties and biocompatibility. Here we investigate which material properties are necessary to keep the optical excitation localized. This… ▽ More Light activated local stimulation and sensing of biological cells offers enormous potential for minimally invasive bioelectronic interfaces. Organic semiconductors are a promising material class to achieve this kind of transduction due to their optoelectronic properties and biocompatibility. Here we investigate which material properties are necessary to keep the optical excitation localized. This is critical to single cell transduction with high spatial resolution. As a model system we use organic photocapacitors for cell stimulation made of the small molecule semiconductors H2Pc and PTCDI. We investigate the spatial broadening of the localized optical excitation with photovoltage microscopy measurements. Our experimental data combined with modelling show that resolution losses due to the broadening of the excitation are directly related to the effective diffusion length of charge carriers generated at the heterojunction. With additional transient photovoltage measurements we find that the H2Pc/PTCDI heterojunction offers a small diffusion length of lambda = 1.5 +/- 0.1 um due to the small mobility of charge carriers along the heterojunction. Instead covering the heterojunction with a layer of PEDOT:PSS improves the photocapacitor performance but increases the carrier diffusion length to lambda = 7.0 +/- 0.3 um due to longer lifetime and higher carrier mobility. Furthermore, we introduce electrochemical photocurrent microscopy experiments to demonstrate micrometric resolution with the pn-junction under realistic aqueous operation conditions. This work offers valuable insights into the physical mechanisms governing the excitation and transduction profile and provide design principles for future organic semiconductor junctions, aiming to achieve high efficiency and high spatial resolution. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2401.09308 [pdf, other]

Can Synthetic Data Boost the Training of Deep Acoustic Vehicle Counting Networks?

Authors: Stefano Damiano, Luca Bondi, Shabnam Ghaffarzadegan, Andre Guntoro, Toon van Waterschoot

Abstract: In the design of traffic monitoring solutions for optimizing the urban mobility infrastructure, acoustic vehicle counting models have received attention due to their cost effectiveness and energy efficiency. Although deep learning has proven effective for visual traffic monitoring, its use has not been thoroughly investigated in the audio domain, likely due to real-world data scarcity. In this wor… ▽ More In the design of traffic monitoring solutions for optimizing the urban mobility infrastructure, acoustic vehicle counting models have received attention due to their cost effectiveness and energy efficiency. Although deep learning has proven effective for visual traffic monitoring, its use has not been thoroughly investigated in the audio domain, likely due to real-world data scarcity. In this work, we propose a novel approach to acoustic vehicle counting by developing: i) a traffic noise simulation framework to synthesize realistic vehicle pass-by events; ii) a strategy to mix synthetic and real data to train a deep-learning model for traffic counting. The proposed system is capable of simultaneously counting cars and commercial vehicles driving on a two-lane road, and identifying their direction of travel under moderate traffic density conditions. With only 24 hours of labeled real-world traffic noise, we are able to improve counting accuracy on real-world data from $63\%$ to $88\%$ for cars and from $86\%$ to $94\%$ for commercial vehicles. △ Less

Submitted 17 January, 2024; originally announced January 2024.

Comments: Accepted paper: 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024)

arXiv:2401.04935 [pdf, other]

Learning Audio Concepts from Counterfactual Natural Language

Authors: Ali Vosoughi, Luca Bondi, Ho-Hsiang Wu, Chenliang Xu

Abstract: Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text. Recent methods unlock learning joint audio-text embeddings from raw audio-text pairs describing audio in natural language. Despite recent advancements, there is little exploration of systematic methods to train models for recognizing sound events and sources in alternative scenarios, s… ▽ More Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text. Recent methods unlock learning joint audio-text embeddings from raw audio-text pairs describing audio in natural language. Despite recent advancements, there is little exploration of systematic methods to train models for recognizing sound events and sources in alternative scenarios, such as distinguishing fireworks from gunshots at outdoor events in similar situations. This study introduces causal reasoning and counterfactual analysis in the audio domain. We use counterfactual instances and include them in our model across different aspects. Our model considers acoustic characteristics and sound source information from human-annotated reference texts. To validate the effectiveness of our model, we conducted pre-training utilizing multiple audio captioning datasets. We then evaluate with several common downstream tasks, demonstrating the merits of the proposed method as one of the first works leveraging counterfactual information in audio domain. Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: Accepted at ICASSP 2024

arXiv:2309.13343 [pdf, other]

Two vs. Four-Channel Sound Event Localization and Detection

Authors: Julia Wilkins, Magdalena Fuentes, Luca Bondi, Shabnam Ghaffarzadegan, Ali Abavisani, Juan Pablo Bello

Abstract: Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devi… ▽ More Sound event localization and detection (SELD) systems estimate both the direction-of-arrival (DOA) and class of sound sources over time. In the DCASE 2022 SELD Challenge (Task 3), models are designed to operate in a 4-channel setting. While beneficial to further the development of SELD systems using a multichannel recording setup such as first-order Ambisonics (FOA), most consumer electronics devices rarely are able to record using more than two channels. For this reason, in this work we investigate the performance of the DCASE 2022 SELD baseline model using three audio input representations: FOA, binaural, and stereo. We perform a novel comparative analysis illustrating the effect of these audio input representations on SELD performance. Crucially, we show that binaural and stereo (i.e. 2-channel) audio-based SELD models are still able to localize and detect sound sources laterally quite well, despite overall performance degrading as less audio information is provided. Further, we segment our analysis by scenes containing varying degrees of sound source polyphony to better understand the effect of audio input representation on localization and detection performance as scene conditions become increasingly complex. △ Less

Submitted 23 September, 2023; originally announced September 2023.

arXiv:2212.11345 [pdf, other]

Knowledge-driven Scene Priors for Semantic Audio-Visual Embodied Navigation

Authors: Gyan Tatiya, Jonathan Francis, Luca Bondi, Ingrid Navarro, Eric Nyberg, Jivko Sinapov, Jean Oh

Abstract: Generalisation to unseen contexts remains a challenge for embodied navigation agents. In the context of semantic audio-visual navigation (SAVi) tasks, the notion of generalisation should include both generalising to unseen indoor visual scenes as well as generalising to unheard sounding objects. However, previous SAVi task definitions do not include evaluation conditions on truly novel sounding ob… ▽ More Generalisation to unseen contexts remains a challenge for embodied navigation agents. In the context of semantic audio-visual navigation (SAVi) tasks, the notion of generalisation should include both generalising to unseen indoor visual scenes as well as generalising to unheard sounding objects. However, previous SAVi task definitions do not include evaluation conditions on truly novel sounding objects, resorting instead to evaluating agents on unheard sound clips of known objects; meanwhile, previous SAVi methods do not include explicit mechanisms for incorporating domain knowledge about object and region semantics. These weaknesses limit the development and assessment of models' abilities to generalise their learned experience. In this work, we introduce the use of knowledge-driven scene priors in the semantic audio-visual embodied navigation task: we combine semantic information from our novel knowledge graph that encodes object-region relations, spatial knowledge from dual Graph Encoder Networks, and background knowledge from a series of pre-training tasks -- all within a reinforcement learning framework for audio-visual navigation. We also define a new audio-visual navigation sub-task, where agents are evaluated on novel sounding objects, as opposed to unheard clips of known objects. We show improvements over strong baselines in generalisation to unseen regions and novel sounding objects, within the Habitat-Matterport3D simulation environment, under the SoundSpaces task. △ Less

Submitted 21 December, 2022; originally announced December 2022.

Comments: 19 pages, 8 figures, 9 tables

arXiv:2204.01905 [pdf, other]

Learning to Adapt to Domain Shifts with Few-shot Samples in Anomalous Sound Detection

Authors: Bingqing Chen, Luca Bondi, Samarjit Das

Abstract: Anomaly detection has many important applications, such as monitoring industrial equipment. Despite recent advances in anomaly detection with deep-learning methods, it is unclear how existing solutions would perform under out-of-distribution scenarios, e.g., due to shifts in machine load or environmental noise. Grounded in the application of machine health monitoring, we propose a framework that a… ▽ More Anomaly detection has many important applications, such as monitoring industrial equipment. Despite recent advances in anomaly detection with deep-learning methods, it is unclear how existing solutions would perform under out-of-distribution scenarios, e.g., due to shifts in machine load or environmental noise. Grounded in the application of machine health monitoring, we propose a framework that adapts to new conditions with few-shot samples. Building upon prior work, we adopt a classification-based approach for anomaly detection and show its equivalence to mixture density estimation of the normal samples. We incorporate an episodic training procedure to match the few-shot setting during inference. We define multiple auxiliary classification tasks based on meta-information and leverage gradient-based meta-learning to improve generalization to different shifts. We evaluate our proposed method on a recently-released dataset of audio measurements from different machine types. It improved upon two baselines by around 10% and is on par with best-performing model reported on the dataset. △ Less

Submitted 4 April, 2022; originally announced April 2022.

arXiv:2011.07792 [pdf, other]

Training Strategies and Data Augmentations in CNN-based DeepFake Video Detection

Authors: Luca Bondi, Edoardo Daniele Cannas, Paolo Bestagini, Stefano Tubaro

Abstract: The fast and continuous growth in number and quality of deepfake videos calls for the development of reliable detection systems capable of automatically warning users on social media and on the Internet about the potential untruthfulness of such contents. While algorithms, software, and smartphone apps are getting better every day in generating manipulated videos and swapping faces, the accuracy o… ▽ More The fast and continuous growth in number and quality of deepfake videos calls for the development of reliable detection systems capable of automatically warning users on social media and on the Internet about the potential untruthfulness of such contents. While algorithms, software, and smartphone apps are getting better every day in generating manipulated videos and swapping faces, the accuracy of automated systems for face forgery detection in videos is still quite limited and generally biased toward the dataset used to design and train a specific detection system. In this paper we analyze how different training strategies and data augmentation techniques affect CNN-based deepfake detectors when training and testing on the same dataset or across different datasets. △ Less

Submitted 16 November, 2020; originally announced November 2020.

arXiv:2004.07676 [pdf, other]

Video Face Manipulation Detection Through Ensemble of CNNs

Authors: Nicolò Bonettini, Edoardo Daniele Cannas, Sara Mandelli, Luca Bondi, Paolo Bestagini, Stefano Tubaro

Abstract: In the last few years, several techniques for facial manipulation in videos have been successfully developed and made available to the masses (i.e., FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces in video sequences with incredibly realistic results and a very little effort. Despite the usefulness of these tools in many fields, if used maliciously, they can have a signi… ▽ More In the last few years, several techniques for facial manipulation in videos have been successfully developed and made available to the masses (i.e., FaceSwap, deepfake, etc.). These methods enable anyone to easily edit faces in video sequences with incredibly realistic results and a very little effort. Despite the usefulness of these tools in many fields, if used maliciously, they can have a significantly bad impact on society (e.g., fake news spreading, cyber bullying through fake revenge porn). The ability of objectively detecting whether a face has been manipulated in a video sequence is then a task of utmost importance. In this paper, we tackle the problem of face manipulation detection in video sequences targeting modern facial manipulation techniques. In particular, we study the ensembling of different trained Convolutional Neural Network (CNN) models. In the proposed solution, different models are obtained starting from a base network (i.e., EfficientNetB4) making use of two different concepts: (i) attention layers; (ii) siamese training. We show that combining these networks leads to promising face manipulation detection results on two publicly available datasets with more than 119000 videos. △ Less

Submitted 16 April, 2020; originally announced April 2020.

arXiv:1904.08497 [pdf, other]

doi 10.1109/ACCESS.2019.2921436

An In-Depth Study on Open-Set Camera Model Identification

Authors: Pedro Ribeiro Mendes Júnior, Luca Bondi, Paolo Bestagini, Stefano Tubaro, Anderson Rocha

Abstract: Camera model identification refers to the problem of linking a picture to the camera model used to shoot it. As this might be an enabling factor in different forensic applications to single out possible suspects (e.g., detecting the author of child abuse or terrorist propaganda material), many accurate camera model attribution methods have been developed in the literature. One of their main drawba… ▽ More Camera model identification refers to the problem of linking a picture to the camera model used to shoot it. As this might be an enabling factor in different forensic applications to single out possible suspects (e.g., detecting the author of child abuse or terrorist propaganda material), many accurate camera model attribution methods have been developed in the literature. One of their main drawbacks, however, is the typical closed-set assumption of the problem. This means that an investigated photograph is always assigned to one camera model within a set of known ones present during investigation, i.e., training time, and the fact that the picture can come from a completely unrelated camera model during actual testing is usually ignored. Under realistic conditions, it is not possible to assume that every picture under analysis belongs to one of the available camera models. To deal with this issue, in this paper, we present the first in-depth study on the possibility of solving the camera model identification problem in open-set scenarios. Given a photograph, we aim at detecting whether it comes from one of the known camera models of interest or from an unknown one. We compare different feature extraction algorithms and classifiers specially targeting open-set recognition. We also evaluate possible open-set training protocols that can be applied along with any open-set classifier, observing that a simple of those alternatives obtains best results. Thorough testing on independent datasets shows that it is possible to leverage a recently proposed convolutional neural network as feature extractor paired with a properly trained open-set classifier aiming at solving the open-set camera model attribution problem even to small-scale image patches, improving over state-of-the-art available solutions. △ Less

Submitted 13 November, 2019; v1 submitted 11 April, 2019; originally announced April 2019.

Comments: Published through IEEE Access journal

arXiv:1805.02131 [pdf, other]

doi 10.1109/CVPRW.2017.230

A Counter-Forensic Method for CNN-Based Camera Model Identification

Authors: David Güera, Yu Wang, Luca Bondi, Paolo Bestagini, Stefano Tubaro, Edward J. Delp

Abstract: An increasing number of digital images are being shared and accessed through websites, media, and social applications. Many of these images have been modified and are not authentic. Recent advances in the use of deep convolutional neural networks (CNNs) have facilitated the task of analyzing the veracity and authenticity of largely distributed image datasets. We examine in this paper the problem o… ▽ More An increasing number of digital images are being shared and accessed through websites, media, and social applications. Many of these images have been modified and are not authentic. Recent advances in the use of deep convolutional neural networks (CNNs) have facilitated the task of analyzing the veracity and authenticity of largely distributed image datasets. We examine in this paper the problem of identifying the camera model or type that was used to take an image and that can be spoofed. Due to the linear nature of CNNs and the high-dimensionality of images, neural networks are vulnerable to attacks with adversarial examples. These examples are imperceptibly different from correctly classified images but are misclassified with high confidence by CNNs. In this paper, we describe a counter-forensic method capable of subtly altering images to change their estimated camera model when they are analyzed by any CNN-based camera model detector. Our method can use both the Fast Gradient Sign Method (FGSM) or the Jacobian-based Saliency Map Attack (JSMA) to craft these adversarial images and does not require direct access to the CNN. Our results show that even advanced deep learning architectures trained to analyze images and obtain camera model information are still vulnerable to our proposed method. △ Less

Submitted 5 May, 2018; originally announced May 2018.

Comments: Presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Workshop on Media Forensics

arXiv:1708.00930 [pdf, other]

doi 10.1016/j.jvcir.2017.09.003

Aligned and Non-Aligned Double JPEG Detection Using Convolutional Neural Networks

Authors: Mauro Barni, Luca Bondi, Nicolò Bonettini, Paolo Bestagini, Andrea Costanzo, Marco Maggini, Benedetta Tondi, Stefano Tubaro

Abstract: Due to the wide diffusion of JPEG coding standard, the image forensic community has devoted significant attention to the development of double JPEG (DJPEG) compression detectors through the years. The ability of detecting whether an image has been compressed twice provides paramount information toward image authenticity assessment. Given the trend recently gained by convolutional neural networks (… ▽ More Due to the wide diffusion of JPEG coding standard, the image forensic community has devoted significant attention to the development of double JPEG (DJPEG) compression detectors through the years. The ability of detecting whether an image has been compressed twice provides paramount information toward image authenticity assessment. Given the trend recently gained by convolutional neural networks (CNN) in many computer vision tasks, in this paper we propose to use CNNs for aligned and non-aligned double JPEG compression detection. In particular, we explore the capability of CNNs to capture DJPEG artifacts directly from images. Results show that the proposed CNN-based detectors achieve good performance even with small size images (i.e., 64x64), outperforming state-of-the-art solutions, especially in the non-aligned case. Besides, good results are also achieved in the commonly-recognized challenging case in which the first quality factor is larger than the second one. △ Less

Submitted 2 August, 2017; originally announced August 2017.

Comments: Submitted to Journal of Visual Communication and Image Representation (first submission: March 20, 2017; second submission: August 2, 2017)

arXiv:1609.02500 [pdf, other]

doi 10.1109/ICCE-Berlin.2016.7684706

Reduced Memory Region Based Deep Convolutional Neural Network Detection

Authors: Denis Tome', Luca Bondi, Emanuele Plebani, Luca Baroffio, Danilo Pau, Stefano Tubaro

Abstract: Accurate pedestrian detection has a primary role in automotive safety: for example, by issuing warnings to the driver or acting actively on car's brakes, it helps decreasing the probability of injuries and human fatalities. In order to achieve very high accuracy, recent pedestrian detectors have been based on Convolutional Neural Networks (CNN). Unfortunately, such approaches require vast amounts… ▽ More Accurate pedestrian detection has a primary role in automotive safety: for example, by issuing warnings to the driver or acting actively on car's brakes, it helps decreasing the probability of injuries and human fatalities. In order to achieve very high accuracy, recent pedestrian detectors have been based on Convolutional Neural Networks (CNN). Unfortunately, such approaches require vast amounts of computational power and memory, preventing efficient implementations on embedded systems. This work proposes a CNN-based detector, adapting a general-purpose convolutional network to the task at hand. By thoroughly analyzing and optimizing each step of the detection pipeline, we develop an architecture that outperforms methods based on traditional image features and achieves an accuracy close to the state-of-the-art while having low computational complexity. Furthermore, the model is compressed in order to fit the tight constrains of low power devices with a limited amount of embedded memory available. This paper makes two main contributions: (1) it proves that a region based deep neural network can be finely tuned to achieve adequate accuracy for pedestrian detection (2) it achieves a very low memory usage without reducing detection accuracy on the Caltech Pedestrian dataset. △ Less

Submitted 8 September, 2016; originally announced September 2016.

Comments: IEEE 2016 ICCE-Berlin

Journal ref: 2016 IEEE 6th International Conference on Consumer Electronics - Berlin (ICCE-Berlin)

arXiv:1603.01068 [pdf, other]

doi 10.1109/LSP.2016.2641006

First Steps Toward Camera Model Identification with Convolutional Neural Networks

Authors: Luca Bondi, Luca Baroffio, David Güera, Paolo Bestagini, Edward J. Delp, Stefano Tubaro

Abstract: Detecting the camera model used to shoot a picture enables to solve a wide series of forensic problems, from copyright infringement to ownership attribution. For this reason, the forensic community has developed a set of camera model identification algorithms that exploit characteristic traces left on acquired images by the processing pipelines specific of each camera model. In this paper, we inve… ▽ More Detecting the camera model used to shoot a picture enables to solve a wide series of forensic problems, from copyright infringement to ownership attribution. For this reason, the forensic community has developed a set of camera model identification algorithms that exploit characteristic traces left on acquired images by the processing pipelines specific of each camera model. In this paper, we investigate a novel approach to solve camera model identification problem. Specifically, we propose a data-driven algorithm based on convolutional neural networks, which learns features characterizing each camera model directly from the acquired pictures. Results on a well-known dataset of 18 camera models show that: (i) the proposed method outperforms up-to-date state-of-the-art algorithms on classification of 64x64 color image patches; (ii) features learned by the proposed network generalize to camera models never used for training. △ Less

Submitted 26 September, 2017; v1 submitted 3 March, 2016; originally announced March 2016.

arXiv:1510.03608 [pdf, other]

doi 10.1016/j.image.2016.05.007

Deep convolutional neural networks for pedestrian detection

Authors: Denis Tomè, Federico Monti, Luca Baroffio, Luca Bondi, Marco Tagliasacchi, Stefano Tubaro

Abstract: Pedestrian detection is a popular research topic due to its paramount importance for a number of applications, especially in the fields of automotive, surveillance and robotics. Despite the significant improvements, pedestrian detection is still an open challenge that calls for more and more accurate algorithms. In the last few years, deep learning and in particular convolutional neural networks e… ▽ More Pedestrian detection is a popular research topic due to its paramount importance for a number of applications, especially in the fields of automotive, surveillance and robotics. Despite the significant improvements, pedestrian detection is still an open challenge that calls for more and more accurate algorithms. In the last few years, deep learning and in particular convolutional neural networks emerged as the state of the art in terms of accuracy for a number of computer vision tasks such as image classification, object detection and segmentation, often outperforming the previous gold standards by a large margin. In this paper, we propose a pedestrian detection system based on deep learning, adapting a general-purpose convolutional network to the task at hand. By thoroughly analyzing and optimizing each step of the detection pipeline we propose an architecture that outperforms traditional methods, achieving a task accuracy close to that of state-of-the-art approaches, while requiring a low computational time. Finally, we tested the system on an NVIDIA Jetson TK1, a 192-core platform that is envisioned to be a forerunner computational brain of future self-driving cars. △ Less

Submitted 7 March, 2016; v1 submitted 13 October, 2015; originally announced October 2015.

Comments: submitted to Elsevier Signal Processing: Image Communication special Issue on Deep Learning

Showing 1–15 of 15 results for author: Bondi, L