-
Randomized Asymmetric Chain of LoRA: The First Meaningful Theoretical Framework for Low-Rank Adaptation
Authors:
Grigory Malinovsky,
Umberto Michieli,
Hasan Abed Al Kader Hammoud,
Taha Ceritli,
Hayder Elesedy,
Mete Ozay,
Peter Richtárik
Abstract:
Fine-tuning has become a popular approach to adapting large foundational models to specific tasks. As the size of models and datasets grows, parameter-efficient fine-tuning techniques are increasingly important. One of the most widely used methods is Low-Rank Adaptation (LoRA), with adaptation update expressed as the product of two low-rank matrices. While LoRA was shown to possess strong performa…
▽ More
Fine-tuning has become a popular approach to adapting large foundational models to specific tasks. As the size of models and datasets grows, parameter-efficient fine-tuning techniques are increasingly important. One of the most widely used methods is Low-Rank Adaptation (LoRA), with adaptation update expressed as the product of two low-rank matrices. While LoRA was shown to possess strong performance in fine-tuning, it often under-performs when compared to full-parameter fine-tuning (FPFT). Although many variants of LoRA have been extensively studied empirically, their theoretical optimization analysis is heavily under-explored. The starting point of our work is a demonstration that LoRA and its two extensions, Asymmetric LoRA and Chain of LoRA, indeed encounter convergence issues. To address these issues, we propose Randomized Asymmetric Chain of LoRA (RAC-LoRA) -- a general optimization framework that rigorously analyzes the convergence rates of LoRA-based methods. Our approach inherits the empirical benefits of LoRA-style heuristics, but introduces several small but important algorithmic modifications which turn it into a provably convergent method. Our framework serves as a bridge between FPFT and low-rank adaptation. We provide provable guarantees of convergence to the same solution as FPFT, along with the rate of convergence. Additionally, we present a convergence analysis for smooth, non-convex loss functions, covering gradient descent, stochastic gradient descent, and federated learning settings. Our theoretical findings are supported by experimental results.
△ Less
Submitted 10 October, 2024;
originally announced October 2024.
-
Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search
Authors:
Kirill Paramonov,
Jia-Xing Zhong,
Umberto Michieli,
Jijoong Moon,
Mete Ozay
Abstract:
In this paper, we address a recent trend in robotic home appliances to include vision systems on personal devices, capable of personalizing the appliances on the fly. In particular, we formulate and address an important technical task of personal object search, which involves localization and identification of personal items of interest on images captured by robotic appliances, with each item refe…
▽ More
In this paper, we address a recent trend in robotic home appliances to include vision systems on personal devices, capable of personalizing the appliances on the fly. In particular, we formulate and address an important technical task of personal object search, which involves localization and identification of personal items of interest on images captured by robotic appliances, with each item referenced only by a few annotated images. The task is crucial for robotic home appliances and mobile systems, which need to process personal visual scenes or to operate with particular personal objects (e.g., for grasping or navigation). In practice, personal object search presents two main technical challenges. First, a robot vision system needs to be able to distinguish between many fine-grained classes, in the presence of occlusions and clutter. Second, the strict resource requirements for the on-device system restrict the usage of most state-of-the-art methods for few-shot learning and often prevent on-device adaptation. In this work, we propose Swiss DINO: a simple yet effective framework for one-shot personal object search based on the recent DINOv2 transformer model, which was shown to have strong zero-shot generalization properties. Swiss DINO handles challenging on-device personalized scene understanding requirements and does not require any adaptation training. We show significant improvement (up to 55%) in segmentation and recognition accuracy compared to the common lightweight solutions, and significant footprint reduction of backbone inference time (up to 100x) and GPU consumption (up to 10x) compared to the heavy transformer-based solutions.
△ Less
Submitted 10 July, 2024;
originally announced July 2024.
-
Enhanced Model Robustness to Input Corruptions by Per-corruption Adaptation of Normalization Statistics
Authors:
Elena Camuffo,
Umberto Michieli,
Simone Milani,
Jijoong Moon,
Mete Ozay
Abstract:
Developing a reliable vision system is a fundamental challenge for robotic technologies (e.g., indoor service robots and outdoor autonomous robots) which can ensure reliable navigation even in challenging environments such as adverse weather conditions (e.g., fog, rain), poor lighting conditions (e.g., over/under exposure), or sensor degradation (e.g., blurring, noise), and can guarantee high perf…
▽ More
Developing a reliable vision system is a fundamental challenge for robotic technologies (e.g., indoor service robots and outdoor autonomous robots) which can ensure reliable navigation even in challenging environments such as adverse weather conditions (e.g., fog, rain), poor lighting conditions (e.g., over/under exposure), or sensor degradation (e.g., blurring, noise), and can guarantee high performance in safety-critical functions. Current solutions proposed to improve model robustness usually rely on generic data augmentation techniques or employ costly test-time adaptation methods. In addition, most approaches focus on addressing a single vision task (typically, image recognition) utilising synthetic data. In this paper, we introduce Per-corruption Adaptation of Normalization statistics (PAN) to enhance the model robustness of vision systems. Our approach entails three key components: (i) a corruption type identification module, (ii) dynamic adjustment of normalization layer statistics based on identified corruption type, and (iii) real-time update of these statistics according to input data. PAN can integrate seamlessly with any convolutional model for enhanced accuracy in several robot vision tasks. In our experiments, PAN obtains robust performance improvement on challenging real-world corrupted image datasets (e.g., OpenLoris, ExDark, ACDC), where most of the current solutions tend to fail. Moreover, PAN outperforms the baseline models by 20-30% on synthetic benchmarks in object recognition tasks.
△ Less
Submitted 8 July, 2024;
originally announced July 2024.
-
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Authors:
Hayder Elesedy,
Pedro M. Esperança,
Silviu Vlad Oprea,
Mete Ozay
Abstract:
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that…
▽ More
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Cross-Architecture Auxiliary Feature Space Translation for Efficient Few-Shot Personalized Object Detection
Authors:
Francesco Barbato,
Umberto Michieli,
Jijoong Moon,
Pietro Zanuttigh,
Mete Ozay
Abstract:
Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user's dog). We refer to this challenging task as Instance-level Personalized Object…
▽ More
Recent years have seen object detection robotic systems deployed in several personal devices (e.g., home robots and appliances). This has highlighted a challenge in their design, i.e., they cannot efficiently update their knowledge to distinguish between general classes and user-specific instances (e.g., a dog vs. user's dog). We refer to this challenging task as Instance-level Personalized Object Detection (IPOD). The personalization task requires many samples for model tuning and optimization in a centralized server, raising privacy concerns. An alternative is provided by approaches based on recent large-scale Foundation Models, but their compute costs preclude on-device applications.
In our work we tackle both problems at the same time, designing a Few-Shot IPOD strategy called AuXFT. We introduce a conditional coarse-to-fine few-shot learner to refine the coarse predictions made by an efficient object detector, showing that using an off-the-shelf model leads to poor personalization due to neural collapse. Therefore, we introduce a Translator block that generates an auxiliary feature space where features generated by a self-supervised model (e.g., DINOv2) are distilled without impacting the performance of the detector. We validate AuXFT on three publicly available datasets and one in-house benchmark designed for the IPOD task, achieving remarkable gains in all considered scenarios with excellent time-complexity trade-off: AuXFT reaches a performance of 80% its upper bound at just 32% of the inference time, 13% of VRAM and 19% of the model size.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Authors:
Hasan Abed Al Kader Hammoud,
Umberto Michieli,
Fabio Pizzati,
Philip Torr,
Adel Bibi,
Bernard Ghanem,
Mete Ozay
Abstract:
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popu…
▽ More
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
DP-DyLoRA: Fine-Tuning Transformer-Based Models On-Device under Differentially Private Federated Learning using Dynamic Low-Rank Adaptation
Authors:
Jie Xu,
Karthikeyan Saravanan,
Rogier van Dalen,
Haaris Mehmood,
David Tuckey,
Mete Ozay
Abstract:
Federated learning (FL) allows clients to collaboratively train a global model without sharing their local data with a server. However, clients' contributions to the server can still leak sensitive information. Differential privacy (DP) addresses such leakage by providing formal privacy guarantees, with mechanisms that add randomness to the clients' contributions. The randomness makes it infeasibl…
▽ More
Federated learning (FL) allows clients to collaboratively train a global model without sharing their local data with a server. However, clients' contributions to the server can still leak sensitive information. Differential privacy (DP) addresses such leakage by providing formal privacy guarantees, with mechanisms that add randomness to the clients' contributions. The randomness makes it infeasible to train large transformer-based models, common in modern federated learning systems. In this work, we empirically evaluate the practicality of fine-tuning large scale on-device transformer-based models with differential privacy in a federated learning system. We conduct comprehensive experiments on various system properties for tasks spanning a multitude of domains: speech recognition, computer vision (CV) and natural language understanding (NLU). Our results show that full fine-tuning under differentially private federated learning (DP-FL) generally leads to huge performance degradation which can be alleviated by reducing the dimensionality of contributions through parameter-efficient fine-tuning (PEFT). Our benchmarks of existing DP-PEFT methods show that DP-Low-Rank Adaptation (DP-LoRA) consistently outperforms other methods. An even more promising approach, DyLoRA, which makes the low rank variable, when naively combined with FL would straightforwardly break differential privacy. We therefore propose an adaptation method that can be combined with differential privacy and call it DP-DyLoRA. Finally, we are able to reduce the accuracy degradation and word error rate (WER) increase due to DP to less than 2% and 7% respectively with 1 million clients and a stringent privacy budget of $ε=2$.
△ Less
Submitted 22 July, 2024; v1 submitted 10 May, 2024;
originally announced May 2024.
-
Object-conditioned Bag of Instances for Few-Shot Personalized Instance Recognition
Authors:
Umberto Michieli,
Jijoong Moon,
Daehyun Kim,
Mete Ozay
Abstract:
Nowadays, users demand for increased personalization of vision systems to localize and identify personal instances of objects (e.g., my dog rather than dog) from a few-shot dataset only. Despite outstanding results of deep networks on classical label-abundant benchmarks (e.g., those of the latest YOLOv8 model for standard object detection), they struggle to maintain within-class variability to rep…
▽ More
Nowadays, users demand for increased personalization of vision systems to localize and identify personal instances of objects (e.g., my dog rather than dog) from a few-shot dataset only. Despite outstanding results of deep networks on classical label-abundant benchmarks (e.g., those of the latest YOLOv8 model for standard object detection), they struggle to maintain within-class variability to represent different instances rather than object categories only. We construct an Object-conditioned Bag of Instances (OBoI) based on multi-order statistics of extracted features, where generic object detection models are extended to search and identify personal instances from the OBoI's metric space, without need for backpropagation. By relying on multi-order statistics, OBoI achieves consistent superior accuracy in distinguishing different instances. In the results, we achieve 77.1% personal object recognition accuracy in case of 18 personal instances, showing about 12% relative gain over the state of the art.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
FFT-based Selection and Optimization of Statistics for Robust Recognition of Severely Corrupted Images
Authors:
Elena Camuffo,
Umberto Michieli,
Jijoong Moon,
Daehyun Kim,
Mete Ozay
Abstract:
Improving model robustness in case of corrupted images is among the key challenges to enable robust vision systems on smart devices, such as robotic agents. Particularly, robust test-time performance is imperative for most of the applications. This paper presents a novel approach to improve robustness of any classification model, especially on severely corrupted images. Our method (FROST) employs…
▽ More
Improving model robustness in case of corrupted images is among the key challenges to enable robust vision systems on smart devices, such as robotic agents. Particularly, robust test-time performance is imperative for most of the applications. This paper presents a novel approach to improve robustness of any classification model, especially on severely corrupted images. Our method (FROST) employs high-frequency features to detect input image corruption type, and select layer-wise feature normalization statistics. FROST provides the state-of-the-art results for different models and datasets, outperforming competitors on ImageNet-C by up to 37.1% relative gain, improving baseline of 40.9% mCE on severe corruptions.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
Deep Neural Network Models Trained With A Fixed Random Classifier Transfer Better Across Domains
Authors:
Hafiz Tiomoko Ali,
Umberto Michieli,
Ji Joong Moon,
Daehyun Kim,
Mete Ozay
Abstract:
The recently discovered Neural collapse (NC) phenomenon states that the last-layer weights of Deep Neural Networks (DNN), converge to the so-called Equiangular Tight Frame (ETF) simplex, at the terminal phase of their training. This ETF geometry is equivalent to vanishing within-class variability of the last layer activations. Inspired by NC properties, we explore in this paper the transferability…
▽ More
The recently discovered Neural collapse (NC) phenomenon states that the last-layer weights of Deep Neural Networks (DNN), converge to the so-called Equiangular Tight Frame (ETF) simplex, at the terminal phase of their training. This ETF geometry is equivalent to vanishing within-class variability of the last layer activations. Inspired by NC properties, we explore in this paper the transferability of DNN models trained with their last layer weight fixed according to ETF. This enforces class separation by eliminating class covariance information, effectively providing implicit regularization. We show that DNN models trained with such a fixed classifier significantly improve transfer performance, particularly on out-of-domain datasets. On a broad range of fine-grained image classification datasets, our approach outperforms i) baseline methods that do not perform any covariance regularization (up to 22%), as well as ii) methods that explicitly whiten covariance of activations throughout training (up to 19%). Our findings suggest that DNNs trained with fixed ETF classifiers offer a powerful mechanism for improving transfer learning across domains.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
HOP to the Next Tasks and Domains for Continual Learning in NLP
Authors:
Umberto Michieli,
Mete Ozay
Abstract:
Continual Learning (CL) aims to learn a sequence of problems (i.e., tasks and domains) by transferring knowledge acquired on previous problems, whilst avoiding forgetting of past ones. Different from previous approaches which focused on CL for one NLP task or domain in a specific use-case, in this paper, we address a more general CL setting to learn from a sequence of problems in a unique framewor…
▽ More
Continual Learning (CL) aims to learn a sequence of problems (i.e., tasks and domains) by transferring knowledge acquired on previous problems, whilst avoiding forgetting of past ones. Different from previous approaches which focused on CL for one NLP task or domain in a specific use-case, in this paper, we address a more general CL setting to learn from a sequence of problems in a unique framework. Our method, HOP, permits to hop across tasks and domains by addressing the CL problem along three directions: (i) we employ a set of adapters to generalize a large pre-trained model to unseen problems, (ii) we compute high-order moments over the distribution of embedded representations to distinguish independent and correlated statistics across different tasks and domains, (iii) we process this enriched information with auxiliary heads specialized for each end problem. Extensive experimental campaign on 4 NLP applications, 5 benchmarks and 2 CL setups demonstrates the effectiveness of our HOP.
△ Less
Submitted 28 February, 2024;
originally announced February 2024.
-
A Modular System for Enhanced Robustness of Multimedia Understanding Networks via Deep Parametric Estimation
Authors:
Francesco Barbato,
Umberto Michieli,
Mehmet Kerim Yucel,
Pietro Zanuttigh,
Mete Ozay
Abstract:
In multimedia understanding tasks, corrupted samples pose a critical challenge, because when fed to machine learning models they lead to performance degradation. In the past, three groups of approaches have been proposed to handle noisy data: i) enhancer and denoiser modules to improve the quality of the noisy data, ii) data augmentation approaches, and iii) domain adaptation strategies. All the a…
▽ More
In multimedia understanding tasks, corrupted samples pose a critical challenge, because when fed to machine learning models they lead to performance degradation. In the past, three groups of approaches have been proposed to handle noisy data: i) enhancer and denoiser modules to improve the quality of the noisy data, ii) data augmentation approaches, and iii) domain adaptation strategies. All the aforementioned approaches come with drawbacks that limit their applicability; the first has high computational costs and requires pairs of clean-corrupted data for training, while the others only allow deployment of the same task/network they were trained on (\ie, when upstream and downstream task/network are the same). In this paper, we propose SyMPIE to solve these shortcomings. To this end, we design a small, modular, and efficient (just 2GFLOPs to process a Full HD image) system to enhance input data for robust downstream multimedia understanding with minimal computational cost. Our SyMPIE is pre-trained on an upstream task/network that should not match the downstream ones and does not need paired clean-corrupted samples. Our key insight is that most input corruptions found in real-world tasks can be modeled through global operations on color channels of images or spatial filters with small kernels. We validate our approach on multiple datasets and tasks, such as image classification (on ImageNetC, ImageNetC-Bar, VizWiz, and a newly proposed mixed corruption benchmark named ImageNetC-mixed) and semantic segmentation (on Cityscapes, ACDC, and DarkZurich) with consistent improvements of about 5\% relative accuracy gain across the board. The code of our approach and the new ImageNetC-mixed benchmark will be made available upon publication.
△ Less
Submitted 29 February, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
On-Device Speaker Anonymization of Acoustic Embeddings for ASR based onFlexible Location Gradient Reversal Layer
Authors:
Md Asif Jalal,
Pablo Peso Parada,
Jisi Zhang,
Karthikeyan Saravanan,
Mete Ozay,
Myoungji Han,
Jung In Lee,
Seokyeong Jung
Abstract:
Smart devices serviced by large-scale AI models necessitates user data transfer to the cloud for inference. For speech applications, this means transferring private user information, e.g., speaker identity. Our paper proposes a privacy-enhancing framework that targets speaker identity anonymization while preserving speech recognition accuracy for our downstream task~-~Automatic Speech Recognition…
▽ More
Smart devices serviced by large-scale AI models necessitates user data transfer to the cloud for inference. For speech applications, this means transferring private user information, e.g., speaker identity. Our paper proposes a privacy-enhancing framework that targets speaker identity anonymization while preserving speech recognition accuracy for our downstream task~-~Automatic Speech Recognition (ASR). The proposed framework attaches flexible gradient reversal based speaker adversarial layers to target layers within an ASR model, where speaker adversarial training anonymizes acoustic embeddings generated by the targeted layers to remove speaker identity. We propose on-device deployment by execution of initial layers of the ASR model, and transmitting anonymized embeddings to the cloud, where the rest of the model is executed while preserving privacy. Experimental results show that our method efficiently reduces speaker recognition relative accuracy by 33%, and improves ASR performance by achieving 6.2% relative Word Error Rate (WER) reduction.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
Online Continual Learning in Keyword Spotting for Low-Resource Devices via Pooling High-Order Temporal Statistics
Authors:
Umberto Michieli,
Pablo Peso Parada,
Mete Ozay
Abstract:
Keyword Spotting (KWS) models on embedded devices should adapt fast to new user-defined words without forgetting previous ones. Embedded devices have limited storage and computational resources, thus, they cannot save samples or update large models. We consider the setup of embedded online continual learning (EOCL), where KWS models with frozen backbone are trained to incrementally recognize new w…
▽ More
Keyword Spotting (KWS) models on embedded devices should adapt fast to new user-defined words without forgetting previous ones. Embedded devices have limited storage and computational resources, thus, they cannot save samples or update large models. We consider the setup of embedded online continual learning (EOCL), where KWS models with frozen backbone are trained to incrementally recognize new words from a non-repeated stream of samples, seen one at a time. To this end, we propose Temporal Aware Pooling (TAP) which constructs an enriched feature space computing high-order moments of speech features extracted by a pre-trained backbone. Our method, TAP-SLDA, updates a Gaussian model for each class on the enriched feature space to effectively use audio representations. In experimental analyses, TAP-SLDA outperforms competitors on several setups, backbones, and baselines, bringing a relative average gain of 11.3% on the GSC dataset.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization
Authors:
Edward Fish,
Umberto Michieli,
Mete Ozay
Abstract:
Recent advancement in Automatic Speech Recognition (ASR) has produced large AI models, which become impractical for deployment in mobile devices. Model quantization is effective to produce compressed general-purpose models, however such models may only be deployed to a restricted sub-domain of interest. We show that ASR models can be personalized during quantization while relying on just a small s…
▽ More
Recent advancement in Automatic Speech Recognition (ASR) has produced large AI models, which become impractical for deployment in mobile devices. Model quantization is effective to produce compressed general-purpose models, however such models may only be deployed to a restricted sub-domain of interest. We show that ASR models can be personalized during quantization while relying on just a small set of unlabelled samples from the target domain. To this end, we propose myQASR, a mixed-precision quantization method that generates tailored quantization schemes for diverse users under any memory requirement with no fine-tuning. myQASR automatically evaluates the quantization sensitivity of network layers by analysing the full-precision activation values. We are then able to generate a personalised mixed-precision quantization scheme for any pre-determined memory budget. Results for large-scale ASR models show how myQASR improves performance for specific genders, languages, and speakers.
△ Less
Submitted 11 February, 2024; v1 submitted 24 July, 2023;
originally announced July 2023.
-
Online Continual Learning for Robust Indoor Object Recognition
Authors:
Umberto Michieli,
Mete Ozay
Abstract:
Vision systems mounted on home robots need to interact with unseen classes in changing environments. Robots have limited computational resources, labelled data and storage capability. These requirements pose some unique challenges: models should adapt without forgetting past knowledge in a data- and parameter-efficient way. We characterize the problem as few-shot (FS) online continual learning (OC…
▽ More
Vision systems mounted on home robots need to interact with unseen classes in changing environments. Robots have limited computational resources, labelled data and storage capability. These requirements pose some unique challenges: models should adapt without forgetting past knowledge in a data- and parameter-efficient way. We characterize the problem as few-shot (FS) online continual learning (OCL), where robotic agents learn from a non-repeated stream of few-shot data updating only a few model parameters. Additionally, such models experience variable conditions at test time, where objects may appear in different poses (e.g., horizontal or vertical) and environments (e.g., day or night). To improve robustness of CL agents, we propose RobOCLe, which; 1) constructs an enriched feature space computing high order statistical moments from the embedded features of samples; and 2) computes similarity between high order statistics of the samples on the enriched feature space, and predicts their class labels. We evaluate robustness of CL models to train/test augmentations in various cases. We show that different moments allow RobOCLe to capture different properties of deformations, providing higher robustness with no decrease of inference speed.
△ Less
Submitted 19 July, 2023;
originally announced July 2023.
-
Progressive Self-Distillation for Ground-to-Aerial Perception Knowledge Transfer
Authors:
Junjie Hu,
Chenyou Fan,
Mete Ozay,
Hua Feng,
Yuan Gao,
Tin Lun Lam
Abstract:
We study a practical yet hasn't been explored problem: how a drone can perceive in an environment from different flight heights. Unlike autonomous driving, where the perception is always conducted from a ground viewpoint, a flying drone may flexibly change its flight height due to specific tasks, requiring the capability for viewpoint invariant perception. Tackling the such problem with supervised…
▽ More
We study a practical yet hasn't been explored problem: how a drone can perceive in an environment from different flight heights. Unlike autonomous driving, where the perception is always conducted from a ground viewpoint, a flying drone may flexibly change its flight height due to specific tasks, requiring the capability for viewpoint invariant perception. Tackling the such problem with supervised learning will incur tremendous costs for data annotation of different flying heights. On the other hand, current semi-supervised learning methods are not effective under viewpoint differences. In this paper, we introduce the ground-to-aerial perception knowledge transfer and propose a progressive semi-supervised learning framework that enables drone perception using only labeled data of ground viewpoint and unlabeled data of flying viewpoints. Our framework has four core components: i) a dense viewpoint sampling strategy that splits the range of vertical flight height into a set of small pieces with evenly-distributed intervals, ii) nearest neighbor pseudo-labeling that infers labels of the nearest neighbor viewpoint with a model learned on the preceding viewpoint, iii) MixView that generates augmented images among different viewpoints to alleviate viewpoint differences, and iv) a progressive distillation strategy to gradually learn until reaching the maximum flying height. We collect a synthesized and a real-world dataset, and we perform extensive experimental analyses to show that our method yields 22.2% and 16.9% accuracy improvement for the synthesized dataset and the real world. Code and datasets are available on https://github.com/FreeformRobotics/Progressive-Self-Distillation-for-Ground-to-Aerial-Perception-Knowledge-Transfer.
△ Less
Submitted 16 April, 2023; v1 submitted 29 August, 2022;
originally announced August 2022.
-
Dense Depth Distillation with Out-of-Distribution Simulated Images
Authors:
Junjie Hu,
Chenyou Fan,
Mete Ozay,
Hualie Jiang,
Tin Lun Lam
Abstract:
We study data-free knowledge distillation (KD) for monocular depth estimation (MDE), which learns a lightweight model for real-world depth perception tasks by compressing it from a trained teacher model while lacking training data in the target domain. Owing to the essential difference between image classification and dense regression, previous methods of data-free KD are not applicable to MDE. To…
▽ More
We study data-free knowledge distillation (KD) for monocular depth estimation (MDE), which learns a lightweight model for real-world depth perception tasks by compressing it from a trained teacher model while lacking training data in the target domain. Owing to the essential difference between image classification and dense regression, previous methods of data-free KD are not applicable to MDE. To strengthen its applicability in real-world tasks, in this paper, we propose to apply KD with out-of-distribution simulated images. The major challenges to be resolved are i) lacking prior information about scene configurations of real-world training data and ii) domain shift between simulated and real-world images. To cope with these difficulties, we propose a tailored framework for depth distillation. The framework generates new training samples for embracing a multitude of possible object arrangements in the target domain and utilizes a transformation network to efficiently adapt them to the feature statistics preserved in the teacher model. Through extensive experiments on various depth estimation models and two different datasets, we show that our method outperforms the baseline KD by a good margin and even achieves slightly better performance with as few as 1/6 of training images, demonstrating a clear superiority.
△ Less
Submitted 7 December, 2023; v1 submitted 26 August, 2022;
originally announced August 2022.
-
pMCT: Patched Multi-Condition Training for Robust Speech Recognition
Authors:
Pablo Peso Parada,
Agnieszka Dobrowolska,
Karthikeyan Saravanan,
Mete Ozay
Abstract:
We propose a novel Patched Multi-Condition Training (pMCT) method for robust Automatic Speech Recognition (ASR). pMCT employs Multi-condition Audio Modification and Patching (MAMP) via mixing {\it patches} of the same utterance extracted from clean and distorted speech. Training using patch-modified signals improves robustness of models in noisy reverberant scenarios. Our proposed pMCT is evaluate…
▽ More
We propose a novel Patched Multi-Condition Training (pMCT) method for robust Automatic Speech Recognition (ASR). pMCT employs Multi-condition Audio Modification and Patching (MAMP) via mixing {\it patches} of the same utterance extracted from clean and distorted speech. Training using patch-modified signals improves robustness of models in noisy reverberant scenarios. Our proposed pMCT is evaluated on the LibriSpeech dataset showing improvement over using vanilla Multi-Condition Training (MCT). For analyses on robust ASR, we employed pMCT on the VOiCES dataset which is a noisy reverberant dataset created using utterances from LibriSpeech. In the analyses, pMCT achieves 23.1% relative WER reduction compared to the MCT.
△ Less
Submitted 11 July, 2022;
originally announced July 2022.
-
FedNST: Federated Noisy Student Training for Automatic Speech Recognition
Authors:
Haaris Mehmood,
Agnieszka Dobrowolska,
Karthikeyan Saravanan,
Mete Ozay
Abstract:
Federated Learning (FL) enables training state-of-the-art Automatic Speech Recognition (ASR) models on user devices (clients) in distributed systems, hence preventing transmission of raw user data to a central server. A key challenge facing practical adoption of FL for ASR is obtaining ground-truth labels on the clients. Existing approaches rely on clients to manually transcribe their speech, whic…
▽ More
Federated Learning (FL) enables training state-of-the-art Automatic Speech Recognition (ASR) models on user devices (clients) in distributed systems, hence preventing transmission of raw user data to a central server. A key challenge facing practical adoption of FL for ASR is obtaining ground-truth labels on the clients. Existing approaches rely on clients to manually transcribe their speech, which is impractical for obtaining large training corpora. A promising alternative is using semi-/self-supervised learning approaches to leverage unlabelled user data. To this end, we propose FedNST, a novel method for training distributed ASR models using private and unlabelled user data. We explore various facets of FedNST, such as training models with different proportions of labelled and unlabelled data, and evaluate the proposed approach on 1173 simulated clients. Evaluating FedNST on LibriSpeech, where 960 hours of speech data is split equally into server (labelled) and client (unlabelled) data, showed a 22.5% relative word error rate reduction} (WERR) over a supervised baseline trained only on server data.
△ Less
Submitted 12 July, 2022; v1 submitted 6 June, 2022;
originally announced June 2022.
-
Deep Depth Completion from Extremely Sparse Data: A Survey
Authors:
Junjie Hu,
Chenyu Bao,
Mete Ozay,
Chenyou Fan,
Qing Gao,
Honghai Liu,
Tin Lun Lam
Abstract:
Depth completion aims at predicting dense pixel-wise depth from an extremely sparse map captured from a depth sensor, e.g., LiDARs. It plays an essential role in various applications such as autonomous driving, 3D reconstruction, augmented reality, and robot navigation. Recent successes on the task have been demonstrated and dominated by deep learning based solutions. In this article, for the firs…
▽ More
Depth completion aims at predicting dense pixel-wise depth from an extremely sparse map captured from a depth sensor, e.g., LiDARs. It plays an essential role in various applications such as autonomous driving, 3D reconstruction, augmented reality, and robot navigation. Recent successes on the task have been demonstrated and dominated by deep learning based solutions. In this article, for the first time, we provide a comprehensive literature review that helps readers better grasp the research trends and clearly understand the current advances. We investigate the related studies from the design aspects of network architectures, loss functions, benchmark datasets, and learning strategies with a proposal of a novel taxonomy that categorizes existing methods. Besides, we present a quantitative comparison of model performance on three widely used benchmarks, including indoor and outdoor datasets. Finally, we discuss the challenges of prior works and provide readers with some insights for future research directions.
△ Less
Submitted 29 August, 2022; v1 submitted 11 May, 2022;
originally announced May 2022.
-
A Mixed Quantization Network for Computationally Efficient Mobile Inverse Tone Mapping
Authors:
Juan Borrego-Carazo,
Mete Ozay,
Frederik Laboyrie,
Paul Wisbey
Abstract:
Recovering a high dynamic range (HDR) image from a single low dynamic range (LDR) image, namely inverse tone mapping (ITM), is challenging due to the lack of information in over- and under-exposed regions. Current methods focus exclusively on training high-performing but computationally inefficient ITM models, which in turn hinder deployment of the ITM models in resource-constrained environments w…
▽ More
Recovering a high dynamic range (HDR) image from a single low dynamic range (LDR) image, namely inverse tone mapping (ITM), is challenging due to the lack of information in over- and under-exposed regions. Current methods focus exclusively on training high-performing but computationally inefficient ITM models, which in turn hinder deployment of the ITM models in resource-constrained environments with limited computing power such as edge and mobile device applications.
To this end, we propose combining efficient operations of deep neural networks with a novel mixed quantization scheme to construct a well-performing but computationally efficient mixed quantization network (MQN) which can perform single image ITM on mobile platforms. In the ablation studies, we explore the effect of using different attention mechanisms, quantization schemes, and loss functions on the performance of MQN in ITM tasks. In the comparative analyses, ITM models trained using MQN perform on par with the state-of-the-art methods on benchmark datasets. MQN models provide up to 10 times improvement on latency and 25 times improvement on memory consumption.
△ Less
Submitted 12 March, 2022;
originally announced March 2022.
-
Adversarial Attacks and Defense Methods for Power Quality Recognition
Authors:
Jiwei Tian,
Buhong Wang,
Jing Li,
Zhen Wang,
Mete Ozay
Abstract:
Vulnerability of various machine learning methods to adversarial examples has been recently explored in the literature. Power systems which use these vulnerable methods face a huge threat against adversarial examples. To this end, we first propose a signal-specific method and a universal signal-agnostic method to attack power systems using generated adversarial examples. Black-box attacks based on…
▽ More
Vulnerability of various machine learning methods to adversarial examples has been recently explored in the literature. Power systems which use these vulnerable methods face a huge threat against adversarial examples. To this end, we first propose a signal-specific method and a universal signal-agnostic method to attack power systems using generated adversarial examples. Black-box attacks based on transferable characteristics and the above two methods are also proposed and evaluated. We then adopt adversarial training to defend systems against adversarial attacks. Experimental analyses demonstrate that our signal-specific attack method provides less perturbation compared to the FGSM (Fast Gradient Sign Method), and our signal-agnostic attack method can generate perturbations fooling most natural signals with high probability. What's more, the attack method based on the universal signal-agnostic algorithm has a higher transfer rate of black-box attacks than the attack method based on the signal-specific algorithm. In addition, the results show that the proposed adversarial training improves robustness of power systems to adversarial examples.
△ Less
Submitted 11 February, 2022;
originally announced February 2022.
-
Task Guided Compositional Representation Learning for ZDA
Authors:
Shuang Liu,
Mete Ozay
Abstract:
Zero-shot domain adaptation (ZDA) methods aim to transfer knowledge about a task learned in a source domain to a target domain, while data from target domain are not available. In this work, we address learning feature representations which are invariant to and shared among different domains considering task characteristics for ZDA. To this end, we propose a method for task-guided ZDA (TG-ZDA) whi…
▽ More
Zero-shot domain adaptation (ZDA) methods aim to transfer knowledge about a task learned in a source domain to a target domain, while data from target domain are not available. In this work, we address learning feature representations which are invariant to and shared among different domains considering task characteristics for ZDA. To this end, we propose a method for task-guided ZDA (TG-ZDA) which employs multi-branch deep neural networks to learn feature representations exploiting their domain invariance and shareability properties. The proposed TG-ZDA models can be trained end-to-end without requiring synthetic tasks and data generated from estimated representations of target domains. The proposed TG-ZDA has been examined using benchmark ZDA tasks on image classification datasets. Experimental results show that our proposed TG-ZDA outperforms state-of-the-art ZDA methods for different domains and tasks.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
Prototype Guided Federated Learning of Visual Feature Representations
Authors:
Umberto Michieli,
Mete Ozay
Abstract:
Federated Learning (FL) is a framework which enables distributed model training using a large corpus of decentralized training data. Existing methods aggregate models disregarding their internal representations, which are crucial for training models in vision tasks. System and statistical heterogeneity (e.g., highly imbalanced and non-i.i.d. data) further harm model training. To this end, we intro…
▽ More
Federated Learning (FL) is a framework which enables distributed model training using a large corpus of decentralized training data. Existing methods aggregate models disregarding their internal representations, which are crucial for training models in vision tasks. System and statistical heterogeneity (e.g., highly imbalanced and non-i.i.d. data) further harm model training. To this end, we introduce a method, called FedProto, which computes client deviations using margins of prototypical representations learned on distributed data, and applies them to drive federated optimization via an attention mechanism. In addition, we propose three methods to analyse statistical properties of feature representations learned in FL, in order to elucidate the relationship between accuracy, margins and feature discrepancy of FL models. In experimental analyses, FedProto demonstrates state-of-the-art accuracy and convergence rate across image classification and semantic segmentation benchmarks by enabling maximum margin training of FL models. Moreover, FedProto reduces uncertainty of predictions of FL models compared to the baseline. To our knowledge, this is the first work evaluating FL models in dense prediction tasks, such as semantic segmentation.
△ Less
Submitted 19 May, 2021;
originally announced May 2021.
-
A New Neural Network Architecture Invariant to the Action of Symmetry Subgroups
Authors:
Piotr Kicki,
Mete Ozay,
Piotr Skrzypczyński
Abstract:
We propose a computationally efficient $G$-invariant neural network that approximates functions invariant to the action of a given permutation subgroup $G \leq S_n$ of the symmetric group on input data. The key element of the proposed network architecture is a new $G$-invariant transformation module, which produces a $G$-invariant latent representation of the input data. Theoretical considerations…
▽ More
We propose a computationally efficient $G$-invariant neural network that approximates functions invariant to the action of a given permutation subgroup $G \leq S_n$ of the symmetric group on input data. The key element of the proposed network architecture is a new $G$-invariant transformation module, which produces a $G$-invariant latent representation of the input data. Theoretical considerations are supported by numerical experiments, which demonstrate the effectiveness and strong generalization properties of the proposed method in comparison to other $G$-invariant neural networks.
△ Less
Submitted 11 December, 2020;
originally announced December 2020.
-
Learning from Experience for Rapid Generation of Local Car Maneuvers
Authors:
Piotr Kicki,
Tomasz Gawron,
Krzysztof Ćwian,
Mete Ozay,
Piotr Skrzypczyński
Abstract:
Being able to rapidly respond to the changing scenes and traffic situations by generating feasible local paths is of pivotal importance for car autonomy. We propose to train a deep neural network (DNN) to plan feasible and nearly-optimal paths for kinematically constrained vehicles in small constant time. Our DNN model is trained using a novel weakly supervised approach and a gradient-based policy…
▽ More
Being able to rapidly respond to the changing scenes and traffic situations by generating feasible local paths is of pivotal importance for car autonomy. We propose to train a deep neural network (DNN) to plan feasible and nearly-optimal paths for kinematically constrained vehicles in small constant time. Our DNN model is trained using a novel weakly supervised approach and a gradient-based policy search. On real and simulated scenes and a large set of local planning problems, we demonstrate that our approach outperforms the existing planners with respect to the number of successfully completed tasks. While the path generation time is about 40 ms, the generated paths are smooth and comparable to those obtained from conventional path planners.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.
-
A Computationally Efficient Neural Network Invariant to the Action of Symmetry Subgroups
Authors:
Piotr Kicki,
Mete Ozay,
Piotr Skrzypczyński
Abstract:
We introduce a method to design a computationally efficient $G$-invariant neural network that approximates functions invariant to the action of a given permutation subgroup $G \leq S_n$ of the symmetric group on input data. The key element of the proposed network architecture is a new $G$-invariant transformation module, which produces a $G$-invariant latent representation of the input data. This…
▽ More
We introduce a method to design a computationally efficient $G$-invariant neural network that approximates functions invariant to the action of a given permutation subgroup $G \leq S_n$ of the symmetric group on input data. The key element of the proposed network architecture is a new $G$-invariant transformation module, which produces a $G$-invariant latent representation of the input data. This latent representation is then processed with a multi-layer perceptron in the network. We prove the universality of the proposed architecture, discuss its properties and highlight its computational and memory efficiency. Theoretical considerations are supported by numerical experiments involving different network configurations, which demonstrate the effectiveness and strong generalization properties of the proposed method in comparison to other $G$-invariant neural networks.
△ Less
Submitted 18 February, 2020;
originally announced February 2020.
-
Fine-grained Optimization of Deep Neural Networks
Authors:
Mete Ozay
Abstract:
In recent studies, several asymptotic upper bounds on generalization errors on deep neural networks (DNNs) are theoretically derived. These bounds are functions of several norms of weights of the DNNs, such as the Frobenius and spectral norms, and they are computed for weights grouped according to either input and output channels of the DNNs. In this work, we conjecture that if we can impose multi…
▽ More
In recent studies, several asymptotic upper bounds on generalization errors on deep neural networks (DNNs) are theoretically derived. These bounds are functions of several norms of weights of the DNNs, such as the Frobenius and spectral norms, and they are computed for weights grouped according to either input and output channels of the DNNs. In this work, we conjecture that if we can impose multiple constraints on weights of DNNs to upper bound the norms of the weights, and train the DNNs with these weights, then we can attain empirical generalization errors closer to the derived theoretical bounds, and improve accuracy of the DNNs.
To this end, we pose two problems. First, we aim to obtain weights whose different norms are all upper bounded by a constant number, e.g. 1.0. To achieve these bounds, we propose a two-stage renormalization procedure; (i) normalization of weights according to different norms used in the bounds, and (ii) reparameterization of the normalized weights to set a constant and finite upper bound of their norms. In the second problem, we consider training DNNs with these renormalized weights. To this end, we first propose a strategy to construct joint spaces (manifolds) of weights according to different constraints in DNNs. Next, we propose a fine-grained SGD algorithm (FG-SGD) for optimization on the weight manifolds to train DNNs with assurance of convergence to minima. Experimental results show that image classification accuracy of baseline DNNs can be boosted using FG-SGD on collections of manifolds identified by multiple constraints.
△ Less
Submitted 22 May, 2019;
originally announced May 2019.
-
Improving Head Pose Estimation with a Combined Loss and Bounding Box Margin Adjustment
Authors:
Mingzhen Shao,
Zhun Sun,
Mete Ozay,
Takayuki Okatani
Abstract:
We address a problem of estimating pose of a person's head from its RGB image. The employment of CNNs for the problem has contributed to significant improvement in accuracy in recent works. However, we show that the following two methods, despite their simplicity, can attain further improvement: (i) proper adjustment of the margin of bounding box of a detected face, and (ii) choice of loss functio…
▽ More
We address a problem of estimating pose of a person's head from its RGB image. The employment of CNNs for the problem has contributed to significant improvement in accuracy in recent works. However, we show that the following two methods, despite their simplicity, can attain further improvement: (i) proper adjustment of the margin of bounding box of a detected face, and (ii) choice of loss functions. We show that the integration of these two methods achieve the new state-of-the-art on standard benchmark datasets for in-the-wild head pose estimation.
△ Less
Submitted 14 May, 2019;
originally announced May 2019.
-
Revisiting Single Image Depth Estimation: Toward Higher Resolution Maps with Accurate Object Boundaries
Authors:
Junjie Hu,
Mete Ozay,
Yan Zhang,
Takayuki Okatani
Abstract:
This paper considers the problem of single image depth estimation. The employment of convolutional neural networks (CNNs) has recently brought about significant advancements in the research of this problem. However, most existing methods suffer from loss of spatial resolution in the estimated depth maps; a typical symptom is distorted and blurry reconstruction of object boundaries. In this paper,…
▽ More
This paper considers the problem of single image depth estimation. The employment of convolutional neural networks (CNNs) has recently brought about significant advancements in the research of this problem. However, most existing methods suffer from loss of spatial resolution in the estimated depth maps; a typical symptom is distorted and blurry reconstruction of object boundaries. In this paper, toward more accurate estimation with a focus on depth maps with higher spatial resolution, we propose two improvements to existing approaches. One is about the strategy of fusing features extracted at different scales, for which we propose an improved network architecture consisting of four modules: an encoder, decoder, multi-scale feature fusion module, and refinement module. The other is about loss functions for measuring inference errors used in training. We show that three loss terms, which measure errors in depth, gradients and surface normals, respectively, contribute to improvement of accuracy in an complementary fashion. Experimental results show that these two improvements enable to attain higher accuracy than the current state-of-the-arts, which is given by finer resolution reconstruction, for example, with small objects and object boundaries.
△ Less
Submitted 22 September, 2018; v1 submitted 23 March, 2018;
originally announced March 2018.
-
Exploiting the Potential of Standard Convolutional Autoencoders for Image Restoration by Evolutionary Search
Authors:
Masanori Suganuma,
Mete Ozay,
Takayuki Okatani
Abstract:
Researchers have applied deep neural networks to image restoration tasks, in which they proposed various network architectures, loss functions, and training methods. In particular, adversarial training, which is employed in recent studies, seems to be a key ingredient to success. In this paper, we show that simple convolutional autoencoders (CAEs) built upon only standard network components, i.e.,…
▽ More
Researchers have applied deep neural networks to image restoration tasks, in which they proposed various network architectures, loss functions, and training methods. In particular, adversarial training, which is employed in recent studies, seems to be a key ingredient to success. In this paper, we show that simple convolutional autoencoders (CAEs) built upon only standard network components, i.e., convolutional layers and skip connections, can outperform the state-of-the-art methods which employ adversarial training and sophisticated loss functions. The secret is to employ an evolutionary algorithm to automatically search for good architectures. Training optimized CAEs by minimizing the $\ell_2$ loss between reconstructed images and their ground truths using the ADAM optimizer is all we need. Our experimental results show that this approach achieves 27.8 dB peak signal to noise ratio (PSNR) on the CelebA dataset and 40.4 dB on the SVHN dataset, compared to 22.8 dB and 33.0 dB provided by the former state-of-the-art methods, respectively.
△ Less
Submitted 1 March, 2018;
originally announced March 2018.
-
Deep Structured Energy-Based Image Inpainting
Authors:
Fazil Altinel,
Mete Ozay,
Takayuki Okatani
Abstract:
In this paper, we propose a structured image inpainting method employing an energy based model. In order to learn structural relationship between patterns observed in images and missing regions of the images, we employ an energy-based structured prediction method. The structural relationship is learned by minimizing an energy function which is defined by a simple convolutional neural network. The…
▽ More
In this paper, we propose a structured image inpainting method employing an energy based model. In order to learn structural relationship between patterns observed in images and missing regions of the images, we employ an energy-based structured prediction method. The structural relationship is learned by minimizing an energy function which is defined by a simple convolutional neural network. The experimental results on various benchmark datasets show that our proposed method significantly outperforms the state-of-the-art methods which use Generative Adversarial Networks (GANs). We obtained 497.35 mean squared error (MSE) on the Olivetti face dataset compared to 833.0 MSE provided by the state-of-the-art method. Moreover, we obtained 28.4 dB peak signal to noise ratio (PSNR) on the SVHN dataset and 23.53 dB on the CelebA dataset, compared to 22.3 dB and 21.3 dB, provided by the state-of-the-art methods, respectively. The code is publicly available.
△ Less
Submitted 30 August, 2018; v1 submitted 24 January, 2018;
originally announced January 2018.
-
A vision based system for underwater docking
Authors:
Shuang Liu,
Mete Ozay,
Takayuki Okatani,
Hongli Xu,
Kai Sun,
Yang Lin
Abstract:
Autonomous underwater vehicles (AUVs) have been deployed for underwater exploration. However, its potential is confined by its limited on-board battery energy and data storage capacity. This problem has been addressed using docking systems by underwater recharging and data transfer for AUVs. In this work, we propose a vision based framework for underwater docking following these systems. The propo…
▽ More
Autonomous underwater vehicles (AUVs) have been deployed for underwater exploration. However, its potential is confined by its limited on-board battery energy and data storage capacity. This problem has been addressed using docking systems by underwater recharging and data transfer for AUVs. In this work, we propose a vision based framework for underwater docking following these systems. The proposed framework comprises two modules; (i) a detection module which provides location information on underwater docking stations in 2D images captured by an on-board camera, and (ii) a pose estimation module which recovers the relative 3D position and orientation between docking stations and AUVs from the 2D images. For robust and credible detection of docking stations, we propose a convolutional neural network called Docking Neural Network (DoNN). For accurate pose estimation, a perspective-n-point algorithm is integrated into our framework. In order to examine our framework in underwater docking tasks, we collected a dataset of 2D images, named Underwater Docking Images Dataset (UDID), in an experimental water pool. To the best of our knowledge, UDID is the first publicly available underwater docking dataset. In the experiments, we first evaluate performance of the proposed detection module on UDID and its deformed variations. Next, we assess the accuracy of the pose estimation module by ground experiments, since it is not feasible to obtain true relative position and orientation between docking stations and AUVs under water. Then, we examine the pose estimation module by underwater experiments in our experimental water pool. Experimental results show that the proposed framework can be used to detect docking stations and estimate their relative pose efficiently and successfully, compared to the state-of-the-art baseline systems.
△ Less
Submitted 12 December, 2017;
originally announced December 2017.
-
HyperNetworks with statistical filtering for defending adversarial examples
Authors:
Zhun Sun,
Mete Ozay,
Takayuki Okatani
Abstract:
Deep learning algorithms have been known to be vulnerable to adversarial perturbations in various tasks such as image classification. This problem was addressed by employing several defense methods for detection and rejection of particular types of attacks. However, training and manipulating networks according to particular defense schemes increases computational complexity of the learning algorit…
▽ More
Deep learning algorithms have been known to be vulnerable to adversarial perturbations in various tasks such as image classification. This problem was addressed by employing several defense methods for detection and rejection of particular types of attacks. However, training and manipulating networks according to particular defense schemes increases computational complexity of the learning algorithms. In this work, we propose a simple yet effective method to improve robustness of convolutional neural networks (CNNs) to adversarial attacks by using data dependent adaptive convolution kernels. To this end, we propose a new type of HyperNetwork in order to employ statistical properties of input data and features for computation of statistical adaptive maps. Then, we filter convolution weights of CNNs with the learned statistical maps to compute dynamic kernels. Thereby, weights and kernels are collectively optimized for learning of image classification models robust to adversarial attacks without employment of additional target detection and rejection algorithms. We empirically demonstrate that the proposed method enables CNNs to spontaneously defend against different types of attacks, e.g. attacks generated by Gaussian noise, fast gradient sign methods (Goodfellow et al., 2014) and a black-box attack(Narodytska & Kasiviswanathan, 2016).
△ Less
Submitted 6 November, 2017;
originally announced November 2017.
-
Linear Discriminant Generative Adversarial Networks
Authors:
Zhun Sun,
Mete Ozay,
Takayuki Okatani
Abstract:
We develop a novel method for training of GANs for unsupervised and class conditional generation of images, called Linear Discriminant GAN (LD-GAN). The discriminator of an LD-GAN is trained to maximize the linear separability between distributions of hidden representations of generated and targeted samples, while the generator is updated based on the decision hyper-planes computed by performing L…
▽ More
We develop a novel method for training of GANs for unsupervised and class conditional generation of images, called Linear Discriminant GAN (LD-GAN). The discriminator of an LD-GAN is trained to maximize the linear separability between distributions of hidden representations of generated and targeted samples, while the generator is updated based on the decision hyper-planes computed by performing LDA over the hidden representations. LD-GAN provides a concrete metric of separation capacity for the discriminator, and we experimentally show that it is possible to stabilize the training of LD-GAN simply by calibrating the update frequencies between generators and discriminators in the unsupervised case, without employment of normalization methods and constraints on weights. In the class conditional generation tasks, the proposed method shows improved training stability together with better generalization performance compared to WGAN that employs an auxiliary classifier.
△ Less
Submitted 25 July, 2017;
originally announced July 2017.
-
Improving Robustness of Feature Representations to Image Deformations using Powered Convolution in CNNs
Authors:
Zhun Sun,
Mete Ozay,
Takayuki Okatani
Abstract:
In this work, we address the problem of improvement of robustness of feature representations learned using convolutional neural networks (CNNs) to image deformation. We argue that higher moment statistics of feature distributions could be shifted due to image deformations, and the shift leads to degrade of performance and cannot be reduced by ordinary normalization methods as observed in experimen…
▽ More
In this work, we address the problem of improvement of robustness of feature representations learned using convolutional neural networks (CNNs) to image deformation. We argue that higher moment statistics of feature distributions could be shifted due to image deformations, and the shift leads to degrade of performance and cannot be reduced by ordinary normalization methods as observed in experimental analyses. In order to attenuate this effect, we apply additional non-linearity in CNNs by combining power functions with learnable parameters into convolution operation. In the experiments, we observe that CNNs which employ the proposed method obtain remarkable boost in both the generalization performance and the robustness under various types of deformations using large scale benchmark datasets. For instance, a model equipped with the proposed method obtains 3.3\% performance boost in mAP on Pascal Voc object detection task using deformed images, compared to the reference model, while both models provide the same performance using original images. To the best of our knowledge, this is the first work that studies robustness of deep features learned using CNNs to a wide range of deformations for object recognition and detection.
△ Less
Submitted 25 July, 2017;
originally announced July 2017.
-
Information Potential Auto-Encoders
Authors:
Yan Zhang,
Mete Ozay,
Zhun Sun,
Takayuki Okatani
Abstract:
In this paper, we suggest a framework to make use of mutual information as a regularization criterion to train Auto-Encoders (AEs). In the proposed framework, AEs are regularized by minimization of the mutual information between input and encoding variables of AEs during the training phase. In order to estimate the entropy of the encoding variables and the mutual information, we propose a non-para…
▽ More
In this paper, we suggest a framework to make use of mutual information as a regularization criterion to train Auto-Encoders (AEs). In the proposed framework, AEs are regularized by minimization of the mutual information between input and encoding variables of AEs during the training phase. In order to estimate the entropy of the encoding variables and the mutual information, we propose a non-parametric method. We also give an information theoretic view of Variational AEs (VAEs), which suggests that VAEs can be considered as parametric methods that estimate entropy. Experimental results show that the proposed non-parametric models have more degree of freedom in terms of representation learning of features drawn from complex distributions such as Mixture of Gaussians, compared to methods which estimate entropy using parametric approaches, such as Variational AEs.
△ Less
Submitted 6 August, 2017; v1 submitted 14 June, 2017;
originally announced June 2017.
-
Truncating Wide Networks using Binary Tree Architectures
Authors:
Yan Zhang,
Mete Ozay,
Shuohao Li,
Takayuki Okatani
Abstract:
Recent study shows that a wide deep network can obtain accuracy comparable to a deeper but narrower network. Compared to narrower and deeper networks, wide networks employ relatively less number of layers and have various important benefits, such that they have less running time on parallel computing devices, and they are less affected by gradient vanishing problems. However, the parameter size of…
▽ More
Recent study shows that a wide deep network can obtain accuracy comparable to a deeper but narrower network. Compared to narrower and deeper networks, wide networks employ relatively less number of layers and have various important benefits, such that they have less running time on parallel computing devices, and they are less affected by gradient vanishing problems. However, the parameter size of a wide network can be very large due to use of large width of each layer in the network. In order to keep the benefits of wide networks meanwhile improve the parameter size and accuracy trade-off of wide networks, we propose a binary tree architecture to truncate architecture of wide networks by reducing the width of the networks. More precisely, in the proposed architecture, the width is continuously reduced from lower layers to higher layers in order to increase the expressive capacity of network with a less increase on parameter size. Also, to ease the gradient vanishing problem, features obtained at different layers are concatenated to form the output of our architecture. By employing the proposed architecture on a baseline wide network, we can construct and train a new network with same depth but considerably less number of parameters. In our experimental analyses, we observe that the proposed architecture enables us to obtain better parameter size and accuracy trade-off compared to baseline networks using various benchmark image classification datasets. The results show that our model can decrease the classification error of baseline from 20.43% to 19.22% on Cifar-100 using only 28% of parameters that baseline has. Code is available at https://github.com/ZhangVision/bitnet.
△ Less
Submitted 3 April, 2017;
originally announced April 2017.
-
Optimization on Product Submanifolds of Convolution Kernels
Authors:
Mete Ozay,
Takayuki Okatani
Abstract:
Recent advances in optimization methods used for training convolutional neural networks (CNNs) with kernels, which are normalized according to particular constraints, have shown remarkable success. This work introduces an approach for training CNNs using ensembles of joint spaces of kernels constructed using different constraints. For this purpose, we address a problem of optimization on ensembles…
▽ More
Recent advances in optimization methods used for training convolutional neural networks (CNNs) with kernels, which are normalized according to particular constraints, have shown remarkable success. This work introduces an approach for training CNNs using ensembles of joint spaces of kernels constructed using different constraints. For this purpose, we address a problem of optimization on ensembles of products of submanifolds (PEMs) of convolution kernels. To this end, we first propose three strategies to construct ensembles of PEMs in CNNs. Next, we expound their geometric properties (metric and curvature properties) in CNNs. We make use of our theoretical results by developing a geometry-aware SGD algorithm (G-SGD) for optimization on ensembles of PEMs to train CNNs. Moreover, we analyze convergence properties of G-SGD considering geometric properties of PEMs. In the experimental analyses, we employ G-SGD to train CNNs on Cifar-10, Cifar-100 and Imagenet datasets. The results show that geometric adaptive step size computation methods of G-SGD can improve training loss and convergence properties of CNNs. Moreover, we observe that classification performance of baseline CNNs can be boosted using G-SGD on ensembles of PEMs identified by multiple constraints.
△ Less
Submitted 27 November, 2017; v1 submitted 22 January, 2017;
originally announced January 2017.
-
Optimization on Submanifolds of Convolution Kernels in CNNs
Authors:
Mete Ozay,
Takayuki Okatani
Abstract:
Kernel normalization methods have been employed to improve robustness of optimization methods to reparametrization of convolution kernels, covariate shift, and to accelerate training of Convolutional Neural Networks (CNNs). However, our understanding of theoretical properties of these methods has lagged behind their success in applications. We develop a geometric framework to elucidate underlying…
▽ More
Kernel normalization methods have been employed to improve robustness of optimization methods to reparametrization of convolution kernels, covariate shift, and to accelerate training of Convolutional Neural Networks (CNNs). However, our understanding of theoretical properties of these methods has lagged behind their success in applications. We develop a geometric framework to elucidate underlying mechanisms of a diverse range of kernel normalization methods. Our framework enables us to expound and identify geometry of space of normalized kernels. We analyze and delineate how state-of-the-art kernel normalization methods affect the geometry of search spaces of the stochastic gradient descent (SGD) algorithms in CNNs. Following our theoretical results, we propose a SGD algorithm with assurance of almost sure convergence of the methods to a solution at single minimum of classification loss of CNNs. Experimental results show that the proposed method achieves state-of-the-art performance for major image classification benchmarks with CNNs.
△ Less
Submitted 22 October, 2016;
originally announced October 2016.
-
Encoding the Local Connectivity Patterns of fMRI for Cognitive State Classification
Authors:
Itir Onal Ertugrul,
Mete Ozay,
Fatos T. Yarman Vural
Abstract:
In this work, we propose a novel framework to encode the local connectivity patterns of brain, using Fisher Vectors (FV), Vector of Locally Aggregated Descriptors (VLAD) and Bag-of-Words (BoW) methods. We first obtain local descriptors, called Mesh Arc Descriptors (MADs) from fMRI data, by forming local meshes around anatomical regions, and estimating their relationship within a neighborhood. Then…
▽ More
In this work, we propose a novel framework to encode the local connectivity patterns of brain, using Fisher Vectors (FV), Vector of Locally Aggregated Descriptors (VLAD) and Bag-of-Words (BoW) methods. We first obtain local descriptors, called Mesh Arc Descriptors (MADs) from fMRI data, by forming local meshes around anatomical regions, and estimating their relationship within a neighborhood. Then, we extract a dictionary of relationships, called \textit{brain connectivity dictionary} by fitting a generative Gaussian mixture model (GMM) to a set of MADs, and selecting the codewords at the mean of each component of the mixture. Codewords represent the connectivity patterns among anatomical regions. We also encode MADs by VLAD and BoW methods using the k-Means clustering.
We classify the cognitive states of Human Connectome Project (HCP) task fMRI dataset, where we train support vector machines (SVM) by the encoded MADs. Results demonstrate that, FV encoding of MADs can be successfully employed for classification of cognitive tasks, and outperform the VLAD and BoW representations. Moreover, we identify the significant Gaussians in mixture models by computing energy of their corresponding FV parts, and analyze their effect on classification accuracy. Finally, we suggest a new method to visualize the codewords of brain connectivity dictionary.
△ Less
Submitted 17 October, 2016;
originally announced October 2016.
-
Hierarchical Multi-resolution Mesh Networks for Brain Decoding
Authors:
Itir Onal Ertugrul,
Mete Ozay,
Fatos Tunay Yarman Vural
Abstract:
We propose a new framework, called Hierarchical Multi-resolution Mesh Networks (HMMNs), which establishes a set of brain networks at multiple time resolutions of fMRI signal to represent the underlying cognitive process. The suggested framework, first, decomposes the fMRI signal into various frequency subbands using wavelet transforms. Then, a brain network, called mesh network, is formed at each…
▽ More
We propose a new framework, called Hierarchical Multi-resolution Mesh Networks (HMMNs), which establishes a set of brain networks at multiple time resolutions of fMRI signal to represent the underlying cognitive process. The suggested framework, first, decomposes the fMRI signal into various frequency subbands using wavelet transforms. Then, a brain network, called mesh network, is formed at each subband by ensembling a set of local meshes. The locality around each anatomic region is defined with respect to a neighborhood system based on functional connectivity. The arc weights of a mesh are estimated by ridge regression formed among the average region time series. In the final step, the adjacency matrices of mesh networks obtained at different subbands are ensembled for brain decoding under a hierarchical learning architecture, called, fuzzy stacked generalization (FSG). Our results on Human Connectome Project task-fMRI dataset reflect that the suggested HMMN model can successfully discriminate tasks by extracting complementary information obtained from mesh arc weights of multiple subbands. We study the topological properties of the mesh networks at different resolutions using the network measures, namely, node degree, node strength, betweenness centrality and global efficiency; and investigate the connectivity of anatomic regions, during a cognitive task. We observe significant variations among the network topologies obtained for different subbands. We, also, analyze the diversity properties of classifier ensemble, trained by the mesh networks in multiple subbands and observe that the classifiers in the ensemble collaborate with each other to fuse the complementary information freed at each subband. We conclude that the fMRI data, recorded during a cognitive task, embed diverse information across the anatomic regions at each resolution.
△ Less
Submitted 11 January, 2017; v1 submitted 12 July, 2016;
originally announced July 2016.
-
Modeling the Sequence of Brain Volumes by Local Mesh Models for Brain Decoding
Authors:
Itir Onal,
Mete Ozay,
Eda Mizrak,
Ilke Oztekin,
Fatos T. Yarman Vural
Abstract:
We represent the sequence of fMRI (Functional Magnetic Resonance Imaging) brain volumes recorded during a cognitive stimulus by a graph which consists of a set of local meshes. The corresponding cognitive process, encoded in the brain, is then represented by these meshes each of which is estimated assuming a linear relationship among the voxel time series in a predefined locality. First, we define…
▽ More
We represent the sequence of fMRI (Functional Magnetic Resonance Imaging) brain volumes recorded during a cognitive stimulus by a graph which consists of a set of local meshes. The corresponding cognitive process, encoded in the brain, is then represented by these meshes each of which is estimated assuming a linear relationship among the voxel time series in a predefined locality. First, we define the concept of locality in two neighborhood systems, namely, the spatial and functional neighborhoods. Then, we construct spatially and functionally local meshes around each voxel, called seed voxel, by connecting it either to its spatial or functional p-nearest neighbors. The mesh formed around a voxel is a directed sub-graph with a star topology, where the direction of the edges is taken towards the seed voxel at the center of the mesh. We represent the time series recorded at each seed voxel in terms of linear combination of the time series of its p-nearest neighbors in the mesh. The relationships between a seed voxel and its neighbors are represented by the edge weights of each mesh, and are estimated by solving a linear regression equation. The estimated mesh edge weights lead to a better representation of information in the brain for encoding and decoding of the cognitive tasks. We test our model on a visual object recognition and emotional memory retrieval experiments using Support Vector Machines that are trained using the mesh edge weights as features. In the experimental analysis, we observe that the edge weights of the spatial and functional meshes perform better than the state-of-the-art brain decoding models.
△ Less
Submitted 3 March, 2016;
originally announced March 2016.
-
Design of Kernels in Convolutional Neural Networks for Image Classification
Authors:
Zhun Sun,
Mete Ozay,
Takayuki Okatani
Abstract:
Despite the effectiveness of Convolutional Neural Networks (CNNs) for image classification, our understanding of the relationship between shape of convolution kernels and learned representations is limited. In this work, we explore and employ the relationship between shape of kernels which define Receptive Fields (RFs) in CNNs for learning of feature representations and image classification. For t…
▽ More
Despite the effectiveness of Convolutional Neural Networks (CNNs) for image classification, our understanding of the relationship between shape of convolution kernels and learned representations is limited. In this work, we explore and employ the relationship between shape of kernels which define Receptive Fields (RFs) in CNNs for learning of feature representations and image classification. For this purpose, we first propose a feature visualization method for visualization of pixel-wise classification score maps of learned features. Motivated by our experimental results, and observations reported in the literature for modeling of visual systems, we propose a novel design of shape of kernels for learning of representations in CNNs. In the experimental results, we achieved a state-of-the-art classification performance compared to a base CNN model [28] by reducing the number of parameters and computational time of the model using the ILSVRC-2012 dataset [24]. The proposed models also outperform the state-of-the-art models employed on the CIFAR-10/100 datasets [12] for image classification. Additionally, we analyzed the robustness of the proposed method to occlusion for classification of partially occluded images compared with the state-of-the-art methods. Our results indicate the effectiveness of the proposed approach. The code is available in github.com/minogame/caffe-qhconv.
△ Less
Submitted 28 November, 2016; v1 submitted 30 November, 2015;
originally announced November 2015.
-
Integrating Deep Features for Material Recognition
Authors:
Yan Zhang,
Mete Ozay,
Xing Liu,
Takayuki Okatani
Abstract:
We propose a method for integration of features extracted using deep representations of Convolutional Neural Networks (CNNs) each of which is learned using a different image dataset of objects and materials for material recognition. Given a set of representations of multiple pre-trained CNNs, we first compute activations of features using the representations on the images to select a set of sample…
▽ More
We propose a method for integration of features extracted using deep representations of Convolutional Neural Networks (CNNs) each of which is learned using a different image dataset of objects and materials for material recognition. Given a set of representations of multiple pre-trained CNNs, we first compute activations of features using the representations on the images to select a set of samples which are best represented by the features. Then, we measure the uncertainty of the features by computing the entropy of class distributions for each sample set. Finally, we compute the contribution of each feature to representation of classes for feature selection and integration. We examine the proposed method on three benchmark datasets for material recognition. Experimental results show that the proposed method achieves state-of-the-art performance by integrating deep features. Additionally, we introduce a new material dataset called EFMD by extending Flickr Material Database (FMD). By the employment of the EFMD with transfer learning for updating the learned CNN models, we achieve 84.0%+/-1.8% accuracy on the FMD dataset which is close to human performance that is 84.9%.
△ Less
Submitted 21 April, 2016; v1 submitted 20 November, 2015;
originally announced November 2015.
-
Machine Learning Methods for Attack Detection in the Smart Grid
Authors:
Mete Ozay,
Inaki Esnaola,
Fatos T. Yarman Vural,
Sanjeev R. Kulkarni,
H. Vincent Poor
Abstract:
Attack detection problems in the smart grid are posed as statistical learning problems for different attack scenarios in which the measurements are observed in batch or online settings. In this approach, machine learning algorithms are used to classify measurements as being either secure or attacked. An attack detection framework is provided to exploit any available prior knowledge about the syste…
▽ More
Attack detection problems in the smart grid are posed as statistical learning problems for different attack scenarios in which the measurements are observed in batch or online settings. In this approach, machine learning algorithms are used to classify measurements as being either secure or attacked. An attack detection framework is provided to exploit any available prior knowledge about the system and surmount constraints arising from the sparse structure of the problem in the proposed approach. Well-known batch and online learning algorithms (supervised and semi-supervised) are employed with decision and feature level fusion to model the attack detection problem. The relationships between statistical and geometric properties of attack vectors employed in the attack scenarios and learning algorithms are analyzed to detect unobservable attacks using statistical learning methods. The proposed algorithms are examined on various IEEE test systems. Experimental analyses show that machine learning algorithms can detect attacks with performances higher than the attack detection algorithms which employ state vector estimation methods in the proposed attack detection framework.
△ Less
Submitted 22 March, 2015;
originally announced March 2015.
-
A Hierarchical Approach for Joint Multi-view Object Pose Estimation and Categorization
Authors:
Mete Ozay,
Krzysztof Walas,
Ales Leonardis
Abstract:
We propose a joint object pose estimation and categorization approach which extracts information about object poses and categories from the object parts and compositions constructed at different layers of a hierarchical object representation algorithm, namely Learned Hierarchy of Parts (LHOP). In the proposed approach, we first employ the LHOP to learn hierarchical part libraries which represent e…
▽ More
We propose a joint object pose estimation and categorization approach which extracts information about object poses and categories from the object parts and compositions constructed at different layers of a hierarchical object representation algorithm, namely Learned Hierarchy of Parts (LHOP). In the proposed approach, we first employ the LHOP to learn hierarchical part libraries which represent entity parts and compositions across different object categories and views. Then, we extract statistical and geometric features from the part realizations of the objects in the images in order to represent the information about object pose and category at each different layer of the hierarchy. Unlike the traditional approaches which consider specific layers of the hierarchies in order to extract information to perform specific tasks, we combine the information extracted at different layers to solve a joint object pose estimation and categorization problem using distributed optimization algorithms. We examine the proposed generative-discriminative learning approach and the algorithms on two benchmark 2-D multi-view image datasets. The proposed approach and the algorithms outperform state-of-the-art classification, regression and feature extraction algorithms. In addition, the experimental results shed light on the relationship between object categorization, pose estimation and the part realizations observed at different layers of the hierarchy.
△ Less
Submitted 4 March, 2015;
originally announced March 2015.
-
Fusion of Image Segmentation Algorithms using Consensus Clustering
Authors:
Mete Ozay,
Fatos T. Yarman Vural,
Sanjeev R. Kulkarni,
H. Vincent Poor
Abstract:
A new segmentation fusion method is proposed that ensembles the output of several segmentation algorithms applied on a remotely sensed image. The candidate segmentation sets are processed to achieve a consensus segmentation using a stochastic optimization algorithm based on the Filtered Stochastic BOEM (Best One Element Move) method. For this purpose, Filtered Stochastic BOEM is reformulated as a…
▽ More
A new segmentation fusion method is proposed that ensembles the output of several segmentation algorithms applied on a remotely sensed image. The candidate segmentation sets are processed to achieve a consensus segmentation using a stochastic optimization algorithm based on the Filtered Stochastic BOEM (Best One Element Move) method. For this purpose, Filtered Stochastic BOEM is reformulated as a segmentation fusion problem by designing a new distance learning approach. The proposed algorithm also embeds the computation of the optimum number of clusters into the segmentation fusion problem.
△ Less
Submitted 18 February, 2015;
originally announced February 2015.
-
Semi-supervised Segmentation Fusion of Multi-spectral and Aerial Images
Authors:
Mete Ozay
Abstract:
A Semi-supervised Segmentation Fusion algorithm is proposed using consensus and distributed learning. The aim of Unsupervised Segmentation Fusion (USF) is to achieve a consensus among different segmentation outputs obtained from different segmentation algorithms by computing an approximate solution to the NP problem with less computational complexity. Semi-supervision is incorporated in USF using…
▽ More
A Semi-supervised Segmentation Fusion algorithm is proposed using consensus and distributed learning. The aim of Unsupervised Segmentation Fusion (USF) is to achieve a consensus among different segmentation outputs obtained from different segmentation algorithms by computing an approximate solution to the NP problem with less computational complexity. Semi-supervision is incorporated in USF using a new algorithm called Semi-supervised Segmentation Fusion (SSSF). In SSSF, side information about the co-occurrence of pixels in the same or different segments is formulated as the constraints of a convex optimization problem. The results of the experiments employed on artificial and real-world benchmark multi-spectral and aerial images show that the proposed algorithms perform better than the individual state-of-the art segmentation algorithms.
△ Less
Submitted 25 February, 2015; v1 submitted 17 February, 2015;
originally announced February 2015.