-
Fractal Calibration for long-tailed object detection
Authors:
Konstantinos Panagiotis Alexandridis,
Ismail Elezi,
Jiankang Deng,
Anh Nguyen,
Shan Luo
Abstract:
Real-world datasets follow an imbalanced distribution, which poses significant challenges in rare-category object detection. Recent studies tackle this problem by developing re-weighting and re-sampling methods, that utilise the class frequencies of the dataset. However, these techniques focus solely on the frequency statistics and ignore the distribution of the classes in image space, missing imp…
▽ More
Real-world datasets follow an imbalanced distribution, which poses significant challenges in rare-category object detection. Recent studies tackle this problem by developing re-weighting and re-sampling methods, that utilise the class frequencies of the dataset. However, these techniques focus solely on the frequency statistics and ignore the distribution of the classes in image space, missing important information. In contrast to them, we propose FRActal CALibration (FRACAL): a novel post-calibration method for long-tailed object detection. FRACAL devises a logit adjustment method that utilises the fractal dimension to estimate how uniformly classes are distributed in image space. During inference, it uses the fractal dimension to inversely downweight the probabilities of uniformly spaced class predictions achieving balance in two axes: between frequent and rare categories, and between uniformly spaced and sparsely spaced classes. FRACAL is a post-processing method and it does not require any training, also it can be combined with many off-the-shelf models such as one-stage sigmoid detectors and two-stage instance segmentation models. FRACAL boosts the rare class performance by up to 8.6% and surpasses all previous methods on LVIS dataset, while showing good generalisation to other datasets such as COCO, V3Det and OpenImages. The code will be released.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
ECGN: A Cluster-Aware Approach to Graph Neural Networks for Imbalanced Classification
Authors:
Bishal Thapaliya,
Anh Nguyen,
Yao Lu,
Tian Xie,
Igor Grudetskyi,
Fudong Lin,
Antonios Valkanas,
Jingyu Liu,
Deepayan Chakraborty,
Bilel Fehri
Abstract:
Classifying nodes in a graph is a common problem. The ideal classifier must adapt to any imbalances in the class distribution. It must also use information in the clustering structure of real-world graphs. Existing Graph Neural Networks (GNNs) have not addressed both problems together. We propose the Enhanced Cluster-aware Graph Network (ECGN), a novel method that addresses these issues by integra…
▽ More
Classifying nodes in a graph is a common problem. The ideal classifier must adapt to any imbalances in the class distribution. It must also use information in the clustering structure of real-world graphs. Existing Graph Neural Networks (GNNs) have not addressed both problems together. We propose the Enhanced Cluster-aware Graph Network (ECGN), a novel method that addresses these issues by integrating cluster-specific training with synthetic node generation. Unlike traditional GNNs that apply the same node update process for all nodes, ECGN learns different aggregations for different clusters. We also use the clusters to generate new minority-class nodes in a way that helps clarify the inter-class decision boundary. By combining cluster-aware embeddings with a global integration step, ECGN enhances the quality of the resulting node embeddings. Our method works with any underlying GNN and any cluster generation technique. Experimental results show that ECGN consistently outperforms its closest competitors by up to 11% on some widely studied benchmark datasets.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
Observation of disorder-free localization and efficient disorder averaging on a quantum processor
Authors:
Gaurav Gyawali,
Tyler Cochran,
Yuri Lensky,
Eliott Rosenberg,
Amir H. Karamlou,
Kostyantyn Kechedzhi,
Julia Berndtsson,
Tom Westerhout,
Abraham Asfaw,
Dmitry Abanin,
Rajeev Acharya,
Laleh Aghababaie Beni,
Trond I. Andersen,
Markus Ansmann,
Frank Arute,
Kunal Arya,
Nikita Astrakhantsev,
Juan Atalaya,
Ryan Babbush,
Brian Ballard,
Joseph C. Bardin,
Andreas Bengtsson,
Alexander Bilmes,
Gina Bortoli,
Alexandre Bourassa
, et al. (195 additional authors not shown)
Abstract:
One of the most challenging problems in the computational study of localization in quantum manybody systems is to capture the effects of rare events, which requires sampling over exponentially many disorder realizations. We implement an efficient procedure on a quantum processor, leveraging quantum parallelism, to efficiently sample over all disorder realizations. We observe localization without d…
▽ More
One of the most challenging problems in the computational study of localization in quantum manybody systems is to capture the effects of rare events, which requires sampling over exponentially many disorder realizations. We implement an efficient procedure on a quantum processor, leveraging quantum parallelism, to efficiently sample over all disorder realizations. We observe localization without disorder in quantum many-body dynamics in one and two dimensions: perturbations do not diffuse even though both the generator of evolution and the initial states are fully translationally invariant. The disorder strength as well as its density can be readily tuned using the initial state. Furthermore, we demonstrate the versatility of our platform by measuring Renyi entropies. Our method could also be extended to higher moments of the physical observables and disorder learning.
△ Less
Submitted 9 October, 2024;
originally announced October 2024.
-
Manual Verbalizer Enrichment for Few-Shot Text Classification
Authors:
Quang Anh Nguyen,
Nadi Tomeh,
Mustapha Lebbah,
Thierry Charnois,
Hanene Azzag,
Santiago Cordoba Muñoz
Abstract:
With the continuous development of pre-trained language models, prompt-based training becomes a well-adopted paradigm that drastically improves the exploitation of models for many natural language processing tasks. Prompting also shows great performance compared to traditional fine-tuning when adapted to zero-shot or few-shot scenarios where the number of annotated data is limited. In this framewo…
▽ More
With the continuous development of pre-trained language models, prompt-based training becomes a well-adopted paradigm that drastically improves the exploitation of models for many natural language processing tasks. Prompting also shows great performance compared to traditional fine-tuning when adapted to zero-shot or few-shot scenarios where the number of annotated data is limited. In this framework, the role of verbalizers is essential, as an interpretation from masked word distributions into output predictions. In this work, we propose \acrshort{mave}, an approach for verbalizer construction by enrichment of class labels using neighborhood relation in the embedding space of words for the text classification task. In addition, we elaborate a benchmarking procedure to evaluate typical baselines of verbalizers for document classification in few-shot learning contexts. Our model achieves state-of-the-art results while using significantly fewer resources. We show that our approach is particularly effective in cases with extremely limited supervision data.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
Hybrid Gripper with Passive Pneumatic Soft Joints for Grasping Deformable Thin Objects
Authors:
Ngoc-Duy Tran,
Hoang-Hiep Ly,
Xuan-Thuan Nguyen,
Thi-Thoa Mac,
Anh Nguyen,
Tung D. Ta
Abstract:
Grasping a variety of objects remains a key challenge in the development of versatile robotic systems. The human hand is remarkably dexterous, capable of grasping and manipulating objects with diverse shapes, mechanical properties, and textures. Inspired by how humans use two fingers to pick up thin and large objects such as fabric or sheets of paper, we aim to develop a gripper optimized for gras…
▽ More
Grasping a variety of objects remains a key challenge in the development of versatile robotic systems. The human hand is remarkably dexterous, capable of grasping and manipulating objects with diverse shapes, mechanical properties, and textures. Inspired by how humans use two fingers to pick up thin and large objects such as fabric or sheets of paper, we aim to develop a gripper optimized for grasping such deformable objects. Observing how the soft and flexible fingertip joints of the hand approach and grasp thin materials, a hybrid gripper design that incorporates both soft and rigid components was proposed. The gripper utilizes a soft pneumatic ring wrapped around a rigid revolute joint to create a flexible two-fingered gripper. Experiments were conducted to characterize and evaluate the gripper performance in handling sheets of paper and other objects. Compared to rigid grippers, the proposed design improves grasping efficiency and reduces the gripping distance by up to eightfold.
△ Less
Submitted 10 October, 2024; v1 submitted 8 October, 2024;
originally announced October 2024.
-
VPI-Mlogs: A web-based machine learning solution for applications in petrophysics
Authors:
Anh Tuan Nguyen
Abstract:
Machine learning is an important part of the data science field. In petrophysics, machine learning algorithms and applications have been widely approached. In this context, Vietnam Petroleum Institute (VPI) has researched and deployed several effective prediction models, namely missing log prediction, fracture zone and fracture density forecast, etc. As one of our solutions, VPI-MLogs is a web-bas…
▽ More
Machine learning is an important part of the data science field. In petrophysics, machine learning algorithms and applications have been widely approached. In this context, Vietnam Petroleum Institute (VPI) has researched and deployed several effective prediction models, namely missing log prediction, fracture zone and fracture density forecast, etc. As one of our solutions, VPI-MLogs is a web-based deployment platform which integrates data preprocessing, exploratory data analysis, visualisation and model execution. Using the most popular data analysis programming language, Python, this approach gives users a powerful tool to deal with the petrophysical logs section. The solution helps to narrow the gap between common knowledge and petrophysics insights. This article will focus on the web-based application which integrates many solutions to grasp petrophysical data.
△ Less
Submitted 6 October, 2024;
originally announced October 2024.
-
Nonparametric tests for interaction in two-way ANOVA with balanced replications
Authors:
Bao Khue Tran,
Amy S. Wagaman,
Andrew Nguyen,
David Jacobson,
Bradley Hartlaub
Abstract:
Nonparametric procedures are more powerful for detecting interaction in two-way ANOVA when the data are non-normal. In this paper, we compute null critical values for the aligned rank-based tests (APCSSA/APCSSM) where the levels of the factors are between 2 and 6. We compare the performance of these new procedures with the ANOVA F-test for interaction, the adjusted rank transform test (ART), Conov…
▽ More
Nonparametric procedures are more powerful for detecting interaction in two-way ANOVA when the data are non-normal. In this paper, we compute null critical values for the aligned rank-based tests (APCSSA/APCSSM) where the levels of the factors are between 2 and 6. We compare the performance of these new procedures with the ANOVA F-test for interaction, the adjusted rank transform test (ART), Conover's rank transform procedure (RT), and a rank-based ANOVA test (raov) using Monte Carlo simulations. The new procedures APCSSA/APCSSM are comparable with existing competitors in all settings. Even though there is no single dominant test in detecting interaction effects for non-normal data, nonparametric procedure APCSSM is the most highly recommended procedure for Cauchy errors settings.
△ Less
Submitted 6 October, 2024;
originally announced October 2024.
-
Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification
Authors:
Zhaorui Tan,
Xi Yang,
Qiufeng Wang,
Anh Nguyen,
Kaizhu Huang
Abstract:
Vision models excel in image classification but struggle to generalize to unseen data, such as classifying images from unseen domains or discovering novel categories. In this paper, we explore the relationship between logical reasoning and deep learning generalization in visual classification. A logical regularization termed L-Reg is derived which bridges a logical analysis framework to image clas…
▽ More
Vision models excel in image classification but struggle to generalize to unseen data, such as classifying images from unseen domains or discovering novel categories. In this paper, we explore the relationship between logical reasoning and deep learning generalization in visual classification. A logical regularization termed L-Reg is derived which bridges a logical analysis framework to image classification. Our work reveals that L-Reg reduces the complexity of the model in terms of the feature distribution and classifier weights. Specifically, we unveil the interpretability brought by L-Reg, as it enables the model to extract the salient features, such as faces to persons, for classification. Theoretical analysis and experiments demonstrate that L-Reg enhances generalization across various scenarios, including multi-domain generalization and generalized category discovery. In complex real-world scenarios where images span unknown classes and unseen domains, L-Reg consistently improves generalization, highlighting its practical efficacy.
△ Less
Submitted 16 October, 2024; v1 submitted 6 October, 2024;
originally announced October 2024.
-
The Ni isotopic composition of Ryugu reveals a common accretion region for carbonaceous chondrites
Authors:
Fridolin Spitzer,
Thorsten Kleine,
Christoph Burkhardt,
Timo Hopp,
Tetsuya Yokoyama,
Yoshinari Abe,
Jérôme Aléon,
Conel M. O'D. Alexander,
Sachiko Amari,
Yuri Amelin,
Ken-ichi Bajo,
Martin Bizzarro,
Audrey Bouvier,
Richard W. Carlson,
Marc Chaussidon,
Byeon-Gak Choi,
Nicolas Dauphas,
Andrew M. Davis,
Tommaso Di Rocco,
Wataru Fujiya,
Ryota Fukai,
Ikshu Gautam,
Makiko K. Haba,
Yuki Hibiya,
Hiroshi Hidaka
, et al. (66 additional authors not shown)
Abstract:
The isotopic compositions of samples returned from Cb-type asteroid Ryugu and Ivuna-type (CI) chondrites are distinct from other carbonaceous chondrites, which has led to the suggestion that Ryugu and CI chondrites formed in a different region of the accretion disk, possibly around the orbits of Uranus and Neptune. We show that, like for Fe, Ryugu and CI chondrites also have indistinguishable Ni i…
▽ More
The isotopic compositions of samples returned from Cb-type asteroid Ryugu and Ivuna-type (CI) chondrites are distinct from other carbonaceous chondrites, which has led to the suggestion that Ryugu and CI chondrites formed in a different region of the accretion disk, possibly around the orbits of Uranus and Neptune. We show that, like for Fe, Ryugu and CI chondrites also have indistinguishable Ni isotope anomalies, which differ from those of other carbonaceous chondrites. We propose that this unique Fe and Ni isotopic composition reflects different accretion efficiencies of small FeNi metal grains among the carbonaceous chondrite parent bodies. The CI chondrites incorporated these grains more efficiently, possibly because they formed at the end of the disk's lifetime, when planetesimal formation was also triggered by photoevaporation of the disk. Isotopic variations among carbonaceous chondrites may thus reflect fractionation of distinct dust components from a common reservoir, implying CI chondrites and Ryugu may have formed in the same region of the accretion disk as other carbonaceous chondrites.
△ Less
Submitted 5 October, 2024;
originally announced October 2024.
-
Safe Reference Tracking and Collision Avoidance for Taxiing Aircraft Using an MPC-CBF Framework
Authors:
Brooks A. Butler,
Zarif Cabrera,
Andy Nguyen,
Philip E. Paré
Abstract:
In this paper, we develop a framework for the automatic taxiing of aircraft between hangar and take-off given a graph-based model of an airport. We implement a high-level path-planning algorithm that models taxiway intersections as nodes in an undirected graph, algorithmically constructs a directed graph according to the physical limitations of the aircraft, and finds the shortest valid taxi path…
▽ More
In this paper, we develop a framework for the automatic taxiing of aircraft between hangar and take-off given a graph-based model of an airport. We implement a high-level path-planning algorithm that models taxiway intersections as nodes in an undirected graph, algorithmically constructs a directed graph according to the physical limitations of the aircraft, and finds the shortest valid taxi path through the directed graph using Dijkstra's algorithm. We then use this shortest path to construct a reference trajectory for the aircraft to follow that considers the turning capabilities of a given aircraft. Using high-order control barrier functions (HOCBFs), we construct safety conditions for multi-obstacle avoidance and safe reference tracking for simple 2D unicycle dynamics with acceleration control inputs. We then use these safety conditions to design an MPC-CBF framework that tracks the reference trajectory while adhering to the safety constraints. We compare the performance of our MPC-CBF controller with a PID-CBF control method via simulations.
△ Less
Submitted 4 October, 2024;
originally announced October 2024.
-
LOO-PIT: A sensitive posterior test
Authors:
Alan B. H. Nguyen,
Marco Bonici,
Glen McGee,
Will J. Percival
Abstract:
With the advent of the next generation of astrophysics experiments, the volume of data available to researchers will be greater than ever. As these projects will significantly drive down statistical uncertainties in measurements, it is crucial to develop novel tools to assess the ability of our models to fit these data within the specified errors. We introduce to astronomy the Leave One Out-Probab…
▽ More
With the advent of the next generation of astrophysics experiments, the volume of data available to researchers will be greater than ever. As these projects will significantly drive down statistical uncertainties in measurements, it is crucial to develop novel tools to assess the ability of our models to fit these data within the specified errors. We introduce to astronomy the Leave One Out-Probability Integral Transform (LOO-PIT) technique. This first estimates the LOO posterior predictive distributions based on the model and likelihood distribution specified, then evaluates the quality of the match between the model and data by applying the PIT to each estimated distribution and data point, outputting a LOO-PIT distribution. Deviations between this output distribution and that expected can be characterised visually and with a standard Kolmogorov--Smirnov distribution test. We compare LOO-PIT and the more common $χ^2$ test using both a simplified model and a more realistic astrophysics problem, where we consider fitting Baryon Acoustic Oscillations in galaxy survey data with contamination from emission line interlopers. LOO-PIT and $χ^2$ tend to find different signals from the contaminants, and using these tests in conjunction increases the statistical power compared to using either test alone. We also show that LOO-PIT outperforms $χ^2$ in certain realistic test cases.
△ Less
Submitted 4 October, 2024;
originally announced October 2024.
-
Multilevel Picard approximations overcome the curse of dimensionality when approximating semilinear heat equations with gradient-dependent nonlinearities in $L^p$-sense
Authors:
Tuan Anh Nguyen
Abstract:
We prove that multilevel Picard approximations are capable of approximating solutions of semilinear heat equations in $L^{p}$-sense, ${p}\in [2,\infty)$, in the case of gradient-dependent, Lipschitz-continuous nonlinearities, in the sense that the computational effort of the multilevel Picard approximations grow at most polynomially in both the dimension $d$ and the reciprocal $1/ε$ of the prescri…
▽ More
We prove that multilevel Picard approximations are capable of approximating solutions of semilinear heat equations in $L^{p}$-sense, ${p}\in [2,\infty)$, in the case of gradient-dependent, Lipschitz-continuous nonlinearities, in the sense that the computational effort of the multilevel Picard approximations grow at most polynomially in both the dimension $d$ and the reciprocal $1/ε$ of the prescribed accuracy $ε$.
△ Less
Submitted 12 October, 2024; v1 submitted 30 September, 2024;
originally announced October 2024.
-
A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media
Authors:
Dung Ha Nguyen,
Anh Thi Hoang Nguyen,
Kiet Van Nguyen
Abstract:
This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese. Social media data is rich and diverse, but the evolving and varied language used in these contexts makes manual labeling labor-intensive and expensive. To tackle these issues, we propose a framework that integrates sem…
▽ More
This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts for low-resource languages like Vietnamese. Social media data is rich and diverse, but the evolving and varied language used in these contexts makes manual labeling labor-intensive and expensive. To tackle these issues, we propose a framework that integrates semi-supervised learning with weak supervision techniques. This approach enhances the quality of training dataset and expands its size while minimizing manual labeling efforts. Our framework automatically labels raw data, converting non-standard vocabulary into standardized forms, thereby improving the accuracy and consistency of the training data. Experimental results demonstrate the effectiveness of our weak supervision framework in normalizing Vietnamese text, especially when utilizing Pre-trained Language Models. The proposed framework achieves an impressive F1-score of 82.72% and maintains vocabulary integrity with an accuracy of up to 99.22%. Additionally, it effectively handles undiacritized text under various conditions. This framework significantly enhances natural language normalization quality and improves the accuracy of various NLP tasks, leading to an average accuracy increase of 1-3%.
△ Less
Submitted 30 September, 2024;
originally announced September 2024.
-
Multilevel Picard approximations and deep neural networks with ReLU, leaky ReLU, and softplus activation overcome the curse of dimensionality when approximating semilinear parabolic partial differential equations in $L^p$-sense
Authors:
Ariel Neufeld,
Tuan Anh Nguyen
Abstract:
We prove that multilevel Picard approximations and deep neural networks with ReLU, leaky ReLU, and softplus activation are capable of approximating solutions of semilinear Kolmogorov PDEs in $L^\mathfrak{p}$-sense, $\mathfrak{p}\in [2,\infty)$, in the case of gradient-independent, Lipschitz-continuous nonlinearities, while the computational effort of the multilevel Picard approximations and the re…
▽ More
We prove that multilevel Picard approximations and deep neural networks with ReLU, leaky ReLU, and softplus activation are capable of approximating solutions of semilinear Kolmogorov PDEs in $L^\mathfrak{p}$-sense, $\mathfrak{p}\in [2,\infty)$, in the case of gradient-independent, Lipschitz-continuous nonlinearities, while the computational effort of the multilevel Picard approximations and the required number of parameters in the neural networks grow at most polynomially in both dimension $d\in \mathbb{N}$ and reciprocal of the prescribed accuracy $ε$.
△ Less
Submitted 30 September, 2024;
originally announced September 2024.
-
NeuroMax: Enhancing Neural Topic Modeling via Maximizing Mutual Information and Group Topic Regularization
Authors:
Duy-Tung Pham,
Thien Trang Nguyen Vu,
Tung Nguyen,
Linh Ngo Van,
Duc Anh Nguyen,
Thien Huu Nguyen
Abstract:
Recent advances in neural topic models have concentrated on two primary directions: the integration of the inference network (encoder) with a pre-trained language model (PLM) and the modeling of the relationship between words and topics in the generative model (decoder). However, the use of large PLMs significantly increases inference costs, making them less practical for situations requiring low…
▽ More
Recent advances in neural topic models have concentrated on two primary directions: the integration of the inference network (encoder) with a pre-trained language model (PLM) and the modeling of the relationship between words and topics in the generative model (decoder). However, the use of large PLMs significantly increases inference costs, making them less practical for situations requiring low inference times. Furthermore, it is crucial to simultaneously model the relationships between topics and words as well as the interrelationships among topics themselves. In this work, we propose a novel framework called NeuroMax (Neural Topic Model with Maximizing Mutual Information with Pretrained Language Model and Group Topic Regularization) to address these challenges. NeuroMax maximizes the mutual information between the topic representation obtained from the encoder in neural topic models and the representation derived from the PLM. Additionally, NeuroMax employs optimal transport to learn the relationships between topics by analyzing how information is transported among them. Experimental results indicate that NeuroMax reduces inference time, generates more coherent topics and topic groups, and produces more representative document embeddings, thereby enhancing performance on downstream tasks.
△ Less
Submitted 29 September, 2024;
originally announced September 2024.
-
Range-aware Positional Encoding via High-order Pretraining: Theory and Practice
Authors:
Viet Anh Nguyen,
Nhat Khang Ngo,
Truong Son Hy
Abstract:
Unsupervised pre-training on vast amounts of graph data is critical in real-world applications wherein labeled data is limited, such as molecule properties prediction or materials science. Existing approaches pre-train models for specific graph domains, neglecting the inherent connections within networks. This limits their ability to transfer knowledge to various supervised tasks. In this work, we…
▽ More
Unsupervised pre-training on vast amounts of graph data is critical in real-world applications wherein labeled data is limited, such as molecule properties prediction or materials science. Existing approaches pre-train models for specific graph domains, neglecting the inherent connections within networks. This limits their ability to transfer knowledge to various supervised tasks. In this work, we propose a novel pre-training strategy on graphs that focuses on modeling their multi-resolution structural information, allowing us to capture global information of the whole graph while preserving local structures around its nodes. We extend the work of Wave}let Positional Encoding (WavePE) from (Ngo et al., 2023) by pretraining a High-Order Permutation-Equivariant Autoencoder (HOPE-WavePE) to reconstruct node connectivities from their multi-resolution wavelet signals. Unlike existing positional encodings, our method is designed to become sensitivity to the input graph size in downstream tasks, which efficiently capture global structure on graphs. Since our approach relies solely on the graph structure, it is also domain-agnostic and adaptable to datasets from various domains, therefore paving the wave for developing general graph structure encoders and graph foundation models. We theoretically demonstrate that there exists a parametrization of such architecture that it can predict the output adjacency up to arbitrarily low error. We also evaluate HOPE-WavePE on graph-level prediction tasks of different areas and show its superiority compared to other methods.
△ Less
Submitted 27 September, 2024;
originally announced September 2024.
-
Robotic-CLIP: Fine-tuning CLIP on Action Data for Robotic Applications
Authors:
Nghia Nguyen,
Minh Nhat Vu,
Tung D. Ta,
Baoru Huang,
Thieu Vo,
Ngan Le,
Anh Nguyen
Abstract:
Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involv…
▽ More
Vision language models have played a key role in extracting meaningful features for various robotic applications. Among these, Contrastive Language-Image Pretraining (CLIP) is widely used in robotic tasks that require both vision and natural language understanding. However, CLIP was trained solely on static images paired with text prompts and has not yet been fully adapted for robotic tasks involving dynamic actions. In this paper, we introduce Robotic-CLIP to enhance robotic perception capabilities. We first gather and label large-scale action data, and then build our Robotic-CLIP by fine-tuning CLIP on 309,433 videos (~7.4 million frames) of action data using contrastive learning. By leveraging action data, Robotic-CLIP inherits CLIP's strong image performance while gaining the ability to understand actions in robotic contexts. Intensive experiments show that our Robotic-CLIP outperforms other CLIP-based models across various language-driven robotic tasks. Additionally, we demonstrate the practical effectiveness of Robotic-CLIP in real-world grasping applications.
△ Less
Submitted 26 September, 2024;
originally announced September 2024.
-
Decentralized Federated Learning with Gradient Tracking over Time-Varying Directed Networks
Authors:
Duong Thuy Anh Nguyen,
Su Wang,
Duong Tung Nguyen,
Angelia Nedich,
H. Vincent Poor
Abstract:
We investigate the problem of agent-to-agent interaction in decentralized (federated) learning over time-varying directed graphs, and, in doing so, propose a consensus-based algorithm called DSGTm-TV. The proposed algorithm incorporates gradient tracking and heavy-ball momentum to distributively optimize a global objective function, while preserving local data privacy. Under DSGTm-TV, agents will…
▽ More
We investigate the problem of agent-to-agent interaction in decentralized (federated) learning over time-varying directed graphs, and, in doing so, propose a consensus-based algorithm called DSGTm-TV. The proposed algorithm incorporates gradient tracking and heavy-ball momentum to distributively optimize a global objective function, while preserving local data privacy. Under DSGTm-TV, agents will update local model parameters and gradient estimates using information exchange with neighboring agents enabled through row- and column-stochastic mixing matrices, which we show guarantee both consensus and optimality. Our analysis establishes that DSGTm-TV exhibits linear convergence to the exact global optimum when exact gradient information is available, and converges in expectation to a neighborhood of the global optimum when employing stochastic gradients. Moreover, in contrast to existing methods, DSGTm-TV preserves convergence for networks with uncoordinated stepsizes and momentum parameters, for which we provide explicit bounds. These results enable agents to operate in a fully decentralized manner, independently optimizing their local hyper-parameters. We demonstrate the efficacy of our approach via comparisons with state-of-the-art baselines on real-world image classification and natural language processing tasks.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Visualizing Dynamics of Charges and Strings in (2+1)D Lattice Gauge Theories
Authors:
Tyler A. Cochran,
Bernhard Jobst,
Eliott Rosenberg,
Yuri D. Lensky,
Gaurav Gyawali,
Norhan Eassa,
Melissa Will,
Dmitry Abanin,
Rajeev Acharya,
Laleh Aghababaie Beni,
Trond I. Andersen,
Markus Ansmann,
Frank Arute,
Kunal Arya,
Abraham Asfaw,
Juan Atalaya,
Ryan Babbush,
Brian Ballard,
Joseph C. Bardin,
Andreas Bengtsson,
Alexander Bilmes,
Alexandre Bourassa,
Jenna Bovaird,
Michael Broughton,
David A. Browne
, et al. (167 additional authors not shown)
Abstract:
Lattice gauge theories (LGTs) can be employed to understand a wide range of phenomena, from elementary particle scattering in high-energy physics to effective descriptions of many-body interactions in materials. Studying dynamical properties of emergent phases can be challenging as it requires solving many-body problems that are generally beyond perturbative limits. We investigate the dynamics of…
▽ More
Lattice gauge theories (LGTs) can be employed to understand a wide range of phenomena, from elementary particle scattering in high-energy physics to effective descriptions of many-body interactions in materials. Studying dynamical properties of emergent phases can be challenging as it requires solving many-body problems that are generally beyond perturbative limits. We investigate the dynamics of local excitations in a $\mathbb{Z}_2$ LGT using a two-dimensional lattice of superconducting qubits. We first construct a simple variational circuit which prepares low-energy states that have a large overlap with the ground state; then we create particles with local gates and simulate their quantum dynamics via a discretized time evolution. As the effective magnetic field is increased, our measurements show signatures of transitioning from deconfined to confined dynamics. For confined excitations, the magnetic field induces a tension in the string connecting them. Our method allows us to experimentally image string dynamics in a (2+1)D LGT from which we uncover two distinct regimes inside the confining phase: for weak confinement the string fluctuates strongly in the transverse direction, while for strong confinement transverse fluctuations are effectively frozen. In addition, we demonstrate a resonance condition at which dynamical string breaking is facilitated. Our LGT implementation on a quantum processor presents a novel set of techniques for investigating emergent particle and string dynamics.
△ Less
Submitted 25 September, 2024;
originally announced September 2024.
-
Generalization vs. Specialization under Concept Shift
Authors:
Alex Nguyen,
David J. Schwab,
Vudtiwat Ngampruetikorn
Abstract:
Machine learning models are often brittle under distribution shift, i.e., when data distributions at test time differ from those during training. Understanding this failure mode is central to identifying and mitigating safety risks of mass adoption of machine learning. Here we analyze ridge regression under concept shift -- a form of distribution shift in which the input-label relationship changes…
▽ More
Machine learning models are often brittle under distribution shift, i.e., when data distributions at test time differ from those during training. Understanding this failure mode is central to identifying and mitigating safety risks of mass adoption of machine learning. Here we analyze ridge regression under concept shift -- a form of distribution shift in which the input-label relationship changes at test time. We derive an exact expression for prediction risk in the high-dimensional limit. Our results reveal nontrivial effects of concept shift on generalization performance, depending on the properties of robust and nonrobust features of the input. We show that test performance can exhibit a nonmonotonic data dependence, even when double descent is absent. Finally, our experiments on MNIST and FashionMNIST suggest that this intriguing behavior is present also in classification problems.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning
Authors:
Huy Hoang Nguyen,
An Vuong,
Anh Nguyen,
Ian Reid,
Minh Nhat Vu
Abstract:
Grasp detection is a fundamental robotic task critical to the success of many industrial applications. However, current language-driven models for this task often struggle with cluttered images, lengthy textual descriptions, or slow inference speed. We introduce GraspMamba, a new language-driven grasp detection method that employs hierarchical feature fusion with Mamba vision to tackle these chall…
▽ More
Grasp detection is a fundamental robotic task critical to the success of many industrial applications. However, current language-driven models for this task often struggle with cluttered images, lengthy textual descriptions, or slow inference speed. We introduce GraspMamba, a new language-driven grasp detection method that employs hierarchical feature fusion with Mamba vision to tackle these challenges. By leveraging rich visual features of the Mamba-based backbone alongside textual information, our approach effectively enhances the fusion of multimodal features. GraspMamba represents the first Mamba-based grasp detection model to extract vision and language features at multiple scales, delivering robust performance and rapid inference time. Intensive experiments show that GraspMamba outperforms recent methods by a clear margin. We validate our approach through real-world robotic experiments, highlighting its fast inference speed.
△ Less
Submitted 22 September, 2024;
originally announced September 2024.
-
Monomial Matrix Group Equivariant Neural Functional Networks
Authors:
Hoang V. Tran,
Thieu N. Vo,
Tho H. Tran,
An T. Nguyen,
Tan Minh Nguyen
Abstract:
Neural functional networks (NFNs) have recently gained significant attention due to their diverse applications, ranging from predicting network generalization and network editing to classifying implicit neural representation. Previous NFN designs often depend on permutation symmetries in neural networks' weights, which traditionally arise from the unordered arrangement of neurons in hidden layers.…
▽ More
Neural functional networks (NFNs) have recently gained significant attention due to their diverse applications, ranging from predicting network generalization and network editing to classifying implicit neural representation. Previous NFN designs often depend on permutation symmetries in neural networks' weights, which traditionally arise from the unordered arrangement of neurons in hidden layers. However, these designs do not take into account the weight scaling symmetries of $\operatorname{ReLU}$ networks, and the weight sign flipping symmetries of $\operatorname{sin}$ or $\operatorname{tanh}$ networks. In this paper, we extend the study of the group action on the network weights from the group of permutation matrices to the group of monomial matrices by incorporating scaling/sign-flipping symmetries. Particularly, we encode these scaling/sign-flipping symmetries by designing our corresponding equivariant and invariant layers. We name our new family of NFNs the Monomial Matrix Group Equivariant Neural Functional Networks (Monomial-NFN). Because of the expansion of the symmetries, Monomial-NFN has much fewer independent trainable parameters compared to the baseline NFNs in the literature, thus enhancing the model's efficiency. Moreover, for fully connected and convolutional neural networks, we theoretically prove that all groups that leave these networks invariant while acting on their weight spaces are some subgroups of the monomial matrix group. We provide empirical evidences to demonstrate the advantages of our model over existing baselines, achieving competitive performance and efficiency.
△ Less
Submitted 18 September, 2024;
originally announced September 2024.
-
Charting new regions of Cobalt's chemical space with maximally large magnetic anisotropy: A computational high-throughput study
Authors:
Lorenzo A. Mariano,
Vu Ha Anh Nguyen,
Valerio Briganti,
Alessandro Lunghi
Abstract:
Magnetic anisotropy slows down magnetic relaxation and plays a prominent role in the design of permanent magnets. Coordination compounds of Co(II) in particular exhibit large magnetic anisotropy in the presence of low-coordination environments and have been used as single-molecule magnet prototypes. However, only a limited sampling of Cobalt's vast chemical space has been performed, potentially ob…
▽ More
Magnetic anisotropy slows down magnetic relaxation and plays a prominent role in the design of permanent magnets. Coordination compounds of Co(II) in particular exhibit large magnetic anisotropy in the presence of low-coordination environments and have been used as single-molecule magnet prototypes. However, only a limited sampling of Cobalt's vast chemical space has been performed, potentially obscuring alternative chemical routes toward large magnetic anisotropy. Here we perform a computational high-throughput exploration of Co(II)'s chemical space in search of new single-molecule magnets. We automatically assemble a diverse set of about 15000 novel complexes of Co(II) and fully characterize them with multi-reference ab initio methods. More than 100 compounds exhibit magnetic anisotropy comparable to or larger than leading known compounds. The analysis of these results shows that compounds with record-breaking magnetic anisotropy can also be achieved with coordination four or higher, going beyond the established paradigm of two-coordinated linear complexes.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
Origin of yield stress and mechanical plasticity in biological tissues
Authors:
Anh Q. Nguyen,
Junxiang Huang,
Dapeng Bi
Abstract:
During development and under normal physiological conditions, biological tissues are continuously subjected to substantial mechanical stresses. In response to large deformations cells in a tissue must undergo multicellular rearrangements in order to maintain integrity and robustness. However, how these events are connected in time and space remains unknown. Here, using computational and theoretica…
▽ More
During development and under normal physiological conditions, biological tissues are continuously subjected to substantial mechanical stresses. In response to large deformations cells in a tissue must undergo multicellular rearrangements in order to maintain integrity and robustness. However, how these events are connected in time and space remains unknown. Here, using computational and theoretical modeling, we studied the mechanical plasticity of epithelial monolayers under large deformations. Our results demonstrate that the jamming-unjamming (solid-fluid) transition in tissues can vary significantly depending on the degree of deformation, implying that tissues are highly unconventional materials. Using analytical modeling, we elucidate the origins of this behavior. We also demonstrate how a tissue accommodates large deformations through a collective series of rearrangements, which behave similarly to avalanches in non-living materials. We find that these tissue avalanches are governed by stress redistribution and the spatial distribution of vulnerable spots. Finally, we propose a simple and experimentally accessible framework to predict avalanches and infer tissue mechanical stress based on static images.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
Provable Hyperparameter Tuning for Structured Pfaffian Settings
Authors:
Maria-Florina Balcan,
Anh Tuan Nguyen,
Dravyansh Sharma
Abstract:
Data-driven algorithm design automatically adapts algorithms to specific application domains, achieving better performance. In the context of parameterized algorithms, this approach involves tuning the algorithm parameters using problem instances drawn from the problem distribution of the target application domain. While empirical evidence supports the effectiveness of data-driven algorithm design…
▽ More
Data-driven algorithm design automatically adapts algorithms to specific application domains, achieving better performance. In the context of parameterized algorithms, this approach involves tuning the algorithm parameters using problem instances drawn from the problem distribution of the target application domain. While empirical evidence supports the effectiveness of data-driven algorithm design, providing theoretical guarantees for several parameterized families remains challenging. This is due to the intricate behaviors of their corresponding utility functions, which typically admit piece-wise and discontinuity structures. In this work, we present refined frameworks for providing learning guarantees for parameterized data-driven algorithm design problems in both distributional and online learning settings. For the distributional learning setting, we introduce the Pfaffian GJ framework, an extension of the classical GJ framework, capable of providing learning guarantees for function classes for which the computation involves Pfaffian functions. Unlike the GJ framework, which is limited to function classes with computation characterized by rational functions, our proposed framework can deal with function classes involving Pfaffian functions, which are much more general and widely applicable. We then show that for many parameterized algorithms of interest, their utility function possesses a refined piece-wise structure, which automatically translates to learning guarantees using our proposed framework. For the online learning setting, we provide a new tool for verifying dispersion property of a sequence of loss functions. This sufficient condition allows no-regret learning for sequences of piece-wise structured loss functions where the piece-wise structure involves Pfaffian transition boundaries.
△ Less
Submitted 6 September, 2024;
originally announced September 2024.
-
Two-neutrino double electron capture of $^{124}$Xe in the first LUX-ZEPLIN exposure
Authors:
J. Aalbers,
D. S. Akerib,
A. K. Al Musalhi,
F. Alder,
C. S. Amarasinghe,
A. Ames,
T. J. Anderson,
N. Angelides,
H. M. Araújo,
J. E. Armstrong,
M. Arthurs,
A. Baker,
S. Balashov,
J. Bang,
J. W. Bargemann,
E. E. Barillier,
K. Beattie,
A. Bhatti,
A. Biekert,
T. P. Biesiadzinski,
H. J. Birch,
E. Bishop,
G. M. Blockinger,
B. Boxer,
C. A. J. Brew
, et al. (180 additional authors not shown)
Abstract:
The broad physics reach of the LUX-ZEPLIN (LZ) experiment covers rare phenomena beyond the direct detection of dark matter. We report precise measurements of the extremely rare decay of $^{124}$Xe through the process of two-neutrino double electron capture (2$ν$2EC), utilizing a $1.39\,\mathrm{kg} \times \mathrm{yr}$ isotopic exposure from the first LZ science run. A half-life of…
▽ More
The broad physics reach of the LUX-ZEPLIN (LZ) experiment covers rare phenomena beyond the direct detection of dark matter. We report precise measurements of the extremely rare decay of $^{124}$Xe through the process of two-neutrino double electron capture (2$ν$2EC), utilizing a $1.39\,\mathrm{kg} \times \mathrm{yr}$ isotopic exposure from the first LZ science run. A half-life of $T_{1/2}^{2\nu2\mathrm{EC}} = (1.09 \pm 0.14_{\text{stat}} \pm 0.05_{\text{sys}}) \times 10^{22}\,\mathrm{yr}$ is observed with a statistical significance of $8.3\,σ$, in agreement with literature. First empirical measurements of the KK capture fraction relative to other K-shell modes were conducted, and demonstrate consistency with respect to recent signal models at the $1.4\,σ$ level.
△ Less
Submitted 30 August, 2024;
originally announced August 2024.
-
Large-Scale Demand Prediction in Urban Rail using Multi-Graph Inductive Representation Learning
Authors:
Dang Viet Anh Nguyen,
J. Victor Flensburg,
Fabrizio Cerreto,
Bianca Pascariu,
Paola Pellegrini,
Carlos Lima Azevedo,
Filipe Rodrigues
Abstract:
With the expansion of cities over time, URT (Urban Rail Transit) networks have also grown significantly. Demand prediction plays an important role in supporting planning, scheduling, fleet management, and other operational decisions. In this study, we propose an Origin-Destination (OD) demand prediction model called Multi-Graph Inductive Representation Learning (mGraphSAGE) for large-scale URT net…
▽ More
With the expansion of cities over time, URT (Urban Rail Transit) networks have also grown significantly. Demand prediction plays an important role in supporting planning, scheduling, fleet management, and other operational decisions. In this study, we propose an Origin-Destination (OD) demand prediction model called Multi-Graph Inductive Representation Learning (mGraphSAGE) for large-scale URT networks under operational uncertainties. Our main contributions are twofold: we enhance prediction results while ensuring scalability for large networks by relying simultaneously on multiple graphs, where each OD pair is a node on a graph and distinct OD relationships, such as temporal and spatial correlations; we show the importance of including operational uncertainties such as train delays and cancellations as inputs in demand prediction for daily operations. The model is validated on three different scales of the URT network in Copenhagen, Denmark. Experimental results show that by leveraging information from neighboring ODs and learning node representations via sampling and aggregation, mGraphSAGE is particularly suitable for OD demand prediction in large-scale URT networks, outperforming reference machine learning methods. Furthermore, during periods with train cancellations and delays, the performance gap between mGraphSAGE and other methods improves compared to normal operating conditions, demonstrating its ability to leverage system reliability information for predicting OD demand under uncertainty.
△ Less
Submitted 28 August, 2024;
originally announced August 2024.
-
Quantum error correction below the surface code threshold
Authors:
Rajeev Acharya,
Laleh Aghababaie-Beni,
Igor Aleiner,
Trond I. Andersen,
Markus Ansmann,
Frank Arute,
Kunal Arya,
Abraham Asfaw,
Nikita Astrakhantsev,
Juan Atalaya,
Ryan Babbush,
Dave Bacon,
Brian Ballard,
Joseph C. Bardin,
Johannes Bausch,
Andreas Bengtsson,
Alexander Bilmes,
Sam Blackwell,
Sergio Boixo,
Gina Bortoli,
Alexandre Bourassa,
Jenna Bovaird,
Leon Brill,
Michael Broughton,
David A. Browne
, et al. (224 additional authors not shown)
Abstract:
Quantum error correction provides a path to reach practical quantum computing by combining multiple physical qubits into a logical qubit, where the logical error rate is suppressed exponentially as more qubits are added. However, this exponential suppression only occurs if the physical error rate is below a critical threshold. In this work, we present two surface code memories operating below this…
▽ More
Quantum error correction provides a path to reach practical quantum computing by combining multiple physical qubits into a logical qubit, where the logical error rate is suppressed exponentially as more qubits are added. However, this exponential suppression only occurs if the physical error rate is below a critical threshold. In this work, we present two surface code memories operating below this threshold: a distance-7 code and a distance-5 code integrated with a real-time decoder. The logical error rate of our larger quantum memory is suppressed by a factor of $Λ$ = 2.14 $\pm$ 0.02 when increasing the code distance by two, culminating in a 101-qubit distance-7 code with 0.143% $\pm$ 0.003% error per cycle of error correction. This logical memory is also beyond break-even, exceeding its best physical qubit's lifetime by a factor of 2.4 $\pm$ 0.3. We maintain below-threshold performance when decoding in real time, achieving an average decoder latency of 63 $μ$s at distance-5 up to a million cycles, with a cycle time of 1.1 $μ$s. To probe the limits of our error-correction performance, we run repetition codes up to distance-29 and find that logical performance is limited by rare correlated error events occurring approximately once every hour, or 3 $\times$ 10$^9$ cycles. Our results present device performance that, if scaled, could realize the operational requirements of large scale fault-tolerant quantum algorithms.
△ Less
Submitted 24 August, 2024;
originally announced August 2024.
-
CathAction: A Benchmark for Endovascular Intervention Understanding
Authors:
Baoru Huang,
Tuan Vo,
Chayun Kongtongvattana,
Giulio Dagnino,
Dennis Kundrat,
Wenqiang Chi,
Mohamed Abdelaziz,
Trevor Kwok,
Tudor Jianu,
Tuong Do,
Hieu Le,
Minh Nguyen,
Hoan Nguyen,
Erman Tjiputra,
Quang Tran,
Jianyang Xie,
Yanda Meng,
Binod Bhattarai,
Zhaorui Tan,
Hongbin Liu,
Hong Seng Gan,
Wei Wang,
Xi Yang,
Qiufeng Wang,
Jionglong Su
, et al. (13 additional authors not shown)
Abstract:
Real-time visual feedback from catheterization analysis is crucial for enhancing surgical safety and efficiency during endovascular interventions. However, existing datasets are often limited to specific tasks, small scale, and lack the comprehensive annotations necessary for broader endovascular intervention understanding. To tackle these limitations, we introduce CathAction, a large-scale datase…
▽ More
Real-time visual feedback from catheterization analysis is crucial for enhancing surgical safety and efficiency during endovascular interventions. However, existing datasets are often limited to specific tasks, small scale, and lack the comprehensive annotations necessary for broader endovascular intervention understanding. To tackle these limitations, we introduce CathAction, a large-scale dataset for catheterization understanding. Our CathAction dataset encompasses approximately 500,000 annotated frames for catheterization action understanding and collision detection, and 25,000 ground truth masks for catheter and guidewire segmentation. For each task, we benchmark recent related works in the field. We further discuss the challenges of endovascular intentions compared to traditional computer vision tasks and point out open research questions. We hope that CathAction will facilitate the development of endovascular intervention understanding methods that can be applied to real-world applications. The dataset is available at https://airvlab.github.io/cathaction/.
△ Less
Submitted 30 August, 2024; v1 submitted 23 August, 2024;
originally announced August 2024.
-
Small Sample Behavior of Wasserstein Projections, Connections to Empirical Likelihood, and Other Applications
Authors:
Sirui Lin,
Jose Blanchet,
Peter Glynn,
Viet Anh Nguyen
Abstract:
The empirical Wasserstein projection (WP) distance quantifies the Wasserstein distance from the empirical distribution to a set of probability measures satisfying given expectation constraints. The WP is a powerful tool because it mitigates the curse of dimensionality inherent in the Wasserstein distance, making it valuable for various tasks, including constructing statistics for hypothesis testin…
▽ More
The empirical Wasserstein projection (WP) distance quantifies the Wasserstein distance from the empirical distribution to a set of probability measures satisfying given expectation constraints. The WP is a powerful tool because it mitigates the curse of dimensionality inherent in the Wasserstein distance, making it valuable for various tasks, including constructing statistics for hypothesis testing, optimally selecting the ambiguity size in Wasserstein distributionally robust optimization, and studying algorithmic fairness. While the weak convergence analysis of the WP as the sample size $n$ grows is well understood, higher-order (i.e., sharp) asymptotics of WP remain unknown. In this paper, we study the second-order asymptotic expansion and the Edgeworth expansion of WP, both expressed as power series of $n^{-1/2}$. These expansions are essential to develop improved confidence level accuracy and a power expansion analysis for the WP-based tests for moment equations null against local alternative hypotheses. As a by-product, we obtain insightful criteria for comparing the power of the Empirical Likelihood and Hotelling's $T^2$ tests against the WP-based test. This insight provides the first comprehensive guideline for selecting the most powerful local test among WP-based, empirical-likelihood-based, and Hotelling's $T^2$ tests for a null. Furthermore, we introduce Bartlett-type corrections to improve the approximation to WP distance quantiles and, thus, improve the coverage in WP applications.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
Open-Ended 3D Point Cloud Instance Segmentation
Authors:
Phuc D. A. Nguyen,
Minh Luu,
Anh Tran,
Cuong Pham,
Khoi Nguyen
Abstract:
Open-Vocab 3D Instance Segmentation methods (OV-3DIS) have recently demonstrated their ability to generalize to unseen objects. However, these methods still depend on predefined class names during testing, restricting the autonomy of agents. To mitigate this constraint, we propose a novel problem termed Open-Ended 3D Instance Segmentation (OE-3DIS), which eliminates the necessity for predefined cl…
▽ More
Open-Vocab 3D Instance Segmentation methods (OV-3DIS) have recently demonstrated their ability to generalize to unseen objects. However, these methods still depend on predefined class names during testing, restricting the autonomy of agents. To mitigate this constraint, we propose a novel problem termed Open-Ended 3D Instance Segmentation (OE-3DIS), which eliminates the necessity for predefined class names during testing. Moreover, we contribute a comprehensive set of strong baselines, derived from OV-3DIS approaches and leveraging 2D Multimodal Large Language Models. To assess the performance of our OE-3DIS system, we introduce a novel Open-Ended score, evaluating both the semantic and geometric quality of predicted masks and their associated class names, alongside the standard AP score. Our approach demonstrates significant performance improvements over the baselines on the ScanNet200 and ScanNet++ datasets. Remarkably, our method surpasses the performance of Open3DIS, the current state-of-the-art method in OV-3DIS, even in the absence of ground-truth object class names.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
Public Health in Disaster: Emotional Health and Life Incidents Extraction during Hurricane Harvey
Authors:
Thomas Hoang,
Quynh Anh Nguyen,
Long Nguyen
Abstract:
Countless disasters have resulted from climate change, causing severe damage to infrastructure and the economy. These disasters have significant societal impacts, necessitating mental health services for the millions affected. To prepare for and respond effectively to such events, it is important to understand people's emotions and the life incidents they experience before and after a disaster str…
▽ More
Countless disasters have resulted from climate change, causing severe damage to infrastructure and the economy. These disasters have significant societal impacts, necessitating mental health services for the millions affected. To prepare for and respond effectively to such events, it is important to understand people's emotions and the life incidents they experience before and after a disaster strikes. In this case study, we collected a dataset of approximately 400,000 public tweets related to the storm. Using a BERT-based model, we predicted the emotions associated with each tweet. To efficiently identify these topics, we utilized the Latent Dirichlet Allocation (LDA) technique for topic modeling, which allowed us to bypass manual content analysis and extract meaningful patterns from the data. However, rather than stopping at topic identification like previous methods \cite{math11244910}, we further refined our analysis by integrating Graph Neural Networks (GNN) and Large Language Models (LLM). The GNN was employed to generate embeddings and construct a similarity graph of the tweets, which was then used to optimize clustering. Subsequently, we used an LLM to automatically generate descriptive names for each event cluster, offering critical insights for disaster preparedness and response strategies.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Enhancing Material Screening at Boulby Underground Laboratory with XIA UltraLo-1800 Alpha Particle Detectors
Authors:
Sid El Moctar Ahmed Maouloud,
XinRan Liu,
Anh Nguyen,
James Edward Young Dobson,
Chamkaur Ghag,
Léna Le Floch,
Emma Meehan,
Alexander St. John Murphy,
Sean Paling,
Ruben Saakyan,
Paul Robert Scovell,
Christopher Toth
Abstract:
The Boulby UnderGround Screening (BUGS) facility, located at the Boulby Underground Laboratory, has significantly advanced its material screening capabilities by installing two XIA UltraLo-1800 alpha particle detectors. This study presents a comprehensive evaluation of one of these detectors, operated 1,100 meters underground at the Boulby Underground Laboratory, which provides significant shieldi…
▽ More
The Boulby UnderGround Screening (BUGS) facility, located at the Boulby Underground Laboratory, has significantly advanced its material screening capabilities by installing two XIA UltraLo-1800 alpha particle detectors. This study presents a comprehensive evaluation of one of these detectors, operated 1,100 meters underground at the Boulby Underground Laboratory, which provides significant shielding from cosmic radiation and maintains a low ambient radon activity of 2.30 $\pm$ 0.03 Bq/m$^3$. Our evaluation focuses on energy reconstruction accuracy, background radiation rates, and operational stability. The XIA UltraLo-1800 detector demonstrates remarkable stability in energy reconstruction, with less than 0.1 MeV variation over four years. Moreover, the implementation of a graphite-filled PTFE liner in the sample tray resulted in a significant reduction in background radiation levels compared to measurements with the original stainless steel tray, achieving an average activity of 0.15 $\pm$ 0.01 $α$/cm$^2$/khr. Copper sample assays, performed before and after radon exposure, demonstrated the detector's ability to accurately identify and quantify $^{210}$Po contamination. By implementing the robust cleanliness procedures and protocols described in this article, we observed a reduction in $^{210}$Po activity from 0.504 $\pm$ 0.022 mBq to 0.336 $\pm$ 0.013 mBq, highlighting the crucial role of refined cleaning methods in minimizing background for sensitive experiments. Additionally, observations of elevated background activity levels post-high-activity sample measurements illustrate the need for careful management of assay conditions and environment to maintain low background levels. These results highlight the potential of the XIA UltraLo-1800 in enhancing the precision of material assays essential for reducing background interference in rare event experiments.
△ Less
Submitted 27 September, 2024; v1 submitted 13 August, 2024;
originally announced August 2024.
-
Sampling Foundational Transformer: A Theoretical Perspective
Authors:
Viet Anh Nguyen,
Minh Lenhat,
Khoa Nguyen,
Duong Duc Hieu,
Dao Huu Hung,
Truong Son Hy
Abstract:
The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. To apply transformers across different data modalities, practitioners have to make specific clever data-modality-dependent constructions. In this paper, we propose Sampling Foundational Transformer (SFT) that can work…
▽ More
The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. To apply transformers across different data modalities, practitioners have to make specific clever data-modality-dependent constructions. In this paper, we propose Sampling Foundational Transformer (SFT) that can work on multiple data modalities (e.g., point cloud, graph, and sequence) and constraints (e.g., rotational-invariant). The existence of such model is important as contemporary foundational modeling requires operability on multiple data sources. For efficiency on large number of tokens, our model relies on our context aware sampling-without-replacement mechanism for both linear asymptotic computational complexity and real inference time gain. For efficiency, we rely on our newly discovered pseudoconvex formulation of transformer layer to increase model's convergence rate. As a model working on multiple data modalities, SFT has achieved competitive results on many benchmarks, while being faster in inference, compared to other very specialized models.
△ Less
Submitted 17 August, 2024; v1 submitted 11 August, 2024;
originally announced August 2024.
-
SAMSA: Efficient Transformer for Many Data Modalities
Authors:
Minh Lenhat,
Viet Anh Nguyen,
Khoa Nguyen,
Duong Duc Hieu,
Dao Huu Hung,
Truong Son Hy
Abstract:
The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. Efficient transformers, on the other hand, often rely on clever data-modality-dependent construction to get over the quadratic complexity of transformers. This greatly hinders their applications on different data modal…
▽ More
The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. Efficient transformers, on the other hand, often rely on clever data-modality-dependent construction to get over the quadratic complexity of transformers. This greatly hinders their applications on different data modalities, which is one of the pillars of contemporary foundational modeling. In this paper, we lay the groundwork for efficient foundational modeling by proposing SAMSA - SAMpling-Self-Attention, a context-aware linear complexity self-attention mechanism that works well on multiple data modalities. Our mechanism is based on a differentiable sampling without replacement method we discovered. This enables the self-attention module to attend to the most important token set, where the importance is defined by data. Moreover, as differentiability is not needed in inference, the sparse formulation of our method costs little time overhead, further lowering computational costs. In short, SAMSA achieved competitive or even SOTA results on many benchmarks, while being faster in inference, compared to other very specialized models. Against full self-attention, real inference time significantly decreases while performance ranges from negligible degradation to outperformance. We release our source code in the repository: https://github.com/HySonLab/SAMSA
△ Less
Submitted 18 August, 2024; v1 submitted 9 August, 2024;
originally announced August 2024.
-
XMainframe: A Large Language Model for Mainframe Modernization
Authors:
Anh T. V. Dau,
Hieu Trung Dao,
Anh Tuan Nguyen,
Hieu Trung Tran,
Phong X. Nguyen,
Nghi D. Q. Bui
Abstract:
Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-th…
▽ More
Mainframe operating systems, despite their inception in the 1940s, continue to support critical sectors like finance and government. However, these systems are often viewed as outdated, requiring extensive maintenance and modernization. Addressing this challenge necessitates innovative tools that can understand and interact with legacy codebases. To this end, we introduce XMainframe, a state-of-the-art large language model (LLM) specifically designed with knowledge of mainframe legacy systems and COBOL codebases. Our solution involves the creation of an extensive data collection pipeline to produce high-quality training datasets, enhancing XMainframe's performance in this specialized domain. Additionally, we present MainframeBench, a comprehensive benchmark for assessing mainframe knowledge, including multiple-choice questions, question answering, and COBOL code summarization. Our empirical evaluations demonstrate that XMainframe consistently outperforms existing state-of-the-art LLMs across these tasks. Specifically, XMainframe achieves 30% higher accuracy than DeepSeek-Coder on multiple-choice questions, doubles the BLEU score of Mixtral-Instruct 8x7B on question answering, and scores six times higher than GPT-3.5 on COBOL summarization. Our work highlights the potential of XMainframe to drive significant advancements in managing and modernizing legacy systems, thereby enhancing productivity and saving time for software developers.
△ Less
Submitted 26 August, 2024; v1 submitted 5 August, 2024;
originally announced August 2024.
-
e-Health CSIRO at RRG24: Entropy-Augmented Self-Critical Sequence Training for Radiology Report Generation
Authors:
Aaron Nicolson,
Jinghui Liu,
Jason Dowling,
Anthony Nguyen,
Bevan Koopman
Abstract:
The Shared Task on Large-Scale Radiology Report Generation (RRG24) aims to expedite the development of assistive systems for interpreting and reporting on chest X-ray (CXR) images. This task challenges participants to develop models that generate the findings and impression sections of radiology reports from CXRs from a patient's study, using five different datasets. This paper outlines the e-Heal…
▽ More
The Shared Task on Large-Scale Radiology Report Generation (RRG24) aims to expedite the development of assistive systems for interpreting and reporting on chest X-ray (CXR) images. This task challenges participants to develop models that generate the findings and impression sections of radiology reports from CXRs from a patient's study, using five different datasets. This paper outlines the e-Health CSIRO team's approach, which achieved multiple first-place finishes in RRG24. The core novelty of our approach lies in the addition of entropy regularisation to self-critical sequence training, to maintain a higher entropy in the token distribution. This prevents overfitting to common phrases and ensures a broader exploration of the vocabulary during training, essential for handling the diversity of the radiology reports in the RRG24 datasets. Our model is available on Hugging Face https://huggingface.co/aehrc/cxrmate-rrg24.
△ Less
Submitted 6 August, 2024;
originally announced August 2024.
-
Simulating intermediate black hole mass measurements for a sample of galaxies with nuclear star clusters using ELT/HARMONI high spatial resolution integral-field stellar kinematics
Authors:
Dieu D. Nguyen,
Michele Cappellari,
Hai N. Ngo,
Tinh Q. T. Le,
Khue N . H. Ho,
An K. Nguyen,
Huy G . Tong,
Phong T. On,
Tuan N. Le,
Miguel Pereira-Santaella
Abstract:
The fraction of low-mass galaxies hosting an intermediate-mass black hole (IMBH, with masses $M_{\rm BH} \approx 10^2-10^5$ M$_\odot$), is sensitive to how black hole seeds formed in the early Universe but is observationally still unconstrained. In this paper, we assemble a sample of dwarf galaxies within 10 Mpc hosting bright nuclear star clusters (NSCs) that could host IMBHs. For a subset of the…
▽ More
The fraction of low-mass galaxies hosting an intermediate-mass black hole (IMBH, with masses $M_{\rm BH} \approx 10^2-10^5$ M$_\odot$), is sensitive to how black hole seeds formed in the early Universe but is observationally still unconstrained. In this paper, we assemble a sample of dwarf galaxies within 10 Mpc hosting bright nuclear star clusters (NSCs) that could host IMBHs. For a subset of them, we use their observed surface brightness from {\it Hubble Space Telescope} (\hst) images, an assumed synthetic spectrum of their stellar population, Jeans Anisotropic Model (JAM) of the stellar dynamics, and the {\tt HSIM} simulator software to create mock observations with the High Angular Resolution Monolithic Optical and Near-infrared Integral (HARMONI) field spectrograph for the Extremely Large Telescope (ELT). We analyze the simulated data cube like real data, using JAM to infer the IMBH mass and its error in a Bayesian framework. Our simulations show that the ELT/HARMONI instrument can clearly detect the existence of IMBH demographics in NSCs down to a mass of about 0.5\% of the NSC.
△ Less
Submitted 31 July, 2024;
originally announced August 2024.
-
Language-driven Grasp Detection with Mask-guided Attention
Authors:
Tuan Van Vo,
Minh Nhat Vu,
Baoru Huang,
An Vuong,
Ngan Le,
Thieu Vo,
Anh Nguyen
Abstract:
Grasp detection is an essential task in robotics with various industrial applications. However, traditional methods often struggle with occlusions and do not utilize language for grasping. Incorporating natural language into grasp detection remains a challenging task and largely unexplored. To address this gap, we propose a new method for language-driven grasp detection with mask-guided attention…
▽ More
Grasp detection is an essential task in robotics with various industrial applications. However, traditional methods often struggle with occlusions and do not utilize language for grasping. Incorporating natural language into grasp detection remains a challenging task and largely unexplored. To address this gap, we propose a new method for language-driven grasp detection with mask-guided attention by utilizing the transformer attention mechanism with semantic segmentation features. Our approach integrates visual data, segmentation mask features, and natural language instructions, significantly improving grasp detection accuracy. Our work introduces a new framework for language-driven grasp detection, paving the way for language-driven robotic applications. Intensive experiments show that our method outperforms other recent baselines by a clear margin, with a 10.0% success score improvement. We further validate our method in real-world robotic experiments, confirming the effectiveness of our approach.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
SHANGUS: Deep Reinforcement Learning Meets Heuristic Optimization for Speedy Frontier-Based Exploration of Autonomous Vehicles in Unknown Spaces
Authors:
Seunghyeop Nam,
Tuan Anh Nguyen,
Eunmi Choi,
Dugki Min
Abstract:
This paper introduces SHANGUS, an advanced framework combining Deep Reinforcement Learning (DRL) with heuristic optimization to improve frontier-based exploration efficiency in unknown environments, particularly for intelligent vehicles in autonomous air services, search and rescue operations, and space exploration robotics. SHANGUS harnesses DRL's adaptability and heuristic prioritization, marked…
▽ More
This paper introduces SHANGUS, an advanced framework combining Deep Reinforcement Learning (DRL) with heuristic optimization to improve frontier-based exploration efficiency in unknown environments, particularly for intelligent vehicles in autonomous air services, search and rescue operations, and space exploration robotics. SHANGUS harnesses DRL's adaptability and heuristic prioritization, markedly enhancing exploration efficiency, reducing completion time, and minimizing travel distance. The strategy involves a frontier selection node to identify unexplored areas and a DRL navigation node using the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm for robust path planning and dynamic obstacle avoidance. Extensive experiments in ROS2 and Gazebo simulation environments show SHANGUS surpasses representative traditional methods like the Nearest Frontier (NF), Novel Frontier-Based Exploration Algorithm (CFE), and Goal-Driven Autonomous Exploration (GDAE) algorithms, especially in complex scenarios, excelling in completion time, travel distance, and exploration rate. This scalable solution is suitable for real-time autonomous navigation in fields such as industrial automation, autonomous driving, household robotics, and space exploration. Future research will integrate additional sensory inputs and refine heuristic functions to further boost SHANGUS's efficiency and robustness.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Scalable Group Choreography via Variational Phase Manifold Learning
Authors:
Nhat Le,
Khoa Do,
Xuan Bui,
Tuong Do,
Erman Tjiputra,
Quang D. Tran,
Anh Nguyen
Abstract:
Generating group dance motion from the music is a challenging task with several industrial applications. Although several methods have been proposed to tackle this problem, most of them prioritize optimizing the fidelity in dancing movement, constrained by predetermined dancer counts in datasets. This limitation impedes adaptability to real-world applications. Our study addresses the scalability p…
▽ More
Generating group dance motion from the music is a challenging task with several industrial applications. Although several methods have been proposed to tackle this problem, most of them prioritize optimizing the fidelity in dancing movement, constrained by predetermined dancer counts in datasets. This limitation impedes adaptability to real-world applications. Our study addresses the scalability problem in group choreography while preserving naturalness and synchronization. In particular, we propose a phase-based variational generative model for group dance generation on learning a generative manifold. Our method achieves high-fidelity group dance motion and enables the generation with an unlimited number of dancers while consuming only a minimal and constant amount of memory. The intensive experiments on two public datasets show that our proposed method outperforms recent state-of-the-art approaches by a large margin and is scalable to a great number of dancers beyond the training data.
△ Less
Submitted 31 July, 2024; v1 submitted 26 July, 2024;
originally announced July 2024.
-
Lightweight Language-driven Grasp Detection using Conditional Consistency Model
Authors:
Nghia Nguyen,
Minh Nhat Vu,
Baoru Huang,
An Vuong,
Ngan Le,
Thieu Vo,
Anh Nguyen
Abstract:
Language-driven grasp detection is a fundamental yet challenging task in robotics with various industrial applications. In this work, we present a new approach for language-driven grasp detection that leverages the concept of lightweight diffusion models to achieve fast inference time. By integrating diffusion processes with grasping prompts in natural language, our method can effectively encode v…
▽ More
Language-driven grasp detection is a fundamental yet challenging task in robotics with various industrial applications. In this work, we present a new approach for language-driven grasp detection that leverages the concept of lightweight diffusion models to achieve fast inference time. By integrating diffusion processes with grasping prompts in natural language, our method can effectively encode visual and textual information, enabling more accurate and versatile grasp positioning that aligns well with the text query. To overcome the long inference time problem in diffusion models, we leverage the image and text features as the condition in the consistency model to reduce the number of denoising timesteps during inference. The intensive experimental results show that our method outperforms other recent grasp detection methods and lightweight diffusion models by a clear margin. We further validate our method in real-world robotic experiments to demonstrate its fast inference time capability.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
Automated Code-centric Software Vulnerability Assessment: How Far Are We? An Empirical Study in C/C++
Authors:
Anh The Nguyen,
Triet Huynh Minh Le,
M. Ali Babar
Abstract:
Background: The C and C++ languages hold significant importance in Software Engineering research because of their widespread use in practice. Numerous studies have utilized Machine Learning (ML) and Deep Learning (DL) techniques to detect software vulnerabilities (SVs) in the source code written in these languages. However, the application of these techniques in function-level SV assessment has be…
▽ More
Background: The C and C++ languages hold significant importance in Software Engineering research because of their widespread use in practice. Numerous studies have utilized Machine Learning (ML) and Deep Learning (DL) techniques to detect software vulnerabilities (SVs) in the source code written in these languages. However, the application of these techniques in function-level SV assessment has been largely unexplored. SV assessment is increasingly crucial as it provides detailed information on the exploitability, impacts, and severity of security defects, thereby aiding in their prioritization and remediation. Aims: We conduct the first empirical study to investigate and compare the performance of ML and DL models, many of which have been used for SV detection, for function-level SV assessment in C/C++. Method: Using 9,993 vulnerable C/C++ functions, we evaluated the performance of six multi-class ML models and five multi-class DL models for the SV assessment at the function level based on the Common Vulnerability Scoring System (CVSS). We further explore multi-task learning, which can leverage common vulnerable code to predict all SV assessment outputs simultaneously in a single model, and compare the effectiveness and efficiency of this model type with those of the original multi-class models. Results: We show that ML has matching or even better performance compared to the multi-class DL models for function-level SV assessment with significantly less training time. Employing multi-task learning allows the DL models to perform significantly better, with an average of 8-22% increase in Matthews Correlation Coefficient (MCC). Conclusions: We distill the practices of using data-driven techniques for function-level SV assessment in C/C++, including the use of multi-task DL to balance efficiency and effectiveness. This can establish a strong foundation for future work in this area.
△ Less
Submitted 3 August, 2024; v1 submitted 24 July, 2024;
originally announced July 2024.
-
Fusion and Cross-Modal Transfer for Zero-Shot Human Action Recognition
Authors:
Abhi Kamboj,
Anh Duy Nguyen,
Minh Do
Abstract:
Despite living in a multi-sensory world, most AI models are limited to textual and visual interpretations of human motion and behavior. Inertial measurement units (IMUs) provide a salient signal to understand human motion; however, they are challenging to use due to their uninterpretability and scarcity of their data. We investigate a method to transfer knowledge between visual and inertial modali…
▽ More
Despite living in a multi-sensory world, most AI models are limited to textual and visual interpretations of human motion and behavior. Inertial measurement units (IMUs) provide a salient signal to understand human motion; however, they are challenging to use due to their uninterpretability and scarcity of their data. We investigate a method to transfer knowledge between visual and inertial modalities using the structure of an informative joint representation space designed for human action recognition (HAR). We apply the resulting Fusion and Cross-modal Transfer (FACT) method to a novel setup, where the model does not have access to labeled IMU data during training and is able to perform HAR with only IMU data during testing. Extensive experiments on a wide range of RGB-IMU datasets demonstrate that FACT significantly outperforms existing methods in zero-shot cross-modal transfer.
△ Less
Submitted 23 July, 2024;
originally announced July 2024.
-
Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance
Authors:
Toan Nguyen,
Minh Nhat Vu,
Baoru Huang,
An Vuong,
Quan Vuong,
Ngan Le,
Thieu Vo,
Anh Nguyen
Abstract:
6-DoF grasp detection has been a fundamental and challenging problem in robotic vision. While previous works have focused on ensuring grasp stability, they often do not consider human intention conveyed through natural language, hindering effective collaboration between robots and users in complex 3D environments. In this paper, we present a new approach for language-driven 6-DoF grasp detection i…
▽ More
6-DoF grasp detection has been a fundamental and challenging problem in robotic vision. While previous works have focused on ensuring grasp stability, they often do not consider human intention conveyed through natural language, hindering effective collaboration between robots and users in complex 3D environments. In this paper, we present a new approach for language-driven 6-DoF grasp detection in cluttered point clouds. We first introduce Grasp-Anything-6D, a large-scale dataset for the language-driven 6-DoF grasp detection task with 1M point cloud scenes and more than 200M language-associated 3D grasp poses. We further introduce a novel diffusion model that incorporates a new negative prompt guidance learning strategy. The proposed negative prompt strategy directs the detection process toward the desired object while steering away from unwanted ones given the language input. Our method enables an end-to-end framework where humans can command the robot to grasp desired objects in a cluttered scene using natural language. Intensive experimental results show the effectiveness of our method in both benchmarking experiments and real-world scenarios, surpassing other baselines. In addition, we demonstrate the practicality of our approach in real-world robotic applications. Our project is available at https://airvlab.github.io/grasp-anything.
△ Less
Submitted 25 July, 2024; v1 submitted 18 July, 2024;
originally announced July 2024.
-
LiteGPT: Large Vision-Language Model for Joint Chest X-ray Localization and Classification Task
Authors:
Khai Le-Duc,
Ryan Zhang,
Ngoc Son Nguyen,
Tan-Hanh Pham,
Anh Dao,
Ba Hung Ngo,
Anh Totti Nguyen,
Truong-Son Hy
Abstract:
Vision-language models have been extensively explored across a wide range of tasks, achieving satisfactory performance; however, their application in medical imaging remains underexplored. In this work, we propose a unified framework - LiteGPT - for the medical imaging. We leverage multiple pre-trained visual encoders to enrich information and enhance the performance of vision-language models. To…
▽ More
Vision-language models have been extensively explored across a wide range of tasks, achieving satisfactory performance; however, their application in medical imaging remains underexplored. In this work, we propose a unified framework - LiteGPT - for the medical imaging. We leverage multiple pre-trained visual encoders to enrich information and enhance the performance of vision-language models. To the best of our knowledge, this is the first study to utilize vision-language models for the novel task of joint localization and classification in medical images. Besides, we are pioneers in providing baselines for disease localization in chest X-rays. Finally, we set new state-of-the-art performance in the image classification task on the well-benchmarked VinDr-CXR dataset. All code and models are publicly available online: https://github.com/leduckhai/LiteGPT
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
Detecting Omissions in Geographic Maps through Computer Vision
Authors:
Phuc D. A. Nguyen,
Anh Do,
Minh Hoai
Abstract:
This paper explores the application of computer vision technologies to the analysis of maps, an area with substantial historical, cultural, and political significance. Our focus is on developing and evaluating a method for automatically identifying maps that depict specific regions and feature landmarks with designated names, a task that involves complex challenges due to the diverse styles and me…
▽ More
This paper explores the application of computer vision technologies to the analysis of maps, an area with substantial historical, cultural, and political significance. Our focus is on developing and evaluating a method for automatically identifying maps that depict specific regions and feature landmarks with designated names, a task that involves complex challenges due to the diverse styles and methods used in map creation. We address three main subtasks: differentiating maps from non-maps, verifying the accuracy of the region depicted, and confirming the presence or absence of particular landmark names through advanced text recognition techniques. Our approach utilizes a Convolutional Neural Network and transfer learning to differentiate maps from non-maps, verify the accuracy of depicted regions, and confirm landmark names through advanced text recognition. We also introduce the VinMap dataset, containing annotated map images of Vietnam, to train and test our method. Experiments on this dataset demonstrate that our technique achieves F1-score of 85.51% for identifying maps excluding specific territorial landmarks. This result suggests practical utility and indicates areas for future improvement.
△ Less
Submitted 15 July, 2024;
originally announced July 2024.
-
GPC: Generative and General Pathology Image Classifier
Authors:
Anh Tien Nguyen,
Jin Tae Kwak
Abstract:
Deep learning has been increasingly incorporated into various computational pathology applications to improve its efficiency, accuracy, and robustness. Although successful, most previous approaches for image classification have crucial drawbacks. There exist numerous tasks in pathology, but one needs to build a model per task, i.e., a task-specific model, thereby increasing the number of models, t…
▽ More
Deep learning has been increasingly incorporated into various computational pathology applications to improve its efficiency, accuracy, and robustness. Although successful, most previous approaches for image classification have crucial drawbacks. There exist numerous tasks in pathology, but one needs to build a model per task, i.e., a task-specific model, thereby increasing the number of models, training resources, and cost. Moreover, transferring arbitrary task-specific model to another task is still a challenging problem. Herein, we propose a task-agnostic generative and general pathology image classifier, so called GPC, that aims at learning from diverse kinds of pathology images and conducting numerous classification tasks in a unified model. GPC, equipped with a convolutional neural network and a Transformer-based language model, maps pathology images into a high-dimensional feature space and generates pertinent class labels as texts via the image-to-text classification mechanism. We evaluate GPC on six datasets for four different pathology image classification tasks. Experimental results show that GPC holds considerable potential for developing an effective and efficient universal model for pathology image analysis.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
CAMP: Continuous and Adaptive Learning Model in Pathology
Authors:
Anh Tien Nguyen,
Keunho Byeon,
Kyungeun Kim,
Boram Song,
Seoung Wan Chae,
Jin Tae Kwak
Abstract:
There exist numerous diagnostic tasks in pathology. Conventional computational pathology formulates and tackles them as independent and individual image classification problems, thereby resulting in computational inefficiency and high costs. To address the challenges, we propose a generic, unified, and universal framework, called a continuous and adaptive learning model in pathology (CAMP), for pa…
▽ More
There exist numerous diagnostic tasks in pathology. Conventional computational pathology formulates and tackles them as independent and individual image classification problems, thereby resulting in computational inefficiency and high costs. To address the challenges, we propose a generic, unified, and universal framework, called a continuous and adaptive learning model in pathology (CAMP), for pathology image classification. CAMP is a generative, efficient, and adaptive classification model that can continuously adapt to any classification task by leveraging pathology-specific prior knowledge and learning taskspecific knowledge with minimal computational cost and without forgetting the knowledge from the existing tasks. We evaluated CAMP on 22 datasets, including 1,171,526 patches and 11,811 pathology slides, across 17 classification tasks. CAMP achieves state-of-theart classification performance on a wide range of datasets and tasks at both patch- and slide-levels and reduces up to 94% of computation time and 85% of storage memory in comparison to the conventional classification models. Our results demonstrate that CAMP can offer a fundamental transformation in pathology image classification, paving the way for the fully digitized and computerized pathology practice.
△ Less
Submitted 12 July, 2024;
originally announced July 2024.
-
Adaptive Parametric Activation
Authors:
Konstantinos Panagiotis Alexandridis,
Jiankang Deng,
Anh Nguyen,
Shan Luo
Abstract:
The activation function plays a crucial role in model optimisation, yet the optimal choice remains unclear. For example, the Sigmoid activation is the de-facto activation in balanced classification tasks, however, in imbalanced classification, it proves inappropriate due to bias towards frequent classes. In this work, we delve deeper in this phenomenon by performing a comprehensive statistical ana…
▽ More
The activation function plays a crucial role in model optimisation, yet the optimal choice remains unclear. For example, the Sigmoid activation is the de-facto activation in balanced classification tasks, however, in imbalanced classification, it proves inappropriate due to bias towards frequent classes. In this work, we delve deeper in this phenomenon by performing a comprehensive statistical analysis in the classification and intermediate layers of both balanced and imbalanced networks and we empirically show that aligning the activation function with the data distribution, enhances the performance in both balanced and imbalanced tasks. To this end, we propose the Adaptive Parametric Activation (APA) function, a novel and versatile activation function that unifies most common activation functions under a single formula. APA can be applied in both intermediate layers and attention layers, significantly outperforming the state-of-the-art on several imbalanced benchmarks such as ImageNet-LT, iNaturalist2018, Places-LT, CIFAR100-LT and LVIS and balanced benchmarks such as ImageNet1K, COCO and V3DET. The code is available at https://github.com/kostas1515/AGLU.
△ Less
Submitted 9 October, 2024; v1 submitted 11 July, 2024;
originally announced July 2024.