subscribe to arXiv mailings

Reflection-Bench: probing AI intelligence with reflection

Authors: Lingyu Li, Yixu Wang, Haiquan Zhao, Shuqi Kong, Yan Teng, Chunbo Li, Yingchun Wang

Abstract: The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehens… ▽ More The ability to adapt beliefs or behaviors in response to unexpected outcomes, reflection, is fundamental to intelligent systems' interaction with the world. From a cognitive science perspective, this serves as a core principle of intelligence applicable to both human and AI systems. To address the debate on the intelligence of large language models (LLMs), we propose Reflection-Bench, a comprehensive benchmark comprising 7 tasks spanning core cognitive functions crucial for reflection, including perception, memory, belief updating, decision-making, prediction, counterfactual thinking, and meta-reflection. We evaluate the performances of 13 prominent LLMs such as OpenAI o1, GPT-4, Claude 3.5 Sonnet, etc. The results indicate that current LLMs still lack satisfactory reflection ability. We discuss the underlying causes of these results and suggest potential avenues for future research. In conclusion, Reflection-Bench offers both evaluation tools and inspiration for developing AI capable of reliably interacting with the environment. Our data and code are available at https://github.com/YabYum/ReflectionBench. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: 11 pages, 7 figures, 2 tables

arXiv:2410.16119 [pdf, other]

SeaDAG: Semi-autoregressive Diffusion for Conditional Directed Acyclic Graph Generation

Authors: Xinyi Zhou, Xing Li, Yingzhao Lian, Yiwen Wang, Lei Chen, Mingxuan Yuan, Jianye Hao, Guangyong Chen, Pheng Ann Heng

Abstract: We introduce SeaDAG, a semi-autoregressive diffusion model for conditional generation of Directed Acyclic Graphs (DAGs). Considering their inherent layer-wise structure, we simulate layer-wise autoregressive generation by designing different denoising speed for different layers. Unlike conventional autoregressive generation that lacks a global graph structure view, our method maintains a complete… ▽ More We introduce SeaDAG, a semi-autoregressive diffusion model for conditional generation of Directed Acyclic Graphs (DAGs). Considering their inherent layer-wise structure, we simulate layer-wise autoregressive generation by designing different denoising speed for different layers. Unlike conventional autoregressive generation that lacks a global graph structure view, our method maintains a complete graph structure at each diffusion step, enabling operations such as property control that require the full graph structure. Leveraging this capability, we evaluate the DAG properties during training by employing a graph property decoder. We explicitly train the model to learn graph conditioning with a condition loss, which enhances the diffusion model's capacity to generate graphs that are both realistic and aligned with specified properties. We evaluate our method on two representative conditional DAG generation tasks: (1) circuit generation from truth tables, where precise DAG structures are crucial for realizing circuit functionality, and (2) molecule generation based on quantum properties. Our approach demonstrates promising results, generating high-quality and realistic DAGs that closely align with given conditions. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16086 [pdf, other]

Enhanced $S$-factor for the $^{14}$N$(p,γ)^{15}$O reaction and its impact on the solar composition problem

Authors: X. Chen, J. Su, Y. P. Shen, L. Y. Zhang, J. J. He, S. Z. Chen, S. Wang, Z. L. Shen, S. Lin, L. Y. Song, H. Zhang, L. H. Wang, X. Z. Jiang, L. Wang, Y. T. Huang, Z. W. Qin, F. C. Liu, Y. D. Sheng, Y. J. Chen, Y. L. Lu, X. Y. Li, J. Y. Dong, Y. C. Jiang, Y. Q. Zhang, Y. Zhang , et al. (23 additional authors not shown)

Abstract: The solar composition problem has puzzled astrophysicists for more than 20 years. Recent measurements of carbon-nitrogen-oxygen (CNO) neutrinos by the Borexino experiment show a $\sim2σ$ tension with the "low-metallicity" determinations. $^{14}$N$(p,γ)^{15}$O, the slowest reaction in the CNO cycle, plays a crucial role in the standard solar model (SSM) calculations of CNO neutrino fluxes. Here we… ▽ More The solar composition problem has puzzled astrophysicists for more than 20 years. Recent measurements of carbon-nitrogen-oxygen (CNO) neutrinos by the Borexino experiment show a $\sim2σ$ tension with the "low-metallicity" determinations. $^{14}$N$(p,γ)^{15}$O, the slowest reaction in the CNO cycle, plays a crucial role in the standard solar model (SSM) calculations of CNO neutrino fluxes. Here we report a direct measurement of the $^{14}$N$(p,γ)^{15}$O reaction, in which $S$-factors for all transitions were simultaneously determined in the energy range of $E_p=110-260$ keV for the first time. Our results resolve previous discrepancies in the ground-state transition, yielding a zero-energy $S$-factor $S_{114}(0) = 1.92\pm0.08$ keV b which is 14% higher than the $1.68\pm0.14$ keV b recommended in Solar Fusion III (SF-III). With our $S_{114}$ values, the SSM B23-GS98, and the latest global analysis of solar neutrino measurements, the C and N photospheric abundance determined by the Borexino experiment is updated to $N_{\mathrm{CN}}=({4.45}^{+0.69}_{-0.61})\times10^{-4}$. This new $N_{\mathrm{CN}}$ value agrees well with latest "high-metallicity" composition, however, is also consistent with the "low-metallicity" determination within $\sim 1σ$ C.L., indicating that the solar metallicity problem remains an open question. In addition, the significant reduction in the uncertainty of $S_{114}$ paves the way for the precise determination of the CN abundance in future large-volume solar neutrino measurements. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16059 [pdf, other]

Multi-Level Speaker Representation for Target Speaker Extraction

Authors: Ke Zhang, Junjie Li, Shuai Wang, Yangjie Wei, Yi Wang, Yannan Wang, Haizhou Li

Abstract: Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embed… ▽ More Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embeddings, to serve as the speaker reference cue. We generate a spectral-level representation from the enrollment magnitude spectrogram as a raw, low-level feature, which significantly improves the model's generalization capability. Additionally, we propose a contextual embedding feature based on cross-attention mechanisms that integrate frame-level embeddings from a pre-trained speaker encoder. By incorporating speaker features across multiple levels, we significantly enhance the performance of the TSE model. Our approach achieves a 2.74 dB improvement and a 4.94% increase in extraction accuracy on Libri2mix test set over the baseline. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: 5 pages. Submitted to ICASSP 2025. Implementation will be released at https://github.com/wenet-e2e/wesep

arXiv:2410.16057 [pdf, other]

Label Filling via Mixed Supervision for Medical Image Segmentation from Noisy Annotations

Authors: Ming Li, Wei Shen, Qingli Li, Yan Wang

Abstract: The success of medical image segmentation usually requires a large number of high-quality labels. But since the labeling process is usually affected by the raters' varying skill levels and characteristics, the estimated masks provided by different raters usually suffer from high inter-rater variability. In this paper, we propose a simple yet effective Label Filling framework, termed as LF-Net, pre… ▽ More The success of medical image segmentation usually requires a large number of high-quality labels. But since the labeling process is usually affected by the raters' varying skill levels and characteristics, the estimated masks provided by different raters usually suffer from high inter-rater variability. In this paper, we propose a simple yet effective Label Filling framework, termed as LF-Net, predicting the groundtruth segmentation label given only noisy annotations during training. The fundamental idea of label filling is to supervise the segmentation model by a subset of pixels with trustworthy labels, meanwhile filling labels of other pixels by mixed supervision. More concretely, we propose a qualified majority voting strategy, i.e., a threshold voting scheme is designed to model agreement among raters and the majority-voted labels of the selected subset of pixels are regarded as supervision. To fill labels of other pixels, two types of mixed auxiliary supervision are proposed: a soft label learned from intrinsic structures of noisy annotations, and raters' characteristics labels which propagate individual rater's characteristics information. LF-Net has two main advantages. 1) Training with trustworthy pixels incorporates training with confident supervision, guiding the direction of groundtruth label learning. 2) Two types of mixed supervision prevent over-fitting issues when the network is supervised by a subset of pixels, and guarantee high fidelity with the true label. Results on five datasets of diverse imaging modalities show that our LF-Net boosts segmentation accuracy in all datasets compared with state-of-the-art methods, with even a 7% improvement in DSC for MS lesion segmentation. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16016 [pdf, other]

Proactive security defense: cyber threat intelligence modeling for connected autonomous vehicles

Authors: Yinghui Wang, Yilong Ren, Zhiyong Cui, Haiyang Yu

Abstract: Cybersecurity has become a crucial concern in the field of connected autonomous vehicles. Cyber threat intelligence (CTI), as the collection of cyber threat information, offers an ideal way for responding to emerging cyber threats and realizing proactive security defense. However, instant analysis and modeling of vehicle cybersecurity data is a fundamental challenge since its complex and professio… ▽ More Cybersecurity has become a crucial concern in the field of connected autonomous vehicles. Cyber threat intelligence (CTI), as the collection of cyber threat information, offers an ideal way for responding to emerging cyber threats and realizing proactive security defense. However, instant analysis and modeling of vehicle cybersecurity data is a fundamental challenge since its complex and professional context. In this paper, we suggest an automotive CTI modeling framework, Actim, to extract and analyse the interrelated relationships among cyber threat elements. Specifically, we first design a vehicle security-safety conceptual ontology model to depict various threat entity classes and their relations. Then, we manually annotate the first automobile CTI corpus by using real cybersecurity data, which comprises 908 threat intelligence texts, including 8195 entities and 4852 relationships. To effectively extract cyber threat entities and their relations, we propose an automotive CTI mining model based on cross-sentence context. Experiment results show that the proposed BERT-DocHiatt-BiLSTM-LSTM model exceeds the performance of existing methods. Finally, we define entity-relation matching rules and create a CTI knowledge graph that structurally fuses various elements of cyber threats. The Actim framework enables mining the intrinsic connections among threat entities, providing valuable insight on the evolving cyber threat landscape. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15907 [pdf, other]

Seismic Phase Picking

Authors: Yuchen Wang, Ruihuan Wang

Abstract: Seismic phase picking, which aims to determine the arrival time of P- and S-waves according to seismic waveforms, is fundamental to earthquake monitoring. Generally, manual phase picking is trustworthy, but with the increasing number of worldwide stations and seismic monitors, it becomes more challenging for human to complete the task comprehensively. In this work, we explore multiple ways to do a… ▽ More Seismic phase picking, which aims to determine the arrival time of P- and S-waves according to seismic waveforms, is fundamental to earthquake monitoring. Generally, manual phase picking is trustworthy, but with the increasing number of worldwide stations and seismic monitors, it becomes more challenging for human to complete the task comprehensively. In this work, we explore multiple ways to do automatic phase picking, including traditional and learning-based methods. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15827 [pdf, other]

Explainability of Highly Associated Fuzzy Churn Patterns in Binary Classification

Authors: D. Y. C. Wang, Lars Arne Jordanger, Jerry Chun-Wei Lin

Abstract: Customer churn, particularly in the telecommunications sector, influences both costs and profits. As the explainability of models becomes increasingly important, this study emphasizes not only the explainability of customer churn through machine learning models, but also the importance of identifying multivariate patterns and setting soft bounds for intuitive interpretation. The main objective is… ▽ More Customer churn, particularly in the telecommunications sector, influences both costs and profits. As the explainability of models becomes increasingly important, this study emphasizes not only the explainability of customer churn through machine learning models, but also the importance of identifying multivariate patterns and setting soft bounds for intuitive interpretation. The main objective is to use a machine learning model and fuzzy-set theory with top-\textit{k} HUIM to identify highly associated patterns of customer churn with intuitive identification, referred to as Highly Associated Fuzzy Churn Patterns (HAFCP). Moreover, this method aids in uncovering association rules among multiple features across low, medium, and high distributions. Such discoveries are instrumental in enhancing the explainability of findings. Experiments show that when the top-5 HAFCPs are included in five datasets, a mixture of performance results is observed, with some showing notable improvements. It becomes clear that high importance features enhance explanatory power through their distribution and patterns associated with other features. As a result, the study introduces an innovative approach that improves the explainability and effectiveness of customer churn prediction models. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: 18 pages single columns, 4 figures, This paper is an extended version of a work originally presented at the 6th International Workshop on Utility-Driven Mining and Learning (held in conjunction with the 28th Pacific-Asia Conference on Knowledge Discovery and Data Mining - PAKDD 2024) on May 7, 2024

arXiv:2410.15790 [pdf, other]

Event-based contextuality theory

Authors: Songyi Liu, Yongjun Wang, Baoshan Wang

Abstract: Fully revealing the mathmatical structure of quantum contextuality is a significant task, while some known contextuality theories are only applicable for rank-1 projectors. That is because they adopt the observable-based definitions. This paper analyses the challenges faced by some known contextuality theories, and establishes an event-based contextuality theory with partial Boolean algebra to ove… ▽ More Fully revealing the mathmatical structure of quantum contextuality is a significant task, while some known contextuality theories are only applicable for rank-1 projectors. That is because they adopt the observable-based definitions. This paper analyses the challenges faced by some known contextuality theories, and establishes an event-based contextuality theory with partial Boolean algebra to overcome them. The theory can handle the scenarios composed of general projectors and observables, and provides a unified mathematical structure to investigate the hierarchy of quantum contextuality. It also introduces a tool to extend some known results from rank-1 cases to general cases. For example, we get a Kochen-Specker set with 12 projectors from the Cabello-Estebaranz-Garcia set with 18 vectors. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15701 [pdf, other]

Students Rather Than Experts: A New AI For Education Pipeline To Model More Human-Like And Personalised Early Adolescences

Authors: Yiping Ma, Shiyu Hu, Xuchen Li, Yipei Wang, Shiqing Liu, Kang Hao Cheong

Abstract: The capabilities of large language models (LLMs) have been applied in expert systems across various domains, providing new opportunities for AI in Education. Educational interactions involve a cyclical exchange between teachers and students. Current research predominantly focuses on using LLMs to simulate teachers, leveraging their expertise to enhance student learning outcomes. However, the simul… ▽ More The capabilities of large language models (LLMs) have been applied in expert systems across various domains, providing new opportunities for AI in Education. Educational interactions involve a cyclical exchange between teachers and students. Current research predominantly focuses on using LLMs to simulate teachers, leveraging their expertise to enhance student learning outcomes. However, the simulation of students, which could improve teachers' instructional skills, has received insufficient attention due to the challenges of modeling and evaluating virtual students. This research asks: Can LLMs be utilized to develop virtual student agents that mimic human-like behavior and individual variability? Unlike expert systems focusing on knowledge delivery, virtual students must replicate learning difficulties, emotional responses, and linguistic uncertainties. These traits present significant challenges in both modeling and evaluation. To address these issues, this study focuses on language learning as a context for modeling virtual student agents. We propose a novel AI4Education framework, called SOE (Scene-Object-Evaluation), to systematically construct LVSA (LLM-based Virtual Student Agents). By curating a dataset of personalized teacher-student interactions with various personality traits, question types, and learning stages, and fine-tuning LLMs using LoRA, we conduct multi-dimensional evaluation experiments. Specifically, we: (1) develop a theoretical framework for generating LVSA; (2) integrate human subjective evaluation metrics into GPT-4 assessments, demonstrating a strong correlation between human evaluators and GPT-4 in judging LVSA authenticity; and (3) validate that LLMs can generate human-like, personalized virtual student agents in educational contexts, laying a foundation for future applications in pre-service teacher training and multi-agent simulation environments. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15686 [pdf, other]

NetSafe: Exploring the Topological Safety of Multi-agent Networks

Authors: Miao Yu, Shilong Wang, Guibin Zhang, Junyuan Mao, Chenlong Yin, Qijiong Liu, Qingsong Wen, Kun Wang, Yang Wang

Abstract: Large language models (LLMs) have empowered nodes within multi-agent networks with intelligence, showing growing applications in both academia and industry. However, how to prevent these networks from generating malicious information remains unexplored with previous research on single LLM's safety be challenging to transfer. In this paper, we focus on the safety of multi-agent networks from a topo… ▽ More Large language models (LLMs) have empowered nodes within multi-agent networks with intelligence, showing growing applications in both academia and industry. However, how to prevent these networks from generating malicious information remains unexplored with previous research on single LLM's safety be challenging to transfer. In this paper, we focus on the safety of multi-agent networks from a topological perspective, investigating which topological properties contribute to safer networks. To this end, we propose a general framework, NetSafe along with an iterative RelCom interaction to unify existing diverse LLM-based agent frameworks, laying the foundation for generalized topological safety research. We identify several critical phenomena when multi-agent networks are exposed to attacks involving misinformation, bias, and harmful information, termed as Agent Hallucination and Aggregation Safety. Furthermore, we find that highly connected networks are more susceptible to the spread of adversarial attacks, with task performance in a Star Graph Topology decreasing by 29.7%. Besides, our proposed static metrics aligned more closely with real-world dynamic evaluations than traditional graph-theoretic metrics, indicating that networks with greater average distances from attackers exhibit enhanced safety. In conclusion, our work introduces a new topological perspective on the safety of LLM-based multi-agent networks and discovers several unreported phenomena, paving the way for future research to explore the safety of such networks. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15621 [pdf, other]

DRIM-ANN: An Approximate Nearest Neighbor Search Engine based on Commercial DRAM-PIMs

Authors: Mingkai Chen, Tianhua Han, Cheng Liu, Shengwen Liang, Kuai Yu, Lei Dai, Ziming Yuan, Ying Wang, Lei Zhang, Huawei Li, Xiaowei Li

Abstract: Approximate Nearest Neighbor Search (ANNS), which enables efficient semantic similarity search in large datasets, has become a fundamental component of critical applications such as information retrieval and retrieval-augmented generation (RAG). However, ANNS is a well-known I/O-intensive algorithm with a low compute-to-I/O ratio, often requiring massive storage due to the large volume of high-dim… ▽ More Approximate Nearest Neighbor Search (ANNS), which enables efficient semantic similarity search in large datasets, has become a fundamental component of critical applications such as information retrieval and retrieval-augmented generation (RAG). However, ANNS is a well-known I/O-intensive algorithm with a low compute-to-I/O ratio, often requiring massive storage due to the large volume of high-dimensional data. This leads to I/O bottlenecks on CPUs and memory limitations on GPUs. DRAM-based Processing-in-Memory (DRAM-PIM) architecture, which offers high bandwidth, large-capacity memory, and the ability to perform efficient computation in or near the data, presents a promising solution for ANNS. In this work, we investigate the use of commercial DRAM-PIM for ANNS for the first time and propose DRIM-ANN, an optimized ANNS engine based on DRAM-PIMs from UPMEM. Notably, given that the target DRAM-PIM exhibits an even lower compute-to-I/O ratio than basic ANNS, we leverage lookup tables (LUTs) to replace more multiplications with I/O operations. We then systematically tune ANNS to search optimized configurations with lower computational load, aligning the compute-to-I/O ratio of ANNS with that of DRAM-PIMs while maintaining accuracy constraints. Building on this tuned ANNS algorithm, we further explore implementation optimizations to fully utilize the two thousand parallel processing units with private local memory in DRAM-PIMs. To address the load imbalance caused by ANNS requests distributed across different clusters of large datasets, we propose a load-balancing strategy that combines static data layout optimization with dynamic runtime request scheduling. Experimental results on representative datasets show that DRIM-ANN achieves an average performance speedup of 2.92x compared to a 32-thread CPU counterpart. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.15575 [pdf, other]

Neural Search Space in Gboard Decoder

Authors: Yanxiang Zhang, Yuanbo Zhang, Haicheng Sun, Yun Wang, Billy Dou, Gary Sivek, Shumin Zhai

Abstract: Gboard Decoder produces suggestions by looking for paths that best match input touch points on the context aware search space, which is backed by the language Finite State Transducers (FST). The language FST is currently an N-gram language model (LM). However, N-gram LMs, limited in context length, are known to have sparsity problem under device model size constraint. In this paper, we propose \te… ▽ More Gboard Decoder produces suggestions by looking for paths that best match input touch points on the context aware search space, which is backed by the language Finite State Transducers (FST). The language FST is currently an N-gram language model (LM). However, N-gram LMs, limited in context length, are known to have sparsity problem under device model size constraint. In this paper, we propose \textbf{Neural Search Space} which substitutes the N-gram LM with a Neural Network LM (NN-LM) and dynamically constructs the search space during decoding. Specifically, we integrate the long range context awareness of NN-LM into the search space by converting its outputs given context, into the language FST at runtime. This involves language FST structure redesign, pruning strategy tuning, and data structure optimizations. Online experiments demonstrate improved quality results, reducing Words Modified Ratio by [0.26\%, 1.19\%] on various locales with acceptable latency increases. This work opens new avenues for further improving keyboard decoding quality by enhancing neural LM more directly. △ Less

Submitted 20 October, 2024; originally announced October 2024.

Comments: 10 pages, 7 figures, 3 tables

arXiv:2410.15567 [pdf, other]

Pruning Foundation Models for High Accuracy without Retraining

Authors: Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin

Abstract: Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consum… ▽ More Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consuming massive data and hardware resources. To deal with this problem, post-training pruning methods are proposed to prune LLMs in one-shot without retraining. However, their accuracy after pruning may suffer from certain performance degradation due to the lack of retraining with massive data. To address this issue, in this paper, we first formulate the post-training problem for layer-wise LLM compression to simultaneously prune multiple weights in LLMs. Next, we provide an optimal solution for this problem and design our post-training pruning algorithm for both unstructured and semi-structured sparsity. Our extensive experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines across various LLM families including transformer-based LLMs and Mamba-based LLMs. Code link: https://github.com/piuzha/APT △ Less

Submitted 20 October, 2024; originally announced October 2024.

Comments: Accepted by EMNLP 2024 findings

arXiv:2410.15375 [pdf, ps, other]

Preparing Spin Squeezed States via Adaptive Genetic Algorithm

Authors: Yiming Zhao, Libo Chen, Yong Wang, Hongyang Ma, Xiaolong Zhao

Abstract: We introduce a novel strategy employing an adaptive genetic algorithm (GA) for iterative optimization of control sequences to generate quantum nonclassical states. Its efficacy is demonstrated by preparing spin-squeezed states in an open collective spin model governed by a linear control field. Inspired by Darwinian evolution, the algorithm iteratively refines control sequences using crossover, mu… ▽ More We introduce a novel strategy employing an adaptive genetic algorithm (GA) for iterative optimization of control sequences to generate quantum nonclassical states. Its efficacy is demonstrated by preparing spin-squeezed states in an open collective spin model governed by a linear control field. Inspired by Darwinian evolution, the algorithm iteratively refines control sequences using crossover, mutation, and elimination strategies, starting from a coherent spin state within a dissipative and dephasing environment. An adaptive parameter adjustment mechanism further enhances optimization. Our approach, compared to constant control schemes, yields a variety of control sequences capable of maintaining squeezing for the collective spin model. Furthermore, the proposed strategy exhibits increased effectiveness in diverse systems, while reservoir thermal excitations are shown to negatively impact control outcomes. We discuss feasible experimental implementations and potential extensions to alternative quantum systems, and the adaptability of the GA module. This research establishes the foundation for utilizing GA-like strategies in controlling quantum systems and achieving desired nonclassical states. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.15371 [pdf, other]

FrameBridge: Improving Image-to-Video Generation with Bridge Models

Authors: Yuji Wang, Zehua Chen, Xiaoyu Chen, Jun Zhu, Jianfei Chen

Abstract: Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation. However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to… ▽ More Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. Recently, diffusion-based I2V models have achieved remarkable progress given their novel design on network architecture, cascaded framework, and motion representation. However, restricted by their noise-to-data generation process, diffusion-based methods inevitably suffer the difficulty to generate video samples with both appearance consistency and temporal coherence from an uninformative Gaussian noise, which may limit their synthesis quality. In this work, we present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. By formulating I2V synthesis as a frames-to-frames generation task and modelling it with a data-to-data process, we fully exploit the information in input image and facilitate the generative model to learn the image animation process. In two popular settings of training I2V models, namely fine-tuning a pre-trained text-to-video (T2V) model or training from scratch, we further propose two techniques, SNR-Aligned Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively. Experiments conducted on WebVid-2M and UCF-101 demonstrate that: (1) our FrameBridge achieves superior I2V quality in comparison with the diffusion counterpart (zero-shot FVD 83 vs. 176 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101); (2) our proposed SAF and neural prior effectively enhance the ability of bridge-based I2V models in the scenarios of fine-tuning and training from scratch. Demo samples can be visited at: https://framebridge-demo.github.io/. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.15319 [pdf, other]

Causality for Large Language Models

Authors: Anpeng Wu, Kun Kuang, Minqin Zhu, Yingrong Wang, Yujia Zheng, Kairong Han, Baohong Li, Guangyi Chen, Fei Wu, Kun Zhang

Abstract: Recent breakthroughs in artificial intelligence have driven a paradigm shift, where large language models (LLMs) with billions or trillions of parameters are trained on vast datasets, achieving unprecedented success across a series of language tasks. However, despite these successes, LLMs still rely on probabilistic modeling, which often captures spurious correlations rooted in linguistic patterns… ▽ More Recent breakthroughs in artificial intelligence have driven a paradigm shift, where large language models (LLMs) with billions or trillions of parameters are trained on vast datasets, achieving unprecedented success across a series of language tasks. However, despite these successes, LLMs still rely on probabilistic modeling, which often captures spurious correlations rooted in linguistic patterns and social stereotypes, rather than the true causal relationships between entities and events. This limitation renders LLMs vulnerable to issues such as demographic biases, social stereotypes, and LLM hallucinations. These challenges highlight the urgent need to integrate causality into LLMs, moving beyond correlation-driven paradigms to build more reliable and ethically aligned AI systems. While many existing surveys and studies focus on utilizing prompt engineering to activate LLMs for causal knowledge or developing benchmarks to assess their causal reasoning abilities, most of these efforts rely on human intervention to activate pre-trained models. How to embed causality into the training process of LLMs and build more general and intelligent models remains unexplored. Recent research highlights that LLMs function as causal parrots, capable of reciting causal knowledge without truly understanding or applying it. These prompt-based methods are still limited to human interventional improvements. This survey aims to address this gap by exploring how causality can enhance LLMs at every stage of their lifecycle-from token embedding learning and foundation model training to fine-tuning, alignment, inference, and evaluation-paving the way for more interpretable, reliable, and causally-informed models. Additionally, we further outline six promising future directions to advance LLM development, enhance their causal reasoning capabilities, and address the current limitations these models face. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.15285 [pdf, other]

Contextual Augmented Multi-Model Programming (CAMP): A Hybrid Local-Cloud Copilot Framework

Authors: Yuchen Wang, Shangxin Guo, Chee Wei Tan

Abstract: The advancements in cloud-based Large Languages Models (LLMs) have revolutionized AI-assisted programming. However, their integration into certain local development environments like ones within the Apple software ecosystem (e.g., iOS apps, macOS) remains challenging due to computational demands and sandboxed constraints. This paper presents CAMP, a multi-model AI-assisted programming framework th… ▽ More The advancements in cloud-based Large Languages Models (LLMs) have revolutionized AI-assisted programming. However, their integration into certain local development environments like ones within the Apple software ecosystem (e.g., iOS apps, macOS) remains challenging due to computational demands and sandboxed constraints. This paper presents CAMP, a multi-model AI-assisted programming framework that consists of a local model that employs Retrieval-Augmented Generation (RAG) to retrieve contextual information from the codebase to facilitate context-aware prompt construction thus optimizing the performance of the cloud model, empowering LLMs' capabilities in local Integrated Development Environments (IDEs). The methodology is actualized in Copilot for Xcode, an AI-assisted programming tool crafted for Xcode that employs the RAG module to address software constraints and enables diverse generative programming tasks, including automatic code completion, documentation, error detection, and intelligent user-agent interaction. The results from objective experiments on generated code quality and subjective experiments on user adoption collectively demonstrate the pilot success of the proposed system and mark its significant contributions to the realm of AI-assisted programming. △ Less

Submitted 20 October, 2024; originally announced October 2024.

Comments: 12 pages, 3 figures, 4 tables

arXiv:2410.15270 [pdf, other]

Can LVLMs Describe Videos like Humans? A Five-in-One Video Annotations Benchmark for Better Human-Machine Comparison

Authors: Shiyu Hu, Xuchen Li, Xuzhao Li, Jing Zhang, Yipei Wang, Xin Zhao, Kang Hao Cheong

Abstract: Large vision-language models (LVLMs) have made significant strides in addressing complex video tasks, sparking researchers' interest in their human-like multimodal understanding capabilities. Video description serves as a fundamental task for evaluating video comprehension, necessitating a deep understanding of spatial and temporal dynamics, which presents challenges for both humans and machines.… ▽ More Large vision-language models (LVLMs) have made significant strides in addressing complex video tasks, sparking researchers' interest in their human-like multimodal understanding capabilities. Video description serves as a fundamental task for evaluating video comprehension, necessitating a deep understanding of spatial and temporal dynamics, which presents challenges for both humans and machines. Thus, investigating whether LVLMs can describe videos as comprehensively as humans (through reasonable human-machine comparisons using video captioning as a proxy task) will enhance our understanding and application of these models. However, current benchmarks for video comprehension have notable limitations, including short video durations, brief annotations, and reliance on a single annotator's perspective. These factors hinder a comprehensive assessment of LVLMs' ability to understand complex, lengthy videos and prevent the establishment of a robust human baseline that accurately reflects human video comprehension capabilities. To address these issues, we propose a novel benchmark, FIOVA (Five In One Video Annotations), designed to evaluate the differences between LVLMs and human understanding more comprehensively. FIOVA includes 3,002 long video sequences (averaging 33.6 seconds) that cover diverse scenarios with complex spatiotemporal relationships. Each video is annotated by five distinct annotators, capturing a wide range of perspectives and resulting in captions that are 4-15 times longer than existing benchmarks, thereby establishing a robust baseline that represents human understanding comprehensively for the first time in video description tasks. Using the FIOVA benchmark, we conducted an in-depth evaluation of six state-of-the-art LVLMs, comparing their performance with humans. More detailed information can be found at https://huuuuusy.github.io/fiova/. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.15250 [pdf, other]

Multimodal Policies with Physics-informed Representations

Authors: Haodong Feng, Peiyan Hu, Yue Wang, Dixia Fan

Abstract: In the control problems of the PDE systems, observation is important to make the decision. However, the observation is generally sparse and missing in practice due to the limitation and fault of sensors. The above challenges cause observations with uncertain quantities and modalities. Therefore, how to leverage the uncertain observations as the states in control problems of the PDE systems has bec… ▽ More In the control problems of the PDE systems, observation is important to make the decision. However, the observation is generally sparse and missing in practice due to the limitation and fault of sensors. The above challenges cause observations with uncertain quantities and modalities. Therefore, how to leverage the uncertain observations as the states in control problems of the PDE systems has become a scientific problem. The dynamics of PDE systems rely on the initial conditions, boundary conditions, and PDE formula. Given the above three elements, PINNs can be used to solve the PDE systems. In this work, we discover that the neural network can also be used to identify and represent the PDE systems using PDE loss and sparse data loss. Inspired by the above discovery, we propose a Physics-Informed Representation (PIR) algorithm for multimodal policies in PDE systems' control. It leverages PDE loss to fit the neural network and data loss calculated on the observations with random quantities and modalities to propagate the information of initial conditions and boundary conditions into the inputs. The inputs can be the learnable parameters or the output of the encoders. Then, under the environments of the PDE systems, such inputs are the representation of the current state. In our experiments, the PIR illustrates the superior consistency with the features of the ground truth compared with baselines, even when there are missing modalities. Furthermore, PIR has been successfully applied in the downstream control tasks where the robot leverages the learned state by PIR faster and more accurately, passing through the complex vortex street from a random starting location to reach a random target. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.15240 [pdf, other]

Fastrack: Fast IO for Secure ML using GPU TEEs

Authors: Yongqin Wang, Rachit Rajat, Jonghyun Lee, Tingting Tang, Murali Annavaram

Abstract: As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handling authentication and computation. However, CPU-to-GPU communication overheads significantly hinder performance, as data must be encrypted, authenticated, decryp… ▽ More As cloud-based ML expands, ensuring data security during training and inference is critical. GPU-based Trusted Execution Environments (TEEs) offer secure, high-performance solutions, with CPU TEEs managing data movement and GPU TEEs handling authentication and computation. However, CPU-to-GPU communication overheads significantly hinder performance, as data must be encrypted, authenticated, decrypted, and verified, increasing costs by 12.69 to 33.53 times. This results in GPU TEE inference becoming 54.12% to 903.9% slower and training 10% to 455% slower than non-TEE systems, undermining GPU TEE advantages in latency-sensitive applications. This paper analyzes Nvidia H100 TEE protocols and identifies three key overheads: 1) redundant CPU re-encryption, 2) limited authentication parallelism, and 3) unnecessary operation serialization. We propose Fastrack, optimizing with 1) direct GPU TEE communication, 2) parallelized authentication, and 3) overlapping decryption with PCI-e transmission. These optimizations cut communication costs and reduce inference/training runtime by up to 84.6%, with minimal overhead compared to non-TEE systems. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.15226 [pdf, other]

On the Diversity of Synthetic Data and its Impact on Training Large Language Models

Authors: Hao Chen, Abdul Waheed, Xiang Li, Yidong Wang, Jindong Wang, Bhiksha Raj, Marah I. Abdin

Abstract: The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM… ▽ More The rise of Large Language Models (LLMs) has accentuated the need for diverse, high-quality pre-training data. Synthetic data emerges as a viable solution to the challenges of data scarcity and inaccessibility. While previous literature has focused predominantly on the quality and quantity of real data, our work enables the measurement of diversity in synthetic data and explores its impact on LLM performance. We study the downstream effects of synthetic data diversity during both the pre-training and fine-tuning stages by introducing a new diversity metric, \textit{LLM cluster-agent}, designed to evaluate the diversity of synthetic datasets. Through a series of controlled experiments with models of 350M and 1.4B parameters, we demonstrate that the proposed cluster-based LLM scoring of diversity correlates positively with both pre-training and supervised fine-tuning performance. Our findings also reveal that synthetic data diversity in pre-training affects supervised fine-tuning more significantly than pre-training itself, even for smaller models. We hope this study advances our understanding of the optimal use of synthetic data in LLM training and opens new avenues for efficient data generation processes. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.15175 [pdf]

Implicit neural representation for free-breathing MR fingerprinting (INR-MRF): co-registered 3D whole-liver water T1, water T2, proton density fat fraction, and R2* mapping

Authors: Chao Li, Jiahao Li, Jinwei Zhang, Eddy Solomon, Alexey V. Dimov, Pascal Spincemaille, Thanh D. Nguyen, Martin R. Prince, Yi Wang

Abstract: Purpose: To develop an MRI technique for free-breathing 3D whole-liver quantification of water T1, water T2, proton density fat fraction (PDFF), R2*. Methods: An Eight-echo spoiled gradient echo pulse sequence with spiral readout was developed by interleaving inversion recovery and T2 magnetization preparation. We propose a neural network based on a 4D and a 3D implicit neural representation (INR)… ▽ More Purpose: To develop an MRI technique for free-breathing 3D whole-liver quantification of water T1, water T2, proton density fat fraction (PDFF), R2*. Methods: An Eight-echo spoiled gradient echo pulse sequence with spiral readout was developed by interleaving inversion recovery and T2 magnetization preparation. We propose a neural network based on a 4D and a 3D implicit neural representation (INR) which simultaneously learns the motion deformation fields and the static reference frame MRI subspace images respectively. Water and fat singular images were separated during network training, with no need of performing retrospective water-fat separation. T1, T2, R2* and proton density fat fraction (PDFF) produced by the proposed method were validated in vivo on 10 healthy subjects, using quantitative maps generated from conventional scans as reference. Results: Our results showed minimal bias and narrow 95% limits of agreement on T1, T2, R2* and PDFF values in the liver compared to conventional breath-holding scans. Conclusions: INR-MRF enabled co-registered 3D whole liver T1, T2, R2* and PDFF mapping in a single free-breathing scan. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.15164 [pdf, other]

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

Authors: Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao

Abstract: Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths an… ▽ More Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-Bench offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over ten agents with the flexibility to add more; (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.15128 [pdf, other]

Generalized Flow Matching for Transition Dynamics Modeling

Authors: Haibo Wang, Yuxuan Qiu, Yanze Wang, Rob Brekelmans, Yuanqi Du

Abstract: Simulating transition dynamics between metastable states is a fundamental challenge in dynamical systems and stochastic processes with wide real-world applications in understanding protein folding, chemical reactions and neural activities. However, the computational challenge often lies on sampling exponentially many paths in which only a small fraction ends in the target metastable state due to e… ▽ More Simulating transition dynamics between metastable states is a fundamental challenge in dynamical systems and stochastic processes with wide real-world applications in understanding protein folding, chemical reactions and neural activities. However, the computational challenge often lies on sampling exponentially many paths in which only a small fraction ends in the target metastable state due to existence of high energy barriers. To amortize the cost, we propose a data-driven approach to warm-up the simulation by learning nonlinear interpolations from local dynamics. Specifically, we infer a potential energy function from local dynamics data. To find plausible paths between two metastable states, we formulate a generalized flow matching framework that learns a vector field to sample propable paths between the two marginal densities under the learned energy function. Furthermore, we iteratively refine the model by assigning importance weights to the sampled paths and buffering more likely paths for training. We validate the effectiveness of the proposed method to sample probable paths on both synthetic and real-world molecular systems. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.15105 [pdf, other]

Standardizing Generative Face Video Compression using Supplemental Enhancement Information

Authors: Bolin Chen, Yan Ye, Jie Chen, Ru-Ling Liao, Shanzhi Yin, Shiqi Wang, Kaifa Yang, Yue Li, Yiling Xu, Ye-Kui Wang, Shiv Gehlot, Guan-Ming Su, Peng Yin, Sean McCarthy, Gary J. Sullivan

Abstract: This paper proposes a Generative Face Video Compression (GFVC) approach using Supplemental Enhancement Information (SEI), where a series of compact spatial and temporal representations of a face video signal (i.e., 2D/3D key-points, facial semantics and compact features) can be coded using SEI message and inserted into the coded video bitstream. At the time of writing, the proposed GFVC approach i… ▽ More This paper proposes a Generative Face Video Compression (GFVC) approach using Supplemental Enhancement Information (SEI), where a series of compact spatial and temporal representations of a face video signal (i.e., 2D/3D key-points, facial semantics and compact features) can be coded using SEI message and inserted into the coded video bitstream. At the time of writing, the proposed GFVC approach is an official "technology under consideration" (TuC) for standardization by the Joint Video Experts Team (JVET) of ISO/IEC JVT 1/SC 29 and ITU-T SG16. To the best of the authors' knowledge, the JVET work on the proposed SEI-based GFVC approach is the first standardization activity for generative video compression. The proposed SEI approach has not only advanced the reconstruction quality of early-day Model-Based Coding (MBC) via the state-of-the-art generative technique, but also established a new SEI definition for future GFVC applications and deployment. Experimental results illustrate that the proposed SEI-based GFVC approach can achieve remarkable rate-distortion performance compared with the latest Versatile Video Coding (VVC) standard, whilst also potentially enabling a wide variety of functionalities including user-specified animation/filtering and metaverse-related applications. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.15060 [pdf, other]

BYOCL: Build Your Own Consistent Latent with Hierarchical Representative Latent Clustering

Authors: Jiayue Dai, Yunya Wang, Yihan Fang, Yuetong Chen, Butian Xiong

Abstract: To address the semantic inconsistency issue with SAM or other single-image segmentation models handling image sequences, we introduce BYOCL. This novel model outperforms SAM in extensive experiments, showcasing its Hierarchical prototype capabilities across CLIP and other representations. BYOCL significantly reduces time and space consumption by dividing inputs into smaller batches, achieving expo… ▽ More To address the semantic inconsistency issue with SAM or other single-image segmentation models handling image sequences, we introduce BYOCL. This novel model outperforms SAM in extensive experiments, showcasing its Hierarchical prototype capabilities across CLIP and other representations. BYOCL significantly reduces time and space consumption by dividing inputs into smaller batches, achieving exponential time reduction compared to previous methods. Our approach leverages the SAM image encoder for feature extraction, followed by Intra-Batch and Inter-Batch clustering algorithms. Extensive experiments demonstrate that BYOCL far exceeds the previous state-of-the-art single image segmentation model. Our work is the first to apply consistent segmentation using foundation models without requiring training, utilizing plug-and-play modules for any latent space, making our method highly efficientModels are available at \href{https://github.com/cyt1202/BYOCL.git △ Less

Submitted 19 October, 2024; originally announced October 2024.

Comments: 5 pages, 5 figures

arXiv:2410.15036 [pdf, other]

EViT-Unet: U-Net Like Efficient Vision Transformer for Medical Image Segmentation on Mobile and Edge Devices

Authors: Xin Li, Wenhui Zhu, Xuanzhao Dong, Oana M. Dumitrascu, Yalin Wang

Abstract: With the rapid development of deep learning, CNN-based U-shaped networks have succeeded in medical image segmentation and are widely applied for various tasks. However, their limitations in capturing global features hinder their performance in complex segmentation tasks. The rise of Vision Transformer (ViT) has effectively compensated for this deficiency of CNNs and promoted the application of ViT… ▽ More With the rapid development of deep learning, CNN-based U-shaped networks have succeeded in medical image segmentation and are widely applied for various tasks. However, their limitations in capturing global features hinder their performance in complex segmentation tasks. The rise of Vision Transformer (ViT) has effectively compensated for this deficiency of CNNs and promoted the application of ViT-based U-networks in medical image segmentation. However, the high computational demands of ViT make it unsuitable for many medical devices and mobile platforms with limited resources, restricting its deployment on resource-constrained and edge devices. To address this, we propose EViT-UNet, an efficient ViT-based segmentation network that reduces computational complexity while maintaining accuracy, making it ideal for resource-constrained medical devices. EViT-UNet is built on a U-shaped architecture, comprising an encoder, decoder, bottleneck layer, and skip connections, combining convolutional operations with self-attention mechanisms to optimize efficiency. Experimental results demonstrate that EViT-UNet achieves high accuracy in medical image segmentation while significantly reducing computational complexity. △ Less

Submitted 19 October, 2024; originally announced October 2024.

Comments: 5 pages, 3 figures

arXiv:2410.14996 [pdf, other]

EDRF: Enhanced Driving Risk Field Based on Multimodal Trajectory Prediction and Its Applications

Authors: Junkai Jiang, Zeyu Han, Yuning Wang, Mengchi Cai, Qingwen Meng, Qing Xu, Jianqiang Wang

Abstract: Driving risk assessment is crucial for both autonomous vehicles and human-driven vehicles. The driving risk can be quantified as the product of the probability that an event (such as collision) will occur and the consequence of that event. However, the probability of events occurring is often difficult to predict due to the uncertainty of drivers' or vehicles' behavior. Traditional methods general… ▽ More Driving risk assessment is crucial for both autonomous vehicles and human-driven vehicles. The driving risk can be quantified as the product of the probability that an event (such as collision) will occur and the consequence of that event. However, the probability of events occurring is often difficult to predict due to the uncertainty of drivers' or vehicles' behavior. Traditional methods generally employ kinematic-based approaches to predict the future trajectories of entities, which often yield unrealistic prediction results. In this paper, the Enhanced Driving Risk Field (EDRF) model is proposed, integrating deep learning-based multimodal trajectory prediction results with Gaussian distribution models to quantitatively capture the uncertainty of traffic entities' behavior. The applications of the EDRF are also proposed. It is applied across various tasks (traffic risk monitoring, ego-vehicle risk analysis, and motion and trajectory planning) through the defined concept Interaction Risk (IR). Adequate example scenarios are provided for each application to illustrate the effectiveness of the model. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.14943 [pdf, other]

doi 10.5281/zenodo.13844758

Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows

Authors: Rafael Ferreira da Silva, Deborah Bard, Kyle Chard, Shaun de Witt, Ian T. Foster, Tom Gibbs, Carole Goble, William Godoy, Johan Gustafsson, Utz-Uwe Haus, Stephen Hudson, Shantenu Jha, Laila Los, Drew Paine, Frédéric Suter, Logan Ward, Sean Wilkinson, Marcos Amaris, Yadu Babuji, Jonathan Bader, Riccardo Balin, Daniel Balouek, Sarah Beecroft, Khalid Belhajjame, Rajat Bhattarai , et al. (86 additional authors not shown)

Abstract: The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive workflows, AI-HPC convergence, multi-facility workflows, heterogeneous HPC environments, user experience, and FAIR computational workflows. The integration of AI and exascale computing has revolutionized scientific w… ▽ More The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive workflows, AI-HPC convergence, multi-facility workflows, heterogeneous HPC environments, user experience, and FAIR computational workflows. The integration of AI and exascale computing has revolutionized scientific workflows, enabling higher-fidelity models and complex, time-sensitive processes, while introducing challenges in managing heterogeneous environments and multi-facility data dependencies. The rise of large language models is driving computational demands to zettaflop scales, necessitating modular, adaptable systems and cloud-service models to optimize resource utilization and ensure reproducibility. Multi-facility workflows present challenges in data movement, curation, and overcoming institutional silos, while diverse hardware architectures require integrating workflow considerations into early system design and developing standardized resource management tools. The summit emphasized improving user experience in workflow systems and ensuring FAIR workflows to enhance collaboration and accelerate scientific discovery. Key recommendations include developing standardized metrics for time-sensitive workflows, creating frameworks for cloud-HPC integration, implementing distributed-by-design workflow modeling, establishing multi-facility authentication protocols, and accelerating AI integration in HPC workflow management. The summit also called for comprehensive workflow benchmarks, workflow-specific UX principles, and a FAIR workflow maturity model, highlighting the need for continued collaboration in addressing the complex challenges posed by the convergence of AI, HPC, and multi-facility research environments. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Report number: ORNL/TM-2024/3573

arXiv:2410.14881 [pdf, other]

Class-RAG: Content Moderation with Retrieval Augmented Generation

Authors: Jianfa Chen, Emily Shen, Trupti Bavalatti, Xiaowen Lin, Yongkai Wang, Shuming Hu, Harihar Subramanyam, Ksheeraj Sai Vepuri, Ming Jiang, Ji Qi, Li Chen, Nan Jiang, Ankit Jain

Abstract: Robust content moderation classifiers are essential for the safety of Generative AI systems. Content moderation, or safety classification, is notoriously ambiguous: differences between safe and unsafe inputs are often extremely subtle, making it difficult for classifiers (and indeed, even humans) to properly distinguish violating vs. benign samples without further context or explanation. Furthermo… ▽ More Robust content moderation classifiers are essential for the safety of Generative AI systems. Content moderation, or safety classification, is notoriously ambiguous: differences between safe and unsafe inputs are often extremely subtle, making it difficult for classifiers (and indeed, even humans) to properly distinguish violating vs. benign samples without further context or explanation. Furthermore, as these technologies are deployed across various applications and audiences, scaling risk discovery and mitigation through continuous model fine-tuning becomes increasingly challenging and costly. To address these challenges, we propose a Classification approach employing Retrieval-Augmented Generation (Class-RAG). Class-RAG extends the capability of its base LLM through access to a retrieval library which can be dynamically updated to enable semantic hotfixing for immediate, flexible risk mitigation. Compared to traditional fine-tuned models, Class-RAG demonstrates flexibility and transparency in decision-making. As evidenced by empirical studies, Class-RAG outperforms on classification and is more robust against adversarial attack. Besides, our findings suggest that Class-RAG performance scales with retrieval library size, indicating that increasing the library size is a viable and low-cost approach to improve content moderation. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: 11 pages, submit to ACL

arXiv:2410.14725 [pdf, other]

Rethinking Token Reduction for State Space Models

Authors: Zheng Zhan, Yushu Wu, Zhenglun Kong, Changdi Yang, Yifan Gong, Xuan Shen, Xue Lin, Pu Zhao, Yanzhi Wang

Abstract: Recent advancements in State Space Models (SSMs) have attracted significant interest, particularly in models optimized for parallel training and handling long-range dependencies. Architectures like Mamba have scaled to billions of parameters with selective SSM. To facilitate broader applications using Mamba, exploring its efficiency is crucial. While token reduction techniques offer a straightforw… ▽ More Recent advancements in State Space Models (SSMs) have attracted significant interest, particularly in models optimized for parallel training and handling long-range dependencies. Architectures like Mamba have scaled to billions of parameters with selective SSM. To facilitate broader applications using Mamba, exploring its efficiency is crucial. While token reduction techniques offer a straightforward post-training strategy, we find that applying existing methods directly to SSMs leads to substantial performance drops. Through insightful analysis, we identify the reasons for this failure and the limitations of current techniques. In response, we propose a tailored, unified post-training token reduction method for SSMs. Our approach integrates token importance and similarity, thus taking advantage of both pruning and merging, to devise a fine-grained intra-layer token reduction strategy. Extensive experiments show that our method improves the average accuracy by 5.7% to 13.1% on six benchmarks with Mamba-2 compared to existing methods, while significantly reducing computational demands and memory requirements. △ Less

Submitted 15 October, 2024; originally announced October 2024.

Comments: EMNLP 2024

arXiv:2410.14722 [pdf, other]

Design Studies Of A Pulsed Quasimonoenergetic 2-keV Neutron Source For Calibration Of Low Threshold Dark Matter Detectors

Authors: L. Chaplinsky, S. Fiorucci, C. W. Fink, M. Garcia-Sciveres, W. Guo, S. A. Hertel, J. K. Wuko, X. Li, J. Lin, R. Mahapatra, W. Matava, D. N. McKinsey, D. Z. Osterman, P. K. Patel, B. Penning, H. D. Pinckney, M. Platt, Y. Qi, M. Reed, G. R. C Rischbieter, R. K. Romani, P. Sorensen, V. Velan, G. Wang, Y. Wang , et al. (2 additional authors not shown)

Abstract: We describe design studies for a pulsed quasi-monoenergetic 2-keV neutron source for calibration of sub-keV nuclear recoils. Such a calibration is required for detectors sensitive to sub-GeV dark matter and also the coherent elastic scattering of reactor neutrinos. In our design, neutrons from a commercial deuterium-tritium generator are moderated to the keV scale and then filtered to the monoener… ▽ More We describe design studies for a pulsed quasi-monoenergetic 2-keV neutron source for calibration of sub-keV nuclear recoils. Such a calibration is required for detectors sensitive to sub-GeV dark matter and also the coherent elastic scattering of reactor neutrinos. In our design, neutrons from a commercial deuterium-tritium generator are moderated to the keV scale and then filtered to the monoenergetic spectrum using a feature in the neutron cross section of scandium. In this approach, unmoderated high-energy neutrons form a challenging background, along with gammas from neutron capture in the moderator materials. We describe the optimization of the moderator+filter and shielding geometry, and find a geometry that in simulation achieves both the target neutron flux at 2 keV and subdominant rates of background interactions. Lastly, we describe a future path to lower-energy (few eV scale) calibrations using time-of-flight and sub-keV neutrons. △ Less

Submitted 14 October, 2024; originally announced October 2024.

Comments: 10 pages, 7 figures

arXiv:2410.14682 [pdf, other]

ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models

Authors: Lingfeng Zhang, Yuening Wang, Hongjian Gu, Atia Hamidizadeh, Zhanguang Zhang, Yuecheng Liu, Yutong Wang, David Gamaliel Arcos Bravo, Junyi Dong, Shunbo Zhou, Tongtong Cao, Yuzheng Zhuang, Yingxue Zhang, Jianye Hao

Abstract: Recent advancements in Large Language Models (LLMs) have spurred numerous attempts to apply these technologies to embodied tasks, particularly focusing on high-level task planning and task decomposition. To further explore this area, we introduce a new embodied task planning benchmark, ET-Plan-Bench, which specifically targets embodied task planning using LLMs. It features a controllable and diver… ▽ More Recent advancements in Large Language Models (LLMs) have spurred numerous attempts to apply these technologies to embodied tasks, particularly focusing on high-level task planning and task decomposition. To further explore this area, we introduce a new embodied task planning benchmark, ET-Plan-Bench, which specifically targets embodied task planning using LLMs. It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities, and is designed to evaluate two critical dimensions of LLMs' application in embodied task understanding: spatial (relation constraint, occlusion for target objects) and temporal & causal understanding of the sequence of actions in the environment. By using multi-source simulators as the backend simulator, it can provide immediate environment feedback to LLMs, which enables LLMs to interact dynamically with the environment and re-plan as necessary. We evaluated the state-of-the-art open source and closed source foundation models, including GPT-4, LLAMA and Mistral on our proposed benchmark. While they perform adequately well on simple navigation tasks, their performance can significantly deteriorate when faced with tasks that require a deeper understanding of spatial, temporal, and causal relationships. Thus, our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework that presents a significant challenge to the latest foundation models. We hope it can spark and drive further research in embodied task planning using foundation models. △ Less

Submitted 2 October, 2024; originally announced October 2024.

arXiv:2410.14633 [pdf, other]

Swiss Army Knife: Synergizing Biases in Knowledge from Vision Foundation Models for Multi-Task Learning

Authors: Yuxiang Lu, Shengcao Cao, Yu-Xiong Wang

Abstract: Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting t… ▽ More Vision Foundation Models (VFMs) have demonstrated outstanding performance on numerous downstream tasks. However, due to their inherent representation biases originating from different training paradigms, VFMs exhibit advantages and disadvantages across distinct vision tasks. Although amalgamating the strengths of multiple VFMs for downstream tasks is an intuitive strategy, effectively exploiting these biases remains a significant challenge. In this paper, we propose a novel and versatile "Swiss Army Knife" (SAK) solution, which adaptively distills knowledge from a committee of VFMs to enhance multi-task learning. Unlike existing methods that use a single backbone for knowledge transfer, our approach preserves the unique representation bias of each teacher by collaborating the lightweight Teacher-Specific Adapter Path modules with the Teacher-Agnostic Stem. Through dynamic selection and combination of representations with Mixture-of-Representations Routers, our SAK is capable of synergizing the complementary strengths of multiple VFMs. Extensive experiments show that our SAK remarkably outperforms prior state of the arts in multi-task learning by 10% on the NYUD-v2 benchmark, while also providing a flexible and robust framework that can readily accommodate more advanced model designs. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14600 [pdf, other]

A dataset for cyber threat intelligence modeling of connected autonomous vehicles

Authors: Yinghui Wang, Yilong Ren, Hongmao Qin, Zhiyong Cui, Yanan Zhao, Haiyang Yu

Abstract: Cyber attacks have become a vital threat to connected autonomous vehicles in intelligent transportation systems. Cyber threat intelligence, as the collection of cyber threat information, provides an ideal approach for responding to emerging vehicle cyber threats and enabling proactive security defense. Obtaining valuable information from enormous cybersecurity data using knowledge extraction techn… ▽ More Cyber attacks have become a vital threat to connected autonomous vehicles in intelligent transportation systems. Cyber threat intelligence, as the collection of cyber threat information, provides an ideal approach for responding to emerging vehicle cyber threats and enabling proactive security defense. Obtaining valuable information from enormous cybersecurity data using knowledge extraction technologies to achieve cyber threat intelligence modeling is an effective means to ensure automotive cybersecurity. Unfortunately, there is no existing cybersecurity dataset available for cyber threat intelligence modeling research in the automotive field. This paper reports the creation of a cyber threat intelligence corpus focusing on vehicle cybersecurity knowledge mining. This dataset, annotated using a joint labeling strategy, comprises 908 real automotive cybersecurity reports, containing 3678 sentences, 8195 security entities and 4852 semantic relations. We further conduct a comprehensive analysis of cyber threat intelligence mining algorithms based on this corpus. The proposed dataset will serve as a valuable resource for evaluating the performance of existing algorithms and advancing research in cyber threat intelligence modeling within the automotive field. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14577 [pdf]

Reimagining partial thickness keratoplasty: An eye mountable robot for autonomous big bubble needle insertion

Authors: Y. Wang, J. D. Opfermann, J. Yu, H. Yi, J. Kaluna, R. Biswas, R. Zuo, W. Gensheimer, A. Krieger, J. U. Kang

Abstract: Autonomous surgical robots have demonstrated significant potential to standardize surgical outcomes, driving innovations that enhance safety and consistency regardless of individual surgeon experience. Deep anterior lamellar keratoplasty (DALK), a partial thickness corneal transplant surgery aimed at replacing the anterior part of cornea above Descemet membrane (DM), would greatly benefit from an… ▽ More Autonomous surgical robots have demonstrated significant potential to standardize surgical outcomes, driving innovations that enhance safety and consistency regardless of individual surgeon experience. Deep anterior lamellar keratoplasty (DALK), a partial thickness corneal transplant surgery aimed at replacing the anterior part of cornea above Descemet membrane (DM), would greatly benefit from an autonomous surgical approach as it highly relies on surgeon skill with high perforation rates. In this study, we proposed a novel autonomous surgical robotic system (AUTO-DALK) based on a customized neural network capable of precise needle control and consistent big bubble demarcation on cadaver and live rabbit models. We demonstrate the feasibility of an AI-based image-guided vertical drilling approach for big bubble generation, in contrast to the conventional horizontal needle approach. Our system integrates an optical coherence tomography (OCT) fiber optic distal sensor into the eye-mountable micro robotic system, which automatically segments OCT M-mode depth signals to identify corneal layers using a custom deep learning algorithm. It enables the robot to autonomously guide the needle to targeted tissue layers via a depth-controlled feedback loop. We compared autonomous needle insertion performance and resulting pneumo-dissection using AUTO-DALK against 1) freehand insertion, 2) OCT sensor guided manual insertion, and 3) teleoperated robotic insertion, reporting significant improvements in insertion depth, pneumo-dissection depth, task completion time, and big bubble formation. Ex vivo and in vivo results indicate that the AI-driven, AUTO-DALK system, is a promising solution to standardize pneumo-dissection outcomes for partial thickness keratoplasty. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14379 [pdf, other]

AnomalyNCD: Towards Novel Anomaly Class Discovery in Industrial Scenarios

Authors: Ziming Huang, Xurui Li, Haotian Liu, Feng Xue, Yuzhe Wang, Yu Zhou

Abstract: In the industrial scenario, anomaly detection could locate but cannot classify anomalies. To complete their capability, we study to automatically discover and recognize visual classes of industrial anomalies. In terms of multi-class anomaly classification, previous methods cluster anomalies represented by frozen pre-trained models but often fail due to poor discrimination. Novel class discovery (N… ▽ More In the industrial scenario, anomaly detection could locate but cannot classify anomalies. To complete their capability, we study to automatically discover and recognize visual classes of industrial anomalies. In terms of multi-class anomaly classification, previous methods cluster anomalies represented by frozen pre-trained models but often fail due to poor discrimination. Novel class discovery (NCD) has the potential to tackle this. However, it struggles with non-prominent and semantically weak anomalies that challenge network learning focus. To address these, we introduce AnomalyNCD, a multi-class anomaly classification framework compatible with existing anomaly detection methods. This framework learns anomaly-specific features and classifies anomalies in a self-supervised manner. Initially, a technique called Main Element Binarization (MEBin) is first designed, which segments primary anomaly regions into masks to alleviate the impact of incorrect detections on learning. Subsequently, we employ mask-guided contrastive representation learning to improve feature discrimination, which focuses network attention on isolated anomalous regions and reduces the confusion of erroneous inputs through re-corrected pseudo labels. Finally, to enable flexible classification at both region and image levels during inference, we develop a region merging strategy that determines the overall image category based on the classified anomaly regions. Our method outperforms the state-of-the-art works on the MVTec AD and MTD datasets. Compared with the current methods, AnomalyNCD combined with zero-shot anomaly detection method achieves a 10.8% $F_1$ gain, 8.8% NMI gain, and 9.5% ARI gain on MVTec AD, 12.8% $F_1$ gain, 5.7% NMI gain, and 10.8% ARI gain on MTD. The source code is available at https://github.com/HUST-SLOW/AnomalyNCD. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14332 [pdf, other]

Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Authors: Yin Xie, Kaicheng Yang, Ninghua Yang, Weimo Deng, Xiangzi Dai, Tiancheng Gu, Yumeng Wang, Xiang An, Yongle Zhao, Ziyong Feng, Jiankang Deng

Abstract: Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual compr… ▽ More Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a "foreign language" for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we release the training code and pre-trained model weights at https://github.com/deepglint/Croc. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: 18 pages, 11 figures

arXiv:2410.14331 [pdf, other]

ChartifyText: Automated Chart Generation from Data-Involved Texts via LLM

Authors: Songheng Zhang, Lei Wang, Toby Jia-Jun Li, Qiaomu Shen, Yixin Cao, Yong Wang

Abstract: Text documents with numerical values involved are widely used in various applications such as scientific research, economy, public health and journalism. However, it is difficult for readers to quickly interpret such data-involved texts and gain deep insights. To fill this research gap, this work aims to automatically generate charts to accurately convey the underlying data and ideas to readers, w… ▽ More Text documents with numerical values involved are widely used in various applications such as scientific research, economy, public health and journalism. However, it is difficult for readers to quickly interpret such data-involved texts and gain deep insights. To fill this research gap, this work aims to automatically generate charts to accurately convey the underlying data and ideas to readers, which is essentially a challenging task. The challenges originate from text ambiguities, intrinsic sparsity and uncertainty of data in text documents, and subjective sentiment differences. Specifically, we propose ChartifyText, a novel fully-automated approach that leverages Large Language Models (LLMs) to convert complex data-involved texts to expressive charts. It consists of two major modules: tabular data inference and expressive chart generation. The tabular data inference module employs systematic prompt engineering to guide the LLM (e.g., GPT-4) to infer table data, where data ranges, uncertainties, missing data values and corresponding subjective sentiments are explicitly considered. The expressive chart generation module augments standard charts with intuitive visual encodings and concise texts to accurately convey the underlying data and insights. We extensively evaluate the effectiveness of ChartifyText on real-world data-involved text documents through case studies, in-depth interviews with three visualization experts, and a carefully-designed user study with 15 participants. The results demonstrate the usefulness and effectiveness of ChartifyText in helping readers efficiently and effectively make sense of data-involved texts. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14255 [pdf, other]

Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas

Authors: Xiang Hu, Hongyu Fu, Jinge Wang, Yifeng Wang, Zhikun Li, Renjun Xu, Yu Lu, Yaochu Jin, Lili Pan, Zhenzhong Lan

Abstract: Scientific innovation is pivotal for humanity, and harnessing large language models (LLMs) to generate research ideas could transform discovery. However, existing LLMs often produce simplistic and repetitive suggestions due to their limited ability in acquiring external knowledge for innovation. To address this problem, we introduce an enhanced planning and search methodology designed to boost the… ▽ More Scientific innovation is pivotal for humanity, and harnessing large language models (LLMs) to generate research ideas could transform discovery. However, existing LLMs often produce simplistic and repetitive suggestions due to their limited ability in acquiring external knowledge for innovation. To address this problem, we introduce an enhanced planning and search methodology designed to boost the creative potential of LLM-based systems. Our approach involves an iterative process to purposely plan the retrieval of external knowledge, progressively enriching the idea generation with broader and deeper insights. Validation through automated and human assessments indicates that our framework substantially elevates the quality of generated ideas, particularly in novelty and diversity. The number of unique novel ideas produced by our framework is 3.4 times higher than without it. Moreover, our method outperforms the current state-of-the-art, generating at least 2.5 times more top-rated ideas based on 170 seed papers in a Swiss Tournament evaluation. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14251 [pdf, other]

Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation

Authors: Shuo Tang, Xianghe Pang, Zexi Liu, Bohan Tang, Rui Ye, Xiaowen Dong, Yanfeng Wang, Siheng Chen

Abstract: Post-training is essential for enabling large language models (LLMs) to follow human instructions. Inspired by the recent success of using LLMs to simulate human society, we leverage multi-agent simulation to automatically generate diverse text-based scenarios, capturing a wide range of real-world human needs. We propose MATRIX, a multi-agent simulator that creates realistic and scalable scenarios… ▽ More Post-training is essential for enabling large language models (LLMs) to follow human instructions. Inspired by the recent success of using LLMs to simulate human society, we leverage multi-agent simulation to automatically generate diverse text-based scenarios, capturing a wide range of real-world human needs. We propose MATRIX, a multi-agent simulator that creates realistic and scalable scenarios. Leveraging these outputs, we introduce a novel scenario-driven instruction generator MATRIX-Gen for controllable and highly realistic data synthesis. Extensive experiments demonstrate that our framework effectively generates both general and domain-specific data. Notably, on AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta's Llama-3-8B-Instruct model, which was trained on over 10M pairs; see our project at https://github.com/ShuoTang123/MATRIX-Gen. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14228 [pdf, other]

Towards High-Speed Passive Visible Light Communication with Event Cameras and Digital Micro-Mirrors

Authors: Yanxiang Wang, Yiran Shen, Kenuo Xu, Guangrong Zhao, Mahbub Hassan, Chenren Xu, Wen Hu

Abstract: Passive visible light communication (VLC) modulates light propagation or reflection to transmit data without directly modulating the light source. Thus, passive VLC provides an alternative to conventional VLC, enabling communication where the light source cannot be directly controlled. There have been ongoing efforts to explore new methods and devices for modulating light propagation or reflection… ▽ More Passive visible light communication (VLC) modulates light propagation or reflection to transmit data without directly modulating the light source. Thus, passive VLC provides an alternative to conventional VLC, enabling communication where the light source cannot be directly controlled. There have been ongoing efforts to explore new methods and devices for modulating light propagation or reflection. The state-of-the-art has broken the 100 kbps data rate barrier for passive VLC by using a digital micro-mirror device (DMD) as the light modulating platform, or transmitter, and a photo-diode as the receiver. We significantly extend this work by proposing a massive spatial data channel framework for DMDs, where individual channels can be decoded in parallel using an event camera at the receiver. For the event camera, we introduce event processing algorithms to detect numerous channels and decode bits from individual channels with high reliability. Our prototype, built with off-the-shelf event cameras and DMDs, can decode up to $\sim$2,000 parallel channels, achieving a data transmission rate of 1.6 Mbps, markedly surpassing current benchmarks by 16x. △ Less

Submitted 21 October, 2024; v1 submitted 18 October, 2024; originally announced October 2024.

Comments: 14 pages, 21 figures, nonacm

arXiv:2410.14163 [pdf, other]

Aggregation of Bilinear Bipartite Equality Constraints and its Application to Structural Model Updating Problem

Authors: Santanu S Dey, Dahye Han, Yang Wang

Abstract: In this paper, we study the strength of convex relaxations obtained by convexification of aggregation of constraints for a set $S$ described by two bilinear bipartite equalities. Aggregation is the process of rescaling the original constraints by scalar weights and adding the scaled constraints together. It is natural to study the aggregation technique as it yields a single bilinear bipartite equa… ▽ More In this paper, we study the strength of convex relaxations obtained by convexification of aggregation of constraints for a set $S$ described by two bilinear bipartite equalities. Aggregation is the process of rescaling the original constraints by scalar weights and adding the scaled constraints together. It is natural to study the aggregation technique as it yields a single bilinear bipartite equality whose convex hull is already understood from previous literature. On the theoretical side, we present sufficient conditions when $\text{conv}(S)$ can be described by the intersection of convex hulls of a finite number of aggregations, examples when $\text{conv}(S)$ can only be obtained as the intersection of the convex hull of an infinite number of aggregations, and examples when $\text{conv}(S)$ cannot be achieved exactly from the process of aggregation. Computationally, we explore different methods to derive aggregation weights in order to obtain tight convex relaxations. We show that even if an exact convex hull may not be achieved using aggregations, including the convex hull of an aggregation often significantly tightens the outer approximation of $\text{conv}(S)$. Finally, we apply the aggregation method to obtain convex relaxation for the structural model updating problem and show that this yields better bounds within a branch-and-bound tree as compared to not using aggregations. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14112 [pdf, ps, other]

Subdivision method in the Laplacian matching polynomial

Authors: Jiang-Chao Wan, Yi Wang, Zhi-Yuan Wang

Abstract: As a bridge connecting the matching polynomial and the Laplacian matching polynomial of graphs, the subdivision method is expected to be useful for investigating the Laplacian matching polynomial. In this paper, we study applications of the method from three aspects. We prove that the zero sequence of the Laplacian matching polynomial of a graph majorizes its degree sequence, establishing a dual r… ▽ More As a bridge connecting the matching polynomial and the Laplacian matching polynomial of graphs, the subdivision method is expected to be useful for investigating the Laplacian matching polynomial. In this paper, we study applications of the method from three aspects. We prove that the zero sequence of the Laplacian matching polynomial of a graph majorizes its degree sequence, establishing a dual relation between the Laplacian matching polynomial and the characteristic polynomial of the signless Laplacian matrix of graphs. In addition, from different viewpoints, we give a new combinatorial interpretations for the coefficients of the Laplacian matching polynomial. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.13974 [pdf, other]

Trojan Prompt Attacks on Graph Neural Networks

Authors: Minhua Lin, Zhiwei Zhang, Enyan Dai, Zongyu Wu, Yilong Wang, Xiang Zhang, Suhang Wang

Abstract: Graph Prompt Learning (GPL) has been introduced as a promising approach that uses prompts to adapt pre-trained GNN models to specific downstream tasks without requiring fine-tuning of the entire model. Despite the advantages of GPL, little attention has been given to its vulnerability to backdoor attacks, where an adversary can manipulate the model's behavior by embedding hidden triggers. Existing… ▽ More Graph Prompt Learning (GPL) has been introduced as a promising approach that uses prompts to adapt pre-trained GNN models to specific downstream tasks without requiring fine-tuning of the entire model. Despite the advantages of GPL, little attention has been given to its vulnerability to backdoor attacks, where an adversary can manipulate the model's behavior by embedding hidden triggers. Existing graph backdoor attacks rely on modifying model parameters during training, but this approach is impractical in GPL as GNN encoder parameters are frozen after pre-training. Moreover, downstream users may fine-tune their own task models on clean datasets, further complicating the attack. In this paper, we propose TGPA, a backdoor attack framework designed specifically for GPL. TGPA injects backdoors into graph prompts without modifying pre-trained GNN encoders and ensures high attack success rates and clean accuracy. To address the challenge of model fine-tuning by users, we introduce a finetuning-resistant poisoning approach that maintains the effectiveness of the backdoor even after downstream model adjustments. Extensive experiments on multiple datasets under various settings demonstrate the effectiveness of TGPA in compromising GPL models with fixed GNN encoders. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.13917 [pdf, other]

GBCT: An Efficient and Adaptive Granular-Ball Clustering Algorithm for Complex Data

Authors: Shuyin Xia, Bolun Shi, Yifan Wang, Jiang Xie, Guoyin Wang, Xinbo Gao

Abstract: Traditional clustering algorithms often focus on the most fine-grained information and achieve clustering by calculating the distance between each pair of data points or implementing other calculations based on points. This way is not inconsistent with the cognitive mechanism of "global precedence" in human brain, resulting in those methods' bad performance in efficiency, generalization ability an… ▽ More Traditional clustering algorithms often focus on the most fine-grained information and achieve clustering by calculating the distance between each pair of data points or implementing other calculations based on points. This way is not inconsistent with the cognitive mechanism of "global precedence" in human brain, resulting in those methods' bad performance in efficiency, generalization ability and robustness. To address this problem, we propose a new clustering algorithm called granular-ball clustering (GBCT) via granular-ball computing. Firstly, GBCT generates a smaller number of granular-balls to represent the original data, and forms clusters according to the relationship between granular-balls, instead of the traditional point relationship. At the same time, its coarse-grained characteristics are not susceptible to noise, and the algorithm is efficient and robust; besides, as granular-balls can fit various complex data, GBCT performs much better in non-spherical data sets than other traditional clustering methods. The completely new coarse granularity representation method of GBCT and cluster formation mode can also used to improve other traditional methods. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.13853 [pdf, other]

AutoAL: Automated Active Learning with Differentiable Query Strategy Search

Authors: Yifeng Wang, Xueying Zhan, Siyu Huang

Abstract: As deep learning continues to evolve, the need for data efficiency becomes increasingly important. Considering labeling large datasets is both time-consuming and expensive, active learning (AL) provides a promising solution to this challenge by iteratively selecting the most informative subsets of examples to train deep neural networks, thereby reducing the labeling cost. However, the effectivenes… ▽ More As deep learning continues to evolve, the need for data efficiency becomes increasingly important. Considering labeling large datasets is both time-consuming and expensive, active learning (AL) provides a promising solution to this challenge by iteratively selecting the most informative subsets of examples to train deep neural networks, thereby reducing the labeling cost. However, the effectiveness of different AL algorithms can vary significantly across data scenarios, and determining which AL algorithm best fits a given task remains a challenging problem. This work presents the first differentiable AL strategy search method, named AutoAL, which is designed on top of existing AL sampling strategies. AutoAL consists of two neural nets, named SearchNet and FitNet, which are optimized concurrently under a differentiable bi-level optimization framework. For any given task, SearchNet and FitNet are iteratively co-optimized using the labeled data, learning how well a set of candidate AL algorithms perform on that task. With the optimal AL strategies identified, SearchNet selects a small subset from the unlabeled pool for querying their annotations, enabling efficient training of the task model. Experimental results demonstrate that AutoAL consistently achieves superior accuracy compared to all candidate AL algorithms and other selective AL approaches, showcasing its potential for adapting and integrating multiple existing AL methods across diverse tasks and domains. Code will be available at: https://github.com/haizailache999/AutoAL. △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.13757 [pdf, other]

MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Authors: Zichen Zhu, Hao Tang, Yansi Li, Kunyao Lan, Yixuan Jiang, Hao Zhou, Yixiao Wang, Situo Zhang, Liangtai Sun, Lu Chen, Kai Yu

Abstract: Current mobile assistants are limited by dependence on system APIs or struggle with complex user instructions and diverse interfaces due to restricted comprehension and decision-making abilities. To address these challenges, we propose MobA, a novel Mobile phone Agent powered by multimodal large language models that enhances comprehension and planning capabilities through a sophisticated two-level… ▽ More Current mobile assistants are limited by dependence on system APIs or struggle with complex user instructions and diverse interfaces due to restricted comprehension and decision-making abilities. To address these challenges, we propose MobA, a novel Mobile phone Agent powered by multimodal large language models that enhances comprehension and planning capabilities through a sophisticated two-level agent architecture. The high-level Global Agent (GA) is responsible for understanding user commands, tracking history memories, and planning tasks. The low-level Local Agent (LA) predicts detailed actions in the form of function calls, guided by sub-tasks and memory from the GA. Integrating a Reflection Module allows for efficient task completion and enables the system to handle previously unseen complex tasks. MobA demonstrates significant improvements in task execution efficiency and completion rate in real-life evaluations, underscoring the potential of MLLM-empowered mobile assistants. △ Less

Submitted 17 October, 2024; originally announced October 2024.

Comments: 27 pages, 6 figures, and 5 tables. We will release our source code in a few days

arXiv:2410.13748 [pdf, other]

Test of lepton flavour universality with $B_s^0 \rightarrow φ\ell^+\ell^-$ decays

Authors: LHCb collaboration, R. Aaij, A. S. W. Abdelmotteleb, C. Abellan Beteta, F. Abudinén, T. Ackernley, A. A. Adefisoye, B. Adeva, M. Adinolfi, P. Adlarson, C. Agapopoulou, C. A. Aidala, Z. Ajaltouni, S. Akar, K. Akiba, P. Albicocco, J. Albrecht, F. Alessio, M. Alexander, Z. Aliouche, P. Alvarez Cartelle, R. Amalric, S. Amato, J. L. Amey, Y. Amhis , et al. (1124 additional authors not shown)

Abstract: Lepton flavour universality in rare $b\rightarrow s$ transitions is tested for the first time using $B_s^0$ meson decays. The measurements are performed using $pp$ collision data collected by the LHCb experiment between 2011 and 2018, corresponding to a total integrated luminosity of 9$\,{\rm fb}^{-1}$. Branching fraction ratios between the $B_s^0 \rightarrow φe^+e^-$ and… ▽ More Lepton flavour universality in rare $b\rightarrow s$ transitions is tested for the first time using $B_s^0$ meson decays. The measurements are performed using $pp$ collision data collected by the LHCb experiment between 2011 and 2018, corresponding to a total integrated luminosity of 9$\,{\rm fb}^{-1}$. Branching fraction ratios between the $B_s^0 \rightarrow φe^+e^-$ and $B_s^0 \rightarrow φμ^+μ^-$ decays are measured in three regions of dilepton mass squared, $q^2$, with $0.1 < q^2 < 1.1$, $1.1 < q^2 < 6.0$, and $15 < q^2 < 19\,{\rm GeV}^2/c^4$. The results agree with the Standard Model expectation of lepton flavour universality. △ Less

Submitted 17 October, 2024; originally announced October 2024.

Comments: All figures and tables, along with machine-readable versions and any supplementary material and additional information, are available at https://lbfence.cern.ch/alcm/public/analysis/full-details/3513/ (LHCb public pages)

Report number: LHCb-PAPER-2024-032, CERN-EP-2024-255

Showing 1–50 of 21,815 results for author: Wang, Y