Search SciRate

320 results for au:Yuan_S in:cs

Show all abstracts

Robust Loop Closure by Textual Cues in Challenging Environments
Tongxing Jin, Thien-Minh Nguyen, Xinhang Xu, Yizhuo Yang, Shenghai Yuan, Jianping Li, Lihua Xie
Oct 22 2024 cs.RO cs.SY eess.SY arXiv:2410.15869v1

@misc{2410.15869, author = {Tongxing Jin and Thien-Minh Nguyen and Xinhang Xu and Yizhuo Yang and Shenghai Yuan and Jianping Li and Lihua Xie}, title = {{R}obust {L}oop {C}losure by {T}extual {C}ues in {C}hallenging {E}nvironments}, year = {2024}, eprint = {2410.15869}, note = {arXiv:2410.15869v1} }
PDF
Loop closure is an important task in robot navigation. However, existing methods mostly rely on some implicit or heuristic features of the environment, which can still fail to work in common environments such as corridors, tunnels, and warehouses. Indeed, navigating in such featureless, degenerative, and repetitive (FDR) environments would also pose a significant challenge even for humans, but explicit text cues in the surroundings often provide the best assistance. This inspires us to propose a multi-modal loop closure method based on explicit human-readable textual cues in FDR environments. Specifically, our approach first extracts scene text entities based on Optical Character Recognition (OCR), then creates a local map of text cues based on accurate LiDAR odometry and finally identifies loop closure events by a graph-theoretic scheme. Experiment results demonstrate that this approach has superior performance over existing methods that rely solely on visual and LiDAR sensors. To benefit the community, we release the source code and datasets at \urlhttps://github.com/TongxingJin/TXTLCD.
Graph Optimality-Aware Stochastic LiDAR Bundle Adjustment with Progressive Spatial Smoothing
Jianping Li, Thien-Minh Nguyen, Muqing Cao, Shenghai Yuan, Tzu-Yi Hung, Lihua Xie
Oct 21 2024 cs.RO arXiv:2410.14565v1

@misc{2410.14565, author = {Jianping Li and Thien-Minh Nguyen and Muqing Cao and Shenghai Yuan and Tzu-Yi Hung and Lihua Xie}, title = {{G}raph {O}ptimality-{A}ware {S}tochastic {L}i{DAR} {B}undle {A}djustment with {P}rogressive {S}patial {S}moothing}, year = {2024}, eprint = {2410.14565}, note = {arXiv:2410.14565v1} }
PDF
Large-scale LiDAR Bundle Adjustment (LBA) for refining sensor orientation and point cloud accuracy simultaneously is a fundamental task in photogrammetry and robotics, particularly as low-cost 3D sensors are increasingly used for 3D mapping in complex scenes. Unlike pose-graph-based methods that rely solely on pairwise relationships between LiDAR frames, LBA leverages raw LiDAR correspondences to achieve more precise results, especially when initial pose estimates are unreliable for low-cost sensors. However, existing LBA methods face challenges such as simplistic planar correspondences, extensive observations, and dense normal matrices in the least-squares problem, which limit robustness, efficiency, and scalability. To address these issues, we propose a Graph Optimality-aware Stochastic Optimization scheme with Progressive Spatial Smoothing, namely PSS-GOSO, to achieve \textitrobust, \textitefficient, and \textitscalable LBA. The Progressive Spatial Smoothing (PSS) module extracts \textitrobust LiDAR feature association exploiting the prior structure information obtained by the polynomial smooth kernel. The Graph Optimality-aware Stochastic Optimization (GOSO) module first sparsifies the graph according to optimality for an \textitefficient optimization. GOSO then utilizes stochastic clustering and graph marginalization to solve the large-scale state estimation problem for a \textitscalable LBA. We validate PSS-GOSO across diverse scenes captured by various platforms, demonstrating its superior performance compared to existing methods.
Revealing the Barriers of Language Agents in Planning
Jian Xie, Kexun Zhang, Jiangjie Chen, Siyu Yuan, Kai Zhang, Yikai Zhang, Lei Li, Yanghua Xiao
Oct 17 2024 cs.AI cs.CL arXiv:2410.12409v1

@misc{2410.12409, author = {Jian Xie and Kexun Zhang and Jiangjie Chen and Siyu Yuan and Kai Zhang and Yikai Zhang and Lei Li and Yanghua Xiao}, title = {{R}evealing the {B}arriers of {L}anguage {A}gents in {P}lanning}, year = {2024}, eprint = {2410.12409}, note = {arXiv:2410.12409v1} }
PDF
Autonomous planning has been an ongoing pursuit since the inception of artificial intelligence. Based on curated problem solvers, early planning agents could deliver precise solutions for specific tasks but lacked generalization. The emergence of large language models (LLMs) and their powerful reasoning capabilities has reignited interest in autonomous planning by automatically generating reasonable solutions for given tasks. However, prior research and our experiments show that current language agents still lack human-level planning abilities. Even the state-of-the-art reasoning model, OpenAI o1, achieves only 15.6% on one of the complex real-world planning benchmarks. This highlights a critical question: What hinders language agents from achieving human-level planning? Although existing studies have highlighted weak performance in agent planning, the deeper underlying issues and the mechanisms and limitations of the strategies proposed to address them remain insufficiently understood. In this work, we apply the feature attribution study and identify two key factors that hinder agent planning: the limited role of constraints and the diminishing influence of questions. We also find that although current strategies help mitigate these challenges, they do not fully resolve them, indicating that agents still have a long way to go before reaching human-level intelligence.
Rician Denoising Diffusion Probabilistic Models For Sodium Breast MRI Enhancement
Shuaiyu Yuan, Tristan Whitmarsh, Dimitri A Kessler, Otso Arponen, Mary A McLean, Gabrielle Baxter, Frank Riemer, Aneurin J Kennerley, William J Brackenbury, Fiona J Gilbert, Joshua D Kaggie
Oct 16 2024 eess.IV cs.CV arXiv:2410.11511v1

@misc{2410.11511, author = {Shuaiyu Yuan and Tristan Whitmarsh and Dimitri A Kessler and Otso Arponen and Mary A McLean and Gabrielle Baxter and Frank Riemer and Aneurin J Kennerley and William J Brackenbury and Fiona J Gilbert and Joshua D Kaggie}, title = {{R}ician {D}enoising {D}iffusion {P}robabilistic {M}odels {F}or {S}odium {B}reast {MRI} {E}nhancement}, year = {2024}, eprint = {2410.11511}, note = {arXiv:2410.11511v1} }
PDF
Sodium MRI is an imaging technique used to visualize and quantify sodium concentrations in vivo, playing a role in many biological processes and potentially aiding in breast cancer characterization. Sodium MRI, however, suffers from inherently low signal-to-noise ratios (SNR) and spatial resolution, compared with conventional proton MRI. A deep-learning method, the Denoising Diffusion Probabilistic Models (DDPM), has demonstrated success across a wide range of denoising tasks, yet struggles with sodium MRI's unique noise profile, as DDPM primarily targets Gaussian noise. DDPM can distort features when applied to sodium MRI. This paper advances the DDPM by introducing the Rician Denoising Diffusion Probabilistic Models (RDDPM) for sodium MRI denoising. RDDPM converts Rician noise to Gaussian noise at each timestep during the denoising process. The model's performance is evaluated using three non-reference image quality assessment metrics, where RDDPM consistently outperforms DDPM and other CNN-based denoising methods.
Towards Better Multi-head Attention via Channel-wise Sample Permutation
Shen Yuan, Hongteng Xu
Oct 16 2024 cs.LG cs.CL cs.CV arXiv:2410.10914v1

@misc{2410.10914, author = {Shen Yuan and Hongteng Xu}, title = {{T}owards {B}etter {M}ulti-head {A}ttention via {C}hannel-wise {S}ample {P}ermutation}, year = {2024}, eprint = {2410.10914}, note = {arXiv:2410.10914v1} }
PDF
Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing, whose effectiveness is mainly attributed to its multi-head attention (MHA) mechanism. In this study, we propose a simple and novel channel-wise sample permutation (CSP) operator, achieving a new structured MHA with fewer parameters and lower complexity. Given an input matrix, CSP circularly shifts the samples of different channels with various steps and then sorts grouped samples of each channel. This operator is equivalent to implicitly implementing cross-channel attention maps as permutation matrices, which achieves linear complexity and suppresses the risk of rank collapse when representing data. We replace the MHA of some representative models with CSP and test the CSP-based models in several discriminative tasks, including image classification and long sequence analysis. Experiments show that the CSP-based models achieve comparable or better performance with fewer parameters and lower computational costs than the classic Transformer and its state-of-the-art variants. The code is available at https://github.com/DaShenZi721/CSP.
Hybrid Spatial Representations for Species Distribution Modeling
Shiran Yuan, Hao Zhao
Oct 16 2024 cs.LG cs.CV arXiv:2410.10937v1

@misc{2410.10937, author = {Shiran Yuan and Hao Zhao}, title = {{H}ybrid {S}patial {R}epresentations for {S}pecies {D}istribution {M}odeling}, year = {2024}, eprint = {2410.10937}, note = {arXiv:2410.10937v1} }
PDF
We address an important problem in ecology called Species Distribution Modeling (SDM), whose goal is to predict whether a species exists at a certain position on Earth. In particular, we tackle a challenging version of this task, where we learn from presence-only data in a community-sourced dataset, model a large number of species simultaneously, and do not use any additional environmental information. Previous work has used neural implicit representations to construct models that achieve promising results. However, implicit representations often generate predictions of limited spatial precision. We attribute this limitation to their inherently global formulation and inability to effectively capture local feature variations. This issue is especially pronounced with presence-only data and a large number of species. To address this, we propose a hybrid embedding scheme that combines both implicit and explicit embeddings. Specifically, the explicit embedding is implemented with a multiresolution hashgrid, enabling our models to better capture local information. Experiments demonstrate that our results exceed other works by a large margin on various standard benchmarks, and that the hybrid representation is better than both purely implicit and explicit ones. Qualitative visualizations and comprehensive ablation studies reveal that our hybrid representation successfully addresses the two main challenges. Our code is open-sourced at https://github.com/Shiran-Yuan/HSR-SDM.
WAPITI: A Watermark for Finetuned Open-Source LLMs
Lingjie Chen, Ruizhong Qiu, Siyu Yuan, Zhining Liu, Tianxin Wei, Hyunsik Yoo, Zhichen Zeng, Deqing Yang, Hanghang Tong
Oct 10 2024 cs.CR arXiv:2410.06467v1

@misc{2410.06467, author = {Lingjie Chen and Ruizhong Qiu and Siyu Yuan and Zhining Liu and Tianxin Wei and Hyunsik Yoo and Zhichen Zeng and Deqing Yang and Hanghang Tong}, title = {{WAPITI}: {A} {W}atermark for {F}inetuned {O}pen-{S}ource {LLM}s}, year = {2024}, eprint = {2410.06467}, note = {arXiv:2410.06467v1} }
PDF
Watermarking of large language models (LLMs) generation embeds an imperceptible statistical pattern within texts, making it algorithmically detectable. Watermarking is a promising method for addressing potential harm and biases from LLMs, as it enables traceability, accountability, and detection of manipulated content, helping to mitigate unintended consequences. However, for open-source models, watermarking faces two major challenges: (i) incompatibility with fine-tuned models, and (ii) vulnerability to fine-tuning attacks. In this work, we propose WAPITI, a new method that transfers watermarking from base models to fine-tuned models through parameter integration. To the best of our knowledge, we propose the first watermark for fine-tuned open-source LLMs that preserves their fine-tuned capabilities. Furthermore, our approach offers an effective defense against fine-tuning attacks. We test our method on various model architectures and watermarking strategies. Results demonstrate that our method can successfully inject watermarks and is highly compatible with fine-tuned models. Additionally, we offer an in-depth analysis of how parameter editing influences the watermark strength and overall capabilities of the resulting models.
Training Over a Distribution of Hyperparameters for Enhanced Performance and Adaptability on Imbalanced Classification
Kelsey Lieberman, Swarna Kamlam Ravindran, Shuai Yuan, Carlo Tomasi
Oct 07 2024 cs.LG arXiv:2410.03588v1

@misc{2410.03588, author = {Kelsey Lieberman and Swarna Kamlam Ravindran and Shuai Yuan and Carlo Tomasi}, title = {{T}raining {O}ver a {D}istribution of {H}yperparameters for {E}nhanced {P}erformance and {A}daptability on {I}mbalanced {C}lassification}, year = {2024}, eprint = {2410.03588}, note = {arXiv:2410.03588v1} }
PDF
Although binary classification is a well-studied problem, training reliable classifiers under severe class imbalance remains a challenge. Recent techniques mitigate the ill effects of imbalance on training by modifying the loss functions or optimization methods. We observe that different hyperparameter values on these loss functions perform better at different recall values. We propose to exploit this fact by training one model over a distribution of hyperparameter values--instead of a single value--via Loss Conditional Training (LCT). Experiments show that training over a distribution of hyperparameters not only approximates the performance of several models but actually improves the overall performance of models on both CIFAR and real medical imaging applications, such as melanoma and diabetic retinopathy detection. Furthermore, training models with LCT is more efficient because some hyperparameter tuning can be conducted after training to meet individual needs without needing to retrain from scratch.
SGBA: Semantic Gaussian Mixture Model-Based LiDAR Bundle Adjustment
Xingyu Ji, Shenghai Yuan, Jianping Li, Pengyu Yin, Haozhi Cao, Lihua Xie
Oct 03 2024 cs.CV cs.RO arXiv:2410.01618v1

@misc{2410.01618, author = {Xingyu Ji and Shenghai Yuan and Jianping Li and Pengyu Yin and Haozhi Cao and Lihua Xie}, title = {{SGBA}: {S}emantic {G}aussian {M}ixture {M}odel-{B}ased {L}i{DAR} {B}undle {A}djustment}, year = {2024}, eprint = {2410.01618}, note = {arXiv:2410.01618v1} }
PDF
LiDAR bundle adjustment (BA) is an effective approach to reduce the drifts in pose estimation from the front-end. Existing works on LiDAR BA usually rely on predefined geometric features for landmark representation. This reliance restricts generalizability, as the system will inevitably deteriorate in environments where these specific features are absent. To address this issue, we propose SGBA, a LiDAR BA scheme that models the environment as a semantic Gaussian mixture model (GMM) without predefined feature types. This approach encodes both geometric and semantic information, offering a comprehensive and general representation adaptable to various environments. Additionally, to limit computational complexity while ensuring generalizability, we propose an adaptive semantic selection framework that selects the most informative semantic clusters for optimization by evaluating the condition number of the cost function. Lastly, we introduce a probabilistic feature association scheme that considers the entire probability density of assignments, which can manage uncertainties in measurement and initial pose estimation. We have conducted various experiments and the results demonstrate that SGBA can achieve accurate and robust pose refinement even in challenging scenarios with low-quality initial pose estimation and limited geometric features. We plan to open-source the work for the benefit of the community https://github.com/Ji1Xinyu/SGBA.
GERA: Geometric Embedding for Efficient Point Registration Analysis
Geng Li, Haozhi Cao, Mingyang Liu, Shenghai Yuan, Jianfei Yang
Oct 02 2024 cs.CV cs.AI arXiv:2410.00589v1

@misc{2410.00589, author = {Geng Li and Haozhi Cao and Mingyang Liu and Shenghai Yuan and Jianfei Yang}, title = {{GERA}: {G}eometric {E}mbedding for {E}fficient {P}oint {R}egistration {A}nalysis}, year = {2024}, eprint = {2410.00589}, note = {arXiv:2410.00589v1} }
PDF
Point cloud registration aims to provide estimated transformations to align point clouds, which plays a crucial role in pose estimation of various navigation systems, such as surgical guidance systems and autonomous vehicles. Despite the impressive performance of recent models on benchmark datasets, many rely on complex modules like KPConv and Transformers, which impose significant computational and memory demands. These requirements hinder their practical application, particularly in resource-constrained environments such as mobile robotics. In this paper, we propose a novel point cloud registration network that leverages a pure MLP architecture, constructing geometric information offline. This approach eliminates the computational and memory burdens associated with traditional complex feature extractors and significantly reduces inference time and resource consumption. Our method is the first to replace 3D coordinate inputs with offline-constructed geometric encoding, improving generalization and stability, as demonstrated by Maximum Mean Discrepancy (MMD) comparisons. This efficient and accurate geometric representation marks a significant advancement in point cloud analysis, particularly for applications requiring fast and reliability.
OnePath: Efficient and Privacy-Preserving Decision Tree Inference in the Cloud
Shuai Yuan, Hongwei Li, Xinyuan Qian, Wenbo Jiang, Guowen Xu
Oct 01 2024 cs.CR arXiv:2409.19334v1

@misc{2409.19334, author = {Shuai Yuan and Hongwei Li and Xinyuan Qian and Wenbo Jiang and Guowen Xu}, title = {{O}ne{P}ath: {E}fficient and {P}rivacy-{P}reserving {D}ecision {T}ree {I}nference in the {C}loud}, year = {2024}, eprint = {2409.19334}, note = {arXiv:2409.19334v1} }
PDF
The expansive storage capacity and robust computational power of cloud servers have led to the widespread outsourcing of machine learning inference services to the cloud. While this practice offers significant operational benefits, it also poses substantial privacy risks, including the exposure of proprietary models and sensitive user data. In this paper, we introduce OnePath, a framework designed for secure and efficient decision tree inference in cloud environments. Unlike existing schemes that require traversing all internal nodes of a decision tree, our protocol securely identifies and processes only the nodes on the prediction path, maintaining data privacy under ciphertext throughout the inference process. This selective traversal enhances both security and efficiency. To further optimize privacy and performance, OnePath employs lightweight cryptographic techniques, such as functional encryption, during the online phase of secure inference. Notably, our protocol allows both providers and clients to perform secure inference without the need to remain online continuously, a critical advantage for real-world applications. We substantiate the security of our framework with formal proofs, demonstrating that OnePath robustly protects the privacy of decision tree classifiers and user data. Experimental results highlight the efficiency of our approach, with our scheme processing query data in mere microseconds on the tested dataset. Through OnePath, we provide a practical solution that balances the needs for security and efficiency in cloud-based decision tree inference, making it a promising option for a variety of applications.
Learning Multimodal Latent Generative Models with Energy-Based Prior
Shiyu Yuan, Jiali Cui, Hanao Li, Tian Han
Oct 01 2024 cs.LG cs.CV arXiv:2409.19862v1

@misc{2409.19862, author = {Shiyu Yuan and Jiali Cui and Hanao Li and Tian Han}, title = {{L}earning {M}ultimodal {L}atent {G}enerative {M}odels with {E}nergy-{B}ased {P}rior}, year = {2024}, eprint = {2409.19862}, note = {arXiv:2409.19862v1} }
PDF
Multimodal generative models have recently gained significant attention for their ability to learn representations across various modalities, enhancing joint and cross-generation coherence. However, most existing works use standard Gaussian or Laplacian distributions as priors, which may struggle to capture the diverse information inherent in multiple data types due to their unimodal and less informative nature. Energy-based models (EBMs), known for their expressiveness and flexibility across various tasks, have yet to be thoroughly explored in the context of multimodal generative models. In this paper, we propose a novel framework that integrates the multimodal latent generative model with the EBM. Both models can be trained jointly through a variational scheme. This approach results in a more expressive and informative prior, better-capturing of information across multiple modalities. Our experiments validate the proposed model, demonstrating its superior generation coherence.
Riemannian conjugate Sobolev gradients and their application to compute ground states of BECs
Yueshan Ai, Patrick Henning, Mahima Yadav, Sitong Yuan
Sep 27 2024 math.NA cs.NA arXiv:2409.17302v1

@misc{2409.17302, author = {Yueshan Ai and Patrick Henning and Mahima Yadav and Sitong Yuan}, title = {{R}iemannian conjugate {S}obolev gradients and their application to compute ground states of {BEC}s}, year = {2024}, eprint = {2409.17302}, note = {arXiv:2409.17302v1} }
PDF
This work considers the numerical computation of ground states of rotating Bose-Einstein condensates (BECs) which can exhibit a multiscale lattice of quantized vortices. This problem involves the minimization of an energy functional on a Riemannian manifold. For this we apply the framework of nonlinear conjugate gradient methods in combination with the paradigm of Sobolev gradients to investigate different metrics. Here we build on previous work that proposed to enhance the convergence of regular Riemannian gradients methods by an adaptively changing metric that is based on the current energy. In this work, we extend this approach to the branch of Riemannian conjugate gradient (CG) methods and investigate the arising schemes numerically. Special attention is given to the selection of the momentum parameter in search direction and how this affects the performance of the resulting schemes. As known from similar applications, we find that the choice of the momentum parameter plays a critical role, with certain parameters reducing the number of iterations required to achieve a specified tolerance by a significant factor. Besides the influence of the momentum parameters, we also investigate how the methods with adaptive metric compare to the corresponding realizations with a standard $H^1_0$-metric. As one of our main findings, the results of the numerical experiments show that the Riemannian CG method with the proposed adaptive metric along with a Polak-Ribiére or Hestenes-Stiefel-type momentum parameter show the best performance and highest robustness compared to the other CG methods that were part of our numerical study.
MultiTalk: Introspective and Extrospective Dialogue for Human-Environment-LLM Alignment
Venkata Naren Devarakonda, Ali Umut Kaypak, Shuaihang Yuan, Prashanth Krishnamurthy, Yi Fang, Farshad Khorrami
Sep 26 2024 cs.RO arXiv:2409.16455v1

@misc{2409.16455, author = {Venkata Naren Devarakonda and Ali Umut Kaypak and Shuaihang Yuan and Prashanth Krishnamurthy and Yi Fang and Farshad Khorrami}, title = {{M}ulti{T}alk: {I}ntrospective and {E}xtrospective {D}ialogue for {H}uman-{E}nvironment-{LLM} {A}lignment}, year = {2024}, eprint = {2409.16455}, note = {arXiv:2409.16455v1} }
PDF
LLMs have shown promising results in task planning due to their strong natural language understanding and reasoning capabilities. However, issues such as hallucinations, ambiguities in human instructions, environmental constraints, and limitations in the executing agent's capabilities often lead to flawed or incomplete plans. This paper proposes MultiTalk, an LLM-based task planning methodology that addresses these issues through a framework of introspective and extrospective dialogue loops. This approach helps ground generated plans in the context of the environment and the agent's capabilities, while also resolving uncertainties and ambiguities in the given task. These loops are enabled by specialized systems designed to extract and predict task-specific states, and flag mismatches or misalignments among the human user, the LLM agent, and the environment. Effective feedback pathways between these systems and the LLM planner foster meaningful dialogue. The efficacy of this methodology is demonstrated through its application to robotic manipulation tasks. Experiments and ablations highlight the robustness and reliability of our method, and comparisons with baselines further illustrate the superiority of MultiTalk in task planning for embodied agents.
AIR-Embodied: An Efficient Active 3DGS-based Interaction and Reconstruction Framework with Embodied Large Language Model
Zhenghao Qi, Shenghai Yuan, Fen Liu, Haozhi Cao, Tianchen Deng, Jianfei Yang, Lihua Xie
Sep 25 2024 cs.RO arXiv:2409.16019v1

@misc{2409.16019, author = {Zhenghao Qi and Shenghai Yuan and Fen Liu and Haozhi Cao and Tianchen Deng and Jianfei Yang and Lihua Xie}, title = {{AIR}-{E}mbodied: {A}n {E}fficient {A}ctive 3{DGS}-based {I}nteraction and {R}econstruction {F}ramework with {E}mbodied {L}arge {L}anguage {M}odel}, year = {2024}, eprint = {2409.16019}, note = {arXiv:2409.16019v1} }
PDF
Recent advancements in 3D reconstruction and neural rendering have enhanced the creation of high-quality digital assets, yet existing methods struggle to generalize across varying object shapes, textures, and occlusions. While Next Best View (NBV) planning and Learning-based approaches offer solutions, they are often limited by predefined criteria and fail to manage occlusions with human-like common sense. To address these problems, we present AIR-Embodied, a novel framework that integrates embodied AI agents with large-scale pretrained multi-modal language models to improve active 3DGS reconstruction. AIR-Embodied utilizes a three-stage process: understanding the current reconstruction state via multi-modal prompts, planning tasks with viewpoint selection and interactive actions, and employing closed-loop reasoning to ensure accurate execution. The agent dynamically refines its actions based on discrepancies between the planned and actual outcomes. Experimental evaluations across virtual and real-world environments demonstrate that AIR-Embodied significantly enhances reconstruction efficiency and quality, providing a robust solution to challenges in active 3D reconstruction.
Distance-based Multiple Non-cooperative Ground Target Encirclement for Complex Environments
Fen Liu, Shenghai Yuan, Kun Cao, Wei Meng, Lihua Xie
Sep 25 2024 cs.RO arXiv:2409.15840v1

@misc{2409.15840, author = {Fen Liu and Shenghai Yuan and Kun Cao and Wei Meng and Lihua Xie}, title = {{D}istance-based {M}ultiple {N}on-cooperative {G}round {T}arget {E}ncirclement for {C}omplex {E}nvironments}, year = {2024}, eprint = {2409.15840}, note = {arXiv:2409.15840v1} }
PDF
This paper proposes a comprehensive strategy for complex multi-target-multi-drone encirclement in an obstacle-rich and GPS-denied environment, motivated by practical scenarios such as pursuing vehicles or humans in urban canyons. The drones have omnidirectional range sensors that can robustly detect ground targets and obtain noisy relative distances. After each drone task is assigned, a novel distance-based target state estimator (DTSE) is proposed by estimating the measurement output noise variance and utilizing the Kalman filter. By integrating anti-synchronization techniques and pseudo-force functions, an acceleration controller enables two tasking drones to cooperatively encircle a target from opposing positions while navigating obstacles. The algorithms effectiveness for the discrete-time double-integrator system is established theoretically, particularly regarding observability. Moreover, the versatility of the algorithm is showcased in aerial-to-ground scenarios, supported by compelling simulation results. Experimental validation demonstrates the effectiveness of the proposed approach.
Past Meets Present: Creating Historical Analogy with Large Language Models
Nianqi Li, Siyu Yuan, Jiangjie Chen, Jiaqing Liang, Feng Wei, Zujie Liang, Deqing Yang, Yanghua Xiao
Sep 24 2024 cs.CL cs.AI arXiv:2409.14820v1

@misc{2409.14820, author = {Nianqi Li and Siyu Yuan and Jiangjie Chen and Jiaqing Liang and Feng Wei and Zujie Liang and Deqing Yang and Yanghua Xiao}, title = {{P}ast {M}eets {P}resent: {C}reating {H}istorical {A}nalogy with {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2409.14820}, note = {arXiv:2409.14820v1} }
PDF
Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method.
ITPatch: An Invisible and Triggered Physical Adversarial Patch against Traffic Sign Recognition
Shuai Yuan, Hongwei Li, Xingshuo Han, Guowen Xu, Wenbo Jiang, Tao Ni, Qingchuan Zhao, Yuguang Fang
Sep 20 2024 cs.CV cs.AI arXiv:2409.12394v1

@misc{2409.12394, author = {Shuai Yuan and Hongwei Li and Xingshuo Han and Guowen Xu and Wenbo Jiang and Tao Ni and Qingchuan Zhao and Yuguang Fang}, title = {{ITP}atch: {A}n {I}nvisible and {T}riggered {P}hysical {A}dversarial {P}atch against {T}raffic {S}ign {R}ecognition}, year = {2024}, eprint = {2409.12394}, note = {arXiv:2409.12394v1} }
PDF
Physical adversarial patches have emerged as a key adversarial attack to cause misclassification of traffic sign recognition (TSR) systems in the real world. However, existing adversarial patches have poor stealthiness and attack all vehicles indiscriminately once deployed. In this paper, we introduce an invisible and triggered physical adversarial patch (ITPatch) with a novel attack vector, i.e., fluorescent ink, to advance the state-of-the-art. It applies carefully designed fluorescent perturbations to a target sign, an attacker can later trigger a fluorescent effect using invisible ultraviolet light, causing the TSR system to misclassify the sign and potentially resulting in traffic accidents. We conducted a comprehensive evaluation to investigate the effectiveness of ITPatch, which shows a success rate of 98.31% in low-light conditions. Furthermore, our attack successfully bypasses five popular defenses and achieves a success rate of 96.72%.
ULOC: Learning to Localize in Complex Large-Scale Environments with Ultra-Wideband Ranges
Thien-Minh Nguyen, Yizhuo Yang, Tien-Dat Nguyen, Shenghai Yuan, Lihua Xie
Sep 18 2024 cs.RO cs.LG arXiv:2409.11122v1

@misc{2409.11122, author = {Thien-Minh Nguyen and Yizhuo Yang and Tien-Dat Nguyen and Shenghai Yuan and Lihua Xie}, title = {{ULOC}: {L}earning to {L}ocalize in {C}omplex {L}arge-{S}cale {E}nvironments with {U}ltra-{W}ideband {R}anges}, year = {2024}, eprint = {2409.11122}, note = {arXiv:2409.11122v1} }
PDF
While UWB-based methods can achieve high localization accuracy in small-scale areas, their accuracy and reliability are significantly challenged in large-scale environments. In this paper, we propose a learning-based framework named ULOC for Ultra-Wideband (UWB) based localization in such complex large-scale environments. First, anchors are deployed in the environment without knowledge of their actual position. Then, UWB observations are collected when the vehicle travels in the environment. At the same time, map-consistent pose estimates are developed from registering (onboard self-localization) data with the prior map to provide the training labels. We then propose a network based on MAMBA that learns the ranging patterns of UWBs over a complex large-scale environment. The experiment demonstrates that our solution can ensure high localization accuracy on a large scale compared to the state-of-the-art. We release our source code to benefit the community at https://github.com/brytsknguyen/uloc.
Adaptive Multi-Modal Control of Digital Human Hand Synthesis Using a Region-Aware Cycle Loss
Qifan Fu, Xiaohang Yang, Muhammad Asad, Changjae Oh, Shanxin Yuan, Gregory Slabaugh
Sep 17 2024 cs.CV arXiv:2409.09149v1

@misc{2409.09149, author = {Qifan Fu and Xiaohang Yang and Muhammad Asad and Changjae Oh and Shanxin Yuan and Gregory Slabaugh}, title = {{A}daptive {M}ulti-{M}odal {C}ontrol of {D}igital {H}uman {H}and {S}ynthesis {U}sing a {R}egion-{A}ware {C}ycle {L}oss}, year = {2024}, eprint = {2409.09149}, note = {arXiv:2409.09149v1} }
PDF
Diffusion models have shown their remarkable ability to synthesize images, including the generation of humans in specific poses. However, current models face challenges in adequately expressing conditional control for detailed hand pose generation, leading to significant distortion in the hand regions. To tackle this problem, we first curate the How2Sign dataset to provide richer and more accurate hand pose annotations. In addition, we introduce adaptive, multi-modal fusion to integrate characters' physical features expressed in different modalities such as skeleton, depth, and surface normal. Furthermore, we propose a novel Region-Aware Cycle Loss (RACL) that enables the diffusion model training to focus on improving the hand region, resulting in improved quality of generated hand gestures. More specifically, the proposed RACL computes a weighted keypoint distance between the full-body pose keypoints from the generated image and the ground truth, to generate higher-quality hand poses while balancing overall pose accuracy. Moreover, we use two hand region metrics, named hand-PSNR and hand-Distance for hand pose generation evaluations. Our experimental evaluations demonstrate the effectiveness of our proposed approach in improving the quality of digital human pose generation using diffusion models, especially the quality of the hand region. The source code is available at https://github.com/fuqifan/Region-Aware-Cycle-Loss.
HelmetPoser: A Helmet-Mounted IMU Dataset for Data-Driven Estimation of Human Head Motion in Diverse Conditions
Jianping Li, Qiutong Leng, Jinxing Liu, Xinhang Xu, Tongxin Jin, Muqing Cao, Thien-Minh Nguyen, Shenghai Yuan, Kun Cao, Lihua Xie
Sep 10 2024 cs.RO arXiv:2409.05006v1

@misc{2409.05006, author = {Jianping Li and Qiutong Leng and Jinxing Liu and Xinhang Xu and Tongxin Jin and Muqing Cao and Thien-Minh Nguyen and Shenghai Yuan and Kun Cao and Lihua Xie}, title = {{H}elmet{P}oser: {A} {H}elmet-{M}ounted {IMU} {D}ataset for {D}ata-{D}riven {E}stimation of {H}uman {H}ead {M}otion in {D}iverse {C}onditions}, year = {2024}, eprint = {2409.05006}, note = {arXiv:2409.05006v1} }
PDF
Helmet-mounted wearable positioning systems are crucial for enhancing safety and facilitating coordination in industrial, construction, and emergency rescue environments. These systems, including LiDAR-Inertial Odometry (LIO) and Visual-Inertial Odometry (VIO), often face challenges in localization due to adverse environmental conditions such as dust, smoke, and limited visual features. To address these limitations, we propose a novel head-mounted Inertial Measurement Unit (IMU) dataset with ground truth, aimed at advancing data-driven IMU pose estimation. Our dataset captures human head motion patterns using a helmet-mounted system, with data from ten participants performing various activities. We explore the application of neural networks, specifically Long Short-Term Memory (LSTM) and Transformer networks, to correct IMU biases and improve localization accuracy. Additionally, we evaluate the performance of these methods across different IMU data window dimensions, motion patterns, and sensor types. We release a publicly available dataset, demonstrate the feasibility of advanced neural network approaches for helmet-based localization, and provide evaluation metrics to establish a baseline for future studies in this field. Data and code can be found at \urlhttps://lqiutong.github.io/HelmetPoser.github.io/.
OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving
Julong Wei, Shanshuai Yuan, Pengfei Li, Qingda Hu, Zhongxue Gan, Wenchao Ding
Sep 06 2024 cs.CV cs.RO arXiv:2409.03272v1

@misc{2409.03272, author = {Julong Wei and Shanshuai Yuan and Pengfei Li and Qingda Hu and Zhongxue Gan and Wenchao Ding}, title = {{O}cc{LL}a{MA}: {A}n {O}ccupancy-{L}anguage-{A}ction {G}enerative {W}orld {M}odel for {A}utonomous {D}riving}, year = {2024}, eprint = {2409.03272}, note = {arXiv:2409.03272v1} }
PDF
The rise of multi-modal large language models(MLLMs) has spurred their applications in autonomous driving. Recent MLLM-based methods perform action by learning a direct mapping from perception to action, neglecting the dynamics of the world and the relations between action and world dynamics. In contrast, human beings possess world model that enables them to simulate the future states based on 3D internal visual representation and plan actions accordingly. To this end, we propose OccLLaMA, an occupancy-language-action generative world model, which uses semantic occupancy as a general visual representation and unifies vision-language-action(VLA) modalities through an autoregressive model. Specifically, we introduce a novel VQVAE-like scene tokenizer to efficiently discretize and reconstruct semantic occupancy scenes, considering its sparsity and classes imbalance. Then, we build a unified multi-modal vocabulary for vision, language and action. Furthermore, we enhance LLM, specifically LLaMA, to perform the next token/scene prediction on the unified vocabulary to complete multiple tasks in autonomous driving. Extensive experiments demonstrate that OccLLaMA achieves competitive performance across multiple tasks, including 4D occupancy forecasting, motion planning, and visual question answering, showcasing its potential as a foundation model in autonomous driving.
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, Li Yuan
Sep 04 2024 cs.CV eess.IV arXiv:2409.01199v2

@misc{2409.01199, author = {Liuhan Chen and Zongjian Li and Bin Lin and Bin Zhu and Qian Wang and Shenghai Yuan and Xing Zhou and Xinhua Cheng and Li Yuan}, title = {{OD}-{VAE}: {A}n {O}mni-dimensional {V}ideo {C}ompressor for {I}mproving {L}atent {V}ideo {D}iffusion {M}odel}, year = {2024}, eprint = {2409.01199}, note = {arXiv:2409.01199v2} }
PDF
Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension. How to conduct temporal compression for videos in a VAE to obtain more concise latent representations while promising accurate reconstruction is seldom explored. To fill this gap, we propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos. Although OD-VAE's more sufficient compression brings a great challenge to video reconstruction, it can still achieve high reconstructed accuracy by our fine design. To obtain a better trade-off between video reconstruction quality and compression speed, four variants of OD-VAE are introduced and analyzed. In addition, a novel tail initialization is designed to train OD-VAE more efficiently, and a novel inference strategy is proposed to enable OD-VAE to handle videos of arbitrary length with limited GPU memory. Comprehensive experiments on video reconstruction and LVDM-based video generation demonstrate the effectiveness and efficiency of our proposed methods.
Android Malware Detection Based on RGB Images and Multi-feature Fusion
Zhiqiang Wang, Qiulong Yu, Sicheng Yuan
Aug 30 2024 cs.CR cs.LG arXiv:2408.16555v1

@misc{2408.16555, author = {Zhiqiang Wang and Qiulong Yu and Sicheng Yuan}, title = {{A}ndroid {M}alware {D}etection {B}ased on {RGB} {I}mages and {M}ulti-feature {F}usion}, year = {2024}, eprint = {2408.16555}, note = {arXiv:2408.16555v1} }
PDF
With the widespread adoption of smartphones, Android malware has become a significant challenge in the field of mobile device security. Current Android malware detection methods often rely on feature engineering to construct dynamic or static features, which are then used for learning. However, static feature-based methods struggle to counter code obfuscation, packing, and signing techniques, while dynamic feature-based methods involve time-consuming feature extraction. Image-based methods for Android malware detection offer better resilience against malware variants and polymorphic malware. This paper proposes an end-to-end Android malware detection technique based on RGB images and multi-feature fusion. The approach involves extracting Dalvik Executable (DEX) files, AndroidManifest.xml files, and API calls from APK files, converting them into grayscale images, and enhancing their texture features using Canny edge detection, histogram equalization, and adaptive thresholding techniques. These grayscale images are then combined into an RGB image containing multi-feature fusion information, which is analyzed using mainstream image classification models for Android malware detection. Extensive experiments demonstrate that the proposed method effectively captures Android malware characteristics, achieving an accuracy of up to 97.25%, outperforming existing detection methods that rely solely on DEX files as classification features. Additionally, ablation experiments confirm the effectiveness of using the three key files for feature representation in the proposed approach.
A Minibatch-SGD-Based Learning Meta-Policy for Inventory Systems with Myopic Optimal Policy
Jiameng Lyu, Jinxing Xie, Shilin Yuan, Yuan Zhou
Aug 30 2024 math.OC cs.LG arXiv:2408.16181v1

@misc{2408.16181, author = {Jiameng Lyu and Jinxing Xie and Shilin Yuan and Yuan Zhou}, title = {{A} {M}inibatch-{SGD}-{B}ased {L}earning {M}eta-{P}olicy for {I}nventory {S}ystems with {M}yopic {O}ptimal {P}olicy}, year = {2024}, eprint = {2408.16181}, note = {arXiv:2408.16181v1} }
PDF
Stochastic gradient descent (SGD) has proven effective in solving many inventory control problems with demand learning. However, it often faces the pitfall of an infeasible target inventory level that is lower than the current inventory level. Several recent works (e.g., Huh and Rusmevichientong (2009), Shi et al.(2016)) are successful to resolve this issue in various inventory systems. However, their techniques are rather sophisticated and difficult to be applied to more complicated scenarios such as multi-product and multi-constraint inventory systems. In this paper, we address the infeasible-target-inventory-level issue from a new technical perspective -- we propose a novel minibatch-SGD-based meta-policy. Our meta-policy is flexible enough to be applied to a general inventory systems framework covering a wide range of inventory management problems with myopic clairvoyant optimal policy. By devising the optimal minibatch scheme, our meta-policy achieves a regret bound of $\mathcal{O}(\sqrt{T})$ for the general convex case and $\mathcal{O}(\log T)$ for the strongly convex case. To demonstrate the power and flexibility of our meta-policy, we apply it to three important inventory control problems: multi-product and multi-constraint systems, multi-echelon serial systems, and one-warehouse and multi-store systems by carefully designing application-specific subroutines.We also conduct extensive numerical experiments to demonstrate that our meta-policy enjoys competitive regret performance, high computational efficiency, and low variances among a wide range of applications.
A Deep-Learning-Based Label-free No-Reference Image Quality Assessment Metric: Application in Sodium MRI Denoising
Shuaiyu Yuan, Tristan Whitmarsh, Dimitri A Kessler, Otso Arponen, Mary A McLean, Gabrielle Baxter, Frank Riemer, Aneurin J Kennerley, William J Brackenbury, Fiona J Gilbert, Joshua D Kaggie
Aug 30 2024 eess.IV cs.CV arXiv:2408.16481v2

@misc{2408.16481, author = {Shuaiyu Yuan and Tristan Whitmarsh and Dimitri A Kessler and Otso Arponen and Mary A McLean and Gabrielle Baxter and Frank Riemer and Aneurin J Kennerley and William J Brackenbury and Fiona J Gilbert and Joshua D Kaggie}, title = {{A} {D}eep-{L}earning-{B}ased {L}abel-free {N}o-{R}eference {I}mage {Q}uality {A}ssessment {M}etric: {A}pplication in {S}odium {MRI} {D}enoising}, year = {2024}, eprint = {2408.16481}, note = {arXiv:2408.16481v2} }
PDF
New multinuclear MRI techniques, such as sodium MRI, generally suffer from low image quality due to an inherently low signal. Postprocessing methods, such as image denoising, have been developed for image enhancement. However, the assessment of these enhanced images is challenging especially considering when there is a lack of high resolution and high signal images as reference, such as in sodium MRI. No-reference Image Quality Assessment (NR-IQA) metrics are approaches to solve this problem. Existing learning-based NR-IQA metrics rely on labels derived from subjective human opinions or metrics like Signal-to-Noise Ratio (SNR), which are either time-consuming or lack accurate ground truths, resulting in unreliable assessment. We note that deep learning (DL) models have a unique characteristic in that they are specialized to a characteristic training set, meaning that deviations between the input testing data from the training data will reduce prediction accuracy. Therefore, we propose a novel DL-based NR-IQA metric, the Model Specialization Metric (MSM), which does not depend on ground-truth images or labels. MSM measures the difference between the input image and the model's prediction for evaluating the quality of the input image. Experiments conducted on both simulated distorted proton T1-weighted MR images and denoised sodium MR images demonstrate that MSM exhibits a superior evaluation performance on various simulated noises and distortions. MSM also has a substantial agreement with the expert evaluations, achieving an averaged Cohen's Kappa coefficient of 0.6528, outperforming the existing NR-IQA metrics.
DLCRec: A Novel Approach for Managing Diversity in LLM-Based Recommender Systems
Jiaju Chen, Chongming Gao, Shuai Yuan, Shuchang Liu, Qingpeng Cai, Peng Jiang
Aug 23 2024 cs.IR arXiv:2408.12470v1

@misc{2408.12470, author = {Jiaju Chen and Chongming Gao and Shuai Yuan and Shuchang Liu and Qingpeng Cai and Peng Jiang}, title = {{DLCR}ec: {A} {N}ovel {A}pproach for {M}anaging {D}iversity in {LLM}-{B}ased {R}ecommender {S}ystems}, year = {2024}, eprint = {2408.12470}, note = {arXiv:2408.12470v1} }
PDF
The integration of Large Language Models (LLMs) into recommender systems has led to substantial performance improvements. However, this often comes at the cost of diminished recommendation diversity, which can negatively impact user satisfaction. To address this issue, controllable recommendation has emerged as a promising approach, allowing users to specify their preferences and receive recommendations that meet their diverse needs. Despite its potential, existing controllable recommender systems frequently rely on simplistic mechanisms, such as a single prompt, to regulate diversity-an approach that falls short of capturing the full complexity of user preferences. In response to these limitations, we propose DLCRec, a novel framework designed to enable fine-grained control over diversity in LLM-based recommendations. Unlike traditional methods, DLCRec adopts a fine-grained task decomposition strategy, breaking down the recommendation process into three sequential sub-tasks: genre prediction, genre filling, and item prediction. These sub-tasks are trained independently and inferred sequentially according to user-defined control numbers, ensuring more precise control over diversity. Furthermore, the scarcity and uneven distribution of diversity-related user behavior data pose significant challenges for fine-tuning. To overcome these obstacles, we introduce two data augmentation techniques that enhance the model's robustness to noisy and out-of-distribution data. These techniques expose the model to a broader range of patterns, improving its adaptability in generating recommendations with varying levels of diversity. Our extensive empirical evaluation demonstrates that DLCRec not only provides precise control over diversity but also outperforms state-of-the-art baselines across multiple recommendation scenarios.
Learning Multimodal Latent Space with EBM Prior and MCMC Inference
Shiyu Yuan, Carlo Lipizzi, Tian Han
Aug 21 2024 cs.LG cs.CV arXiv:2408.10467v1

@misc{2408.10467, author = {Shiyu Yuan and Carlo Lipizzi and Tian Han}, title = {{L}earning {M}ultimodal {L}atent {S}pace with {EBM} {P}rior and {MCMC} {I}nference}, year = {2024}, eprint = {2408.10467}, note = {arXiv:2408.10467v1} }
PDF
Multimodal generative models are crucial for various applications. We propose an approach that combines an expressive energy-based model (EBM) prior with Markov Chain Monte Carlo (MCMC) inference in the latent space for multimodal generation. The EBM prior acts as an informative guide, while MCMC inference, specifically through short-run Langevin dynamics, brings the posterior distribution closer to its true form. This method not only provides an expressive prior to better capture the complexity of multimodality but also improves the learning of shared latent variables for more coherent generation across modalities. Our proposed method is supported by empirical experiments, underscoring the effectiveness of our EBM prior with MCMC inference in enhancing cross-modal and joint generative tasks in multimodal contexts.
Beyond Full Label: Single-Point Prompt for Infrared Small Target Label Generation
Shuai Yuan, Hanlin Qin, Renke Kou, Xiang Yan, Zechuan Li, Chenxu Peng, Abd-Krim Seghouane
Aug 16 2024 cs.CV arXiv:2408.08191v4

@misc{2408.08191, author = {Shuai Yuan and Hanlin Qin and Renke Kou and Xiang Yan and Zechuan Li and Chenxu Peng and Abd-Krim Seghouane}, title = {{B}eyond {F}ull {L}abel: {S}ingle-{P}oint {P}rompt for {I}nfrared {S}mall {T}arget {L}abel {G}eneration}, year = {2024}, eprint = {2408.08191}, note = {arXiv:2408.08191v4} }
PDF
In this work, we make the first attempt to construct a learning-based single-point annotation paradigm for infrared small target label generation (IRSTLG). Our intuition is that label generation requires just one more point prompt than target detection: IRSTLG can be regarded as an infrared small target detection (IRSTD) task with the target location hint. Based on this insight, we introduce an energy double guided single-point prompt (EDGSP) framework, which adeptly transforms the target detection network into a refined label generation method. Specifically, the proposed EDGSP includes: 1) target energy initialization (TEI) to create a foundational outline for sufficient shape evolution of pseudo label, 2) double prompt embedding (DPE) for rapid localization of interested regions and reinforcement of individual differences to avoid label adhesion, and 3) bounding box-based matching (BBM) to eliminate false alarms. Experimental results show that pseudo labels generated by three baselines equipped with EDGSP achieve 100% object-level probability of detection (Pd) and 0% false-alarm rate (Fa) on SIRST, NUDT-SIRST, and IRSTD-1k datasets, with a pixel-level intersection over union (IoU) improvement of 13.28% over state-of-the-art (SOTA) label generation methods. In the practical application of downstream IRSTD, EDGSP realizes, for the first time, a single-point generated pseudo mask beyond the full label. Even with coarse single-point annotations, it still achieves 99.5% performance of full labeling.
Hierarchical Structured Neural Network for Retrieval
Kaushik Rangadurai, Siyang Yuan, Minhui Huang, Yiqun Liu, Golnaz Ghasemiesfeh, Yunchen Pu, Xinfeng Xie, Xingfeng He, Fangzhou Xu, Andrew Cui, Vidhoon Viswanathan, Yan Dong, Liang Xiong, Lin Yang, Liang Wang, Jiyan Yang, Chonglin Sun
Aug 14 2024 cs.IR cs.AI arXiv:2408.06653v1

@misc{2408.06653, author = {Kaushik Rangadurai and Siyang Yuan and Minhui Huang and Yiqun Liu and Golnaz Ghasemiesfeh and Yunchen Pu and Xinfeng Xie and Xingfeng He and Fangzhou Xu and Andrew Cui and Vidhoon Viswanathan and Yan Dong and Liang Xiong and Lin Yang and Liang Wang and Jiyan Yang and Chonglin Sun}, title = {{H}ierarchical {S}tructured {N}eural {N}etwork for {R}etrieval}, year = {2024}, eprint = {2408.06653}, note = {arXiv:2408.06653v1} }
PDF
Embedding Based Retrieval (EBR) is a crucial component of the retrieval stage in (Ads) Recommendation System that utilizes Two Tower or Siamese Networks to learn embeddings for both users and items (ads). It then employs an Approximate Nearest Neighbor Search (ANN) to efficiently retrieve the most relevant ads for a specific user. Despite the recent rise to popularity in the industry, they have a couple of limitations. Firstly, Two Tower model architecture uses a single dot product interaction which despite their efficiency fail to capture the data distribution in practice. Secondly, the centroid representation and cluster assignment, which are components of ANN, occur after the training process has been completed. As a result, they do not take into account the optimization criteria used for retrieval model. In this paper, we present Hierarchical Structured Neural Network (HSNN), a deployed jointly optimized hierarchical clustering and neural network model that can take advantage of sophisticated interactions and model architectures that are more common in the ranking stages while maintaining a sub-linear inference cost. We achieve 6.5% improvement in offline evaluation and also demonstrate 1.22% online gains through A/B experiments. HSNN has been successfully deployed into the Ads Recommendation system and is currently handling major portion of the traffic. The paper shares our experience in developing this system, dealing with challenges like freshness, volatility, cold start recommendations, cluster collapse and lessons deploying the model in a large scale retrieval production system.
AirSLAM: An Efficient and Illumination-Robust Point-Line Visual SLAM System
Kuan Xu, Yuefan Hao, Shenghai Yuan, Chen Wang, Lihua Xie
Aug 08 2024 cs.RO arXiv:2408.03520v2

@misc{2408.03520, author = {Kuan Xu and Yuefan Hao and Shenghai Yuan and Chen Wang and Lihua Xie}, title = {{A}ir{SLAM}: {A}n {E}fficient and {I}llumination-{R}obust {P}oint-{L}ine {V}isual {SLAM} {S}ystem}, year = {2024}, eprint = {2408.03520}, note = {arXiv:2408.03520v2} }
PDF
In this paper, we present an efficient visual SLAM system designed to tackle both short-term and long-term illumination challenges. Our system adopts a hybrid approach that combines deep learning techniques for feature detection and matching with traditional backend optimization methods. Specifically, we propose a unified convolutional neural network (CNN) that simultaneously extracts keypoints and structural lines. These features are then associated, matched, triangulated, and optimized in a coupled manner. Additionally, we introduce a lightweight relocalization pipeline that reuses the built map, where keypoints, lines, and a structure graph are used to match the query frame with the map. To enhance the applicability of the proposed system to real-world robots, we deploy and accelerate the feature detection and matching networks using C++ and NVIDIA TensorRT. Extensive experiments conducted on various datasets demonstrate that our system outperforms other state-of-the-art visual SLAM systems in illumination-challenging environments. Efficiency evaluations show that our system can run at a rate of 73Hz on a PC and 40Hz on an embedded platform.
Deep Uncertainty-Based Explore for Index Construction and Retrieval in Recommendation System
Xin Jiang, Kaiqiang Wang, Yinlong Wang, Fengchang Lv, Taiyang Peng, Shuai Yang, Xianteng Wu, Pengye Zhang, Shuo Yuan, Yifan Zeng
Aug 05 2024 cs.IR cs.LG stat.ML arXiv:2408.00799v2

@misc{2408.00799, author = {Xin Jiang and Kaiqiang Wang and Yinlong Wang and Fengchang Lv and Taiyang Peng and Shuai Yang and Xianteng Wu and Pengye Zhang and Shuo Yuan and Yifan Zeng}, title = {{D}eep {U}ncertainty-{B}ased {E}xplore for {I}ndex {C}onstruction and {R}etrieval in {R}ecommendation {S}ystem}, year = {2024}, eprint = {2408.00799}, note = {arXiv:2408.00799v2} }
PDF
In recommendation systems, the relevance and novelty of the final results are selected through a cascade system of Matching -> Ranking -> Strategy. The matching model serves as the starting point of the pipeline and determines the upper bound of the subsequent stages. Balancing the relevance and novelty of matching results is a crucial step in the design and optimization of recommendation systems, contributing significantly to improving recommendation quality. However, the typical matching algorithms have not simultaneously addressed the relevance and novelty perfectly. One main reason is that deep matching algorithms exhibit significant uncertainty when estimating items in the long tail (e.g., due to insufficient training samples) items.The uncertainty not only affects the training of the models but also influences the confidence in the index construction and beam search retrieval process of these models. This paper proposes the UICR (Uncertainty-based explore for Index Construction and Retrieval) algorithm, which introduces the concept of uncertainty modeling in the matching stage and achieves multi-task modeling of model uncertainty and index uncertainty. The final matching results are obtained by combining the relevance score and uncertainty score infered by the model. Experimental results demonstrate that the UICR improves novelty without sacrificing relevance on realworld industrial productive environments and multiple open-source datasets. Remarkably, online A/B test results of display advertising in Shopee demonstrates the effectiveness of the proposed algorithm.
Automated Quantification of Hyperreflective Foci in SD-OCT With Diabetic Retinopathy
Idowu Paul Okuwobi, Zexuan Ji, Wen Fan, Songtao Yuan, Loza Bekalo, Qiang Chen
Aug 01 2024 cs.AI cs.CV arXiv:2407.21272v1

@misc{2407.21272, author = {Idowu Paul Okuwobi and Zexuan Ji and Wen Fan and Songtao Yuan and Loza Bekalo and Qiang Chen}, title = {{A}utomated {Q}uantification of {H}yperreflective {F}oci in {SD}-{OCT} {W}ith {D}iabetic {R}etinopathy}, year = {2024}, eprint = {2407.21272}, howpublished = {IEEE Journal of Biomedical and Health Informatics, Volume: 24, Issue: 4, pp. 1125 - 1136, 2020}, doi = {10.1109/JBHI.2019.2929842}, note = {arXiv:2407.21272v1} }
PDF
The presence of hyperreflective foci (HFs) is related to retinal disease progression, and the quantity has proven to be a prognostic factor of visual and anatomical outcome in various retinal diseases. However, lack of efficient quantitative tools for evaluating the HFs has deprived ophthalmologist of assessing the volume of HFs. For this reason, we propose an automated quantification algorithm to segment and quantify HFs in spectral domain optical coherence tomography (SD-OCT). The proposed algorithm consists of two parallel processes namely: region of interest (ROI) generation and HFs estimation. To generate the ROI, we use morphological reconstruction to obtain the reconstructed image and histogram constructed for data distributions and clustering. In parallel, we estimate the HFs by extracting the extremal regions from the connected regions obtained from a component tree. Finally, both the ROI and the HFs estimation process are merged to obtain the segmented HFs. The proposed algorithm was tested on 40 3D SD-OCT volumes from 40 patients diagnosed with non-proliferative diabetic retinopathy (NPDR), proliferative diabetic retinopathy (PDR), and diabetic macular edema (DME). The average dice similarity coefficient (DSC) and correlation coefficient (r) are 69.70%, 0.99 for NPDR, 70.31%, 0.99 for PDR, and 71.30%, 0.99 for DME, respectively. The proposed algorithm can provide ophthalmologist with good HFs quantitative information, such as volume, size, and location of the HFs.
Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection
Youheng Sun, Shengming Yuan, Xuanhan Wang, Lianli Gao, Jingkuan Song
Jul 18 2024 cs.CV cs.AI arXiv:2407.12292v1

@misc{2407.12292, author = {Youheng Sun and Shengming Yuan and Xuanhan Wang and Lianli Gao and Jingkuan Song}, title = {{A}ny {T}arget {C}an be {O}ffense: {A}dversarial {E}xample {G}eneration via {G}eneralized {L}atent {I}nfection}, year = {2024}, eprint = {2407.12292}, note = {arXiv:2407.12292v1} }
PDF
Targeted adversarial attack, which aims to mislead a model to recognize any image as a target object by imperceptible perturbations, has become a mainstream tool for vulnerability assessment of deep neural networks (DNNs). Since existing targeted attackers only learn to attack known target classes, they cannot generalize well to unknown classes. To tackle this issue, we propose $\bf{G}$eneralized $\bf{A}$dversarial attac$\bf{KER}$ ($\bf{GAKer}$), which is able to construct adversarial examples to any target class. The core idea behind GAKer is to craft a latently infected representation during adversarial example generation. To this end, the extracted latent representations of the target object are first injected into intermediate features of an input image in an adversarial generator. Then, the generator is optimized to ensure visual consistency with the input image while being close to the target object in the feature space. Since the GAKer is class-agnostic yet model-agnostic, it can be regarded as a general tool that not only reveals the vulnerability of more DNNs but also identifies deficiencies of DNNs in a wider range of classes. Extensive experiments have demonstrated the effectiveness of our proposed method in generating adversarial examples for both known and unknown classes. Notably, compared with other generative methods, our method achieves an approximately $14.13\%$ higher attack success rate for unknown classes and an approximately $4.23\%$ higher success rate for known classes. Our code is available in https://github.com/VL-Group/GAKer.
TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot
Kaiqi Zhang, Shuai Yuan, Honghan Zhao
Jul 17 2024 cs.CL cs.AI arXiv:2407.10999v1

@misc{2407.10999, author = {Kaiqi Zhang and Shuai Yuan and Honghan Zhao}, title = {{TALEC}: {T}each {Y}our {LLM} to {E}valuate in {S}pecific {D}omain with {I}n-house {C}riteria by {C}riteria {D}ivision and {Z}ero-shot {P}lus {F}ew-shot}, year = {2024}, eprint = {2407.10999}, note = {arXiv:2407.10999v1} }
PDF
With the rapid development of large language models (LLM), the evaluation of LLM becomes increasingly important. Measuring text generation tasks such as summarization and article creation is very difficult. Especially in specific application domains (e.g., to-business or to-customer service), in-house evaluation criteria have to meet not only general standards (correctness, helpfulness and creativity, etc.) but also specific needs of customers and business security requirements at the same time, making the evaluation more difficult. So far, the evaluation of LLM in business scenarios has mainly relied on manual, which is expensive and time-consuming. In this paper, we propose a model-based evaluation method: TALEC, which allows users to flexibly set their own evaluation criteria, and uses in-context learning (ICL) to teach judge model these in-house criteria. In addition, we try combining zero-shot and few-shot to make the judge model focus on more information. We also propose a prompt paradigm and an engineering approach to adjust and iterate the shots ,helping judge model to better understand the complex criteria. We then compare fine-tuning with ICL, finding that fine-tuning can be replaced by ICL. TALEC demonstrates a strong capability to accurately reflect human preferences and achieves a correlation of over 80% with human judgments, outperforming even the inter-human correlation in some tasks. The code is released in https://github.com/zlkqz/auto_eval
MINDECHO: Role-Playing Language Agents for Key Opinion Leaders
Rui Xu, Dakuan Lu, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Wei Chu, Yinghui Xu
Jul 09 2024 cs.AI arXiv:2407.05305v2

@misc{2407.05305, author = {Rui Xu and Dakuan Lu and Xiaoyu Tan and Xintao Wang and Siyu Yuan and Jiangjie Chen and Wei Chu and Yinghui Xu}, title = {{MINDECHO}: {R}ole-{P}laying {L}anguage {A}gents for {K}ey {O}pinion {L}eaders}, year = {2024}, eprint = {2407.05305}, note = {arXiv:2407.05305v2} }
PDF
Large language models~(LLMs) have demonstrated impressive performance in various applications, among which role-playing language agents (RPLAs) have engaged a broad user base. Now, there is a growing demand for RPLAs that represent Key Opinion Leaders (KOLs), \ie, Internet celebrities who shape the trends and opinions in their domains. However, research in this line remains underexplored. In this paper, we hence introduce MINDECHO, a comprehensive framework for the development and evaluation of KOL RPLAs. MINDECHO collects KOL data from Internet video transcripts in various professional fields, and synthesizes their conversations leveraging GPT-4. Then, the conversations and the transcripts are used for individualized model training and inference-time retrieval, respectively. Our evaluation covers both general dimensions (\ie, knowledge and tones) and fan-centric dimensions for KOLs. Extensive experiments validate the effectiveness of MINDECHO in developing and evaluating KOL RPLAs.
Closing the Gaps: Optimality of Sample Average Approximation for Data-Driven Newsvendor Problems
Jiameng Lyu, Shilin Yuan, Bingkun Zhou, Yuan Zhou
Jul 09 2024 cs.LG math.OC arXiv:2407.04900v1

@misc{2407.04900, author = {Jiameng Lyu and Shilin Yuan and Bingkun Zhou and Yuan Zhou}, title = {{C}losing the {G}aps: {O}ptimality of {S}ample {A}verage {A}pproximation for {D}ata-{D}riven {N}ewsvendor {P}roblems}, year = {2024}, eprint = {2407.04900}, note = {arXiv:2407.04900v1} }
PDF
We study the regret performance of Sample Average Approximation (SAA) for data-driven newsvendor problems with general convex inventory costs. In literature, the optimality of SAA has not been fully established under both \alpha-global strong convexity and (\alpha,\beta)-local strong convexity (\alpha-strongly convex within the \beta-neighborhood of the optimal quantity) conditions. This paper closes the gaps between regret upper and lower bounds for both conditions. Under the (\alpha,\beta)-local strong convexity condition, we prove the optimal regret bound of \Theta(\log T/\alpha + 1/ (\alpha\beta)) for SAA. This upper bound result demonstrates that the regret performance of SAA is only influenced by \alpha and not by \beta in the long run, enhancing our understanding about how local properties affect the long-term regret performance of decision-making strategies. Under the \alpha-global strong convexity condition, we demonstrate that the worst-case regret of any data-driven method is lower bounded by \Omega(\log T/\alpha), which is the first lower bound result that matches the existing upper bound with respect to both parameter \alpha and time horizon T. Along the way, we propose to analyze the SAA regret via a new gradient approximation technique, as well as a new class of smooth inverted-hat-shaped hard problem instances that might be of independent interest for the lower bounds of broader data-driven problems.
I2EKF-LO: A Dual-Iteration Extended Kalman Filter Based LiDAR Odometry
Wenlu Yu, Jie Xu, Chengwei Zhao, Lijun Zhao, Thien-Minh Nguyen, Shenghai Yuan, Mingming Bai, Lihua Xie
Jul 03 2024 cs.RO arXiv:2407.02190v1

@misc{2407.02190, author = {Wenlu Yu and Jie Xu and Chengwei Zhao and Lijun Zhao and Thien-Minh Nguyen and Shenghai Yuan and Mingming Bai and Lihua Xie}, title = {{I}2{EKF}-{LO}: {A} {D}ual-{I}teration {E}xtended {K}alman {F}ilter {B}ased {L}i{DAR} {O}dometry}, year = {2024}, eprint = {2407.02190}, note = {arXiv:2407.02190v1} }
PDF
LiDAR odometry is a pivotal technology in the fields of autonomous driving and autonomous mobile robotics. However, most of the current works focus on nonlinear optimization methods, and still existing many challenges in using the traditional Iterative Extended Kalman Filter (IEKF) framework to tackle the problem: IEKF only iterates over the observation equation, relying on a rough estimate of the initial state, which is insufficient to fully eliminate motion distortion in the input point cloud; the system process noise is difficult to be determined during state estimation of the complex motions; and the varying motion models across different sensor carriers. To address these issues, we propose the Dual-Iteration Extended Kalman Filter (I2EKF) and the LiDAR odometry based on I2EKF (I2EKF-LO). This approach not only iterates over the observation equation but also leverages state updates to iteratively mitigate motion distortion in LiDAR point clouds. Moreover, it dynamically adjusts process noise based on the confidence level of prior predictions during state estimation and establishes motion models for different sensor carriers to achieve accurate and efficient state estimation. Comprehensive experiments demonstrate that I2EKF-LO achieves outstanding levels of accuracy and computational efficiency in the realm of LiDAR odometry. Additionally, to foster community development, our code is open-sourced.https://github.com/YWL0720/I2EKF-LO.
Collaborative Graph Exploration with Reduced Pose-SLAM Uncertainty via Submodular Optimization
Ruofei Bai, Shenghai Yuan, Hongliang Guo, Pengyu Yin, Wei-Yun Yau, Lihua Xie
Jul 02 2024 cs.RO arXiv:2407.01013v1

@misc{2407.01013, author = {Ruofei Bai and Shenghai Yuan and Hongliang Guo and Pengyu Yin and Wei-Yun Yau and Lihua Xie}, title = {{C}ollaborative {G}raph {E}xploration with {R}educed {P}ose-{SLAM} {U}ncertainty via {S}ubmodular {O}ptimization}, year = {2024}, eprint = {2407.01013}, note = {arXiv:2407.01013v1} }
PDF
This paper considers the collaborative graph exploration problem in GPS-denied environments, where a group of robots are required to cover a graph environment while maintaining reliable pose estimations in collaborative simultaneous localization and mapping (SLAM). Considering both objectives presents challenges for multi-robot pathfinding, as it involves the expensive covariance inference for SLAM uncertainty evaluation, especially considering various combinations of robots' paths. To reduce the computational complexity, we propose an efficient two-stage strategy where exploration paths are first generated for quick coverage, and then enhanced by adding informative and distance-efficient loop-closing actions, called loop edges, along the paths for reliable pose estimation. We formulate the latter problem as a non-monotone submodular maximization problem by relating SLAM uncertainty with pose graph topology, which (1) facilitates more efficient evaluation of SLAM uncertainty than covariance inference, and (2) allows the application of approximation algorithms in submodular optimization to provide optimality guarantees. We further introduce the ordering heuristics to improve objective values while preserving the optimality bound. Simulation experiments over randomly generated graph environments verify the efficiency of our methods in finding paths for quick coverage and enhanced pose graph reliability, and benchmark the performance of the approximation algorithms and the greedy-based algorithm in the loop edge selection problem. Our implementations will be open-source at https://github.com/bairuofei/CGE.
Exploring 6G Potential for Industrial Digital Twinning and Swarm Intelligence in Obstacle-Rich Environments
Siyu Yuan, Khurshid Alam, Bin Han, Dennis Krummacker, Hans D. Schotten
Jul 01 2024 cs.RO cs.MA arXiv:2406.19930v2

@misc{2406.19930, author = {Siyu Yuan and Khurshid Alam and Bin Han and Dennis Krummacker and Hans D.~Schotten}, title = {{E}xploring 6{G} {P}otential for {I}ndustrial {D}igital {T}winning and {S}warm {I}ntelligence in {O}bstacle-{R}ich {E}nvironments}, year = {2024}, eprint = {2406.19930}, note = {arXiv:2406.19930v2} }
PDF
With the advent of 6G technology, the demand for efficient and intelligent systems in industrial applications has surged, driving the need for advanced solutions in target localization. Utilizing swarm robots to locate unknown targets involves navigating increasingly complex environments. Digital Twinning (DT) offers a robust solution by creating a virtual replica of the physical world, which enhances the swarm's navigation capabilities. Our framework leverages DT and integrates Swarm Intelligence to store physical map information in the cloud, enabling robots to efficiently locate unknown targets. The simulation results demonstrate that the DT framework, augmented by Swarm Intelligence, significantly improves target location efficiency in obstacle-rich environments compared to traditional methods. This research underscores the potential of combining DT and Swarm Intelligence to advance the field of robotic navigation and target localization in complex industrial settings.
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan
Jun 27 2024 cs.CV cs.CL arXiv:2406.18522v2

@misc{2406.18522, author = {Shenghai Yuan and Jinfa Huang and Yongqi Xu and Yaoyang Liu and Shaofeng Zhang and Yujun Shi and Ruijie Zhu and Xinhua Cheng and Jiebo Luo and Li Yuan}, title = {{C}hrono{M}agic-{B}ench: {A} {B}enchmark for {M}etamorphic {E}valuation of {T}ext-to-{T}ime-lapse {V}ideo {G}eneration}, year = {2024}, eprint = {2406.18522}, note = {arXiv:2406.18522v2} }
PDF
We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude. [Homepage](https://pku-yuangroup.github.io/ChronoMagic-Bench/).
EndoUIC: Promptable Diffusion Transformer for Unified Illumination Correction in Capsule Endoscopy
Long Bai, Tong Chen, Qiaozhi Tan, Wan Jun Nah, Yanheng Li, Zhicheng He, Sishen Yuan, Zhen Chen, Jinlin Wu, Mobarakol Islam, Zhen Li, Hongbin Liu, Hongliang Ren
Jun 21 2024 eess.IV cs.AI cs.CV arXiv:2406.13705v2

@misc{2406.13705, author = {Long Bai and Tong Chen and Qiaozhi Tan and Wan Jun Nah and Yanheng Li and Zhicheng He and Sishen Yuan and Zhen Chen and Jinlin Wu and Mobarakol Islam and Zhen Li and Hongbin Liu and Hongliang Ren}, title = {{E}ndo{UIC}: {P}romptable {D}iffusion {T}ransformer for {U}nified {I}llumination {C}orrection in {C}apsule {E}ndoscopy}, year = {2024}, eprint = {2406.13705}, note = {arXiv:2406.13705v2} }
PDF
Wireless Capsule Endoscopy (WCE) is highly valued for its non-invasive and painless approach, though its effectiveness is compromised by uneven illumination from hardware constraints and complex internal dynamics, leading to overexposed or underexposed images. While researchers have discussed the challenges of low-light enhancement in WCE, the issue of correcting for different exposure levels remains underexplored. To tackle this, we introduce EndoUIC, a WCE unified illumination correction solution using an end-to-end promptable diffusion transformer (DiT) model. In our work, the illumination prompt module shall navigate the model to adapt to different exposure levels and perform targeted image enhancement, in which the Adaptive Prompt Integration (API) and Global Prompt Scanner (GPS) modules shall further boost the concurrent representation learning between the prompt parameters and features. Besides, the U-shaped restoration DiT model shall capture the long-range dependencies and contextual information for unified illumination restoration. Moreover, we present a novel Capsule-endoscopy Exposure Correction (CEC) dataset, including ground-truth and corrupted image pairs annotated by expert photographers. Extensive experiments against a variety of state-of-the-art (SOTA) methods on four datasets showcase the effectiveness of our proposed method and components in WCE illumination restoration, and the additional downstream experiments further demonstrate its utility for clinical diagnosis and surgical assistance.
EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Dongsheng Li, Deqing Yang
Jun 21 2024 cs.AI arXiv:2406.14228v2

@misc{2406.14228, author = {Siyu Yuan and Kaitao Song and Jiangjie Chen and Xu Tan and Dongsheng Li and Deqing Yang}, title = {{E}vo{A}gent: {T}owards {A}utomatic {M}ulti-{A}gent {G}eneration via {E}volutionary {A}lgorithms}, year = {2024}, eprint = {2406.14228}, note = {arXiv:2406.14228v2} }
PDF
The rise of powerful large language models (LLMs) has spurred a new trend in building LLM-based autonomous agents for solving complex tasks, especially multi-agent systems. Despite the remarkable progress, we notice that existing works are heavily dependent on human-designed frameworks, which greatly limits the functional scope and scalability of agent systems. How to automatically extend the specialized agent to multi-agent systems to improve task-solving capability still remains a significant challenge. In this paper, we introduce EvoAgent, a generic method to automatically extend expert agents to multi-agent systems via the evolutionary algorithm, thereby improving the effectiveness of LLM-based agents in solving tasks. Specifically, we consider the existing agent frameworks as the initial individual and then apply a series of evolutionary operators (e.g., mutation, crossover, selection, etc.) to generate multiple agents with diverse agent settings. EvoAgent can be generalized to any LLM-based agent framework, and can automatically extend the existing agent framework to multi-agent systems without any extra human designs. Experimental results across various tasks have shown that EvoAgent can automatically generate multiple expert agents and significantly enhance the task-solving capabilities of LLM-based agents.
Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?
Siyu Yuan, Cheng Jiayang, Lin Qiu, Deqing Yang
Jun 18 2024 cs.CL cs.AI arXiv:2406.11375v2

@misc{2406.11375, author = {Siyu Yuan and Cheng Jiayang and Lin Qiu and Deqing Yang}, title = {{B}oosting {S}cientific {C}oncepts {U}nderstanding: {C}an {A}nalogy from {T}eacher {M}odels {E}mpower {S}tudent {M}odels?}, year = {2024}, eprint = {2406.11375}, note = {arXiv:2406.11375v2} }
PDF
Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available at https://github.com/siyuyuan/SCUA.
Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models
Yikai Zhang, Qianyu He, Xintao Wang, Siyu Yuan, Jiaqing Liang, Yanghua Xiao
Jun 18 2024 cs.CV cs.CL arXiv:2406.10902v1

@misc{2406.10902, author = {Yikai Zhang and Qianyu He and Xintao Wang and Siyu Yuan and Jiaqing Liang and Yanghua Xiao}, title = {{L}ight {U}p the {S}hadows: {E}nhance {L}ong-{T}ailed {E}ntity {G}rounding with {C}oncept-{G}uided {V}ision-{L}anguage {M}odels}, year = {2024}, eprint = {2406.10902}, note = {arXiv:2406.10902v1} }
PDF
Multi-Modal Knowledge Graphs (MMKGs) have proven valuable for various downstream tasks. However, scaling them up is challenging because building large-scale MMKGs often introduces mismatched images (i.e., noise). Most entities in KGs belong to the long tail, meaning there are few images of them available online. This scarcity makes it difficult to determine whether a found image matches the entity. To address this, we draw on the Triangle of Reference Theory and suggest enhancing vision-language models with concept guidance. Specifically, we introduce COG, a two-stage framework with COncept-Guided vision-language models. The framework comprises a Concept Integration module, which effectively identifies image-text pairs of long-tailed entities, and an Evidence Fusion module, which offers explainability and enables human verification. To demonstrate the effectiveness of COG, we create a dataset of 25k image-text pairs of long-tailed entities. Our comprehensive experiments show that COG not only improves the accuracy of recognizing long-tailed image-text pairs compared to baselines but also offers flexibility and explainability.
Fine-Grained Urban Flow Inference with Multi-scale Representation Learning
Shilu Yuan, Dongfeng Li, Wei Liu, Xinxin Zhang, Meng Chen, Junjie Zhang, Yongshun Gong
Jun 17 2024 cs.CV cs.AI arXiv:2406.09710v1

@misc{2406.09710, author = {Shilu Yuan and Dongfeng Li and Wei Liu and Xinxin Zhang and Meng Chen and Junjie Zhang and Yongshun Gong}, title = {{F}ine-{G}rained {U}rban {F}low {I}nference with {M}ulti-scale {R}epresentation {L}earning}, year = {2024}, eprint = {2406.09710}, note = {arXiv:2406.09710v1} }
PDF
Fine-grained urban flow inference (FUFI) is a crucial transportation service aimed at improving traffic efficiency and safety. FUFI can infer fine-grained urban traffic flows based solely on observed coarse-grained data. However, most of existing methods focus on the influence of single-scale static geographic information on FUFI, neglecting the interactions and dynamic information between different-scale regions within the city. Different-scale geographical features can capture redundant information from the same spatial areas. In order to effectively learn multi-scale information across time and space, we propose an effective fine-grained urban flow inference model called UrbanMSR, which uses self-supervised contrastive learning to obtain dynamic multi-scale representations of neighborhood-level and city-level geographic information, and fuses multi-scale representations to improve fine-grained accuracy. The fusion of multi-scale representations enhances fine-grained. We validate the performance through extensive experiments on three real-world datasets. The resutls compared with state-of-the-art methods demonstrate the superiority of the proposed model.
A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder
Lixian Zhang, Yi Zhao, Runmin Dong, Jinxiao Zhang, Shuai Yuan, Shilei Cao, Mengxuan Chen, Juepeng Zheng, Weijia Li, Wei Liu, Wayne Zhang, Litong Feng, Haohuan Fu
Jun 13 2024 cs.CV arXiv:2406.08079v3

@misc{2406.08079, author = {Lixian Zhang and Yi Zhao and Runmin Dong and Jinxiao Zhang and Shuai Yuan and Shilei Cao and Mengxuan Chen and Juepeng Zheng and Weijia Li and Wei Liu and Wayne Zhang and Litong Feng and Haohuan Fu}, title = {{A}$^{2}$-{MAE}: {A} spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder}, year = {2024}, eprint = {2406.08079}, note = {arXiv:2406.08079v3} }
PDF
Vast amounts of remote sensing (RS) data provide Earth observations across multiple dimensions, encompassing critical spatial, temporal, and spectral information which is essential for addressing global-scale challenges such as land use monitoring, disaster prevention, and environmental change mitigation. Despite various pre-training methods tailored to the characteristics of RS data, a key limitation persists: the inability to effectively integrate spatial, temporal, and spectral information within a single unified model. To unlock the potential of RS data, we construct a Spatial-Temporal-Spectral Structured Dataset (STSSD) characterized by the incorporation of multiple RS sources, diverse coverage, unified locations within image sets, and heterogeneity within images. Building upon this structured dataset, we propose an Anchor-Aware Masked AutoEncoder method (A$^{2}$-MAE), leveraging intrinsic complementary information from the different kinds of images and geo-information to reconstruct the masked patches during the pre-training phase. A$^{2}$-MAE integrates an anchor-aware masking strategy and a geographic encoding module to comprehensively exploit the properties of RS images. Specifically, the proposed anchor-aware masking strategy dynamically adapts the masking process based on the meta-information of a pre-selected anchor image, thereby facilitating the training on images captured by diverse types of RS sources within one model. Furthermore, we propose a geographic encoding method to leverage accurate spatial patterns, enhancing the model generalization capabilities for downstream applications that are generally location-related. Extensive experiments demonstrate our method achieves comprehensive improvements across various downstream tasks compared with existing RS pre-training methods, including image classification, semantic segmentation, and change detection tasks.
SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals
Ruihan Yang, Jiangjie Chen, Yikai Zhang, Siyu Yuan, Aili Chen, Kyle Richardson, Yanghua Xiao, Deqing Yang
Jun 10 2024 cs.CL cs.AI arXiv:2406.04784v1

@misc{2406.04784, author = {Ruihan Yang and Jiangjie Chen and Yikai Zhang and Siyu Yuan and Aili Chen and Kyle Richardson and Yanghua Xiao and Deqing Yang}, title = {{S}elf{G}oal: {Y}our {L}anguage {A}gents {A}lready {K}now {H}ow to {A}chieve {H}igh-level {G}oals}, year = {2024}, eprint = {2406.04784}, note = {arXiv:2406.04784v1} }
PDF
Language agents powered by large language models (LLMs) are increasingly valuable as decision-making tools in domains such as gaming and programming. However, these agents often face challenges in achieving high-level goals without detailed instructions and in adapting to environments where feedback is delayed. In this paper, we present SelfGoal, a novel automatic approach designed to enhance agents' capabilities to achieve high-level goals with limited human prior and environmental feedback. The core concept of SelfGoal involves adaptively breaking down a high-level goal into a tree structure of more practical subgoals during the interaction with environments while identifying the most useful subgoals and progressively updating this structure. Experimental results demonstrate that SelfGoal significantly enhances the performance of language agents across various tasks, including competitive, cooperative, and deferred feedback environments. Project page: https://selfgoal-agent.github.io.
MMPolymer: A Multimodal Multitask Pretraining Framework for Polymer Property Prediction
Fanmeng Wang, Wentao Guo, Minjie Cheng, Shen Yuan, Hongteng Xu, Zhifeng Gao
Jun 10 2024 cs.LG cond-mat.soft cs.AI arXiv:2406.04727v2

@misc{2406.04727, author = {Fanmeng Wang and Wentao Guo and Minjie Cheng and Shen Yuan and Hongteng Xu and Zhifeng Gao}, title = {{MMP}olymer: {A} {M}ultimodal {M}ultitask {P}retraining {F}ramework for {P}olymer {P}roperty {P}rediction}, year = {2024}, eprint = {2406.04727}, note = {arXiv:2406.04727v2} }
PDF
Polymers are high-molecular-weight compounds constructed by the covalent bonding of numerous identical or similar monomers so that their 3D structures are complex yet exhibit unignorable regularity. Typically, the properties of a polymer, such as plasticity, conductivity, bio-compatibility, and so on, are highly correlated with its 3D structure. However, existing polymer property prediction methods heavily rely on the information learned from polymer SMILES sequences (P-SMILES strings) while ignoring crucial 3D structural information, resulting in sub-optimal performance. In this work, we propose MMPolymer, a novel multimodal multitask pretraining framework incorporating polymer 1D sequential and 3D structural information to encourage downstream polymer property prediction tasks. Besides, considering the scarcity of polymer 3D data, we further introduce the "Star Substitution" strategy to extract 3D structural information effectively. During pretraining, in addition to predicting masked tokens and recovering clear 3D coordinates, MMPolymer achieves the cross-modal alignment of latent representations. Then we further fine-tune the pretrained MMPolymer for downstream polymer property prediction tasks in the supervised learning paradigm. Experiments show that MMPolymer achieves state-of-the-art performance in downstream property prediction tasks. Moreover, given the pretrained MMPolymer, utilizing merely a single modality in the fine-tuning phase can also outperform existing methods, showcasing the exceptional capability of MMPolymer in polymer feature extraction and utilization.
FUSU: A Multi-temporal-source Land Use Change Segmentation Dataset for Fine-grained Urban Semantic Understanding
Shuai Yuan, Guancong Lin, Lixian Zhang, Runmin Dong, Jinxiao Zhang, Shuang Chen, Juepeng Zheng, Jie Wang, Haohuan Fu
May 30 2024 cs.CV arXiv:2405.19055v3

@misc{2405.19055, author = {Shuai Yuan and Guancong Lin and Lixian Zhang and Runmin Dong and Jinxiao Zhang and Shuang Chen and Juepeng Zheng and Jie Wang and Haohuan Fu}, title = {{FUSU}: {A} {M}ulti-temporal-source {L}and {U}se {C}hange {S}egmentation {D}ataset for {F}ine-grained {U}rban {S}emantic {U}nderstanding}, year = {2024}, eprint = {2405.19055}, note = {arXiv:2405.19055v3} }
PDF
Fine urban change segmentation using multi-temporal remote sensing images is essential for understanding human-environment interactions in urban areas. Although there have been advances in high-quality land cover datasets that reveal the physical features of urban landscapes, the lack of fine-grained land use datasets hinders a deeper understanding of how human activities are distributed across the landscape and the impact of these activities on the environment, thus constraining proper technique development. To address this, we introduce FUSU, the first fine-grained land use change segmentation dataset for Fine-grained Urban Semantic Understanding. FUSU features the most detailed land use classification system to date, with 17 classes and 30 billion pixels of annotations. It includes bi-temporal high-resolution satellite images with 0.2-0.5 m ground sample distance and monthly optical and radar satellite time series, covering 847 km^2 across five urban areas in the southern and northern of China with different geographical features. The fine-grained land use pixel-wise annotations and high spatial-temporal resolution data provide a robust foundation for developing proper deep learning models to provide contextual insights on human activities and urbanization. To fully leverage FUSU, we propose a unified time-series architecture for both change detection and segmentation. We benchmark FUSU on various methods for several tasks. Dataset and code are available at: https://github.com/yuanshuai0914/FUSU.
Bridging The Gap between Low-rank and Orthogonal Adaptation via Householder Reflection Adaptation
Shen Yuan, Haotian Liu, Hongteng Xu
May 29 2024 cs.LG cs.CV arXiv:2405.17484v2

@misc{2405.17484, author = {Shen Yuan and Haotian Liu and Hongteng Xu}, title = {{B}ridging {T}he {G}ap between {L}ow-rank and {O}rthogonal {A}daptation via {H}ouseholder {R}eflection {A}daptation}, year = {2024}, eprint = {2405.17484}, note = {arXiv:2405.17484v2} }
PDF
While following different technical routes, both low-rank and orthogonal adaptation techniques can efficiently adapt large-scale pre-training models in specific tasks or domains based on a small piece of trainable parameters. In this study, we bridge the gap between these two techniques, proposing a simple but effective adaptation method based on Householder reflections. Given a pre-trained model, our method fine-tunes its layers by multiplying each frozen weight matrix with an orthogonal matrix constructed by a chain of learnable Householder reflections (HRs). This HR-based orthogonal fine-tuning is equivalent to an adaptive low-rank adaptation. Moreover, we show that the orthogonality of the reflection planes corresponding to the HRs impacts the model capacity and regularity. The analysis motivates us to regularize the orthogonality of the HRs, leading to different implementations of the proposed Householder reflection adaptation (HRA) method. Compared with state-of-the-art methods, HRA achieves superior performance with fewer learnable parameters when adapting large language models and conditional image generators. The code of the experiments is available at \urlhttps://github.com/DaShenZi721/HRA, and the method has been merged into the \hrefhttps://github.com/huggingface/peftPEFT package.
Memory-efficient High-resolution OCT Volume Synthesis with Cascaded Amortized Latent Diffusion Models
Kun Huang, Xiao Ma, Yuhan Zhang, Na Su, Songtao Yuan, Yong Liu, Qiang Chen, Huazhu Fu
May 28 2024 eess.IV cs.CV arXiv:2405.16516v1

@misc{2405.16516, author = {Kun Huang and Xiao Ma and Yuhan Zhang and Na Su and Songtao Yuan and Yong Liu and Qiang Chen and Huazhu Fu}, title = {{M}emory-efficient {H}igh-resolution {OCT} {V}olume {S}ynthesis with {C}ascaded {A}mortized {L}atent {D}iffusion {M}odels}, year = {2024}, eprint = {2405.16516}, note = {arXiv:2405.16516v1} }
PDF
Optical coherence tomography (OCT) image analysis plays an important role in the field of ophthalmology. Current successful analysis models rely on available large datasets, which can be challenging to be obtained for certain tasks. The use of deep generative models to create realistic data emerges as a promising approach. However, due to limitations in hardware resources, it is still difficulty to synthesize high-resolution OCT volumes. In this paper, we introduce a cascaded amortized latent diffusion model (CA-LDM) that can synthesis high-resolution OCT volumes in a memory-efficient way. First, we propose non-holistic autoencoders to efficiently build a bidirectional mapping between high-resolution volume space and low-resolution latent space. In tandem with autoencoders, we propose cascaded diffusion processes to synthesize high-resolution OCT volumes with a global-to-local refinement process, amortizing the memory and computational demands. Experiments on a public high-resolution OCT dataset show that our synthetic data have realistic high-resolution and global features, surpassing the capabilities of existing methods. Moreover, performance gains on two down-stream fine-grained segmentation tasks demonstrate the benefit of the proposed method in training deep learning models for medical imaging tasks. The code is public available at: https://github.com/nicetomeetu21/CA-LDM.
Magnetic-Guided Flexible Origami Robot toward Long-Term Phototherapy of H. pylori in the Stomach
Sishen Yuan, Baijia Liang, Po Wa Wong, Mingjing Xu, Chi Hsuan Li, Zhen Li, Hongliang Ren
May 14 2024 eess.SY cs.SY arXiv:2405.07216v1

@misc{2405.07216, author = {Sishen Yuan and Baijia Liang and Po Wa Wong and Mingjing Xu and Chi Hsuan Li and Zhen Li and Hongliang Ren}, title = {{M}agnetic-{G}uided {F}lexible {O}rigami {R}obot toward {L}ong-{T}erm {P}hototherapy of {H}. pylori in the {S}tomach}, year = {2024}, eprint = {2405.07216}, note = {arXiv:2405.07216v1} }
PDF
Helicobacter pylori, a pervasive bacterial infection associated with gastrointestinal disorders such as gastritis, peptic ulcer disease, and gastric cancer, impacts approximately 50% of the global population. The efficacy of standard clinical eradication therapies is diminishing due to the rise of antibiotic-resistant strains, necessitating alternative treatment strategies. Photodynamic therapy (PDT) emerges as a promising prospect in this context. This study presents the development and implementation of a magnetically-guided origami robot, incorporating flexible printed circuit units for sustained and stable phototherapy of Helicobacter pylori. Each integrated unit is equipped with wireless charging capabilities, producing an optimal power output that can concurrently illuminate up to 15 LEDs at their maximum intensity. Crucially, these units can be remotely manipulated via a magnetic field, facilitating both translational and rotational movements. We propose an open-loop manual control sequence that allows the formation of a stable, compliant triangular structure through the interaction of internal magnets. This adaptable configuration is uniquely designed to withstand the dynamic squeezing environment prevalent in real-world gastric applications. The research herein represents a significant stride in leveraging technology for innovative medical solutions, particularly in the management of antibiotic-resistant Helicobacter pylori infections.
Chained Flexible Capsule Endoscope: Unraveling the Conundrum of Size Limitations and Functional Integration for Gastrointestinal Transitivity
Sishen Yuan, Guang Li, Baijia Liang, Lailu Li, Qingzhuo Zheng, Shuang Song, Zhen Li, Hongliang Ren
May 14 2024 physics.med-ph cs.SY eess.SY arXiv:2405.07218v1

@misc{2405.07218, author = {Sishen Yuan and Guang Li and Baijia Liang and Lailu Li and Qingzhuo Zheng and Shuang Song and Zhen Li and Hongliang Ren}, title = {{C}hained {F}lexible {C}apsule {E}ndoscope: {U}nraveling the {C}onundrum of {S}ize {L}imitations and {F}unctional {I}ntegration for {G}astrointestinal {T}ransitivity}, year = {2024}, eprint = {2405.07218}, note = {arXiv:2405.07218v1} }
PDF
Capsule endoscopes, predominantly serving diagnostic functions, provide lucid internal imagery but are devoid of surgical or therapeutic capabilities. Consequently, despite lesion detection, physicians frequently resort to traditional endoscopic or open surgical procedures for treatment, resulting in more complex, potentially risky interventions. To surmount these limitations, this study introduces a chained flexible capsule endoscope (FCE) design concept, specifically conceived to navigate the inherent volume constraints of capsule endoscopes whilst augmenting their therapeutic functionalities. The FCE's distinctive flexibility originates from a conventional rotating joint design and the incision pattern in the flexible material. In vitro experiments validated the passive navigation ability of the FCE in rugged intestinal tracts. Further, the FCE demonstrates consistent reptile-like peristalsis under the influence of an external magnetic field, and possesses the capability for film expansion and disintegration under high-frequency electromagnetic stimulation. These findings illuminate a promising path toward amplifying the therapeutic capacities of capsule endoscopes without necessitating a size compromise.
MaskMatch: Boosting Semi-Supervised Learning Through Mask Autoencoder-Driven Feature Learning
Wenjin Zhang, Keyi Li, Sen Yang, Chenyang Gao, Wanzhao Yang, Sifan Yuan, Ivan Marsic
May 13 2024 cs.CV arXiv:2405.06227v1

@misc{2405.06227, author = {Wenjin Zhang and Keyi Li and Sen Yang and Chenyang Gao and Wanzhao Yang and Sifan Yuan and Ivan Marsic}, title = {{M}ask{M}atch: {B}oosting {S}emi-{S}upervised {L}earning {T}hrough {M}ask {A}utoencoder-{D}riven {F}eature {L}earning}, year = {2024}, eprint = {2405.06227}, note = {arXiv:2405.06227v1} }
PDF
Conventional methods in semi-supervised learning (SSL) often face challenges related to limited data utilization, mainly due to their reliance on threshold-based techniques for selecting high-confidence unlabeled data during training. Various efforts (e.g., FreeMatch) have been made to enhance data utilization by tweaking the thresholds, yet none have managed to use 100% of the available data. To overcome this limitation and improve SSL performance, we introduce \algo, a novel algorithm that fully utilizes unlabeled data to boost semi-supervised learning. \algo integrates a self-supervised learning strategy, i.e., Masked Autoencoder (MAE), that uses all available data to enforce the visual representation learning. This enables the SSL algorithm to leverage all available data, including samples typically filtered out by traditional methods. In addition, we propose a synthetic data training approach to further increase data utilization and improve generalization. These innovations lead \algo to achieve state-of-the-art results on challenging datasets. For instance, on CIFAR-100 with 2 labels per class, STL-10 with 4 labels per class, and Euro-SAT with 2 labels per class, \algo achieves low error rates of 18.71%, 9.47%, and 3.07%, respectively. The code will be made publicly available.
UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model
Shuai Yuan, Lei Luo, Zhuo Hui, Can Pu, Xiaoyu Xiang, Rakesh Ranjan, Denis Demandolx
May 07 2024 cs.CV cs.AI cs.RO arXiv:2405.02608v1

@misc{2405.02608, author = {Shuai Yuan and Lei Luo and Zhuo Hui and Can Pu and Xiaoyu Xiang and Rakesh Ranjan and Denis Demandolx}, title = {{U}n{SAMF}low: {U}nsupervised {O}ptical {F}low {G}uided by {S}egment {A}nything {M}odel}, year = {2024}, eprint = {2405.02608}, note = {arXiv:2405.02608v1} }
PDF
Traditional unsupervised optical flow methods are vulnerable to occlusions and motion boundaries due to lack of object-level information. Therefore, we propose UnSAMFlow, an unsupervised flow network that also leverages object information from the latest foundation model Segment Anything Model (SAM). We first include a self-supervised semantic augmentation module tailored to SAM masks. We also analyze the poor gradient landscapes of traditional smoothness losses and propose a new smoothness definition based on homography instead. A simple yet effective mask feature module has also been added to further aggregate features on the object level. With all these adaptations, our method produces clear optical flow estimation with sharp boundaries around objects, which outperforms state-of-the-art methods on both KITTI and Sintel datasets. Our method also generalizes well across domains and runs very efficiently.
Achievability Bounds on Unequal Error Protection Codes
Liuquan Yao, Shuai Yuan, Yuan Li, Huazi Zhang, Jun Wang, Guiying Yan, Zhiming Ma
May 07 2024 cs.IT math.IT arXiv:2405.03288v3

@misc{2405.03288, author = {Liuquan Yao and Shuai Yuan and Yuan Li and Huazi Zhang and Jun Wang and Guiying Yan and Zhiming Ma}, title = {{A}chievability {B}ounds on {U}nequal {E}rror {P}rotection {C}odes}, year = {2024}, eprint = {2405.03288}, note = {arXiv:2405.03288v3} }
PDF
Unequal error protection (UEP) codes can facilitate the transmission of messages with different protection levels. In this paper, we study the achievability bounds on UEP by the generalization of Gilbert-Varshamov (GV) bound. For the first time, we show that under certain conditions, UEP enhances the code rate comparing with time-sharing (TS) strategies asymptotically.
From Persona to Personalization: A Survey on Role-Playing Language Agents
Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, Yanghua Xiao
Apr 30 2024 cs.CL cs.AI arXiv:2404.18231v2

@misc{2404.18231, author = {Jiangjie Chen and Xintao Wang and Rui Xu and Siyu Yuan and Yikai Zhang and Wei Shi and Jian Xie and Shuang Li and Ruihan Yang and Tinghui Zhu and Aili Chen and Nianqi Li and Lida Chen and Caiyu Hu and Siye Wu and Scott Ren and Ziquan Fu and Yanghua Xiao}, title = {{F}rom {P}ersona to {P}ersonalization: {A} {S}urvey on {R}ole-{P}laying {L}anguage {A}gents}, year = {2024}, eprint = {2404.18231}, note = {arXiv:2404.18231v2} }
PDF
Recent advancements in large language models (LLMs) have significantly boosted the rise of Role-Playing Language Agents (RPLAs), i.e., specialized AI systems designed to simulate assigned personas. By harnessing multiple advanced abilities of LLMs, including in-context learning, instruction following, and social intelligence, RPLAs achieve a remarkable sense of human likeness and vivid role-playing performance. RPLAs can mimic a wide range of personas, ranging from historical figures and fictional characters to real-life individuals. Consequently, they have catalyzed numerous AI applications, such as emotional companions, interactive video games, personalized assistants and copilots, and digital clones. In this paper, we conduct a comprehensive survey of this field, illustrating the evolution and recent progress in RPLAs integrating with cutting-edge LLM technologies. We categorize personas into three types: 1) Demographic Persona, which leverages statistical stereotypes; 2) Character Persona, focused on well-established figures; and 3) Individualized Persona, customized through ongoing user interactions for personalized services. We begin by presenting a comprehensive overview of current methodologies for RPLAs, followed by the details for each persona type, covering corresponding data sourcing, agent construction, and evaluation. Afterward, we discuss the fundamental risks, existing limitations, and future prospects of RPLAs. Additionally, we provide a brief review of RPLAs in AI applications, which reflects practical user demands that shape and drive RPLA research. Through this work, we aim to establish a clear taxonomy of RPLA research and applications, and facilitate future research in this critical and ever-evolving field, and pave the way for a future where humans and RPLAs coexist in harmony.
Evaluating Character Understanding of Large Language Models via Character Profiling from Fictional Works
Xinfeng Yuan, Siyu Yuan, Yuhan Cui, Tianhe Lin, Xintao Wang, Rui Xu, Jiangjie Chen, Deqing Yang
Apr 23 2024 cs.CL arXiv:2404.12726v3

@misc{2404.12726, author = {Xinfeng Yuan and Siyu Yuan and Yuhan Cui and Tianhe Lin and Xintao Wang and Rui Xu and Jiangjie Chen and Deqing Yang}, title = {{E}valuating {C}haracter {U}nderstanding of {L}arge {L}anguage {M}odels via {C}haracter {P}rofiling from {F}ictional {W}orks}, year = {2024}, eprint = {2404.12726}, note = {arXiv:2404.12726v3} }
PDF
Large language models (LLMs) have demonstrated impressive performance and spurred numerous AI applications, in which role-playing agents (RPAs) are particularly popular, especially for fictional characters. The prerequisite for these RPAs lies in the capability of LLMs to understand characters from fictional works. Previous efforts have evaluated this capability via basic classification tasks or characteristic imitation, failing to capture the nuanced character understanding with LLMs. In this paper, we propose evaluating LLMs' character understanding capability via the character profiling task, i.e., summarizing character profiles from corresponding materials, a widely adopted yet understudied practice for RPA development. Specifically, we construct the CroSS dataset from literature experts and assess the generated profiles by comparing them with ground truth references and evaluating their applicability in downstream tasks. Our experiments, which cover various summarization methods and LLMs, have yielded promising results. These results strongly validate the character understanding capability of LLMs. Resources are available at https://github.com/Joanna0123/character_profiling.
"A good pun is its own reword": Can Large Language Models Understand Puns?
Zhijun Xu, Siyu Yuan, Lingjie Chen, Deqing Yang
Apr 23 2024 cs.CL arXiv:2404.13599v2

@misc{2404.13599, author = {Zhijun Xu and Siyu Yuan and Lingjie Chen and Deqing Yang}, title = {"{A} good pun is its own reword": {C}an {L}arge {L}anguage {M}odels {U}nderstand {P}uns?}, year = {2024}, eprint = {2404.13599}, note = {arXiv:2404.13599v2} }
PDF
Puns play a vital role in academic research due to their distinct structure and clear definition, which aid in the comprehensive analysis of linguistic humor. However, the understanding of puns in large language models (LLMs) has not been thoroughly examined, limiting their use in creative writing and humor creation. In this paper, we leverage three popular tasks, i.e., pun recognition, explanation and generation to systematically evaluate the capabilities of LLMs in pun understanding. In addition to adopting the automated evaluation metrics from prior research, we introduce new evaluation methods and metrics that are better suited to the in-context learning paradigm of LLMs. These new metrics offer a more rigorous assessment of an LLM's ability to understand puns and align more closely with human cognition than previous metrics. Our findings reveal the "lazy pun generation" pattern and identify the primary challenges LLMs encounter in understanding puns.
Character is Destiny: Can Large Language Models Simulate Persona-Driven Decisions in Role-Playing?
Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, Yanghua Xiao
Apr 19 2024 cs.AI arXiv:2404.12138v1

@misc{2404.12138, author = {Rui Xu and Xintao Wang and Jiangjie Chen and Siyu Yuan and Xinfeng Yuan and Jiaqing Liang and Zulong Chen and Xiaoqing Dong and Yanghua Xiao}, title = {{C}haracter is {D}estiny: {C}an {L}arge {L}anguage {M}odels {S}imulate {P}ersona-{D}riven {D}ecisions in {R}ole-{P}laying?}, year = {2024}, eprint = {2404.12138}, note = {arXiv:2404.12138v1} }
PDF
Can Large Language Models substitute humans in making important decisions? Recent research has unveiled the potential of LLMs to role-play assigned personas, mimicking their knowledge and linguistic habits. However, imitative decision-making requires a more nuanced understanding of personas. In this paper, we benchmark the ability of LLMs in persona-driven decision-making. Specifically, we investigate whether LLMs can predict characters' decisions provided with the preceding stories in high-quality novels. Leveraging character analyses written by literary experts, we construct a dataset LIFECHOICE comprising 1,401 character decision points from 395 books. Then, we conduct comprehensive experiments on LIFECHOICE, with various LLMs and methods for LLM role-playing. The results demonstrate that state-of-the-art LLMs exhibit promising capabilities in this task, yet there is substantial room for improvement. Hence, we further propose the CHARMAP method, which achieves a 6.01% increase in accuracy via persona-based memory retrieval. We will make our datasets and code publicly available.
O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation
Muer Tie, Julong Wei, Zhengjun Wang, Ke Wu, Shansuai Yuan, Kaizhao Zhang, Jie Jia, Jieru Zhao, Zhongxue Gan, Wenchao Ding
Apr 11 2024 cs.CV arXiv:2404.06836v1

@misc{2404.06836, author = {Muer Tie and Julong Wei and Zhengjun Wang and Ke Wu and Shansuai Yuan and Kaizhao Zhang and Jie Jia and Jieru Zhao and Zhongxue Gan and Wenchao Ding}, title = {{O}2{V}-{M}apping: {O}nline {O}pen-{V}ocabulary {M}apping with {N}eural {I}mplicit {R}epresentation}, year = {2024}, eprint = {2404.06836}, note = {arXiv:2404.06836v1} }
PDF
Online construction of open-ended language scenes is crucial for robotic applications, where open-vocabulary interactive scene understanding is required. Recently, neural implicit representation has provided a promising direction for online interactive mapping. However, implementing open-vocabulary scene understanding capability into online neural implicit mapping still faces three challenges: lack of local scene updating ability, blurry spatial hierarchical semantic segmentation and difficulty in maintaining multi-view consistency. To this end, we proposed O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field, thus allowing for local updates during online training process. Additionally, we leverage a foundational model for image segmentation to extract language features on object-level entities, achieving clear segmentation boundaries and hierarchical semantic features. For the purpose of preserving consistency in 3D object properties across different viewpoints, we propose a spatial adaptive voxel adjustment mechanism and a multi-view weight selection method. Extensive experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes while enhancing accuracy, outperforming the previous SOTA method.
GraSAME: Injecting Token-Level Structural Information to Pretrained Language Models via Graph-guided Self-Attention Mechanism
Shuzhou Yuan, Michael Färber
Apr 11 2024 cs.CL arXiv:2404.06911v1

@misc{2404.06911, author = {Shuzhou Yuan and Michael Färber}, title = {{G}ra{SAME}: {I}njecting {T}oken-{L}evel {S}tructural {I}nformation to {P}retrained {L}anguage {M}odels via {G}raph-guided {S}elf-{A}ttention {M}echanism}, year = {2024}, eprint = {2404.06911}, note = {arXiv:2404.06911v1} }
PDF
Pretrained Language Models (PLMs) benefit from external knowledge stored in graph structures for various downstream tasks. However, bridging the modality gap between graph structures and text remains a significant challenge. Traditional methods like linearizing graphs for PLMs lose vital graph connectivity, whereas Graph Neural Networks (GNNs) require cumbersome processes for integration into PLMs. In this work, we propose a novel graph-guided self-attention mechanism, GraSAME. GraSAME seamlessly incorporates token-level structural information into PLMs without necessitating additional alignment or concatenation efforts. As an end-to-end, lightweight multimodal module, GraSAME follows a multi-task learning strategy and effectively bridges the gap between graph and textual modalities, facilitating dynamic interactions between GNNs and PLMs. Our experiments on the graph-to-text generation task demonstrate that GraSAME outperforms baseline models and achieves results comparable to state-of-the-art (SOTA) models on WebNLG datasets. Furthermore, compared to SOTA models, GraSAME eliminates the need for extra pre-training tasks to adjust graph inputs and reduces the number of trainable parameters by over 100 million.
Incremental Joint Learning of Depth, Pose and Implicit Scene Representation on Monocular Camera in Large-scale Scenes
Tianchen Deng, Nailin Wang, Chongdi Wang, Shenghai Yuan, Jingchuan Wang, Danwei Wang, Weidong Chen
Apr 10 2024 cs.CV cs.RO arXiv:2404.06050v2

@misc{2404.06050, author = {Tianchen Deng and Nailin Wang and Chongdi Wang and Shenghai Yuan and Jingchuan Wang and Danwei Wang and Weidong Chen}, title = {{I}ncremental {J}oint {L}earning of {D}epth, {P}ose and {I}mplicit {S}cene {R}epresentation on {M}onocular {C}amera in {L}arge-scale {S}cenes}, year = {2024}, eprint = {2404.06050}, note = {arXiv:2404.06050v2} }
PDF
Dense scene reconstruction for photo-realistic view synthesis has various applications, such as VR/AR, autonomous vehicles. However, most existing methods have difficulties in large-scale scenes due to three core challenges: \textit(a) inaccurate depth input. Accurate depth input is impossible to get in real-world large-scale scenes. \textit(b) inaccurate pose estimation. Most existing approaches rely on accurate pre-estimated camera poses. \textit(c) insufficient scene representation capability. A single global radiance field lacks the capacity to effectively scale to large-scale scenes. To this end, we propose an incremental joint learning framework, which can achieve accurate depth, pose estimation, and large-scale scene reconstruction. A vision transformer-based network is adopted as the backbone to enhance performance in scale information estimation. For pose estimation, a feature-metric bundle adjustment (FBA) method is designed for accurate and robust camera tracking in large-scale scenes. In terms of implicit scene representation, we propose an incremental scene representation method to construct the entire large-scale scene as multiple local radiance fields to enhance the scalability of 3D scene representation. Extended experiments have been conducted to demonstrate the effectiveness and accuracy of our method in depth estimation, pose estimation, and large-scale scene reconstruction.
Salient Sparse Visual Odometry With Pose-Only Supervision
Siyu Chen, Kangcheng Liu, Chen Wang, Shenghai Yuan, Jianfei Yang, Lihua Xie
Apr 09 2024 cs.CV cs.RO arXiv:2404.04677v1

@misc{2404.04677, author = {Siyu Chen and Kangcheng Liu and Chen Wang and Shenghai Yuan and Jianfei Yang and Lihua Xie}, title = {{S}alient {S}parse {V}isual {O}dometry {W}ith {P}ose-{O}nly {S}upervision}, year = {2024}, eprint = {2404.04677}, doi = {10.1109/LRA.2024.3384757}, note = {arXiv:2404.04677v1} }
PDF
Visual Odometry (VO) is vital for the navigation of autonomous systems, providing accurate position and orientation estimates at reasonable costs. While traditional VO methods excel in some conditions, they struggle with challenges like variable lighting and motion blur. Deep learning-based VO, though more adaptable, can face generalization problems in new environments. Addressing these drawbacks, this paper presents a novel hybrid visual odometry (VO) framework that leverages pose-only supervision, offering a balanced solution between robustness and the need for extensive labeling. We propose two cost-effective and innovative designs: a self-supervised homographic pre-training for enhancing optical flow learning from pose-only labels and a random patch-based salient point detection strategy for more accurate optical flow patch extraction. These designs eliminate the need for dense optical flow labels for training and significantly improve the generalization capability of the system in diverse and challenging environments. Our pose-only supervised method achieves competitive performance on standard datasets and greater robustness and generalization ability in extreme and unseen scenarios, even compared to dense optical flow-supervised state-of-the-art methods.
MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators
Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo
Apr 09 2024 cs.CV arXiv:2404.05014v1

@misc{2404.05014, author = {Shenghai Yuan and Jinfa Huang and Yujun Shi and Yongqi Xu and Ruijie Zhu and Bin Lin and Xinhua Cheng and Li Yuan and Jiebo Luo}, title = {{M}agic{T}ime: {T}ime-lapse {V}ideo {G}eneration {M}odels as {M}etamorphic {S}imulators}, year = {2024}, eprint = {2404.05014}, note = {arXiv:2404.05014v1} }
PDF
Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions. A largely overlooked problem in T2V is that existing models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose \textbfMagicTime, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic generation. First, we design a MagicAdapter scheme to decouple spatial and temporal training, encode more physical knowledge from metamorphic videos, and transform pre-trained T2V models to generate metamorphic videos. Second, we introduce a Dynamic Frames Extraction strategy to adapt to metamorphic time-lapse videos, which have a wider variation range and cover dramatic object metamorphic processes, thus embodying more physical knowledge than general videos. Finally, we introduce a Magic Text-Encoder to improve the understanding of metamorphic video prompts. Furthermore, we create a time-lapse video-text dataset called \textbfChronoMagic, specifically curated to unlock the metamorphic video generation ability. Extensive experiments demonstrate the superiority and effectiveness of MagicTime for generating high-quality and dynamic metamorphic videos, suggesting time-lapse video generation is a promising path toward building metamorphic simulators of the physical world.
HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models
Yimu Wang, Shuai Yuan, Xiangru Jian, Wei Pang, Mushi Wang, Ning Yu
Apr 09 2024 cs.CV cs.CL cs.IR cs.LG arXiv:2404.05083v1

@misc{2404.05083, author = {Yimu Wang and Shuai Yuan and Xiangru Jian and Wei Pang and Mushi Wang and Ning Yu}, title = {{H}a{VTR}: {I}mproving {V}ideo-{T}ext {R}etrieval {T}hrough {A}ugmentation {U}sing {L}arge {F}oundation {M}odels}, year = {2024}, eprint = {2404.05083}, note = {arXiv:2404.05083v1} }
PDF
While recent progress in video-text retrieval has been driven by the exploration of powerful model architectures and training strategies, the representation learning ability of video-text retrieval models is still limited due to low-quality and scarce training data annotations. To address this issue, we present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. Specifically, we first adopt a simple augmentation method, which generates self-similar data by randomly duplicating or dropping subwords and frames. In addition, inspired by the recent advancement in visual and language generative models, we propose a more powerful augmentation method through textual paraphrasing and video stylization using large language models (LLMs) and visual generative models (VGMs). Further, to bring richer information into video and text, we propose a hallucination-based augmentation method, where we use LLMs and VGMs to generate and add new relevant information to the original data. Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.
HGS-Mapping: Online Dense Mapping Using Hybrid Gaussian Representation in Urban Scenes
Ke Wu, Kaizhao Zhang, Zhiwei Zhang, Shanshuai Yuan, Muer Tie, Julong Wei, Zijun Xu, Jieru Zhao, Zhongxue Gan, Wenchao Ding
Apr 01 2024 cs.CV arXiv:2403.20159v1

@misc{2403.20159, author = {Ke Wu and Kaizhao Zhang and Zhiwei Zhang and Shanshuai Yuan and Muer Tie and Julong Wei and Zijun Xu and Jieru Zhao and Zhongxue Gan and Wenchao Ding}, title = {{HGS}-{M}apping: {O}nline {D}ense {M}apping {U}sing {H}ybrid {G}aussian {R}epresentation in {U}rban {S}cenes}, year = {2024}, eprint = {2403.20159}, note = {arXiv:2403.20159v1} }
PDF
Online dense mapping of urban scenes forms a fundamental cornerstone for scene understanding and navigation of autonomous vehicles. Recent advancements in mapping methods are mainly based on NeRF, whose rendering speed is too slow to meet online requirements. 3D Gaussian Splatting (3DGS), with its rendering speed hundreds of times faster than NeRF, holds greater potential in online dense mapping. However, integrating 3DGS into a street-view dense mapping framework still faces two challenges, including incomplete reconstruction due to the absence of geometric information beyond the LiDAR coverage area and extensive computation for reconstruction in large urban scenes. To this end, we propose HGS-Mapping, an online dense mapping framework in unbounded large-scale scenes. To attain complete construction, our framework introduces Hybrid Gaussian Representation, which models different parts of the entire scene using Gaussians with distinct properties. Furthermore, we employ a hybrid Gaussian initialization mechanism and an adaptive update method to achieve high-fidelity and rapid reconstruction. To the best of our knowledge, we are the first to integrate Gaussian representation into online dense mapping of urban scenes. Our approach achieves SOTA reconstruction accuracy while only employing 66% number of Gaussians, leading to 20% faster reconstruction speed.
FairCLIP: Harnessing Fairness in Vision-Language Learning
Yan Luo, Min Shi, Muhammad Osama Khan, Muhammad Muneeb Afzal, Hao Huang, Shuaihang Yuan, Yu Tian, Luo Song, Ava Kouhana, Tobias Elze, Yi Fang, Mengyu Wang
Apr 01 2024 cs.CV arXiv:2403.19949v2

@misc{2403.19949, author = {Yan Luo and Min Shi and Muhammad Osama Khan and Muhammad Muneeb Afzal and Hao Huang and Shuaihang Yuan and Yu Tian and Luo Song and Ava Kouhana and Tobias Elze and Yi Fang and Mengyu Wang}, title = {{F}air{CLIP}: {H}arnessing {F}airness in {V}ision-{L}anguage {L}earning}, year = {2024}, eprint = {2403.19949}, note = {arXiv:2403.19949v2} }
PDF
Fairness is a critical concern in deep learning, especially in healthcare, where these models influence diagnoses and treatment decisions. Although fairness has been investigated in the vision-only domain, the fairness of medical vision-language (VL) models remains unexplored due to the scarcity of medical VL datasets for studying fairness. To bridge this research gap, we introduce the first fair vision-language medical dataset Harvard-FairVLMed that provides detailed demographic attributes, ground-truth labels, and clinical notes to facilitate an in-depth examination of fairness within VL foundation models. Using Harvard-FairVLMed, we conduct a comprehensive fairness analysis of two widely-used VL models (CLIP and BLIP2), pre-trained on both natural and medical domains, across four different protected attributes. Our results highlight significant biases in all VL models, with Asian, Male, Non-Hispanic, and Spanish being the preferred subgroups across the protected attributes of race, gender, ethnicity, and language, respectively. In order to alleviate these biases, we propose FairCLIP, an optimal-transport-based approach that achieves a favorable trade-off between performance and fairness by reducing the Sinkhorn distance between the overall sample distribution and the distributions corresponding to each demographic group. As the first VL dataset of its kind, Harvard-FairVLMed holds the potential to catalyze advancements in the development of machine learning models that are both ethically aware and clinically effective. Our dataset and code are available at https://ophai.hms.harvard.edu/datasets/harvard-fairvlmed10k.
SA-GS: Scale-Adaptive Gaussian Splatting for Training-Free Anti-Aliasing
Xiaowei Song, Jv Zheng, Shiran Yuan, Huan-ang Gao, Jingwei Zhao, Xiang He, Weihao Gu, Hao Zhao
Mar 29 2024 cs.CV arXiv:2403.19615v1

@misc{2403.19615, author = {Xiaowei Song and Jv Zheng and Shiran Yuan and Huan-ang Gao and Jingwei Zhao and Xiang He and Weihao Gu and Hao Zhao}, title = {{SA}-{GS}: {S}cale-{A}daptive {G}aussian {S}platting for {T}raining-{F}ree {A}nti-{A}liasing}, year = {2024}, eprint = {2403.19615}, note = {arXiv:2403.19615v1} }
PDF
In this paper, we present a Scale-adaptive method for Anti-aliasing Gaussian Splatting (SA-GS). While the state-of-the-art method Mip-Splatting needs modifying the training procedure of Gaussian splatting, our method functions at test-time and is training-free. Specifically, SA-GS can be applied to any pretrained Gaussian splatting field as a plugin to significantly improve the field's anti-alising performance. The core technique is to apply 2D scale-adaptive filters to each Gaussian during test time. As pointed out by Mip-Splatting, observing Gaussians at different frequencies leads to mismatches between the Gaussian scales during training and testing. Mip-Splatting resolves this issue using 3D smoothing and 2D Mip filters, which are unfortunately not aware of testing frequency. In this work, we show that a 2D scale-adaptive filter that is informed of testing frequency can effectively match the Gaussian scale, thus making the Gaussian primitive distribution remain consistent across different testing frequencies. When scale inconsistency is eliminated, sampling rates smaller than the scene frequency result in conventional jaggedness, and we propose to integrate the projected 2D Gaussian within each pixel during testing. This integration is actually a limiting case of super-sampling, which significantly improves anti-aliasing performance over vanilla Gaussian Splatting. Through extensive experiments using various settings and both bounded and unbounded scenes, we show SA-GS performs comparably with or better than Mip-Splatting. Note that super-sampling and integration are only effective when our scale-adaptive filtering is activated. Our codes, data and models are available at https://github.com/zsy1987/SA-GS.