Skip to main content

Showing 1–50 of 697 results for author: Gao, Z

  1. arXiv:2410.16261  [pdf, other

    cs.CV

    Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

    Authors: Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang

    Abstract: Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-Inter… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

    Comments: Technical report

  2. arXiv:2410.15010  [pdf, other

    cs.LG cs.AI

    FlexMol: A Flexible Toolkit for Benchmarking Molecular Relational Learning

    Authors: Sizhe Liu, Jun Xia, Lecheng Zhang, Yuchen Liu, Yue Liu, Wenjie Du, Zhangyang Gao, Bozhen Hu, Cheng Tan, Hongxin Xiang, Stan Z. Li

    Abstract: Molecular relational learning (MRL) is crucial for understanding the interaction behaviors between molecular pairs, a critical aspect of drug discovery and development. However, the large feasible model space of MRL poses significant challenges to benchmarking, and existing MRL frameworks face limitations in flexibility and scope. To address these challenges, avoid repetitive coding efforts, and e… ▽ More

    Submitted 19 October, 2024; originally announced October 2024.

  3. arXiv:2410.14169  [pdf, other

    cs.CV

    DaRePlane: Direction-aware Representations for Dynamic Scene Reconstruction

    Authors: Ange Lou, Benjamin Planche, Zhongpai Gao, Yamin Li, Tianyu Luan, Hao Ding, Meng Zheng, Terrence Chen, Ziyan Wu, Jack Noble

    Abstract: Numerous recent approaches to modeling and re-rendering dynamic scenes leverage plane-based explicit representations, addressing slow training times associated with models like neural radiance fields (NeRF) and Gaussian splatting (GS). However, merely decomposing 4D dynamic scenes into multiple 2D plane-based representations is insufficient for high-fidelity re-rendering of scenes with complex mot… ▽ More

    Submitted 18 October, 2024; originally announced October 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2403.02265

  4. arXiv:2410.12475  [pdf

    cs.MA

    Aegis:An Advanced LLM-Based Multi-Agent for Intelligent Functional Safety Engineering

    Authors: Lu Shi, Bin Qi, Jiarui Luo, Yang Zhang, Zhanzhao Liang, Zhaowei Gao, Wenke Deng, Lin Sun

    Abstract: Functional safety is a critical aspect of automotive engineering, encompassing all phases of a vehicle's lifecycle, including design, development, production, operation, and decommissioning. This domain involves highly knowledge-intensive tasks. This paper introduces Aegis: An Advanced LLM-Based Multi-Agent for Intelligent Functional Safety Engineering. Aegis is specifically designed to support co… ▽ More

    Submitted 17 October, 2024; v1 submitted 16 October, 2024; originally announced October 2024.

  5. arXiv:2410.12214  [pdf, other

    cs.CV cs.AI

    Order-aware Interactive Segmentation

    Authors: Bin Wang, Anwesa Choudhuri, Meng Zheng, Zhongpai Gao, Benjamin Planche, Andong Deng, Qin Liu, Terrence Chen, Ulas Bagci, Ziyan Wu

    Abstract: Interactive segmentation aims to accurately segment target objects with minimal user interactions. However, current methods often fail to accurately separate target objects from the background, due to a limited understanding of order, the relative depth between objects in a scene. To address this issue, we propose OIS: order-aware interactive segmentation, where we explicitly encode the relative d… ▽ More

    Submitted 17 October, 2024; v1 submitted 16 October, 2024; originally announced October 2024.

    Comments: Interactive demo can be found in project page: https://ukaukaaaa.github.io/projects/OIS/index.html

  6. arXiv:2410.10833  [pdf, other

    cs.DC cs.AI cs.LG

    Online Client Scheduling and Resource Allocation for Efficient Federated Edge Learning

    Authors: Zhidong Gao, Zhenxiao Zhang, Yu Zhang, Tongnian Wang, Yanmin Gong, Yuanxiong Guo

    Abstract: Federated learning (FL) enables edge devices to collaboratively train a machine learning model without sharing their raw data. Due to its privacy-protecting benefits, FL has been deployed in many real-world applications. However, deploying FL over mobile edge networks with constrained resources such as power, bandwidth, and computation suffers from high training latency and low model accuracy, par… ▽ More

    Submitted 28 September, 2024; originally announced October 2024.

    Comments: 13 pages, 6 figures

  7. arXiv:2410.09562  [pdf, other

    cs.HC

    SituFont: A Just-in-Time Adaptive Intervention System for Enhancing Mobile Readability in Situational Visual Impairments

    Authors: Kun Yue, Mingshan Zhang, Jingruo Chen, Chun Yu, Kexin Nie, Zhiqi Gao, Jinghan Yang, Chen Liang, Yuanchun Shi

    Abstract: Situational visual impairments (SVIs) significantly impact mobile readability, causing user discomfort and hindering information access. This paper introduces SituFont, a novel just-in-time adaptive intervention (JITAI) system designed to enhance mobile text readability by semi-automatically adjusting font parameters in response to real-time contextual changes. Leveraging smartphone sensors and a… ▽ More

    Submitted 12 October, 2024; originally announced October 2024.

  8. arXiv:2410.07577  [pdf, other

    cs.CV

    3D Vision-Language Gaussian Splatting

    Authors: Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, Ziyan Wu

    Abstract: Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches have naively embedded semantic representations into 3D reconstruction methods without striking a bala… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

    Comments: main paper + supplementary material

  9. arXiv:2410.04974  [pdf, other

    cs.CV cs.AI

    6DGS: Enhanced Direction-Aware Gaussian Splatting for Volumetric Rendering

    Authors: Zhongpai Gao, Benjamin Planche, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Ziyan Wu

    Abstract: Novel view synthesis has advanced significantly with the development of neural radiance fields (NeRF) and 3D Gaussian splatting (3DGS). However, achieving high quality without compromising real-time rendering remains challenging, particularly for physically-based ray tracing with view-dependent effects. Recently, N-dimensional Gaussians (N-DG) introduced a 6D spatial-angular representation to bett… ▽ More

    Submitted 10 October, 2024; v1 submitted 7 October, 2024; originally announced October 2024.

    Comments: Project: https://gaozhongpai.github.io/6dgs/ and fixed iteration typos

  10. arXiv:2410.04612  [pdf, other

    cs.LG cs.AI cs.CL

    Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

    Authors: Zhaolin Gao, Wenhao Zhan, Jonathan D. Chang, Gokul Swamy, Kianté Brantley, Jason D. Lee, Wen Sun

    Abstract: Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve a single turn of interaction. However, they can still struggle with multi-turn tasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior di… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

  11. arXiv:2410.03786  [pdf, other

    cs.HC cs.AI cs.CY

    AI-rays: Exploring Bias in the Gaze of AI Through a Multimodal Interactive Installation

    Authors: Ziyao Gao, Yiwen Zhang, Ling Li, Theodoros Papatheodorou, Wei Zeng

    Abstract: Data surveillance has become more covert and pervasive with AI algorithms, which can result in biased social classifications. Appearance offers intuitive identity signals, but what does it mean to let AI observe and speculate on them? We introduce AI-rays, an interactive installation where AI generates speculative identities from participants' appearance which are expressed through synthesized per… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: Siggraph Asia 2024 Art Paper

  12. arXiv:2410.01707  [pdf, other

    cs.CL cs.AI

    Interpretable Contrastive Monte Carlo Tree Search Reasoning

    Authors: Zitian Gao, Boye Niu, Xuzheng He, Haotian Xu, Hongzhang Liu, Aiwei Liu, Xuming Hu, Lijie Wen

    Abstract: We propose SC-MCTS*: a novel Monte Carlo Tree Search (MCTS) reasoning algorithm for Large Language Models (LLMs), significantly improves both reasoning accuracy and speed. Our motivation comes from: 1. Previous MCTS LLM reasoning works often overlooked its biggest drawback--slower speed compared to CoT; 2. Previous research mainly used MCTS as a tool for LLM reasoning on various tasks with limited… ▽ More

    Submitted 11 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

  13. arXiv:2410.01669  [pdf, other

    cs.LG stat.ML

    Sparse Covariance Neural Networks

    Authors: Andrea Cavallo, Zhan Gao, Elvin Isufi

    Abstract: Covariance Neural Networks (VNNs) perform graph convolutions on the covariance matrix of tabular data and achieve success in a variety of applications. However, the empirical covariance matrix on which the VNNs operate may contain many spurious correlations, making VNNs' performance inconsistent due to these noisy estimates and decreasing their computational efficiency. To tackle this issue, we pu… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

  14. arXiv:2410.00362  [pdf, other

    cs.CL cs.AI

    FedPT: Federated Proxy-Tuning of Large Language Models on Resource-Constrained Edge Devices

    Authors: Zhidong Gao, Yu Zhang, Zhenxiao Zhang, Yanmin Gong, Yuanxiong Guo

    Abstract: Despite demonstrating superior performance across a variety of linguistic tasks, pre-trained large language models (LMs) often require fine-tuning on specific datasets to effectively address different downstream tasks. However, fine-tuning these LMs for downstream tasks necessitates collecting data from individuals, which raises significant privacy concerns. Federated learning (FL) has emerged as… ▽ More

    Submitted 30 September, 2024; originally announced October 2024.

    Comments: 29 pages, 19 figures

  15. arXiv:2409.19603  [pdf, other

    cs.CV cs.AI

    One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

    Authors: Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, Mike Zheng Shou

    Abstract: We introduce VideoLISA, a video-based multimodal large language model designed to tackle the problem of language-instructed reasoning segmentation in videos. Leveraging the reasoning capabilities and world knowledge of large language models, and augmented by the Segment Anything Model, VideoLISA generates temporally consistent segmentation masks in videos based on language instructions. Existing i… ▽ More

    Submitted 29 September, 2024; originally announced September 2024.

    Comments: Accepted by NeurlPS 2024

  16. arXiv:2409.19513  [pdf, other

    cs.LG cs.AI

    One Node Per User: Node-Level Federated Learning for Graph Neural Networks

    Authors: Zhidong Gao, Yuanxiong Guo, Yanmin Gong

    Abstract: Graph Neural Networks (GNNs) training often necessitates gathering raw user data on a central server, which raises significant privacy concerns. Federated learning emerges as a solution, enabling collaborative model training without users directly sharing their raw data. However, integrating federated learning with GNNs presents unique challenges, especially when a client represents a graph node a… ▽ More

    Submitted 28 September, 2024; originally announced September 2024.

    Comments: 16 pages, 9 figures

  17. arXiv:2409.19509  [pdf, other

    cs.LG cs.AI cs.DC

    Heterogeneity-Aware Resource Allocation and Topology Design for Hierarchical Federated Edge Learning

    Authors: Zhidong Gao, Yu Zhang, Yanmin Gong, Yuanxiong Guo

    Abstract: Federated Learning (FL) provides a privacy-preserving framework for training machine learning models on mobile edge devices. Traditional FL algorithms, e.g., FedAvg, impose a heavy communication workload on these devices. To mitigate this issue, Hierarchical Federated Edge Learning (HFEL) has been proposed, leveraging edge servers as intermediaries for model aggregation. Despite its effectiveness,… ▽ More

    Submitted 28 September, 2024; originally announced September 2024.

    Comments: 12 pages, 9 figures

  18. arXiv:2409.17746  [pdf, other

    eess.AS cs.SD

    Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition

    Authors: Keyu An, Zerui Li, Zhifu Gao, Shiliang Zhang

    Abstract: Attention-based encoder-decoder, e.g. transformer and its variants, generates the output sequence in an autoregressive (AR) manner. Despite its superior performance, AR model is computationally inefficient as its generation requires as many iterations as the output length. In this paper, we propose Paraformer-v2, an improved version of Paraformer, for fast, accurate, and noise-robust non-autoregre… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

    Comments: NCMMSC 2024 best paper

  19. arXiv:2409.16685  [pdf, other

    cs.CV

    Skyeyes: Ground Roaming using Aerial View Images

    Authors: Zhiyuan Gao, Wenbin Teng, Gonglin Chen, Jinsen Wu, Ningli Xu, Rongjun Qin, Andrew Feng, Yajie Zhao

    Abstract: Integrating aerial imagery-based scene generation into applications like autonomous driving and gaming enhances realism in 3D environments, but challenges remain in creating detailed content for occluded areas and ensuring real-time, consistent rendering. In this paper, we introduce Skyeyes, a novel framework that can generate photorealistic sequences of ground view images using only aerial view i… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

  20. arXiv:2409.12612  [pdf, other

    cs.CV

    Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning

    Authors: Cong Yang, Zuchao Li, Hongzan Jiao, Zhi Gao, Lefei Zhang

    Abstract: Recently, while significant progress has been made in remote sensing image change captioning, existing methods fail to filter out areas unrelated to actual changes, making models susceptible to irrelevant features. In this article, we propose a novel multimodal framework for remote sensing image change captioning, guided by Key Change Features and Instruction-tuned (KCFI). This framework aims to f… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

  21. arXiv:2409.09724  [pdf, other

    cs.CV

    MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection

    Authors: Yaning Zhang, Tianyi Wang, Zitong Yu, Zan Gao, Linlin Shen, Shengyong Chen

    Abstract: The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia, highlighting the urgent need for robust and generalizable face forgery detection (FFD) techniques. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored, which limits th… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

  22. arXiv:2409.09715  [pdf, ps, other

    cs.IT cs.GT

    Generative Semantic Communication via Textual Prompts: Latency Performance Tradeoffs

    Authors: Mengmeng Ren, Li Qiao, Long Yang, Zhen Gao, Jian Chen, Mahdi Boloursaz Mashhadi, Pei Xiao, Rahim Tafazolli, Mehdi Bennis

    Abstract: This paper develops an edge-device collaborative Generative Semantic Communications (Gen SemCom) framework leveraging pre-trained Multi-modal/Vision Language Models (M/VLMs) for ultra-low-rate semantic communication via textual prompts. The proposed framework optimizes the use of M/VLMs on the wireless edge/device to generate high-fidelity textual prompts through visual captioning/question answeri… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

  23. arXiv:2409.08065  [pdf, other

    cond-mat.supr-con cond-mat.mtrl-sci cs.AI physics.comp-ph

    AI-accelerated discovery of high critical temperature superconductors

    Authors: Xiao-Qi Han, Zhenfeng Ouyang, Peng-Jie Guo, Hao Sun, Ze-Feng Gao, Zhong-Yi Lu

    Abstract: The discovery of new superconducting materials, particularly those exhibiting high critical temperature ($T_c$), has been a vibrant area of study within the field of condensed matter physics. Conventional approaches primarily rely on physical intuition to search for potential superconductors within the existing databases. However, the known materials only scratch the surface of the extensive array… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: 11 pages, 7 figures, 4 tables

  24. arXiv:2409.07994  [pdf, other

    cs.NI

    Directional WPT Charging for Routing-Asymmetric WRSNs with a Mobile Charger

    Authors: Zhenguo Gao, Qi Zhang, Qingyu Gao, Yunlong Zhao, Hsiao-Chun Wu

    Abstract: Mobile Charge Scheduling for wirelessly charging nodes in Wireless Rechargeable Sensor Networks (WRSNs) is a promising but still evolving research area. Existing research mostly assumes a symmetric environment, where the routing costs in opposite directions between two locations are considered identical. However, various factors such as terrain restrictions and wind or water flows may invalidate t… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

    Comments: 15 pages, 5 figures

  25. arXiv:2409.07462  [pdf, other

    q-bio.BM cs.LG

    S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

    Authors: Gengmo Zhou, Zhen Wang, Feng Yu, Guolin Ke, Zhewei Wei, Zhifeng Gao

    Abstract: Virtual Screening is an essential technique in the early phases of drug discovery, aimed at identifying promising drug candidates from vast molecular libraries. Recently, ligand-based virtual screening has garnered significant attention due to its efficacy in conducting extensive database screenings without relying on specific protein-binding site information. Obtaining binding affinity data for c… ▽ More

    Submitted 27 August, 2024; originally announced September 2024.

  26. arXiv:2409.05531  [pdf, other

    cs.CV cs.AI

    HMAFlow: Learning More Accurate Optical Flow via Hierarchical Motion Field Alignment

    Authors: Dianbo Ma, Kousuke Imamura, Ziyan Gao, Xiangjie Wang, Satoshi Yamane

    Abstract: Optical flow estimation is a fundamental and long-standing visual task. In this work, we present a novel method, dubbed HMAFlow, to improve optical flow estimation in challenging scenes, particularly those involving small objects. The proposed model mainly consists of two core components: a Hierarchical Motion Field Alignment (HMA) module and a Correlation Self-Attention (CSA) module. In addition,… ▽ More

    Submitted 15 September, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

    Comments: 11 pages, 6 figures

  27. arXiv:2409.05112  [pdf, other

    cs.CL

    WaterSeeker: Pioneering Efficient Detection of Watermarked Segments in Large Documents

    Authors: Leyi Pan, Aiwei Liu, Yijian Lu, Zitian Gao, Yichen Di, Lijie Wen, Irwin King, Philip S. Yu

    Abstract: Watermarking algorithms for large language models (LLMs) have attained high accuracy in detecting LLM-generated text. However, existing methods primarily focus on distinguishing fully watermarked text from non-watermarked text, overlooking real-world scenarios where LLMs generate only small sections within large documents. In this scenario, balancing time complexity and detection performance poses… ▽ More

    Submitted 15 October, 2024; v1 submitted 8 September, 2024; originally announced September 2024.

    Comments: 20 pages, 7 figures, 8 tables

    MSC Class: 68T50 ACM Class: I.2.7

  28. arXiv:2409.04977  [pdf

    cs.LG cs.AI cs.CV

    Enhancing Convolutional Neural Networks with Higher-Order Numerical Difference Methods

    Authors: Qi Wang, Zijun Gao, Mingxiu Sui, Taiyuan Mei, Xiaohan Cheng, Iris Li

    Abstract: With the rise of deep learning technology in practical applications, Convolutional Neural Networks (CNNs) have been able to assist humans in solving many real-world problems. To enhance the performance of CNNs, numerous network architectures have been explored. Some of these architectures are designed based on the accumulated experience of researchers over time, while others are designed through n… ▽ More

    Submitted 8 September, 2024; originally announced September 2024.

  29. arXiv:2409.04290  [pdf, other

    cs.LG cs.AI

    CoxKAN: Kolmogorov-Arnold Networks for Interpretable, High-Performance Survival Analysis

    Authors: William Knottenbelt, Zeyu Gao, Rebecca Wray, Woody Zhidong Zhang, Jiashuai Liu, Mireia Crispin-Ortuzar

    Abstract: Survival analysis is a branch of statistics used for modeling the time until a specific event occurs and is widely used in medicine, engineering, finance, and many other fields. When choosing survival models, there is typically a trade-off between performance and interpretability, where the highest performance is achieved by black-box models based on deep learning. This is a major problem in field… ▽ More

    Submitted 6 September, 2024; originally announced September 2024.

  30. arXiv:2409.04022  [pdf, other

    cs.DC cs.LG

    Heterogeneity-Aware Cooperative Federated Edge Learning with Adaptive Computation and Communication Compression

    Authors: Zhenxiao Zhang, Zhidong Gao, Yuanxiong Guo, Yanmin Gong

    Abstract: Motivated by the drawbacks of cloud-based federated learning (FL), cooperative federated edge learning (CFEL) has been proposed to improve efficiency for FL over mobile edge networks, where multiple edge servers collaboratively coordinate the distributed model training across a large number of edge devices. However, CFEL faces critical challenges arising from dynamic and heterogeneous device prope… ▽ More

    Submitted 6 September, 2024; originally announced September 2024.

    Comments: 20 pages, 7 figures

  31. arXiv:2409.03258  [pdf, other

    cs.CL

    GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

    Authors: Yukun Cao, Shuo Han, Zengyi Gao, Zezhong Ding, Xike Xie, S. Kevin Zhou

    Abstract: Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ''positional biases''.… ▽ More

    Submitted 17 October, 2024; v1 submitted 5 September, 2024; originally announced September 2024.

  32. arXiv:2409.03155  [pdf, other

    cs.CL cs.AI

    Debate on Graph: a Flexible and Reliable Reasoning Framework for Large Language Models

    Authors: Jie Ma, Zhitao Gao, Qi Chai, Wangchun Sun, Pinghui Wang, Hongbin Pei, Jing Tao, Lingyun Song, Jun Liu, Chen Zhang, Lizhen Cui

    Abstract: Large Language Models (LLMs) may suffer from hallucinations in real-world applications due to the lack of relevant knowledge. In contrast, knowledge graphs encompass extensive, multi-relational structures that store a vast array of symbolic facts. Consequently, integrating LLMs with knowledge graphs has been extensively explored, with Knowledge Graph Question Answering (KGQA) serving as a critical… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: 12 pages

    ACM Class: I.2.4

  33. arXiv:2409.02920  [pdf, other

    cs.RO cs.AI cs.CL

    RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)

    Authors: Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, Ping Luo

    Abstract: Effective collaboration of dual-arm robots and their tool use capabilities are increasingly important areas in the advancement of robotics. These skills play a significant role in expanding robots' ability to operate in diverse real-world environments. However, progress is impeded by the scarcity of specialized training data. This paper introduces RoboTwin, a novel benchmark dataset combining real… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: Project page: https://robotwin-benchmark.github.io/early-version/

  34. arXiv:2409.02310  [pdf, other

    cs.CV

    Geometry-aware Feature Matching for Large-Scale Structure from Motion

    Authors: Gonglin Chen, Jinsen Wu, Haiwei Chen, Wenbin Teng, Zhiyuan Gao, Andrew Feng, Rongjun Qin, Yajie Zhao

    Abstract: Establishing consistent and dense correspondences across multiple images is crucial for Structure from Motion (SfM) systems. Significant view changes, such as air-to-ground with very sparse view overlap, pose an even greater challenge to the correspondence solvers. We present a novel optimization-based approach that significantly enhances existing feature matching methods by introducing geometry c… ▽ More

    Submitted 25 September, 2024; v1 submitted 3 September, 2024; originally announced September 2024.

  35. arXiv:2408.16690  [pdf, other

    cs.CV

    Generic Objects as Pose Probes for Few-Shot View Synthesis

    Authors: Zhirui Gao, Renjiao Yi, Chenyang Zhu, Ke Zhuang, Wei Chen, Kai Xu

    Abstract: Radiance fields including NeRFs and 3D Gaussians demonstrate great potential in high-fidelity rendering and scene reconstruction, while they require a substantial number of posed images as inputs. COLMAP is frequently employed for preprocessing to estimate poses, while it necessitates a large number of feature matches to operate effectively, and it struggles with scenes characterized by sparse fea… ▽ More

    Submitted 1 September, 2024; v1 submitted 29 August, 2024; originally announced August 2024.

  36. arXiv:2408.16486  [pdf, other

    cs.CV

    Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

    Authors: Zhengqing Gao, Xiang Ao, Xu-Yao Zhang, Cheng-Lin Liu

    Abstract: Adapting pre-trained models to open classes is a challenging problem in machine learning. Vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, which is naturally suited for various open-set problems. More recently, some research focuses on fine-tuning such models to downstream tasks. Prompt tuning methods achieved huge improve… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: PRCV 2024

  37. arXiv:2408.14427  [pdf, other

    cs.CV

    Few-Shot 3D Volumetric Segmentation with Multi-Surrogate Fusion

    Authors: Meng Zheng, Benjamin Planche, Zhongpai Gao, Terrence Chen, Richard J. Radke, Ziyan Wu

    Abstract: Conventional 3D medical image segmentation methods typically require learning heavy 3D networks (e.g., 3D-UNet), as well as large amounts of in-domain data with accurate pixel/voxel-level labels to avoid overfitting. These solutions are thus extremely time- and labor-expensive, but also may easily fail to generalize to unseen objects during training. To alleviate this issue, we present MSFSeg, a n… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: Accepted to MICCAI 2024

  38. arXiv:2408.12067   

    eess.SP cs.AI cs.NI

    Distributed Noncoherent Joint Transmission Based on Multi-Agent Reinforcement Learning for Dense Small Cell MISO Systems

    Authors: Shaozhuang Bai, Zhenzhen Gao, Xuewen Liao

    Abstract: We consider a dense small cell (DSC) network where multi-antenna small cell base stations (SBSs) transmit data to single-antenna users over a shared frequency band. To enhance capacity, a state-of-the-art technique known as noncoherent joint transmission (JT) is applied, enabling users to receive data from multiple coordinated SBSs. However, the sum rate maximization problem with noncoherent JT is… ▽ More

    Submitted 11 September, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: After thorough discussions with my co-authors, we have identified certain issues with the paper that cannot be resolved through revisions. As a result, we have collectively decided to complete withdraw the paper from arXiv

  39. arXiv:2408.10789  [pdf, other

    cs.CV

    Learning Part-aware 3D Representations by Fusing 2D Gaussians and Superquadrics

    Authors: Zhirui Gao, Renjiao Yi, Yuhang Huang, Wei Chen, Chenyang Zhu, Kai Xu

    Abstract: Low-level 3D representations, such as point clouds, meshes, NeRFs, and 3D Gaussians, are commonly used to represent 3D objects or scenes. However, humans usually perceive 3D objects or scenes at a higher level as a composition of parts or structures rather than points or voxels. Representing 3D as semantic parts can benefit further understanding and applications. We aim to solve part-aware 3D reco… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  40. arXiv:2408.09694  [pdf, other

    cs.RO

    An Efficient Deep Reinforcement Learning Model for Online 3D Bin Packing Combining Object Rearrangement and Stable Placement

    Authors: Peiwen Zhou, Ziyan Gao, Chenghao Li, Nak Young Chong

    Abstract: This paper presents an efficient deep reinforcement learning (DRL) framework for online 3D bin packing (3D-BPP). The 3D-BPP is an NP-hard problem significant in logistics, warehousing, and transportation, involving the optimal arrangement of objects inside a bin. Traditional heuristic algorithms often fail to address dynamic and physical constraints in real-time scenarios. We introduce a novel DRL… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  41. arXiv:2408.09420  [pdf, other

    q-fin.CP cs.CL cs.LG

    Enhancing Startup Success Predictions in Venture Capital: A GraphRAG Augmented Multivariate Time Series Method

    Authors: Zitian Gao, Yihao Xiao

    Abstract: In the Venture Capital(VC) industry, predicting the success of startups is challenging due to limited financial data and the need for subjective revenue forecasts. Previous methods based on time series analysis or deep learning often fall short as they fail to incorporate crucial inter-company relationships such as competition and collaboration. Regarding the issues, we propose a novel approach us… ▽ More

    Submitted 21 August, 2024; v1 submitted 18 August, 2024; originally announced August 2024.

  42. arXiv:2408.09095  [pdf, other

    cs.SE

    Towards Better Answers: Automated Stack Overflow Post Updating

    Authors: Yubo Mai, Zhipeng Gao, Haoye Wang, Tingting Bi, Xing Hu, Xin Xia, Jianling Sun

    Abstract: Utilizing code snippets on Stack Overflow (SO) is a common practice among developers for problem-solving. Although SO code snippets serve as valuable resources, it is important to acknowledge their imperfections, reusing problematic code snippets can lead to the introduction of suboptimal or buggy code into software projects. SO comments often point out weaknesses of a post and provide valuable in… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

  43. arXiv:2408.06385  [pdf, other

    cs.SE cs.AI cs.CL

    ViC: Virtual Compiler Is All You Need For Assembly Code Search

    Authors: Zeyu Gao, Hao Wang, Yuanda Wang, Chao Zhang

    Abstract: Assembly code search is vital for reducing the burden on reverse engineers, allowing them to quickly identify specific functions using natural language within vast binary programs. Despite its significance, this critical task is impeded by the complexities involved in building high-quality datasets. This paper explores training a Large Language Model (LLM) to emulate a general compiler. By leverag… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

  44. arXiv:2408.06357  [pdf

    cs.CV cs.AI

    Algorithm Research of ELMo Word Embedding and Deep Learning Multimodal Transformer in Image Description

    Authors: Xiaohan Cheng, Taiyuan Mei, Yun Zi, Qi Wang, Zijun Gao, Haowei Yang

    Abstract: Zero sample learning is an effective method for data deficiency. The existing embedded zero sample learning methods only use the known classes to construct the embedded space, so there is an overfitting of the known classes in the testing process. This project uses category semantic similarity measures to classify multiple tags. This enables it to incorporate unknown classes that have the same mea… ▽ More

    Submitted 25 July, 2024; originally announced August 2024.

  45. arXiv:2408.05750  [pdf, other

    cs.CV

    FADE: A Dataset for Detecting Falling Objects around Buildings in Video

    Authors: Zhigang Tu, Zitao Gao, Zhengbo Zhang, Chunluan Zhou, Junsong Yuan, Bo Du

    Abstract: Falling objects from buildings can cause severe injuries to pedestrians due to the great impact force they exert. Although surveillance cameras are installed around some buildings, it is challenging for humans to capture such events in surveillance videos due to the small size and fast motion of falling objects, as well as the complex background. Therefore, it is necessary to develop methods to au… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: 11 pages, 10 figures

  46. arXiv:2408.05479  [pdf, other

    cs.CV

    ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

    Authors: Ziyi Gao, Kai Chen, Zhipeng Wei, Tingshu Mou, Jingjing Chen, Zhiyu Tan, Hao Li, Yu-Gang Jiang

    Abstract: Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

  47. arXiv:2408.04638  [pdf, other

    cs.CL cs.CY

    Affective Computing in the Era of Large Language Models: A Survey from the NLP Perspective

    Authors: Yiqun Zhang, Xiaocui Yang, Xingle Xu, Zeran Gao, Yijie Huang, Shiyi Mu, Shi Feng, Daling Wang, Yifei Zhang, Kaisong Song, Ge Yu

    Abstract: Affective Computing (AC), integrating computer science, psychology, and cognitive science knowledge, aims to enable machines to recognize, interpret, and simulate human emotions.To create more value, AC can be applied to diverse scenarios, including social media, finance, healthcare, education, etc. Affective Computing (AC) includes two mainstream tasks, i.e., Affective Understanding (AU) and Affe… ▽ More

    Submitted 30 July, 2024; originally announced August 2024.

  48. arXiv:2408.03544  [pdf, other

    cs.CL cs.AI

    NatLan: Native Language Prompting Facilitates Knowledge Elicitation Through Language Trigger Provision and Domain Trigger Retention

    Authors: Baixuan Li, Yunlong Fan, Tianyi Ma, Zhiqiang Gao

    Abstract: Multilingual large language models (MLLMs) do not perform as well when answering questions in non-dominant languages as they do in their dominant languages. Although existing translate-then-answer methods alleviate this issue, the mechanisms behind their effectiveness remain unclear. In this study, we analogize the dominant language of MLLMs to the native language of humans and use two human cogni… ▽ More

    Submitted 15 October, 2024; v1 submitted 7 August, 2024; originally announced August 2024.

  49. arXiv:2408.02987  [pdf

    cs.LG

    A Differential Smoothness-based Compact-Dynamic Graph Convolutional Network for Spatiotemporal Signal Recovery

    Authors: Pengcheng Gao, Zicheng Gao, Ye Yuan

    Abstract: High quality spatiotemporal signal is vitally important for real application scenarios like energy management, traffic planning and cyber security. Due to the uncontrollable factors like abrupt sensors breakdown or communication fault, the spatiotemporal signal collected by sensors is always incomplete. A dynamic graph convolutional network (DGCN) is effective for processing spatiotemporal signal… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  50. arXiv:2407.21757  [pdf, other

    cs.CV cs.MM

    Learning Video Context as Interleaved Multimodal Sequences

    Authors: Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou

    Abstract: Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as i… ▽ More

    Submitted 12 September, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024