Skip to main content

Showing 1–50 of 423 results for author: Yuan, L

  1. arXiv:2410.14225  [pdf, other

    cs.CL cs.AI

    Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model

    Authors: Li Yuan, Yi Cai, Junsheng Huang

    Abstract: Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from text-image pairs in social media posts. Existing methods for JMERE require large amounts of labeled data. However, gathering and annotating fine-grained multimodal data for JMERE poses significant challenges. Initially, we construct diverse and comprehensive multimodal f… ▽ More

    Submitted 18 October, 2024; originally announced October 2024.

    Comments: accepted by ACM MM 2024

  2. arXiv:2410.11842  [pdf, other

    cs.CV cs.AI cs.LG

    MoH: Multi-Head Attention as Mixture-of-Head Attention

    Authors: Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan

    Abstract: In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to improve efficiency while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that tr… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

    Comments: 23 pages, code: https://github.com/SkyworkAI/MoH

  3. arXiv:2410.10179  [pdf, other

    cs.LG cs.CL

    Is Parameter Collision Hindering Continual Learning in LLMs?

    Authors: Shuo Yang, Kun-Peng Ning, Yu-Yang Liu, Jia-Yu Yao, Yong-Hong Tian, Yi-Bing Song, Li Yuan

    Abstract: Large Language Models (LLMs) often suffer from catastrophic forgetting when learning multiple tasks sequentially, making continual learning (CL) essential for their dynamic deployment. Existing state-of-the-art (SOTA) methods, such as O-LoRA, typically focus on constructing orthogonality tasks to decouple parameter interdependence from various domains.In this paper, we reveal that building non-col… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  4. arXiv:2410.07348  [pdf, other

    cs.LG cs.AI

    MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts

    Authors: Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan

    Abstract: In this work, we aim to simultaneously enhance the effectiveness and efficiency of Mixture-of-Experts (MoE) methods. To achieve this, we propose MoE++, a general and heterogeneous MoE framework that integrates both Feed-Forward Network~(FFN) and zero-computation experts. Specifically, we introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which cor… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

    Comments: 23 pages, Code: https://github.com/SkyworkAI/MoE-plus-plus

  5. arXiv:2409.15698  [pdf, other

    cs.LG cs.SI

    GraphGI:A GNN Explanation Method using Game Interaction

    Authors: Xingping Xian, Jianlu Liu, Tao Wu, Lin Yuan, Chao Wang, Baiyun Chen

    Abstract: Graph Neural Networks (GNNs) have garnered significant attention and have been extensively utilized across various domains. However, similar to other deep learning models, GNNs are often viewed as black-box models, making it challenging to interpret their prediction mechanisms. Current graph explanation techniques focus on identifying key nodes or edges, attributing the critical data features that… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

  6. arXiv:2409.13731  [pdf, other

    cs.CL cs.AI

    KAG: Boosting LLMs in Professional Domains via Knowledge Augmented Generation

    Authors: Lei Liang, Mengshu Sun, Zhengke Gui, Zhongshu Zhu, Zhouyu Jiang, Ling Zhong, Yuan Qu, Peilong Zhao, Zhongpu Bo, Jin Yang, Huaidong Xiong, Lin Yuan, Jun Xu, Zaoyang Wang, Zhiqiang Zhang, Wen Zhang, Huajun Chen, Wenguang Chen, Jun Zhou

    Abstract: The recently developed retrieval-augmented generation (RAG) technology has enabled the efficient construction of domain-specific applications. However, it also has limitations, including the gap between vector similarity and the relevance of knowledge reasoning, as well as insensitivity to knowledge logic, such as numerical values, temporal relations, expert rules, and others, which hinder the eff… ▽ More

    Submitted 26 September, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

    Comments: 33 pages

  7. arXiv:2409.07912  [pdf, other

    cs.CE

    Multi-granularity Score-based Generative Framework Enables Efficient Inverse Design of Complex Organics

    Authors: Zijun Chen, Yu Wang, Liuzhenghao Lv, Hao Li, Zongying Lin, Li Yuan, Yonghong Tian

    Abstract: Efficiently retrieving an enormous chemical library to design targeted molecules is crucial for accelerating drug discovery, organic chemistry, and optoelectronic materials. Despite the emergence of generative models to produce novel drug-like molecules, in a more realistic scenario, the complexity of functional groups (e.g., pyrene, acenaphthylene, and bridged-ring systems) and extensive molecula… ▽ More

    Submitted 12 September, 2024; originally announced September 2024.

  8. arXiv:2409.04133  [pdf, other

    cs.CV cs.CY

    Secure Traffic Sign Recognition: An Attention-Enabled Universal Image Inpainting Mechanism against Light Patch Attacks

    Authors: Hangcheng Cao, Longzhi Yuan, Guowen Xu, Ziyang He, Zhengru Fang, Yuguang Fang

    Abstract: Traffic sign recognition systems play a crucial role in assisting drivers to make informed decisions while driving. However, due to the heavy reliance on deep learning technologies, particularly for future connected and autonomous driving, these systems are susceptible to adversarial attacks that pose significant safety risks to both personal and public transportation. Notably, researchers recentl… ▽ More

    Submitted 6 September, 2024; originally announced September 2024.

  9. arXiv:2409.02368  [pdf, other

    cs.CV

    Pluralistic Salient Object Detection

    Authors: Xuelu Feng, Yunsheng Li, Dongdong Chen, Chunming Qiao, Junsong Yuan, Lu Yuan, Gang Hua

    Abstract: We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image. Unlike conventional SOD methods that produce a single segmentation mask for salient objects, this new setting recognizes the inherent complexity of real-world images, comprising multiple objects, and the ambiguity in defining salient ob… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  10. arXiv:2409.02048  [pdf, other

    cs.CV

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Authors: Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, Yonghong Tian

    Abstract: Despite recent advancements in neural 3D reconstruction, the dependence on dense multi-view captures restricts their broader applicability. In this work, we propose \textbf{ViewCrafter}, a novel method for synthesizing high-fidelity novel views of generic scenes from single or sparse images with the prior of video diffusion model. Our method takes advantage of the powerful generation capabilities… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Project page: https://drexubery.github.io/ViewCrafter/

  11. arXiv:2409.01199  [pdf, other

    cs.CV eess.IV

    OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

    Authors: Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, Li Yuan

    Abstract: Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ign… ▽ More

    Submitted 9 September, 2024; v1 submitted 2 September, 2024; originally announced September 2024.

    Comments: https://github.com/PKU-YuanGroup/Open-Sora-Plan

  12. arXiv:2408.17065  [pdf, other

    cs.CV

    Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

    Authors: Zhiyuan Yan, Yandan Zhao, Shen Chen, Xinghe Fu, Taiping Yao, Shouhong Ding, Li Yuan

    Abstract: Three key challenges hinder the development of current deepfake video detection: (1) Temporal features can be complex and diverse: how can we identify general temporal artifacts to enhance model generalization? (2) Spatiotemporal models often lean heavily on one type of artifact and ignore the other: how can we ensure balanced learning from both? (3) Videos are naturally resource-intensive: how ca… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

  13. arXiv:2408.05432  [pdf, other

    cs.DB

    Simpler is More: Efficient Top-K Nearest Neighbors Search on Large Road Networks

    Authors: Yiqi Wang, Long Yuan, Wenjie Zhang, Xuemin Lin, Zi Chen, Qing Liu

    Abstract: Top-k Nearest Neighbors (kNN) problem on road network has numerous applications on location-based services. As direct search using the Dijkstra's algorithm results in a large search space, a plethora of complex-index-based approaches have been proposed to speedup the query processing. However, even with the current state-of-the-art approach, long query processing delays persist, along with signifi… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

    Comments: 15 pages, 15 figures

  14. arXiv:2407.19548  [pdf, other

    cs.CV

    Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

    Authors: Zhenyu Tang, Junwu Zhang, Xinhua Cheng, Wangbo Yu, Chaoran Feng, Yatian Pang, Bin Lin, Li Yuan

    Abstract: Recent 3D large reconstruction models typically employ a two-stage process, including first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content.However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction. To address this issue,… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

    Comments: Project page: https://pku-yuangroup.github.io/Cycle3D/

  15. arXiv:2407.15187  [pdf, other

    cs.CV cs.GR

    HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

    Authors: Haiyang Zhou, Xinhua Cheng, Wangbo Yu, Yonghong Tian, Li Yuan

    Abstract: 3D scene generation is in high demand across various domains, including virtual reality, gaming, and the film industry. Owing to the powerful generative capabilities of text-to-image diffusion models that provide reliable priors, the creation of 3D scenes using only text prompts has become viable, thereby significantly advancing researches in text-driven 3D scene generation. In order to obtain mul… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

    Comments: Homepage: https://zhouhyocean.github.io/holodreamer

  16. arXiv:2407.14506  [pdf, other

    cs.CV cs.AI cs.CL

    On Pre-training of Multimodal Language Models Customized for Chart Understanding

    Authors: Wan-Cyuan Fan, Yen-Chun Chen, Mengchen Liu, Lu Yuan, Leonid Sigal

    Abstract: Recent studies customizing Multimodal Large Language Models (MLLMs) for domain-specific tasks have yielded promising results, especially in the field of scientific chart comprehension. These studies generally utilize visual instruction tuning with specialized datasets to enhance question and answer (QA) accuracy within the chart domain. However, they often neglect the fundamental discrepancy betwe… ▽ More

    Submitted 31 July, 2024; v1 submitted 19 July, 2024; originally announced July 2024.

  17. arXiv:2407.10528  [pdf, other

    cs.CV

    Local Action-Guided Motion Diffusion Model for Text-to-Motion Generation

    Authors: Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Runyi Yu, Chang Liu, Xiangyang Ji, Li Yuan, Jie Chen

    Abstract: Text-to-motion generation requires not only grounding local actions in language but also seamlessly blending these individual actions to synthesize diverse and realistic global motions. However, existing motion generation methods primarily focus on the direct synthesis of global motions while neglecting the importance of generating and controlling local actions. In this paper, we propose the local… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  18. arXiv:2407.04681  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

    Authors: Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip Torr, Lu Yuan

    Abstract: In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs, limiting their ability to answer questions requiring… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

  19. arXiv:2407.03856  [pdf, other

    cs.LG

    Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation

    Authors: Yi-Chen Li, Fuxiang Zhang, Wenjie Qiu, Lei Yuan, Chengxing Jia, Zongzhang Zhang, Yang Yu, Bo An

    Abstract: Large Language Models (LLMs), trained on a large amount of corpus, have demonstrated remarkable abilities. However, it may not be sufficient to directly apply open-source LLMs like Llama to certain real-world scenarios, since most of them are trained for \emph{general} purposes. Thus, the demands for customizing publicly available LLMs emerge, but are currently under-studied. In this work, we cons… ▽ More

    Submitted 5 October, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

  20. arXiv:2407.03187  [pdf

    cs.CY

    Holistic view of the road transportation system based on real-time data sharing mechanism

    Authors: Li Tao, Dong Xiang, Hao Junfeng, Yin Ping, Xu Xiaoxue, Lai Maokai, Li Yuan, Peng Ting

    Abstract: Traditional manual driving and single-vehicle-based intelligent driving have limitations in real-time and accurate acquisition of the current driving status and intentions of surrounding vehicles, leading to vehicles typically maintaining appropriate safe distances from each other. Yet, accidents still frequently occur, especially in merging areas; meanwhile, it is difficult to comprehensively obt… ▽ More

    Submitted 3 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

  21. arXiv:2407.00634  [pdf, other

    cs.CV cs.LG

    Tarsier: Recipes for Training and Evaluating Large Video Description Models

    Authors: Jiawei Wang, Liping Yuan, Yuchen Zhang, Haomiao Sun

    Abstract: Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-language models designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a met… ▽ More

    Submitted 24 September, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

  22. arXiv:2406.18522  [pdf, other

    cs.CV cs.CL

    ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

    Authors: Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan

    Abstract: We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with s… ▽ More

    Submitted 1 October, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

    Comments: NeurIPS D&B 2024 (Spotlight)

  23. arXiv:2406.18139  [pdf, other

    cs.CL cs.CV

    LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

    Authors: Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan

    Abstract: Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temp… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  24. arXiv:2406.17962  [pdf, other

    cs.CL

    Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework

    Authors: Bohao Yang, Dong Liu, Chen Tang, Chenghao Xiao, Kun Zhao, Chao Li, Lin Yuan, Guang Yang, Lanxiao Huang, Chenghua Lin

    Abstract: Large Language Models (LLMs) demonstrate a remarkable ability to comprehend human instructions and generate high-quality text. This capability allows LLMs to function as agents that can emulate human beings at a more sophisticated level, beyond the mere replication of basic human behaviours. However, there is a lack of exploring into leveraging LLMs to craft characters from diverse aspects. In thi… ▽ More

    Submitted 16 August, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

  25. arXiv:2406.17530  [pdf, other

    cs.CV cs.RO

    Point Tree Transformer for Point Cloud Registration

    Authors: Meiling Wang, Guangyan Chen, Yi Yang, Li Yuan, Yufeng Yue

    Abstract: Point cloud registration is a fundamental task in the fields of computer vision and robotics. Recent developments in transformer-based methods have demonstrated enhanced performance in this domain. However, the standard attention mechanism utilized in these methods often integrates many low-relevance points, thereby struggling to prioritize its attention weights on sparse yet meaningful points. Th… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  26. arXiv:2406.13920  [pdf, other

    cs.LG cs.SI

    Explainable AI Security: Exploring Robustness of Graph Neural Networks to Adversarial Attacks

    Authors: Tao Wu, Canyixing Cui, Xingping Xian, Shaojie Qiao, Chao Wang, Lin Yuan, Shui Yu

    Abstract: Graph neural networks (GNNs) have achieved tremendous success, but recent studies have shown that GNNs are vulnerable to adversarial attacks, which significantly hinders their use in safety-critical scenarios. Therefore, the design of robust GNNs has attracted increasing attention. However, existing research has mainly been conducted via experimental trial and error, and thus far, there remains a… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  27. arXiv:2406.13499  [pdf, other

    cs.SI cs.LG

    GraphMU: Repairing Robustness of Graph Neural Networks via Machine Unlearning

    Authors: Tao Wu, Xinwen Cao, Chao Wang, Shaojie Qiao, Xingping Xian, Lin Yuan, Canyixing Cui, Yanbing Liu

    Abstract: Graph Neural Networks (GNNs) have demonstrated significant application potential in various fields. However, GNNs are still vulnerable to adversarial attacks. Numerous adversarial defense methods on GNNs are proposed to address the problem of adversarial attacks. However, these methods can only serve as a defense before poisoning, but cannot repair poisoned GNN. Therefore, there is an urgent need… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  28. arXiv:2406.13495  [pdf, other

    cs.CV

    DF40: Toward Next-Generation Deepfake Detection

    Authors: Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Li Yuan, Chengjie Wang, Shouhong Ding, Yunsheng Wu

    Abstract: We propose a new comprehensive benchmark to revolutionize the current deepfake detection field to the next generation. Predominantly, existing works identify top-notch detection algorithms and models by adhering to the common practice: training detectors on one specific dataset (e.g., FF++) and testing them on other prevalent deepfake datasets. This protocol is often regarded as a "golden compass"… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  29. arXiv:2406.11721  [pdf, other

    cs.CL cs.AI cs.LG

    Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity

    Authors: Bingxiang He, Ning Ding, Cheng Qian, Jia Deng, Ganqu Cui, Lifan Yuan, Huan-ang Gao, Huimin Chen, Zhiyuan Liu, Maosong Sun

    Abstract: Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfe… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 33 pages, 14 figures

  30. arXiv:2406.04325  [pdf, other

    cs.CV

    ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

    Authors: Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, Jiaqi Wang

    Abstract: We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating st… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Project Page: https://sharegpt4video.github.io/

  31. arXiv:2405.20224  [pdf, other

    cs.CV

    EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

    Authors: Wangbo Yu, Chaoran Feng, Jiye Tang, Xu Jia, Li Yuan, Yonghong Tian

    Abstract: 3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in 3D scene reconstruction and novel view synthesis. However, its training heavily depends on high-quality, sharp images and accurate camera poses. Fulfilling these requirements can be challenging in non-ideal real-world scenarios, where motion-blurred images are commonly encountered in high-speed moving cameras or low-light e… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Project Page: https://drexubery.github.io/EvaGaussians/

  32. arXiv:2405.19465  [pdf, other

    cs.CV

    RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

    Authors: Meng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu, Long Chen, Xiaodan Liang, Li Yuan, Ge Li

    Abstract: Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: Accepted by ACL 2024 Findings

  33. arXiv:2405.18752  [pdf, other

    cs.MA eess.SY

    Resilient Average Consensus with Adversaries via Distributed Detection and Recovery

    Authors: Liwei Yuan, Hideaki Ishii

    Abstract: We study the problem of resilient average consensus in multi-agent systems where some of the agents are subject to failures or attacks. The objective of resilient average consensus is for non-faulty/normal agents to converge to the average of their initial values despite the erroneous effects from malicious agents. To this end, we propose a successful distributed iterative resilient average consen… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: 16 pages

  34. arXiv:2405.00587  [pdf, other

    cs.CV

    GraCo: Granularity-Controllable Interactive Segmentation

    Authors: Yian Zhao, Kehan Li, Zesen Cheng, Pengchong Qiao, Xiawu Zheng, Rongrong Ji, Chang Liu, Li Yuan, Jie Chen

    Abstract: Interactive Segmentation (IS) segments specific objects or parts in the image according to user input. Current IS pipelines fall into two categories: single-granularity output and multi-granularity output. The latter aims to alleviate the spatial ambiguity present in the former. However, the multi-granularity output pipeline suffers from limited interaction flexibility and produces redundant resul… ▽ More

    Submitted 16 May, 2024; v1 submitted 1 May, 2024; originally announced May 2024.

    Comments: CVPR2024 Highlight, Project: https://zhao-yian.github.io/GraCo

  35. EvaNet: Elevation-Guided Flood Extent Mapping on Earth Imagery (Extended Version)

    Authors: Mirza Tanzim Sami, Da Yan, Saugat Adhikari, Lyuheng Yuan, Jiao Han, Zhe Jiang, Jalal Khalil, Yang Zhou

    Abstract: Accurate and timely mapping of flood extent from high-resolution satellite imagery plays a crucial role in disaster management such as damage assessment and relief activities. However, current state-of-the-art solutions are based on U-Net, which can-not segment the flood pixels accurately due to the ambiguous pixels (e.g., tree canopies, clouds) that prevent a direct judgement from only the spectr… ▽ More

    Submitted 25 September, 2024; v1 submitted 27 April, 2024; originally announced April 2024.

    Comments: Published at the International Joint Conference on Artificial Intelligence (IJCAI, 2024)

  36. arXiv:2404.17170  [pdf, other

    cs.CV eess.IV

    Image Quality Assessment With Compressed Sampling

    Authors: Ronghua Liao, Chen Hui, Lang Yuan, Haiqi Zhu, Feng Jiang

    Abstract: No-Reference Image Quality Assessment (NR-IQA) aims at estimating image quality in accordance with subjective human perception. However, most methods focus on exploring increasingly complex networks to improve the final performance,accompanied by limitations on input images. Especially when applied to high-resolution (HR) images, these methods offen have to adjust the size of original image to mee… ▽ More

    Submitted 11 September, 2024; v1 submitted 26 April, 2024; originally announced April 2024.

  37. arXiv:2404.14219  [pdf, other

    cs.CL cs.AI

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Authors: Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai , et al. (104 additional authors not shown)

    Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version… ▽ More

    Submitted 30 August, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: 24 pages

  38. arXiv:2404.13215  [pdf, other

    physics.app-ph cs.LG

    Machine Learning-Guided Design of Non-Reciprocal and Asymmetric Elastic Chiral Metamaterials

    Authors: Lingxiao Yuan, Emma Lejeune, Harold S. Park

    Abstract: There has been significant recent interest in the mechanics community to design structures that can either violate reciprocity, or exhibit elastic asymmetry or odd elasticity. While these properties are highly desirable to enable mechanical metamaterials to exhibit novel wave propagation phenomena, it remains an open question as to how to design passive structures that exhibit both significant non… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

  39. arXiv:2404.10237  [pdf, other

    cs.CV cs.CL

    Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

    Authors: Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan, Zuozhu Liu

    Abstract: Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across dive… ▽ More

    Submitted 1 September, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

  40. arXiv:2404.09619  [pdf, other

    cs.CV cs.AI

    UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

    Authors: Zhaokun Zhou, Qiulin Wang, Bin Lin, Yiwei Su, Rui Chen, Xin Tao, Amin Zheng, Li Yuan, Pengfei Wan, Di Zhang

    Abstract: As an alternative to expensive expert evaluation, Image Aesthetic Assessment (IAA) stands out as a crucial task in computer vision. However, traditional IAA methods are typically constrained to a single data source or task, restricting the universality and broader application. In this work, to better align with human aesthetics, we propose a Unified Multi-modal Image Aesthetic Assessment (UNIAA) f… ▽ More

    Submitted 15 April, 2024; originally announced April 2024.

  41. arXiv:2404.09292  [pdf, other

    cs.CV cs.AI

    Bridging Data Islands: Geographic Heterogeneity-Aware Federated Learning for Collaborative Remote Sensing Semantic Segmentation

    Authors: Jieyi Tan, Yansheng Li, Sergey A. Bartalev, Bo Dang, Wei Chen, Yongjun Zhang, Liangqi Yuan

    Abstract: Remote sensing semantic segmentation (RSS) is an essential task in Earth Observation missions. Due to data privacy concerns, high-quality remote sensing images with annotations cannot be well shared among institutions, making it difficult to fully utilize RSS data to train a generalized model. Federated Learning (FL), a privacy-preserving collaborative learning technology, is a potential solution.… ▽ More

    Submitted 14 April, 2024; originally announced April 2024.

    Comments: 13 pages,9 figures, 4 tables

  42. arXiv:2404.06037  [pdf, other

    cs.DC

    A Survey of Distributed Graph Algorithms on Massive Graphs

    Authors: Lingkai Meng, Yu Shao, Long Yuan, Longbin Lai, Peng Cheng, Xue Li, Wenyuan Yu, Wenjie Zhang, Xuemin Lin, Jingren Zhou

    Abstract: Distributed processing of large-scale graph data has many practical applications and has been widely studied. In recent years, a lot of distributed graph processing frameworks and algorithms have been proposed. While many efforts have been devoted to analyzing these, with most analyzing them based on programming models, less research focuses on understanding their challenges in distributed environ… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

  43. arXiv:2404.05014  [pdf, other

    cs.CV

    MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

    Authors: Shenghai Yuan, Jinfa Huang, Yujun Shi, Yongqi Xu, Ruijie Zhu, Bin Lin, Xinhua Cheng, Li Yuan, Jiebo Luo

    Abstract: Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions. A largely overlooked problem in T2V is that existing models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose \textbf{MagicTime}, a m… ▽ More

    Submitted 7 April, 2024; originally announced April 2024.

  44. arXiv:2404.02078  [pdf, other

    cs.AI cs.CL cs.LG

    Advancing LLM Reasoning Generalists with Preference Trees

    Authors: Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, Maosong Sun

    Abstract: We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B and CodeLlama-70B, Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 1… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: Models and data are available at https://github.com/OpenBMB/Eurus

  45. arXiv:2403.20041  [pdf

    cs.CL

    Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs

    Authors: Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie

    Abstract: The Large Language Model (LLM) is widely employed for tasks such as intelligent assistants, text summarization, translation, and multi-modality on mobile phones. However, the current methods for on-device LLM deployment maintain slow inference speed, which causes poor user experience. To facilitate high-efficiency LLM deployment on device GPUs, we propose four optimization techniques: (a) a symbol… ▽ More

    Submitted 5 July, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

    Comments: 21 pages, 6 figures, fix "E0M4" spell mistake, fix FLOPS to TFLOPS

  46. arXiv:2403.19963  [pdf, other

    cs.CV

    Efficient Modulation for Vision Networks

    Authors: Xu Ma, Xiyang Dai, Jianwei Yang, Bin Xiao, Yinpeng Chen, Yun Fu, Lu Yuan

    Abstract: In this work, we present efficient modulation, a novel design for efficient vision networks. We revisit the modulation mechanism, which operates input through convolutional context modeling and feature projection layers, and fuses features via element-wise multiplication and an MLP block. We demonstrate that the modulation mechanism is particularly well suited for efficient networks and further ta… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: Accepted by ICLR 2024. Codes are made publically available at https://github.com/ma-xu/EfficientMod

  47. arXiv:2403.17935  [pdf, other

    cs.CV

    OmniVid: A Generative Framework for Universal Video Understanding

    Authors: Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang

    Abstract: The core of video understanding tasks, such as recognition, captioning, and tracking, is to automatically detect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal, different tasks often rely on distinct model architectures and annotation formats. In contrast, natural language processing benefits from a unified output space, i.e., text sequences, whic… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted by CVPR 2024

  48. arXiv:2403.16552  [pdf, other

    cs.NE cs.AI cs.CV

    QKFormer: Hierarchical Spiking Transformer using Q-K Attention

    Authors: Chenlin Zhou, Han Zhang, Zhaokun Zhou, Liutao Yu, Liwei Huang, Xiaopeng Fan, Li Yuan, Zhengyu Ma, Huihui Zhou, Yonghong Tian

    Abstract: Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mecha… ▽ More

    Submitted 8 October, 2024; v1 submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted by NeurIPS 2024 (Spotlight). Code and Model: https://github.com/zhouchenlin2096/QKFormer

  49. arXiv:2403.13248  [pdf, other

    cs.CV

    Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

    Authors: Zhengqing Yuan, Yixin Liu, Yihan Cao, Weixiang Sun, Haolong Jia, Ruoxi Chen, Zhaoxu Li, Bin Lin, Li Yuan, Lifang He, Chi Wang, Yanfang Ye, Lichao Sun

    Abstract: Text-to-video generation has made significant strides, but replicating the capabilities of advanced systems like OpenAI Sora remains challenging due to their closed-source nature. Existing open-source methods struggle to achieve comparable performance, often hindered by ineffective agent collaboration and inadequate training data quality. In this paper, we introduce Mora, a novel multi-agent frame… ▽ More

    Submitted 3 October, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

  50. arXiv:2403.12852  [pdf, other

    eess.IV cs.CV

    Generative Enhancement for 3D Medical Images

    Authors: Lingting Zhu, Noel Codella, Dongdong Chen, Zhenchao Jin, Lu Yuan, Lequan Yu

    Abstract: The limited availability of 3D medical image datasets, due to privacy concerns and high collection or annotation costs, poses significant challenges in the field of medical imaging. While a promising alternative is the use of synthesized medical data, there are few solutions for realistic 3D medical image synthesis due to difficulties in backbone design and fewer 3D training samples compared to 2D… ▽ More

    Submitted 24 May, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

    Comments: 20 pages, 8 figures