Skip to main content

Showing 1–50 of 201 results for author: Yue, X

  1. arXiv:2410.16153  [pdf, other

    cs.CL cs.CV

    Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

    Authors: Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig

    Abstract: Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns f… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

    Comments: 52 pages, 27 figures

  2. arXiv:2410.13824  [pdf, other

    cs.CV cs.CL

    Harnessing Webpage UIs for Text-Rich Visual Understanding

    Authors: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue

    Abstract: Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking dire… ▽ More

    Submitted 18 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

  3. arXiv:2410.13754  [pdf, other

    cs.AI cs.LG cs.MM

    MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

    Authors: Jinjie Ni, Yifan Song, Deepanway Ghosal, Bo Li, David Junhao Zhang, Xiang Yue, Fuzhao Xue, Zian Zheng, Kaichen Zhang, Mahir Shah, Kabir Jain, Yang You, Michael Shieh

    Abstract: Perceiving and generating diverse modalities are crucial for AI models to effectively learn from and engage with real-world signals, necessitating reliable evaluations for their development. We identify two major issues in current evaluations: (1) inconsistent standards, shaped by different communities with varying protocols and maturity levels; and (2) significant query, grading, and generalizati… ▽ More

    Submitted 18 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

  4. arXiv:2410.13360  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.MM

    Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

    Authors: Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue

    Abstract: The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

  5. arXiv:2410.11382  [pdf, other

    cs.LG math.NA

    Point-Calibrated Spectral Neural Operators

    Authors: Xihang Yue, Linchao Zhu, Yi Yang

    Abstract: Two typical neural models have been extensively studied for operator learning, learning in spatial space via attention mechanism or learning in spectral space via spectral analysis technique such as Fourier Transform. Spatial learning enables point-level flexibility but lacks global continuity constraint, while spectral learning enforces spectral continuity prior but lacks point-wise adaptivity. T… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

  6. arXiv:2410.10563  [pdf, other

    cs.CV

    MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

    Authors: Jiacheng Chen, Tianhao Liang, Sherman Siu, Zhengqing Wang, Kai Wang, Yubo Wang, Yuansheng Ni, Wang Zhu, Ziyan Jiang, Bohan Lyu, Dongfu Jiang, Xuan He, Yuan Liu, Hexiang Hu, Xiang Yue, Wenhu Chen

    Abstract: We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 real… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: Technical report. Project page: https://tiger-ai-lab.github.io/MEGA-Bench/

  7. arXiv:2410.10511  [pdf, other

    cs.CV

    Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling

    Authors: Wenze Liu, Le Zhuo, Yi Xin, Sheng Xia, Peng Gao, Xiangyu Yue

    Abstract: We introduce a new paradigm for AutoRegressive (AR) image generation, termed Set AutoRegressive Modeling (SAR). SAR generalizes the conventional AR to the next-set setting, i.e., splitting the sequence into arbitrary sets containing multiple tokens, rather than outputting each token in a fixed raster order. To accommodate SAR, we develop a straightforward architecture termed Fully Masked Transform… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 19 pages, 17 figures, 8 tables, github repo: https://github.com/poppuppy/SAR

  8. arXiv:2410.08531  [pdf, other

    cs.CV

    Diffusion Models Need Visual Priors for Image Generation

    Authors: Xiaoyu Yue, Zidong Wang, Zeyu Lu, Shuyang Sun, Meng Wei, Wanli Ouyang, Lei Bai, Luping Zhou

    Abstract: Conventional class-guided diffusion models generally succeed in generating images with correct semantic content, but often struggle with texture details. This limitation stems from the usage of class priors, which only provide coarse and limited conditional information. To address this issue, we propose Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extract… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Comments: Preprint

  9. arXiv:2410.08049  [pdf, other

    cs.CV cs.AI cs.LG

    Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

    Authors: Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue

    Abstract: This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior design strategy. Our work introduces a set of architecture design guidelines for large-kernel ConvNets that optimize their efficiency and performance. We propose th… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

    Comments: This is the journal version of arXiv:2203.06717 and arXiv:2311.15599

  10. arXiv:2410.06526  [pdf, other

    cs.DB

    KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

    Authors: Kaijing Ma, Xinrun Du, Yunran Wang, Haoran Zhang, Zhoufutu Wen, Xingwei Qu, Jian Yang, Jiaheng Liu, Minghao Liu, Xiang Yue, Wenhao Huang, Ge Zhang

    Abstract: In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), which minimizes the impact of domain-specific knowledge for a more accurate evaluation of models' reasoning abilities in out-of-distribution scenarios. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. K… ▽ More

    Submitted 17 October, 2024; v1 submitted 8 October, 2024; originally announced October 2024.

  11. arXiv:2410.01733  [pdf, other

    cs.CL

    Visual Perception in Text Strings

    Authors: Qi Jia, Xiang Yue, Shanshan Huang, Ziheng Qin, Yizhu Liu, Bill Yuchen Lin, Yang You

    Abstract: Understanding visual semantics embedded in consecutive characters is a crucial capability for both large language models (LLMs) and multi-modal large language models (MLLMs). This type of artifact possesses the unique characteristic that identical information can be readily formulated in both texts and images, making them a significant proxy for analyzing modern LLMs' and MLLMs' capabilities in mo… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

  12. arXiv:2410.01623  [pdf, other

    cs.LG cs.AI

    Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?

    Authors: Xi Chen, Kaituo Feng, Changsheng Li, Xunhao Lai, Xiangyu Yue, Ye Yuan, Guoren Wang

    Abstract: Low-rank training has emerged as a promising approach for reducing memory usage in training Large Language Models (LLMs). Previous methods either rely on decomposing weight matrices (e.g., LoRA), or seek to decompose gradient matrices (e.g., GaLore) to ensure reduced memory consumption. However, both of them constrain the training in a low-rank subspace, thus inevitably leading to sub-optimal perf… ▽ More

    Submitted 12 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

    Comments: Add further analysis of the scaling factor, code is available at: https://github.com/xichen-fy/Fira

  13. arXiv:2409.18680  [pdf, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models

    Authors: Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D'Haro, Robby T. Tan, Haizhou Li

    Abstract: Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 d… ▽ More

    Submitted 1 October, 2024; v1 submitted 27 September, 2024; originally announced September 2024.

    Comments: EMNLP24 Findings

  14. arXiv:2409.14888  [pdf, other

    cs.CV

    Advancing Video Quality Assessment for AIGC

    Authors: Xinli Yue, Jianhui Sun, Han Kong, Liangchao Yao, Tianyi Wang, Lei Li, Fengyun Rao, Jing Lv, Fan Xia, Yuetang Deng, Qian Wang, Lingchen Zhao

    Abstract: In recent years, AI generative models have made remarkable progress across various domains, including text generation, image generation, and video generation. However, assessing the quality of text-to-video generation is still in its infancy, and existing evaluation frameworks fall short when compared to those for natural videos. Current video quality assessment (VQA) methods primarily focus on ev… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

    Comments: 5 pages, 1 figure

  15. arXiv:2409.14847  [pdf, other

    cs.CV

    Revisiting Video Quality Assessment from the Perspective of Generalization

    Authors: Xinli Yue, Jianhui Sun, Liangchao Yao, Fan Xia, Yuetang Deng, Tianyi Wang, Lei Li, Fengyun Rao, Jing Lv, Qian Wang, Lingchen Zhao

    Abstract: The increasing popularity of short video platforms such as YouTube Shorts, TikTok, and Kwai has led to a surge in User-Generated Content (UGC), which presents significant challenges for the generalization performance of Video Quality Assessment (VQA) tasks. These challenges not only affect performance on test sets but also impact the ability to generalize across different datasets. While prior res… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

    Comments: 13 pages, 4 figures

  16. arXiv:2409.13972  [pdf, other

    cs.CL

    Can Language Model Understand Word Semantics as A Chatbot? An Empirical Study of Language Model Internal External Mismatch

    Authors: Jinman Zhao, Xueyan Zhang, Xingyu Yue, Weizhe Chen, Zifan Qian, Ruiyu Wang

    Abstract: Current common interactions with language models is through full inference. This approach may not necessarily align with the model's internal knowledge. Studies show discrepancies between prompts and internal representations. Most focus on sentence understanding. We study the discrepancy of word semantics understanding in internal and external mismatch across Encoder-only, Decoder-only, and Encode… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: 10 pages, 1 figure, 5 tables

  17. arXiv:2409.13665  [pdf, other

    cs.LG physics.flu-dyn

    DiffFluid: Plain Diffusion Models are Effective Predictors of Flow Dynamics

    Authors: Dongyu Luo, Jianyu Wu, Jing Wang, Hairun Xie, Xiangyu Yue, Shixiang Tang

    Abstract: We showcase the plain diffusion models with Transformers are effective predictors of fluid dynamics under various working conditions, e.g., Darcy flow and high Reynolds number. Unlike traditional fluid dynamical solvers that depend on complex architectures to extract intricate correlations and learn underlying physical states, our approach formulates the prediction of flow dynamics as the image tr… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

  18. arXiv:2409.07641  [pdf, ps, other

    cs.CL

    SimulBench: Evaluating Language Models with Creative Simulation Tasks

    Authors: Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin

    Abstract: We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM's general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an e… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

    Comments: Website: https://simulbench.github.io/

  19. arXiv:2409.07224  [pdf, other

    cs.SD eess.AS

    Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection

    Authors: Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li

    Abstract: Sound Source Localization (SSL) enabling technology for applications such as surveillance and robotics. While traditional Signal Processing (SP)-based SSL methods provide analytic solutions under specific signal and noise assumptions, recent Deep Learning (DL)-based methods have significantly outperformed them. However, their success depends on extensive training data and substantial computational… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

  20. arXiv:2409.06851  [pdf, other

    cs.CV cs.AI

    LIME: Less Is More for MLLM Evaluation

    Authors: King Zhu, Qianbo Zang, Shian Jia, Siwei Wu, Feiteng Fang, Yizhi Li, Shawn Gavin, Tuney Zheng, Jiawei Guo, Bo Li, Haoning Wu, Xingwei Qu, Jian Yang, Zachary Liu, Xiang Yue, J. H. Liu, Chenghua Lin, Min Yang, Shiwen Ni, Wenhao Huang, Ge Zhang

    Abstract: Multimodal Large Language Models (MLLMs) are evaluated on various benchmarks, such as image captioning, visual question answering, and reasoning. However, many of these benchmarks include overly simple or uninformative samples, complicating the effective distinction of different MLLMs' performance. Furthermore, evaluating models across numerous benchmarks incurs a significant computational burden.… ▽ More

    Submitted 13 October, 2024; v1 submitted 10 September, 2024; originally announced September 2024.

  21. arXiv:2409.02813  [pdf, other

    cs.CL cs.CV

    MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Authors: Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig

    Abstract: This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-o… ▽ More

    Submitted 10 September, 2024; v1 submitted 4 September, 2024; originally announced September 2024.

  22. arXiv:2408.10539  [pdf, other

    cs.CV

    Training Matting Models without Alpha Labels

    Authors: Wenze Liu, Zixuan Ye, Hao Lu, Zhiguo Cao, Xiangyu Yue

    Abstract: The labelling difficulty has been a longstanding problem in deep image matting. To escape from fine labels, this work explores using rough annotations such as trimaps coarsely indicating the foreground/background as supervision. We present that the cooperation between learned semantics from indicated known regions and proper assumed matting rules can help infer alpha values at transition areas. In… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: 12 pages, 12 figures

  23. arXiv:2408.10479  [pdf, other

    cs.LG cs.AI

    An End-to-End Reinforcement Learning Based Approach for Micro-View Order-Dispatching in Ride-Hailing

    Authors: Xinlang Yue, Yiran Liu, Fangzhou Shi, Sihong Luo, Chen Zhong, Min Lu, Zhe Xu

    Abstract: Assigning orders to drivers under localized spatiotemporal context (micro-view order-dispatching) is a major task in Didi, as it influences ride-hailing service experience. Existing industrial solutions mainly follow a two-stage pattern that incorporate heuristic or learning-based algorithms with naive combinatorial methods, tackling the uncertainty of both sides' behaviors, including emerging tim… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: 8 pages, 4 figures

  24. arXiv:2406.18583  [pdf, other

    cs.CV cs.LG

    Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

    Authors: Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

    Abstract: Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lu… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Code at: https://github.com/Alpha-VLLM/Lumina-T2X

  25. arXiv:2406.09795  [pdf, other

    cs.LG math.NA

    DeltaPhi: Learning Physical Trajectory Residual for PDE Solving

    Authors: Xihang Yue, Linchao Zhu, Yi Yang

    Abstract: Although neural operator networks theoretically approximate any operator mapping, the limited generalization capability prevents them from learning correct physical dynamics when potential data biases exist, particularly in the practical PDE solving scenario where the available data amount is restricted or the resolution is extremely low. To address this issue, we propose and formulate the Physica… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  26. arXiv:2406.09412  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    Explore the Limits of Omni-modal Pretraining at Scale

    Authors: Yiyuan Zhang, Handong Li, Jing Liu, Xiangyu Yue

    Abstract: We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Project Website: https://invictus717.github.io/MiCo/

  27. arXiv:2406.07645  [pdf, other

    cs.CV cs.MM

    SSNVC: Single Stream Neural Video Compression with Implicit Temporal Information

    Authors: Feng Wang, Haihang Ruan, Zhihuang Xie, Ronggang Wang, Xiangyu Yue

    Abstract: Recently, Neural Video Compression (NVC) techniques have achieved remarkable performance, even surpassing the best traditional lossy video codec. However, most existing NVC methods heavily rely on transmitting Motion Vector (MV) to generate accurate contextual features, which has the following drawbacks. (1) Compressing and transmitting MV requires specialized MV encoder and decoder, which makes m… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by DCC 2024 as Poster. This is the full paper

  28. arXiv:2406.06565  [pdf, other

    cs.CL cs.AI cs.LG

    MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

    Authors: Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You

    Abstract: Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and s… ▽ More

    Submitted 12 October, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: Accepted to NeurIPS 2024

  29. arXiv:2406.03092  [pdf, other

    cs.CL

    FragRel: Exploiting Fragment-level Relations in the External Memory of Large Language Models

    Authors: Xihang Yue, Linchao Zhu, Yi Yang

    Abstract: To process contexts with unlimited length using Large Language Models (LLMs), recent studies explore hierarchically managing the long text. Only several text fragments are taken from the external memory and passed into the temporary working memory, i.e., LLM's context window. However, existing approaches isolatedly handle the text fragments without considering their structural connections, thereby… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

  30. arXiv:2406.02582  [pdf, other

    cs.LG cs.AI physics.ao-ph

    Spatiotemporal Predictions of Toxic Urban Plumes Using Deep Learning

    Authors: Yinan Wang, M. Giselle Fernández-Godino, Nipun Gunawardena, Donald D. Lucas, Xiaowei Yue

    Abstract: Industrial accidents, chemical spills, and structural fires can release large amounts of harmful materials that disperse into urban atmospheres and impact populated areas. Computer models are typically used to predict the transport of toxic plumes by solving fluid dynamical equations. However, these models can be computationally expensive due to the need for many grid cells to simulate turbulent f… ▽ More

    Submitted 30 May, 2024; originally announced June 2024.

    Comments: 13 pages, 10 figures

    MSC Class: 86-08 ACM Class: I.2.10

  31. arXiv:2406.01574  [pdf, other

    cs.CL

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (Published at NeurIPS 2024 Track Datasets and Benchmarks)

    Authors: Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen

    Abstract: In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in… ▽ More

    Submitted 7 October, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: This version has been accepted and published at NeurIPS 2024 Track Datasets and Benchmarks (Spotlight)

  32. arXiv:2405.20421  [pdf, other

    cs.AI

    Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

    Authors: Qianqi Yan, Xuehai He, Xiang Yue, Xin Eric Wang

    Abstract: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address thi… ▽ More

    Submitted 4 October, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

  33. arXiv:2405.17461  [pdf, other

    cs.LG cs.CV

    EMR-Merging: Tuning-Free High-Performance Model Merging

    Authors: Chenyu Huang, Peng Ye, Tao Chen, Tong He, Xiangyu Yue, Wanli Ouyang

    Abstract: The success of pretrain-finetune paradigm brings about the release of numerous model weights. In this case, merging models finetuned on different tasks to enable a single model with multi-task capabilities is gaining increasing attention for its practicability. Existing model merging methods usually suffer from (1) significant performance degradation or (2) requiring tuning by additional data or t… ▽ More

    Submitted 27 September, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: NeurIPS 2024

  34. arXiv:2405.15071  [pdf, other

    cs.CL

    Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

    Authors: Boshi Wang, Xiang Yue, Yu Su, Huan Sun

    Abstract: We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalizati… ▽ More

    Submitted 26 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: 22 pages, 16 figures. Code and data: https://github.com/OSU-NLP-Group/GrokkedTransformer

  35. arXiv:2405.13581  [pdf, other

    cs.CV cs.AI

    Safety Alignment for Vision Language Models

    Authors: Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, Bo Zheng

    Abstract: Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to an LLMs can realize Vision Language Models (VLMs). However, existing research shows that the visual modality of VLMs is vulnerable, with attackers easily bypassing LLMs' safety alignment through visual modality features to launch attacks. To address this issue, we enhance the e… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 23 pages, 15 figures

  36. arXiv:2405.10514  [pdf, other

    cs.IT eess.SP

    Secrecy Performance Analysis of Multi-Functional RIS-Assisted NOMA Networks

    Authors: Yingjie Pei, Wanli Ni, Jin Xu, Xinwei Yue, Xiaofeng Tao, Dusit Niyato

    Abstract: Although reconfigurable intelligent surface (RIS) can improve the secrecy communication performance of wireless users, it still faces challenges such as limited coverage and double-fading effect. To address these issues, in this paper, we utilize a novel multi-functional RIS (MF-RIS) to enhance the secrecy performance of wireless users, and investigate the physical layer secrecy problem in non-ort… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.

    Comments: 14 pages, 9 figures, submitted to IEEE transactions on wireless communication

  37. arXiv:2405.03939  [pdf, other

    cs.CL

    Long Context Alignment with Short Instructions and Synthesized Positions

    Authors: Wenhao Wu, Yizhong Wang, Yao Fu, Xiang Yue, Dawei Zhu, Sujian Li

    Abstract: Effectively handling instructions with extremely long context remains a challenge for Large Language Models (LLMs), typically necessitating high-quality long data and substantial computational resources. This paper introduces Step-Skipping Alignment (SkipAlign), a new technique designed to enhance the long-context capabilities of LLMs in the phase of alignment without the need for additional effor… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Comments: preview

  38. arXiv:2405.03548  [pdf, other

    cs.CL

    MAmmoTH2: Scaling Instructions from the Web

    Authors: Xiang Yue, Tuney Zheng, Ge Zhang, Wenhu Chen

    Abstract: Instruction tuning improves the reasoning abilities of large language models (LLMs), with data quality and scalability being the crucial factors. Most instruction tuning data come from human crowd-sourcing or GPT-4 distillation. We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus to enhance LLM reasoning. Our approach involve… ▽ More

    Submitted 23 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

  39. arXiv:2404.10662  [pdf, other

    cs.LG cs.AI

    Continual Offline Reinforcement Learning via Diffusion-based Dual Generative Replay

    Authors: Jinmei Liu, Wenbin Li, Xiangyu Yue, Shilin Zhang, Chunlin Chen, Zhi Wang

    Abstract: We study continual offline reinforcement learning, a practical paradigm that facilitates forward transfer and mitigates catastrophic forgetting to tackle sequential offline tasks. We propose a dual generative replay framework that retains previous knowledge by concurrent replay of generated pseudo-data. First, we decouple the continual learning policy into a diffusion-based generative behavior mod… ▽ More

    Submitted 18 April, 2024; v1 submitted 16 April, 2024; originally announced April 2024.

  40. arXiv:2404.06393  [pdf, other

    cs.SD cs.AI eess.AS

    MuPT: A Generative Symbolic Music Pretrained Transformer

    Authors: Xingwei Qu, Yuelin Bai, Yinghao Ma, Ziya Zhou, Ka Man Lo, Jiaheng Liu, Ruibin Yuan, Lejun Min, Xueling Liu, Tianyu Zhang, Xinrun Du, Shuyue Guo, Yiming Liang, Yizhi Li, Shangda Wu, Junting Zhou, Tianyu Zheng, Ziyang Ma, Fengze Han, Wei Xue, Gus Xia, Emmanouil Benetos, Xiang Yue, Chenghua Lin, Xu Tan , et al. (3 additional authors not shown)

    Abstract: In this paper, we explore the application of Large Language Models (LLMs) to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the chal… ▽ More

    Submitted 10 September, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  41. arXiv:2404.05955  [pdf, other

    cs.CL cs.AI

    VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

    Authors: Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, Xiang Yue

    Abstract: Multimodal Large Language models (MLLMs) have shown promise in web-related tasks, but evaluating their performance in the web domain remains a challenge due to the lack of comprehensive benchmarks. Existing benchmarks are either designed for general multimodal tasks, failing to capture the unique characteristics of web pages, or focus on end-to-end web agent tasks, unable to measure fine-grained a… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

  42. arXiv:2404.03543  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

    Authors: Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi LI, Ruibo Liu, Yue Wang, Shuyue Guo, Xingwei Qu, Xiang Yue, Ge Zhang, Wenhu Chen, Jie Fu

    Abstract: Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks, including debugging, translating, polishing, and requirement switching. Unlike existing benchmarks focusing solely on code generation, CodeEditorBench empha… ▽ More

    Submitted 6 April, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  43. arXiv:2404.02060  [pdf, other

    cs.CL cs.AI

    Long-context LLMs Struggle with Long In-context Learning

    Authors: Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, Wenhu Chen

    Abstract: Large Language Models (LLMs) have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark… ▽ More

    Submitted 11 June, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

  44. Secrecy Performance Analysis of RIS Assisted Ambient Backscatter Communication Networks

    Authors: Yingjie Pei, Xinwei Yue, Chongwen Huang, Zhiping Lu

    Abstract: Reconfigurable intelligent surface (RIS) and ambient backscatter communication (AmBC) have been envisioned as two promising technologies due to their high transmission reliability as well as energy-efficiency. This paper investigates the secrecy performance of RIS assisted AmBC networks. New closed-form and asymptotic expressions of secrecy outage probability for RIS-AmBC networks are derived by t… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for publication in IEEE Transactions on Green Communications and Networking

  45. Secure Communication of Active RIS Assisted NOMA Networks

    Authors: Xuehua Li, Yingjie Pei, Xinwei Yue, Yuanwei Liu, Zhiguo Ding

    Abstract: As a revolutionary technology, reconfigurable intelligent surface (RIS) has been deemed as an indispensable part of the 6th generation communications due to its inherent ability to regulate the wireless channels. However, passive RIS (PRIS) still suffers from some pressing issues, one of which is that the fading of the entire reflection link is proportional to the product of the distances from the… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for publication by IEEE Transactions on Wireless Communications

  46. Secrecy Outage Probability Analysis for Downlink RIS-NOMA Networks with On-Off Control

    Authors: Yingjie Pei, Xinwei Yue, Wenqiang Yi, Yuanwei Liu, Xuehua Li, Zhiguo Ding

    Abstract: Reconfigurable intelligent surface (RIS) has been regarded as a promising technology since it has ability to create the favorable channel conditions. This paper investigates the secure communications of RIS assisted non-orthogonal multiple access (NOMA) networks, where both external and internal eavesdropping scenarios are taken into consideration. More specifically, novel approximate and asymptot… ▽ More

    Submitted 17 March, 2024; originally announced March 2024.

    Comments: This paper has been published in IEEE Transactions on Vehicular Technology

    Journal ref: vol. 72, no. 9, pp. 11772-11786, Sep. 2023

  47. arXiv:2403.10073  [pdf, other

    cs.CV

    Revisiting Adversarial Training under Long-Tailed Distributions

    Authors: Xinli Yue, Ningping Mou, Qian Wang, Lingchen Zhao

    Abstract: Deep neural networks are vulnerable to adversarial attacks, often leading to erroneous outputs. Adversarial training has been recognized as one of the most effective methods to counter such attacks. However, existing adversarial training techniques have predominantly been tested on balanced datasets, whereas real-world data often exhibit a long-tailed distribution, casting doubt on the efficacy of… ▽ More

    Submitted 15 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024

  48. arXiv:2403.05628  [pdf, other

    cs.MM cs.CR

    AMUSE: Adaptive Multi-Segment Encoding for Dataset Watermarking

    Authors: Saeed Ranjbar Alvar, Mohammad Akbari, David Ming Xuan Yue, Yong Zhang

    Abstract: Curating high quality datasets that play a key role in the emergence of new AI applications requires considerable time, money, and computational resources. So, effective ownership protection of datasets is becoming critical. Recently, to protect the ownership of an image dataset, imperceptible watermarking techniques are used to store ownership information (i.e., watermark) into the individual ima… ▽ More

    Submitted 18 July, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

  49. arXiv:2403.02502  [pdf, other

    cs.CL cs.AI cs.LG

    Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents

    Authors: Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, Bill Yuchen Lin

    Abstract: Large Language Models (LLMs) have become integral components in various autonomous agent systems. In this study, we present an exploration-based trajectory optimization approach, referred to as ETO. This learning method is designed to enhance the performance of open LLM agents. Contrary to previous studies that exclusively train on successful expert trajectories, our method allows agents to learn… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

    Comments: Accepted to ACL 2024 Main Conference; Camera Ready

  50. arXiv:2403.00669  [pdf, other

    cs.LG

    Advancing Additive Manufacturing through Deep Learning: A Comprehensive Review of Current Progress and Future Challenges

    Authors: Amirul Islam Saimon, Emmanuel Yangue, Xiaowei Yue, Zhenyu James Kong, Chenang Liu

    Abstract: Additive manufacturing (AM) has already proved itself to be the potential alternative to widely-used subtractive manufacturing due to its extraordinary capacity of manufacturing highly customized products with minimum material wastage. Nevertheless, it is still not being considered as the primary choice for the industry due to some of its major inherent challenges, including complex and dynamic pr… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.