Skip to main content

Showing 1–50 of 1,596 results for author: Xue, C

  1. arXiv:2410.15475  [pdf, other

    cs.CV

    Generalized Multimodal Fusion via Poisson-Nernst-Planck Equation

    Authors: Jiayu Xiong, Jing Wang, Hengjing Xiang, Jun Xue, Chen Xu, Zhouqiang Jiang

    Abstract: Previous studies have highlighted significant advancements in multimodal fusion. Nevertheless, such methods often encounter challenges regarding the efficacy of feature extraction, data integrity, consistency of feature dimensions, and adaptability across various downstream tasks. This paper proposes a generalized multimodal fusion method (GMF) via the Poisson-Nernst-Planck (PNP) equation, which a… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

    Comments: NeurIPS 2024 Rejected paper, 28 pages

  2. arXiv:2410.15287  [pdf, other

    cs.CL

    Training Language Models to Critique With Multi-agent Feedback

    Authors: Tian Lan, Wenwei Zhang, Chengqi Lyu, Shuaibin Li, Chen Xu, Heyan Huang, Dahua Lin, Xian-Ling Mao, Kai Chen

    Abstract: Critique ability, a meta-cognitive capability of humans, presents significant challenges for LLMs to improve. Recent works primarily rely on supervised fine-tuning (SFT) using critiques generated by a single LLM like GPT-4. However, these model-generated critiques often exhibit flaws due to the inherent complexity of the critique. Consequently, fine-tuning LLMs on such flawed critiques typically l… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

  3. arXiv:2410.14228  [pdf, other

    cs.NI

    Towards High-Speed Passive Visible Light Communication with Event Cameras and Digital Micro-Mirrors

    Authors: Yanxiang Wang, Yiran Shen, Kenuo Xu, Guangrong Zhao, Mahbub Hassan, Chenren Xu, Wen Hu

    Abstract: Passive visible light communication (VLC) modulates light propagation or reflection to transmit data without directly modulating the light source. Thus, passive VLC provides an alternative to conventional VLC, enabling communication where the light source cannot be directly controlled. There have been ongoing efforts to explore new methods and devices for modulating light propagation or reflection… ▽ More

    Submitted 21 October, 2024; v1 submitted 18 October, 2024; originally announced October 2024.

    Comments: 14 pages, 21 figures, nonacm

  4. arXiv:2410.13409  [pdf, other

    cs.CL cs.AI

    Attr-Int: A Simple and Effective Entity Alignment Framework for Heterogeneous Knowledge Graphs

    Authors: Linyan Yang, Jingwei Cheng, Chuanhao Xu, Xihao Wang, Jiayi Li, Fu Zhang

    Abstract: Entity alignment (EA) refers to the task of linking entities in different knowledge graphs (KGs). Existing EA methods rely heavily on structural isomorphism. However, in real-world KGs, aligned entities usually have non-isomorphic neighborhood structures, which paralyses the application of these structure-dependent methods. In this paper, we investigate and tackle the problem of entity alignment b… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

  5. arXiv:2410.13210  [pdf, other

    cs.CL cs.AI

    FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

    Authors: Forrest Sheng Bao, Miaoran Li, Renyi Qu, Ge Luo, Erana Wan, Yujia Tang, Weisi Fan, Manveer Singh Tamber, Suleman Kazi, Vivek Sourabh, Mike Qi, Ruixuan Tu, Chenyu Xu, Matthew Gonzales, Ofer Mendelevitch, Amin Ahmad

    Abstract: Summarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces Fait… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

  6. arXiv:2410.12295  [pdf, other

    cs.LG cs.AI cs.CV

    Consistency Calibration: Improving Uncertainty Calibration via Consistency among Perturbed Neighbors

    Authors: Linwei Tao, Haolan Guo, Minjing Dong, Chang Xu

    Abstract: Calibration is crucial in deep learning applications, especially in fields like healthcare and autonomous driving, where accurate confidence estimates are vital for decision-making. However, deep neural networks often suffer from miscalibration, with reliability diagrams and Expected Calibration Error (ECE) being the only standard perspective for evaluating calibration performance. In this paper,… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

  7. arXiv:2410.12051  [pdf, other

    cs.HC cs.AI cs.ET cs.MM

    Enabling Data-Driven and Empathetic Interactions: A Context-Aware 3D Virtual Agent in Mixed Reality for Enhanced Financial Customer Experience

    Authors: Cindy Xu, Mengyu Chen, Pranav Deshpande, Elvir Azanli, Runqing Yang, Joseph Ligman

    Abstract: In this paper, we introduce a novel system designed to enhance customer service in the financial and retail sectors through a context-aware 3D virtual agent, utilizing Mixed Reality (MR) and Vision Language Models (VLMs). Our approach focuses on enabling data-driven and empathetic interactions that ensure customer satisfaction by introducing situational awareness of the physical location, personal… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

    Comments: to appear at 1st Workshop on Intelligent XR: Harnessing AI for Next-Generation XR User Experiences at International Symposium on Mixed and Augmented Reality (ISMAR) 2024

    ACM Class: H.5.1; K.4.3

  8. arXiv:2410.11860  [pdf, other

    cs.HC cs.AI cs.CV

    Comparing Zealous and Restrained AI Recommendations in a Real-World Human-AI Collaboration Task

    Authors: Chengyuan Xu, Kuo-Chin Lien, Tobias Höllerer

    Abstract: When designing an AI-assisted decision-making system, there is often a tradeoff between precision and recall in the AI's recommendations. We argue that careful exploitation of this tradeoff can harness the complementary strengths in the human-AI collaboration to significantly improve team performance. We investigate a real-world video anonymization task for which recall is paramount and more costl… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

    Comments: 15 pages, 14 figures, accepted to ACM CHI 2023

    ACM Class: H.5.0; I.2.0

    Journal ref: In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, Article 350, 1 15

  9. arXiv:2410.11577  [pdf, other

    cs.DC cs.LG

    Breaking the Memory Wall for Heterogeneous Federated Learning via Model Splitting

    Authors: Chunlin Tian, Li Li, Kahou Tam, Yebo Wu, Chengzhong Xu

    Abstract: Federated Learning (FL) enables multiple devices to collaboratively train a shared model while preserving data privacy. Ever-increasing model complexity coupled with limited memory resources on the participating devices severely bottlenecks the deployment of FL in real-world scenarios. Thus, a framework that can effectively break the memory wall while jointly taking into account the hardware and s… ▽ More

    Submitted 12 October, 2024; originally announced October 2024.

    Comments: Accepted by TPDS

  10. arXiv:2410.11359  [pdf, other

    cs.LG cs.RO stat.ML

    DODT: Enhanced Online Decision Transformer Learning through Dreamer's Actor-Critic Trajectory Forecasting

    Authors: Eric Hanchen Jiang, Zhi Zhang, Dinghuai Zhang, Andrew Lizarraga, Chenheng Xu, Yasi Zhang, Siyan Zhao, Zhengjie Xu, Peiyu Yu, Yuer Tang, Deqian Kong, Ying Nian Wu

    Abstract: Advancements in reinforcement learning have led to the development of sophisticated models capable of learning complex decision-making tasks. However, efficiently integrating world models with decision transformers remains a challenge. In this paper, we introduce a novel approach that combines the Dreamer algorithm's ability to generate anticipatory trajectories with the adaptive learning strength… ▽ More

    Submitted 15 October, 2024; originally announced October 2024.

  11. arXiv:2410.10873  [pdf, other

    cs.CL cs.AI cs.CY

    AuditWen:An Open-Source Large Language Model for Audit

    Authors: Jiajia Huang, Haoran Zhu, Chao Xu, Tianming Zhan, Qianqian Xie, Jimin Huang

    Abstract: Intelligent auditing represents a crucial advancement in modern audit practices, enhancing both the quality and efficiency of audits within the realm of artificial intelligence. With the rise of large language model (LLM), there is enormous potential for intelligent models to contribute to audit domain. However, general LLMs applied in audit domain face the challenges of lacking specialized knowle… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

    Comments: 18 pages,1 figures

  12. arXiv:2410.10160  [pdf, other

    cs.CV

    Will the Inclusion of Generated Data Amplify Bias Across Generations in Future Image Classification Models?

    Authors: Zeliang Zhang, Xin Liang, Mingqian Feng, Susan Liang, Chenliang Xu

    Abstract: As the demand for high-quality training data escalates, researchers have increasingly turned to generative models to create synthetic data, addressing data scarcity and enabling continuous model improvement. However, reliance on self-generated data introduces a critical question: Will this practice amplify bias in future models? While most research has focused on overall performance, the impact on… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 15 pages, 7 figures

  13. arXiv:2410.09823  [pdf, other

    cs.LG cs.CL

    Simultaneous Computation and Memory Efficient Zeroth-Order Optimizer for Fine-Tuning Large Language Models

    Authors: Fei Wang, Li Shen, Liang Ding, Chao Xue, Ye Liu, Changxing Ding

    Abstract: Fine-tuning is powerful for adapting large language models to downstream tasks, but it often results in huge memory usages. A promising approach to mitigate this is using Zeroth-Order (ZO) optimization, which estimates gradients to replace First-Order (FO) gradient calculations, albeit with longer training time due to its stochastic nature. By revisiting the Memory-efficient ZO (MeZO) optimizer, w… ▽ More

    Submitted 13 October, 2024; originally announced October 2024.

  14. arXiv:2410.09733  [pdf, other

    cs.CV

    MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

    Authors: Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo

    Abstract: The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their com… ▽ More

    Submitted 13 October, 2024; originally announced October 2024.

    Comments: 21 pages, 15 figures

  15. arXiv:2410.09418  [pdf, other

    cs.CL

    Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models

    Authors: Yi-Fan Lu, Xian-Ling Mao, Tian Lan, Chen Xu, Heyan Huang

    Abstract: Event extraction has gained extensive research attention due to its broad range of applications. However, the current mainstream evaluation method for event extraction relies on token-level exact match, which misjudges numerous semantic-level correct cases. This reliance leads to a significant discrepancy between the evaluated performance of models under exact match criteria and their real perform… ▽ More

    Submitted 12 October, 2024; originally announced October 2024.

  16. arXiv:2410.09254  [pdf, other

    cs.CV

    Few Exemplar-Based General Medical Image Segmentation via Domain-Aware Selective Adaptation

    Authors: Chen Xu, Qiming Huang, Yuqi Hou, Jiangxing Wu, Fan Zhang, Hyung Jin Chang, Jianbo Jiao

    Abstract: Medical image segmentation poses challenges due to domain gaps, data modality variations, and dependency on domain knowledge or experts, especially for low- and middle-income countries (LMICs). Whereas for humans, given a few exemplars (with corresponding labels), we are able to segment different medical images even without exten-sive domain-specific clinical training. In addition, current SAM-bas… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Comments: Accepcted in ACCV 2024

  17. arXiv:2410.08723  [pdf, other

    cs.HC

    Investigating Human-Computer Interaction and Visual Comprehension in Text Generation Process of Natural Language Generation Models

    Authors: Yunchao Wang, Zihang Fu, Chaoqing Xu, Guodao Sun, Ronghua Liang

    Abstract: Natural language generation (NLG) models are becoming a highly sought-after research focus in the field of natural language processing (NLP), demonstrating strong capabilities in text generation tasks such as writing and dialogue generation. Despite the impressive performance of NLG models, their complex architecture and extensive model weights result in a lack of interpretability. This limitation… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

  18. arXiv:2410.08611  [pdf, other

    cs.CV cs.AI

    Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models

    Authors: Mengyuan Chen, Junyu Gao, Changsheng Xu

    Abstract: A straightforward pipeline for zero-shot out-of-distribution (OOD) detection involves selecting potential OOD labels from an extensive semantic pool and then leveraging a pre-trained vision-language model to perform classification on both in-distribution (ID) and OOD labels. In this paper, we theorize that enhancing performance requires expanding the semantic pool, while increasing the expected pr… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

    Comments: 28 pages, accepted by NeurIPS 2024

  19. arXiv:2410.08021  [pdf, other

    cs.CV

    OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling

    Authors: Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu

    Abstract: Constrained by the separate encoding of vision and language, existing grounding and referring segmentation works heavily rely on bulky Transformer-based fusion en-/decoders and a variety of early-stage interaction technologies. Simultaneously, the current mask visual language modeling (MVLM) fails to capture the nuanced referential relationship between image-text in referring tasks. In this paper,… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

    Comments: Accepted by NeurIPS 2024. The project page: https://github.com/linhuixiao/OneRef

  20. arXiv:2410.07463  [pdf, other

    cs.CV

    Language-Guided Joint Audio-Visual Editing via One-Shot Adaptation

    Authors: Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

    Abstract: In this paper, we introduce a novel task called language-guided joint audio-visual editing. Given an audio and image pair of a sounding event, this task aims at generating new audio-visual content by editing the given sounding event conditioned on the language guidance. For instance, we can alter the background environment of a sounding object while keeping its appearance unchanged, or we can add… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

    Comments: ACCV 2024

  21. arXiv:2410.07160  [pdf, other

    cs.CV cs.GR

    TextToon: Real-Time Text Toonify Head Avatar from Single Video

    Authors: Luchuan Song, Lele Chen, Celong Liu, Pinxin Liu, Chenliang Xu

    Abstract: We propose TextToon, a method to generate a drivable toonified avatar. Given a short monocular video sequence and a written instruction about the avatar style, our model can generate a high-fidelity toonified avatar that can be driven in real-time by another video with arbitrary identities. Existing related works heavily rely on multi-view modeling to recover geometry via texture embeddings, prese… ▽ More

    Submitted 23 September, 2024; originally announced October 2024.

    Comments: Project Page: https://songluchuan.github.io/TextToon/

  22. arXiv:2410.07023  [pdf, other

    cs.GT

    Mechanism Design for Exchange Markets

    Authors: Yusen Zheng, Yukun Cheng, Chenyang Xu, Xiaotie Deng

    Abstract: Exchange markets are a significant type of market economy, in which each agent holds a budget and certain (divisible) resources available for trading. Most research on equilibrium in exchange economies is based on an environment of completely free competition. However, the orderly operation of markets also relies on effective economic regulatory mechanisms. This paper initiates the study of the me… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  23. arXiv:2410.06764  [pdf, other

    cs.DS math.OC

    An Optimal Algorithm for the Stacker Crane Problem on Fixed Topologies

    Authors: Yike Chen, Ke Shi, Chao Xu

    Abstract: The Stacker Crane Problem (SCP) is a variant of the Traveling Salesman Problem. In SCP, pairs of pickup and delivery points are designated on a graph, and a crane must visit these points to move objects from each pickup location to its respective delivery point. The goal is to minimize the total distance traveled. SCP is known to be NP-hard, even on tree structures. The only positive results, in t… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  24. arXiv:2410.06169  [pdf, other

    cs.CV

    Quadratic Is Not What You Need For Multimodal Large Language Models

    Authors: Phu Pham, Wentian Zhao, Kun Wan, Yu-Jhe Li, Zeliang Zhang, Daniel Miranda, Ajinkya Kale, Chenliang Xu

    Abstract: In the past year, the capabilities of Multimodal Large Language Models (MLLMs) have significantly improved across various aspects. However, constrained by the quadratic growth of computation in LLMs as the number of tokens increases, efficiency has become a bottleneck for further scaling MLLMs. Although recent efforts have been made to prune visual tokens or use more lightweight LLMs to reduce com… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

  25. arXiv:2410.04873  [pdf, other

    cs.CV cs.RO

    TeX-NeRF: Neural Radiance Fields from Pseudo-TeX Vision

    Authors: Chonghao Zhong, Chao Xu

    Abstract: Neural radiance fields (NeRF) has gained significant attention for its exceptional visual effects. However, most existing NeRF methods reconstruct 3D scenes from RGB images captured by visible light cameras. In practical scenarios like darkness, low light, or bad weather, visible light cameras become ineffective. Therefore, we propose TeX-NeRF, a 3D reconstruction method using only infrared images… ▽ More

    Submitted 7 October, 2024; originally announced October 2024.

  26. arXiv:2410.04652  [pdf, other

    cs.HC cs.AI cs.CV

    Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI

    Authors: Chengyuan Xu, Radha Kumaran, Noah Stier, Kangyou Yu, Tobias Höllerer

    Abstract: Seamless integration of virtual and physical worlds in augmented reality benefits from the system semantically "understanding" the physical environment. AR research has long focused on the potential of context awareness, demonstrating novel capabilities that leverage the semantics in the 3D environment for various object-level interactions. Meanwhile, the computer vision community has made leaps i… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

    Comments: 10 pages, 6 figures, accepted to IEEE ISMAR 2024

    ACM Class: I.4.8; H.5.2

  27. arXiv:2410.03796  [pdf, other

    cs.LG cs.AI

    Dynamic Evidence Decoupling for Trusted Multi-view Learning

    Authors: Ying Liu, Lihong Liu, Cai Xu, Xiangyu Song, Ziyu Guan, Wei Zhao

    Abstract: Multi-view learning methods often focus on improving decision accuracy, while neglecting the decision uncertainty, limiting their suitability for safety-critical applications. To mitigate this, researchers propose trusted multi-view learning methods that estimate classification probabilities and uncertainty by learning the class distributions for each instance. However, these methods assume that t… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

  28. arXiv:2410.03137  [pdf, other

    cs.CL

    SAG: Style-Aligned Article Generation via Model Collaboration

    Authors: Chenning Xu, Fangxun Shu, Dian Jin, Jinghao Wei, Hao Jiang

    Abstract: Large language models (LLMs) have increased the demand for personalized and stylish content generation. However, closed-source models like GPT-4 present limitations in optimization opportunities, while the substantial training costs and inflexibility of open-source alternatives, such as Qwen-72B, pose considerable challenges. Conversely, small language models (SLMs) struggle with understanding com… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

  29. arXiv:2410.02548  [pdf, other

    stat.ML cs.LG

    Local Flow Matching Generative Models

    Authors: Chen Xu, Xiuyuan Cheng, Yao Xie

    Abstract: Flow Matching (FM) is a simulation-free method for learning a continuous and invertible flow to interpolate between two distributions, and in particular to generate data from noise in generative modeling. In this paper, we introduce Local Flow Matching (LFM), which learns a sequence of FM sub-models and each matches a diffusion process up to the time of the step size in the data-to-noise direction… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

  30. arXiv:2410.01408  [pdf, other

    cs.CV

    SHAP-CAT: A interpretable multi-modal framework enhancing WSI classification via virtual staining and shapley-value-based multimodal fusion

    Authors: Jun Wang, Yu Mao, Nan Guan, Chun Jason Xue

    Abstract: The multimodal model has demonstrated promise in histopathology. However, most multimodal models are based on H\&E and genomics, adopting increasingly complex yet black-box designs. In our paper, we propose a novel interpretable multimodal framework named SHAP-CAT, which uses a Shapley-value-based dimension reduction technique for effective multimodal fusion. Starting with two paired modalities --… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

  31. arXiv:2410.00393  [pdf, other

    cs.LG cs.AI

    Revisiting Essential and Nonessential Settings of Evidential Deep Learning

    Authors: Mengyuan Chen, Junyu Gao, Changsheng Xu

    Abstract: Evidential Deep Learning (EDL) is an emerging method for uncertainty estimation that provides reliable predictive uncertainty in a single forward pass, attracting significant attention. Grounded in subjective logic, EDL derives Dirichlet concentration parameters from neural networks to construct a Dirichlet probability density function (PDF), modeling the distribution of class probabilities. Despi… ▽ More

    Submitted 1 October, 2024; originally announced October 2024.

    Comments: 22 pages, under review

  32. arXiv:2410.00059  [pdf, other

    cs.CR cs.AI cs.CV cs.LG

    IDEA: An Inverse Domain Expert Adaptation Based Active DNN IP Protection Method

    Authors: Chaohui Xu, Qi Cui, Jinxin Dong, Weiyang He, Chip-Hong Chang

    Abstract: Illegitimate reproduction, distribution and derivation of Deep Neural Network (DNN) models can inflict economic loss, reputation damage and even privacy infringement. Passive DNN intellectual property (IP) protection methods such as watermarking and fingerprinting attempt to prove the ownership upon IP violation, but they are often too late to stop catastrophic damage of IP abuse and too feeble ag… ▽ More

    Submitted 29 September, 2024; originally announced October 2024.

  33. arXiv:2409.19638  [pdf, other

    cs.CV cs.AI

    BadHMP: Backdoor Attack against Human Motion Prediction

    Authors: Chaohui Xu, Si Wang, Chip-Hong Chang

    Abstract: Precise future human motion prediction over subsecond horizons from past observations is crucial for various safety-critical applications. To date, only one study has examined the vulnerability of human motion prediction to evasion attacks. In this paper, we propose BadHMP, the first backdoor attack that targets specifically human motion prediction. Our approach involves generating poisoned traini… ▽ More

    Submitted 29 September, 2024; originally announced September 2024.

  34. arXiv:2409.18996  [pdf, other

    cs.CL cs.AI cs.CV cs.LG cs.MM

    From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models

    Authors: Shengsheng Qian, Zuyi Zhou, Dizhan Xue, Bing Wang, Changsheng Xu

    Abstract: Cross-modal reasoning (CMR), the intricate process of synthesizing and drawing inferences across divergent sensory modalities, is increasingly recognized as a crucial capability in the progression toward more sophisticated and anthropomorphic artificial intelligence systems. Large Language Models (LLMs) represent a class of AI algorithms specifically engineered to parse, produce, and engage with h… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

    ACM Class: A.1

  35. arXiv:2409.18857  [pdf, other

    cs.AI

    Mitigating Selection Bias with Node Pruning and Auxiliary Options

    Authors: Hyeong Kyu Choi, Weijie Xu, Chi Xue, Stephanie Eckman, Chandan K. Reddy

    Abstract: Large language models (LLMs) often show unwarranted preference for certain choice options when responding to multiple-choice questions, posing significant reliability concerns in LLM-automated systems. To mitigate this selection bias problem, previous solutions utilized debiasing methods to adjust the model's input and/or output. Our work, in contrast, investigates the model's internal representat… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

  36. arXiv:2409.18839  [pdf, other

    cs.CV

    MinerU: An Open-Source Solution for Precise Document Content Extraction

    Authors: Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian Shi, Yu Qiao, Dahua Lin, Conghui He

    Abstract: Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution f… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

    Comments: MinerU Technical Report

  37. arXiv:2409.18798  [pdf

    cs.HC cs.AI cs.LG

    Esports Debut as a Medal Event at 2023 Asian Games: Exploring Public Perceptions with BERTopic and GPT-4 Topic Fine-Tuning

    Authors: Tyreal Yizhou Qian, Bo Yu, Weizhe Li, Chenglong Xu

    Abstract: This study examined the public opinions of esports at the 2023 Asian Games and value co-creation during the event using an LLM-enhanced BERTopic modeling analysis. We identified five major themes representing public perceptions, as well as how major stakeholders co-created value within and beyond the esports ecosystem. Key findings highlighted the strategic use of social media marketing to influen… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

  38. arXiv:2409.18419  [pdf, other

    cs.CV cs.LG

    Robust Network Learning via Inverse Scale Variational Sparsification

    Authors: Zhiling Zhou, Zirui Liu, Chengming Xu, Yanwei Fu, Xinwei Sun

    Abstract: While neural networks have made significant strides in many AI tasks, they remain vulnerable to a range of noise types, including natural corruptions, adversarial noise, and low-resolution artifacts. Many existing approaches focus on enhancing robustness against specific noise types, limiting their adaptability to others. Previous studies have addressed general robustness by adopting a spectral pe… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

    Comments: 21 pages, 7 figures

  39. arXiv:2409.17692  [pdf, other

    cs.CL cs.AI cs.LG

    MIO: A Foundation Model on Multimodal Tokens

    Authors: Zekun Wang, King Zhu, Chunpu Xu, Wangchunshu Zhou, Jiaheng Liu, Yibo Zhang, Jiashuo Wang, Ning Shi, Siyu Li, Yizhi Li, Haoran Que, Zhaoxiang Zhang, Yuanxing Zhang, Ge Zhang, Ke Xu, Jie Fu, Wenhao Huang

    Abstract: In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of large language models (LLMs) and multimodal large language models (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they st… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

    Comments: Technical Report. Codes and models will be available soon

  40. EAGLE: Egocentric AGgregated Language-video Engine

    Authors: Jing Bi, Yunlong Tang, Luchuan Song, Ali Vosoughi, Nguyen Nguyen, Chenliang Xu

    Abstract: The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective. Despite this progress, the fragmentation in tasks like action recognition, procedure learning, and moment retrieval, \etc, coupled with inconsistent annotations and isolated model development, hinders a holistic interpretation of video content. In… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

    Comments: Accepted by ACMMM 24

  41. arXiv:2409.15911  [pdf, other

    cs.CL cs.SD eess.AS

    A Modular-based Strategy for Mitigating Gradient Conflicts in Simultaneous Speech Translation

    Authors: Xiaoqian Liu, Yangfan Du, Jianjin Wang, Yuan Ge, Chen Xu, Tong Xiao, Guocheng Chen, Jingbo Zhu

    Abstract: Simultaneous Speech Translation (SimulST) involves generating target language text while continuously processing streaming speech input, presenting significant real-time challenges. Multi-task learning is often employed to enhance SimulST performance but introduces optimization conflicts between primary and auxiliary tasks, potentially compromising overall efficiency. The existing model-level conf… ▽ More

    Submitted 17 October, 2024; v1 submitted 24 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  42. arXiv:2409.15087  [pdf

    eess.IV cs.CV cs.LG

    Towards Accountable AI-Assisted Eye Disease Diagnosis: Workflow Design, External Validation, and Continual Learning

    Authors: Qingyu Chen, Tiarnan D L Keenan, Elvira Agron, Alexis Allot, Emily Guan, Bryant Duong, Amr Elsawy, Benjamin Hou, Cancan Xue, Sanjeeb Bhandari, Geoffrey Broadhead, Chantal Cousineau-Krieger, Ellen Davis, William G Gensheimer, David Grasic, Seema Gupta, Luis Haddock, Eleni Konstantinou, Tania Lamba, Michele Maiberger, Dimosthenis Mantopoulos, Mitul C Mehta, Ayman G Nahri, Mutaz AL-Nawaflh, Arnold Oshinsky , et al. (13 additional authors not shown)

    Abstract: Timely disease diagnosis is challenging due to increasing disease burdens and limited clinician availability. AI shows promise in diagnosis accuracy but faces real-world application issues due to insufficient validation in clinical workflows and diverse populations. This study addresses gaps in medical AI downstream accountability through a case study on age-related macular degeneration (AMD) diag… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

  43. arXiv:2409.15035  [pdf, other

    cs.CV cs.CL

    Can CLIP Count Stars? An Empirical Study on Quantity Bias in CLIP

    Authors: Zeliang Zhang, Zhuo Liu, Mingqian Feng, Chenliang Xu

    Abstract: CLIP has demonstrated great versatility in adapting to various downstream tasks, such as image editing and generation, visual question answering, and video understanding. However, CLIP-based applications often suffer from misunderstandings regarding user intent, leading to discrepancies between the required number of objects and the actual outputs in image generation tasks. In this work, we empiri… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

    Comments: Short paper. Accepted by the Findings of EMNLP 2024

  44. arXiv:2409.14997  [pdf, other

    cs.CL

    Enhancing Aspect-based Sentiment Analysis in Tourism Using Large Language Models and Positional Information

    Authors: Chun Xu, Mengmeng Wang, Yan Ren, Shaolin Zhu

    Abstract: Aspect-Based Sentiment Analysis (ABSA) in tourism plays a significant role in understanding tourists' evaluations of specific aspects of attractions, which is crucial for driving innovation and development in the tourism industry. However, traditional pipeline models are afflicted by issues such as error propagation and incomplete extraction of sentiment elements. To alleviate this issue, this pap… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

    Comments: 19 pages, 17 figures

  45. arXiv:2409.14961  [pdf, other

    cs.DC

    UELLM: A Unified and Efficient Approach for LLM Inference Serving

    Authors: Yiyuan He, Minxian Xu, Jingfeng Wu, Wanyi Zheng, Kejiang Ye, Chengzhong Xu

    Abstract: In the context of Machine Learning as a Service (MLaaS) clouds, the extensive use of Large Language Models (LLMs) often requires efficient management of significant query loads. When providing real-time inference services, several challenges arise. Firstly, increasing the number of GPUs may lead to a decrease in inference speed due to heightened communication overhead, while an inadequate number o… ▽ More

    Submitted 23 September, 2024; v1 submitted 23 September, 2024; originally announced September 2024.

    Comments: 15 pages, 5 figures, ICSOC 2024

  46. DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference Tasks

    Authors: Xutong Jin, Chenxi Xu, Ruohan Gao, Jiajun Wu, Guoping Wang, Sheng Li

    Abstract: Accurately estimating and simulating the physical properties of objects from real-world sound recordings is of great practical importance in the fields of vision, graphics, and robotics. However, the progress in these directions has been limited -- prior differentiable rigid or soft body simulation techniques cannot be directly applied to modal sound synthesis due to the high sampling rate of audi… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: 12 pages, 10 figures. Published in Siggraph 2024. Project page: https://hellojxt.github.io/DiffSound/

  47. arXiv:2409.13430  [pdf, other

    cs.CV cs.AI

    CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction

    Authors: Zhangchen Ye, Tao Jiang, Chenfeng Xu, Yiming Li, Hang Zhao

    Abstract: Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the… ▽ More

    Submitted 25 September, 2024; v1 submitted 20 September, 2024; originally announced September 2024.

    Comments: Accepted to ECCV 2024

  48. arXiv:2409.11678  [pdf

    cs.IR cs.LG

    An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

    Authors: Peng Liu, Jiawei Zhu, Cong Xu, Ming Zhao, Bin Wang

    Abstract: As the last key stage of Recommender Systems (RSs), Multi-Task Fusion (MTF) is in charge of combining multiple scores predicted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which decides the ultimate recommendation results. In recent years, to maximize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is widely used for MTF i… ▽ More

    Submitted 27 September, 2024; v1 submitted 17 September, 2024; originally announced September 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2404.17589

  49. arXiv:2409.11295  [pdf, other

    cs.CR cs.AI cs.CL cs.LG

    EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage

    Authors: Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, Huan Sun

    Abstract: Generalist web agents have demonstrated remarkable potential in autonomously completing a wide range of tasks on real websites, significantly boosting human productivity. However, web tasks, such as booking flights, usually involve users' PII, which may be exposed to potential privacy risks if web agents accidentally interact with compromised websites, a scenario that remains largely unexplored in… ▽ More

    Submitted 3 October, 2024; v1 submitted 17 September, 2024; originally announced September 2024.

    Comments: 29 pages

  50. arXiv:2409.10901  [pdf, other

    cs.CV

    TrajSSL: Trajectory-Enhanced Semi-Supervised 3D Object Detection

    Authors: Philip Jacobson, Yichen Xie, Mingyu Ding, Chenfeng Xu, Masayoshi Tomizuka, Wei Zhan, Ming C. Wu

    Abstract: Semi-supervised 3D object detection is a common strategy employed to circumvent the challenge of manually labeling large-scale autonomous driving perception datasets. Pseudo-labeling approaches to semi-supervised learning adopt a teacher-student framework in which machine-generated pseudo-labels on a large unlabeled dataset are used in combination with a small manually-labeled dataset for training… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.