Skip to main content

Showing 1–50 of 623 results for author: Xiong, C

  1. arXiv:2410.16267  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

    Authors: Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles

    Abstract: We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much f… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

  2. arXiv:2410.15531  [pdf, other

    cs.CL

    Do RAG Systems Cover What Matters? Evaluating and Optimizing Responses with Sub-Question Coverage

    Authors: Kaige Xie, Philippe Laban, Prafulla Kumar Choubey, Caiming Xiong, Chien-Sheng Wu

    Abstract: Evaluating retrieval-augmented generation (RAG) systems remains challenging, particularly for open-ended questions that lack definitive answers and require coverage of multiple sub-topics. In this paper, we introduce a novel evaluation framework based on sub-question coverage, which measures how well a RAG system addresses different facets of a question. We propose decomposing questions into sub-q… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

  3. arXiv:2410.14208  [pdf, other

    cs.CL cs.AI cs.LG

    Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

    Authors: Xiaochuan Li, Zichun Yu, Chenyan Xiong

    Abstract: Synthetic data has been widely used to train large language models, but their generative nature inevitably introduces noisy, non-informative, and misleading learning signals. In this paper, we propose Montessori-Instruct, a novel data synthesis framework that tailors the data synthesis ability of the teacher language model toward the student language model's learning process. Specifically, we util… ▽ More

    Submitted 18 October, 2024; originally announced October 2024.

    Comments: Codes and data are open-sourced at https://github.com/cxcscmu/Montessori-Instruct

  4. arXiv:2410.14180  [pdf, other

    cs.CL

    XForecast: Evaluating Natural Language Explanations for Time Series Forecasting

    Authors: Taha Aksu, Chenghao Liu, Amrita Saha, Sarah Tan, Caiming Xiong, Doyen Sahoo

    Abstract: Time series forecasting aids decision-making, especially for stakeholders who rely on accurate predictions, making it very important to understand and explain these models to ensure informed decisions. Traditional explainable AI (XAI) methods, which underline feature or temporal importance, often require expert knowledge. In contrast, natural language explanations (NLEs) are more accessible to lay… ▽ More

    Submitted 20 October, 2024; v1 submitted 18 October, 2024; originally announced October 2024.

  5. arXiv:2410.13824  [pdf, other

    cs.CV cs.CL

    Harnessing Webpage UIs for Text-Rich Visual Understanding

    Authors: Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan Xiong, Wenhu Chen, Graham Neubig, Xiang Yue

    Abstract: Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking dire… ▽ More

    Submitted 18 October, 2024; v1 submitted 17 October, 2024; originally announced October 2024.

  6. arXiv:2410.13509  [pdf, other

    cs.CL

    RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

    Authors: Xinze Li, Sen Mei, Zhenghao Liu, Yukun Yan, Shuo Wang, Shi Yu, Zheni Zeng, Hao Chen, Ge Yu, Zhiyuan Liu, Maosong Sun, Chenyan Xiong

    Abstract: Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources. To adapt LLMs for RAG pipelines, current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge. This supervised fine-tuning (SFT) approach focuses on equipping LLMs to han… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

  7. arXiv:2410.13121  [pdf, other

    cs.CV cs.AI

    Trust but Verify: Programmatic VLM Evaluation in the Wild

    Authors: Viraj Prabhu, Senthil Purushwalkam, An Yan, Caiming Xiong, Ran Xu

    Abstract: Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

  8. arXiv:2410.11209  [pdf, other

    cs.CR

    CRUcialG: Reconstruct Integrated Attack Scenario Graphs by Cyber Threat Intelligence Reports

    Authors: Wenrui Cheng, Tiantian Zhu, Tieming Chen, Qixuan Yuan, Jie Ying, Hongmei Li, Chunlin Xiong, Mingda Li, Mingqi Lv, Yan Chen

    Abstract: Cyber Threat Intelligence (CTI) reports are factual records compiled by security analysts through their observations of threat events or their own practical experience with attacks. In order to utilize CTI reports for attack detection, existing methods have attempted to map the content of reports onto system-level attack provenance graphs to clearly depict attack procedures. However, existing stud… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  9. arXiv:2410.10469  [pdf, other

    cs.LG stat.ML

    Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

    Authors: Xu Liu, Juncheng Liu, Gerald Woo, Taha Aksu, Yuxuan Liang, Roger Zimmermann, Chenghao Liu, Silvio Savarese, Caiming Xiong, Doyen Sahoo

    Abstract: Time series foundation models have demonstrated impressive performance as zero-shot forecasters. However, achieving effectively unified training on time series remains an open challenge. Existing approaches introduce some level of model specialization to account for the highly heterogeneous nature of time series data. For instance, Moirai pursues unified training by employing multiple input/output… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  10. arXiv:2410.10393  [pdf, other

    cs.LG stat.ML

    GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation

    Authors: Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, Doyen Sahoo

    Abstract: Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training. However, the advancement of these models has been hindered by the lack of comprehensive benchmarks. To address this gap, we introduce the General Time Series Forecasting Model Evaluation, GIFT-Eval, a pioneering benchmark aimed at promoting evaluation across diverse datasets. GIFT-Eval e… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  11. arXiv:2410.09207  [pdf, other

    cs.AI cs.CL

    P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

    Authors: Simeng Han, Aaron Yu, Rui Shen, Zhenting Qi, Martin Riddell, Wenfei Zhou, Yujie Qiao, Yilun Zhao, Semih Yavuz, Ye Liu, Shafiq Joty, Yingbo Zhou, Caiming Xiong, Dragomir Radev, Rex Ying, Arman Cohan

    Abstract: Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically derived rationales, which are not sufficient for proper investigation of model's capabilities. We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by human… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

  12. arXiv:2410.07627  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Automatic Curriculum Expert Iteration for Reliable LLM Reasoning

    Authors: Zirui Zhao, Hanze Dong, Amrita Saha, Caiming Xiong, Doyen Sahoo

    Abstract: Hallucinations (i.e., generating plausible but inaccurate content) and laziness (i.e. excessive refusals or defaulting to "I don't know") persist as major challenges in LLM reasoning. Current efforts to reduce hallucinations primarily focus on factual errors in knowledge-grounded tasks, often neglecting hallucinations related to faulty reasoning. Meanwhile, some approaches render LLMs overly conse… ▽ More

    Submitted 10 October, 2024; originally announced October 2024.

    Comments: 20 pages

  13. arXiv:2410.04698  [pdf, other

    cs.CL

    MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs

    Authors: Lei Wang, Shan Dong, Yuhui Xu, Hanze Dong, Yalu Wang, Amrita Saha, Ee-Peng Lim, Caiming Xiong, Doyen Sahoo

    Abstract: Recent large language models (LLMs) have demonstrated versatile capabilities in long-context scenarios. Although some recent benchmarks have been developed to evaluate the long-context capabilities of LLMs, there is a lack of benchmarks evaluating the mathematical reasoning abilities of LLMs over long contexts, which is crucial for LLMs' application in real-world scenarios. In this paper, we intro… ▽ More

    Submitted 6 October, 2024; originally announced October 2024.

    Comments: Work-in-Progress

  14. arXiv:2410.03727  [pdf, other

    cs.CL cs.AI cs.LG

    FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

    Authors: Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

    Abstract: Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination-where models generate responses misaligned with the provided context-remains a significan… ▽ More

    Submitted 8 October, 2024; v1 submitted 30 September, 2024; originally announced October 2024.

  15. arXiv:2410.02108  [pdf, other

    cs.CL

    ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement

    Authors: Xiangyu Peng, Congying Xia, Xinyi Yang, Caiming Xiong, Chien-Sheng Wu, Chen Xing

    Abstract: Post-training Large Language Models (LLMs) with explicit reasoning trajectories can enhance their reasoning abilities. However, acquiring such high-quality trajectory data typically demands meticulous supervision from humans or superior models, which can be either expensive or license-constrained. In this paper, we explore how far an LLM can improve its reasoning by self-synthesizing reasoning pat… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

  16. arXiv:2409.14664  [pdf, other

    cs.CL

    Direct Judgement Preference Optimization

    Authors: Peifeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty

    Abstract: Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to evaluate and critique other models' outputs. In this work, we investigate the idea of learning from both positive and negative data with preference optimization to enhance the evaluation capabilities of LLM… ▽ More

    Submitted 29 September, 2024; v1 submitted 22 September, 2024; originally announced September 2024.

    Comments: Preprint

  17. IMOST: Incremental Memory Mechanism with Online Self-Supervision for Continual Traversability Learning

    Authors: Kehui Ma, Zhen Sun, Chaoran Xiong, Qiumin Zhu, Kewei Wang, Ling Pei

    Abstract: Traversability estimation is the foundation of path planning for a general navigation system. However, complex and dynamic environments pose challenges for the latest methods using self-supervised learning (SSL) technique. Firstly, existing SSL-based methods generate sparse annotations lacking detailed boundary information. Secondly, their strategies focus on hard samples for rapid adaptation, lea… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

  18. arXiv:2409.09381  [pdf, other

    eess.AS cs.AI cs.SD

    Text Prompt is Not Enough: Sound Event Enhanced Prompt Adapter for Target Style Audio Generation

    Authors: Chenxu Xiong, Ruibo Fu, Shuchen Shi, Zhengqi Wen, Jianhua Tao, Tao Wang, Chenxing Li, Chunyu Qiang, Yuankun Xie, Xin Qi, Guanjun Li, Zizheng Yang

    Abstract: Current mainstream audio generation methods primarily rely on simple text prompts, often failing to capture the nuanced details necessary for multi-style audio generation. To address this limitation, the Sound Event Enhanced Prompt Adapter is proposed. Unlike traditional static global style transfer, this method extracts style embedding through cross-attention between text and reference audio for… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    Comments: 5 pages, 2 figures, submitted to ICASSP 2025

  19. arXiv:2409.04634  [pdf, other

    quant-ph

    Mechanically-intermixed indium superconducting connections for microwave quantum interconnects

    Authors: Yves Martin, Neereja Sundaresan, Jae-woong Nah, Rachel Steiner, Marco Turchetti, Kevin Stawiasz, Chi Xiong, Jason S. Orcutt

    Abstract: Superconducting coaxial cables represent critical communication channels for interconnecting superconducting quantum processors. Here, we report mechanically-intermixed indium joins to aluminum coaxial cables for low loss quantum interconnects. We describe an ABCD matrix formalism to characterize the total resonator internal quality factor ($Q_i$) and any contact ($R_{cont}$) or shunt resistance (… ▽ More

    Submitted 6 September, 2024; originally announced September 2024.

    Comments: 6 pages, 5 figures

  20. arXiv:2409.03215  [pdf, other

    cs.CL cs.AI cs.LG

    xLAM: A Family of Large Action Models to Empower AI Agent Systems

    Authors: Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Awalgaonkar, Rithesh Murthy, Eric Hu, Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong

    Abstract: Autonomous agents powered by large language models (LLMs) have attracted significant research interest. However, the open-source community faces many challenges in developing specialized models for agent tasks, driven by the scarcity of high-quality agent datasets and the absence of standard protocols in this area. We introduce and publicly release xLAM, a series of large action models designed fo… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

    Comments: Technical report for the Salesforce xLAM model series

  21. arXiv:2408.12590  [pdf, other

    cs.CV cs.AI

    xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

    Authors: Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong

    Abstract: We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of vi… ▽ More

    Submitted 31 August, 2024; v1 submitted 22 August, 2024; originally announced August 2024.

    Comments: Accepted by ECCV24 AI4VA

  22. arXiv:2408.10853  [pdf, other

    cs.SD cs.AI eess.AS

    Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?

    Authors: Yuankun Xie, Chenxu Xiong, Xiaopeng Wang, Zhiyong Wang, Yi Lu, Xin Qi, Ruibo Fu, Yukun Liu, Zhengqi Wen, Jianhua Tao, Guanjun Li, Long Ye

    Abstract: Currently, Audio Language Models (ALMs) are rapidly advancing due to the developments in large language models and audio neural codecs. These ALMs have significantly lowered the barrier to creating deepfake audio, generating highly realistic and diverse types of deepfake audio, which pose severe threats to society. Consequently, effective audio deepfake detection technologies to detect ALM-based a… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  23. arXiv:2408.08872  [pdf, other

    cs.CV cs.AI cs.CL

    xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

    Authors: Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles , et al. (2 additional authors not shown)

    Abstract: This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tas… ▽ More

    Submitted 28 August, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

  24. arXiv:2408.07060  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

    Authors: Kexun Zhang, Weiran Yao, Zuxin Liu, Yihao Feng, Zhiwei Liu, Rithesh Murthy, Tian Lan, Lei Li, Renze Lou, Jiacheng Xu, Bo Pang, Yingbo Zhou, Shelby Heinecke, Silvio Savarese, Huan Wang, Caiming Xiong

    Abstract: Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agent… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

  25. arXiv:2408.06810  [pdf, other

    cs.AR

    HLSPilot: LLM-based High-Level Synthesis

    Authors: Chenwei Xiong, Cheng Liu, Huawei Li, Xiaowei Li

    Abstract: Large language models (LLMs) have catalyzed an upsurge in automatic code generation, garnering significant attention for register transfer level (RTL) code generation. Despite the potential of RTL code generation with natural language, it remains error-prone and limited to relatively small modules because of the substantial semantic gap between natural language expressions and hardware design inte… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

  26. arXiv:2408.00930  [pdf, other

    cs.LG cs.AI

    Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research

    Authors: Tian Lan, Huan Wang, Caiming Xiong, Silvio Savarese

    Abstract: We introduce WarpSci, a domain agnostic framework designed to overcome crucial system bottlenecks encountered in the application of reinforcement learning to intricate environments with vast datasets featuring high-dimensional observation or action spaces. Notably, our framework eliminates the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulat… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  27. arXiv:2407.21364  [pdf, other

    cs.IR

    Personalized Multi-task Training for Recommender System

    Authors: Liangwei Yang, Zhiwei Liu, Jianguo Zhang, Rithesh Murthy, Shelby Heinecke, Huan Wang, Caiming Xiong, Philip S. Yu

    Abstract: In the vast landscape of internet information, recommender systems (RecSys) have become essential for guiding users through a sea of choices aligned with their preferences. These systems have applications in diverse domains, such as news feeds, game suggestions, and shopping recommendations. Personalization is a key technique in RecSys, where modern methods leverage representation learning to enco… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: 11 pages

  28. arXiv:2407.21018  [pdf, other

    cs.CL cs.AI

    ThinK: Thinner Key Cache by Query-Driven Pruning

    Authors: Yuhui Xu, Zhanming Jie, Hanze Dong, Lei Wang, Xudong Lu, Aojun Zhou, Amrita Saha, Caiming Xiong, Doyen Sahoo

    Abstract: Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. However, their increased computational and memory demands present significant challenges, especially when handling long sequences. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumptio… ▽ More

    Submitted 2 October, 2024; v1 submitted 30 July, 2024; originally announced July 2024.

    Comments: 20 pages, 6 figures

  29. arXiv:2407.16604  [pdf, other

    cs.CL

    Shared Imagination: LLMs Hallucinate Alike

    Authors: Yilun Zhou, Caiming Xiong, Silvio Savarese, Chien-Sheng Wu

    Abstract: Despite the recent proliferation of large language models (LLMs), their training recipes -- model architecture, pre-training data and optimization algorithm -- are often very similar. This naturally raises the question of the similarity among the resulting models. In this paper, we propose a novel setting, imaginary question answering (IQA), to better understand model similarity. In IQA, we ask on… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  30. arXiv:2407.15268  [pdf, other

    cs.CL

    Fact-Aware Multimodal Retrieval Augmentation for Accurate Medical Radiology Report Generation

    Authors: Liwen Sun, James Zhao, Megan Han, Chenyan Xiong

    Abstract: Multimodal foundation models hold significant potential for automating radiology report generation, thereby assisting clinicians in diagnosing cardiac diseases. However, generated reports often suffer from serious factual inaccuracy. In this paper, we introduce a fact-aware multimodal retrieval-augmented pipeline in generating accurate radiology reports (FactMM-RAG). We first leverage RadGraph to… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

  31. arXiv:2407.14933  [pdf, other

    cs.CL cs.AI cs.LG

    Consent in Crisis: The Rapid Decline of the AI Data Commons

    Authors: Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang , et al. (24 additional authors not shown)

    Abstract: General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how co… ▽ More

    Submitted 24 July, 2024; v1 submitted 20 July, 2024; originally announced July 2024.

    Comments: 41 pages (13 main), 5 figures, 9 tables

  32. arXiv:2407.12259  [pdf, other

    cs.CL

    In-Context Probing Approximates Influence Function for Data Valuation

    Authors: Cathy Jiao, Gary Gao, Chenyan Xiong

    Abstract: Data valuation quantifies the value of training data, and is used for data attribution (i.e., determining the contribution of training data towards model predictions), and data selection; both of which are important for curating high-quality datasets to train large language models. In our paper, we show that data valuation through in-context probing (i.e., prompting a LLM) approximates influence f… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  33. arXiv:2407.10956  [pdf, other

    cs.AI cs.CL

    Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

    Authors: Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu

    Abstract: Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivit… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 34 pages, 14 figures, 10 tables

  34. arXiv:2407.02518  [pdf, other

    cs.SE cs.AI cs.CL cs.CR cs.MA cs.PL

    INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness

    Authors: Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Doyen Sahoo

    Abstract: Large language models (LLMs) for code are typically trained to align with natural language instructions to closely follow their intentions and requirements. However, in many practical scenarios, it becomes increasingly challenging for these models to navigate the intricate boundary between helpfulness and safety, especially against highly complex yet potentially malicious instructions. In this wor… ▽ More

    Submitted 23 June, 2024; originally announced July 2024.

  35. arXiv:2407.01370  [pdf, other

    cs.CL

    Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems

    Authors: Philippe Laban, Alexander R. Fabbri, Caiming Xiong, Chien-Sheng Wu

    Abstract: LLMs and RAG systems are now capable of handling millions of input tokens or more. However, evaluating the output quality of such systems on long-context tasks remains challenging, as tasks like Needle-in-a-Haystack lack complexity. In this work, we argue that summarization can play a central role in such evaluation. We design a procedure to synthesize Haystacks of documents, ensuring that specifi… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  36. arXiv:2406.18518  [pdf, other

    cs.CL cs.AI cs.LG cs.SE

    APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

    Authors: Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong

    Abstract: The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scal… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  37. arXiv:2406.11548  [pdf, other

    cs.RO cs.AI cs.CV

    AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation

    Authors: Chuyan Xiong, Chengyu Shen, Xiaoqi Li, Kaichen Zhou, Jiaming Liu, Ruiping Wang, Hao Dong

    Abstract: The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects. Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly. However, these methods typically focus on high-level planning corrections using an… ▽ More

    Submitted 16 October, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  38. arXiv:2406.11271  [pdf, other

    cs.CV cs.LG

    MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

    Authors: Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, Ran Xu, Yejin Choi, Ludwig Schmidt

    Abstract: Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimo… ▽ More

    Submitted 19 September, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  39. arXiv:2406.10291  [pdf, other

    cs.AI cs.CL cs.IR

    ResearchArena: Benchmarking LLMs' Ability to Collect and Organize Information as Research Agents

    Authors: Hao Kang, Chenyan Xiong

    Abstract: Large language models (LLMs) have exhibited remarkable performance across various tasks in natural language processing. Nevertheless, challenges still arise when these tasks demand domain-specific expertise and advanced analytical skills, such as conducting research surveys on a designated topic. In this research, we develop ResearchArena, a benchmark that measures LLM agents' ability to conduct a… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  40. arXiv:2406.10290  [pdf, other

    cs.CL cs.AI cs.LG

    MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases

    Authors: Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese

    Abstract: The deployment of Large Language Models (LLMs) and Large Multimodal Models (LMMs) on mobile devices has gained significant attention due to the benefits of enhanced privacy, stability, and personalization. However, the hardware constraints of mobile devices necessitate the use of models with fewer parameters and model compression techniques like quantization. Currently, there is limited understand… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  41. arXiv:2406.09696  [pdf, other

    eess.IV cs.CV

    MoME: Mixture of Multimodal Experts for Cancer Survival Prediction

    Authors: Conghao Xiong, Hao Chen, Hao Zheng, Dong Wei, Yefeng Zheng, Joseph J. Y. Sung, Irwin King

    Abstract: Survival analysis, as a challenging task, requires integrating Whole Slide Images (WSIs) and genomic data for comprehensive decision-making. There are two main challenges in this task: significant heterogeneity and complex inter- and intra-modal interactions between the two modalities. Previous approaches utilize co-attention methods, which fuse features from both modalities only once after separa… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 8 + 1/2 pages, early accepted to MICCAI2024

  42. arXiv:2406.06046  [pdf, other

    cs.CL cs.LG

    MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models

    Authors: Zichun Yu, Spandan Das, Chenyan Xiong

    Abstract: Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data sel… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: The code is open-sourced at https://github.com/cxcscmu/MATES

  43. arXiv:2406.04975  [pdf, other

    cs.LG cs.AI

    UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting

    Authors: Juncheng Liu, Chenghao Liu, Gerald Woo, Yiwei Wang, Bryan Hooi, Caiming Xiong, Doyen Sahoo

    Abstract: Transformer-based models have emerged as powerful tools for multivariate time series forecasting (MTSF). However, existing Transformer models often fall short of capturing both intricate dependencies across variate and temporal dimensions in MTS data. Some recent models are proposed to separately capture variate and temporal dependencies through either two sequential or parallel attention mechanis… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  44. arXiv:2405.20099  [pdf, other

    cs.CR

    Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

    Authors: Chen Xiong, Xiangyu Qi, Pin-Yu Chen, Tsung-Yi Ho

    Abstract: Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models' safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces Defens… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  45. arXiv:2405.17418  [pdf, other

    cs.CV

    Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation

    Authors: Jiaming Liu, Chenxuan Li, Guanqun Wang, Lily Lee, Kaichen Zhou, Sixiang Chen, Chuyan Xiong, Jiaxin Ge, Renrui Zhang, Shanghang Zhang

    Abstract: Robot manipulation policies have shown unsatisfactory action performance when confronted with novel task or object instances. Hence, the capability to automatically detect and self-correct failure action is essential for a practical robotic system. Recently, Multimodal Large Language Models (MLLMs) have shown promise in visual instruction following and demonstrated strong reasoning abilities in va… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  46. arXiv:2405.15638  [pdf, other

    cs.CV cs.CL

    M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

    Authors: Hongyu Wang, Jiayu Xu, Senwei Xie, Ruiping Wang, Jialin Li, Zhaojie Xie, Bin Zhang, Chuyan Xiong, Xilin Chen

    Abstract: Multilingual multimodal reasoning is a core component in achieving human-level intelligence. However, most existing benchmarks for multilingual multimodal reasoning struggle to differentiate between models of varying performance; even language models without visual capabilities can easily achieve high scores. This leaves a comprehensive evaluation of leading multilingual multimodal models largely… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: Work in progress

  47. arXiv:2405.09475  [pdf, other

    gr-qc astro-ph.CO astro-ph.IM hep-ph

    Robust inference of gravitational wave source parameters in the presence of noise transients using normalizing flows

    Authors: Chun-Yu Xiong, Tian-Yang Sun, Jing-Fei Zhang, Xin Zhang

    Abstract: Gravitational wave (GW) detection is of paramount importance in fundamental physics and GW astronomy, yet it presents formidable challenges. One significant challenge is the removal of noise transient artifacts known as ``glitches," which greatly impact the search and identification of GWs. Recent research has achieved remarkable results in data denoising, often using effective modeling methods to… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: 13 pages, 9 figures

  48. arXiv:2405.07863  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    RLHF Workflow: From Reward Modeling to Online RLHF

    Authors: Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang

    Abstract: We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill i… ▽ More

    Submitted 12 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

  49. MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

    Authors: Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett, Nick Craswell, Xing Xie, Fan Yang, Bryan Tower, Nikhil Rao, Anlei Dong, Wenqi Jiang, Zheng Liu, Mingqin Li, Chuanjie Liu, Zengzhong Li, Rangan Majumder, Jennifer Neville, Andy Oakley, Knut Magne Risvik , et al. (6 additional authors not shown)

    Abstract: Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of down… ▽ More

    Submitted 13 May, 2024; originally announced May 2024.

    Comments: 10 pages, 6 figures, for associated dataset, see http://github.com/microsoft/MS-MARCO-Web-Search

  50. arXiv:2405.02826  [pdf, other

    cs.CR

    Nip in the Bud: Forecasting and Interpreting Post-exploitation Attacks in Real-time through Cyber Threat Intelligence Reports

    Authors: Tiantian Zhu, Jie Ying, Tieming Chen, Chunlin Xiong, Wenrui Cheng, Qixuan Yuan, Aohan Zheng, Mingqi Lv, Yan Chen

    Abstract: Advanced Persistent Threat (APT) attacks have caused significant damage worldwide. Various Endpoint Detection and Response (EDR) systems are deployed by enterprises to fight against potential threats. However, EDR suffers from high false positives. In order not to affect normal operations, analysts need to investigate and filter detection results before taking countermeasures, in which heavy manua… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.