subscribe to arXiv mailings

Attention Is All You Need for LLM-based Code Vulnerability Localization

Authors: Yue Li, Xiao Li, Hao Wu, Yue Zhang, Xiuzhen Cheng, Sheng Zhong, Fengyuan Xu

Abstract: The rapid expansion of software systems and the growing number of reported vulnerabilities have emphasized the importance of accurately identifying vulnerable code segments. Traditional methods for vulnerability localization, such as manual code audits or rule-based tools, are often time-consuming and limited in scope, typically focusing on specific programming languages or types of vulnerabilitie… ▽ More The rapid expansion of software systems and the growing number of reported vulnerabilities have emphasized the importance of accurately identifying vulnerable code segments. Traditional methods for vulnerability localization, such as manual code audits or rule-based tools, are often time-consuming and limited in scope, typically focusing on specific programming languages or types of vulnerabilities. In recent years, the introduction of large language models (LLMs) such as GPT and LLaMA has opened new possibilities for automating vulnerability detection. However, while LLMs show promise in this area, they face challenges, particularly in maintaining accuracy over longer code contexts. This paper introduces LOVA, a novel framework leveraging the self-attention mechanisms inherent in LLMs to enhance vulnerability localization. Our key insight is that self-attention mechanisms assign varying importance to different parts of the input, making it possible to track how much attention the model focuses on specific lines of code. In the context of vulnerability localization, the hypothesis is that vulnerable lines of code will naturally attract higher attention weights because they have a greater influence on the model's output. By systematically tracking changes in attention weights and focusing on specific lines of code, LOVA improves the precision of identifying vulnerable lines across various programming languages. Through rigorous experimentation and evaluation, we demonstrate that LOVA significantly outperforms existing LLM-based approaches, achieving up to a 5.3x improvement in F1-scores. LOVA also demonstrated strong scalability, with up to a 14.6x improvement in smart contract vulnerability localization across languages like C, Python, Java, and Solidity. Its robustness was proven through consistent performance across different LLM architectures. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.14993 [pdf, other]

Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling

Authors: Hao Wu, Donglin Bai, Shiqi Jiang, Qianxi Zhang, Yifan Yang, Ting Cao, Fengyuan Xu

Abstract: Video understanding has become increasingly important with the rise of multi-modality applications. Understanding continuous video poses considerable challenges due to the fast expansion of streaming video, which contains multi-scale and untrimmed events. We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling. C-VUE has three key designs. The first is a long-r… ▽ More Video understanding has become increasingly important with the rise of multi-modality applications. Understanding continuous video poses considerable challenges due to the fast expansion of streaming video, which contains multi-scale and untrimmed events. We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling. C-VUE has three key designs. The first is a long-range history modeling technique that uses a video-aware approach to retain historical video information. The second is a spatial redundancy reduction technique, which enhances the efficiency of history modeling based on temporal relations. The third is a parallel training structure that incorporates the frame-weighted loss to understand multi-scale events in long videos. Our C-VUE offers high accuracy and efficiency. It runs at speeds >30 FPS on typical edge devices and outperforms all baselines in accuracy. Moreover, applying C-VUE to a video foundation model as a video encoder in our case study resulted in a 0.46-point enhancement (on a 5-point scale) on the in-distribution dataset, and an improvement ranging from 1.19\% to 4\% on zero-shot datasets. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.14442 [pdf, other]

A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference

Authors: You Wu, Haoyi Wu, Kewei Tu

Abstract: Recently, sharing key-value (KV) cache across layers has been found effective in efficient inference of large language models (LLMs). To systematically investigate different techniques of cross-layer KV sharing, we propose a unified framework that covers several recent methods and their novel variants. We conduct comprehensive experiments on all the configurations of the framework, evaluating thei… ▽ More Recently, sharing key-value (KV) cache across layers has been found effective in efficient inference of large language models (LLMs). To systematically investigate different techniques of cross-layer KV sharing, we propose a unified framework that covers several recent methods and their novel variants. We conduct comprehensive experiments on all the configurations of the framework, evaluating their generation throughput and performance in language modeling and downstream tasks. We find that when reducing the size of the KV cache by 2x, most configurations can achieve competitive performance to and higher throughput than standard transformers, but when further reducing the size of the KV cache, pairing queries of all layers with KVs of upper layers can better maintain performance, although it also introduces additional training cost and prefilling latency. We hope that this work will help users choose the appropriate approach according to their requirements and facilitate research on the acceleration of LLM inference. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.13798 [pdf, other]

Learning Graph Quantized Tokenizers for Transformers

Authors: Limei Wang, Kaveh Hassani, Si Zhang, Dongqi Fu, Baichuan Yuan, Weilin Cong, Zhigang Hua, Hao Wu, Ning Yao, Bo Long

Abstract: Transformers serve as the backbone architectures of Foundational Models, where a domain-specific tokenizer helps them adapt to various domains. Graph Transformers (GTs) have recently emerged as a leading model in geometric deep learning, outperforming Graph Neural Networks (GNNs) in various graph learning tasks. However, the development of tokenizers for graphs has lagged behind other modalities,… ▽ More Transformers serve as the backbone architectures of Foundational Models, where a domain-specific tokenizer helps them adapt to various domains. Graph Transformers (GTs) have recently emerged as a leading model in geometric deep learning, outperforming Graph Neural Networks (GNNs) in various graph learning tasks. However, the development of tokenizers for graphs has lagged behind other modalities, with existing approaches relying on heuristics or GNNs co-trained with Transformers. To address this, we introduce GQT (\textbf{G}raph \textbf{Q}uantized \textbf{T}okenizer), which decouples tokenizer training from Transformer training by leveraging multi-task graph self-supervised learning, yielding robust and generalizable graph tokens. Furthermore, the GQT utilizes Residual Vector Quantization (RVQ) to learn hierarchical discrete tokens, resulting in significantly reduced memory requirements and improved generalization capabilities. By combining the GQT with token modulation, a Transformer encoder achieves state-of-the-art performance on 16 out of 18 benchmarks, including large-scale homophilic and heterophilic datasets. The code is available at: https://github.com/limei0307/graph-tokenizer △ Less

Submitted 17 October, 2024; originally announced October 2024.

arXiv:2410.13441 [pdf, other]

Instruction-Driven Game Engine: A Poker Case Study

Authors: Hongqiu Wu, Xingyuan Liu, Yan Wang, Hai Zhao

Abstract: The Instruction-Driven Game Engine (IDGE) project aims to democratize game development by enabling a large language model (LLM) to follow free-form game descriptions and generate game-play processes. The IDGE allows users to create games simply by natural language instructions, which significantly lowers the barrier for game development. We approach the learning process for IDGEs as a Next State P… ▽ More The Instruction-Driven Game Engine (IDGE) project aims to democratize game development by enabling a large language model (LLM) to follow free-form game descriptions and generate game-play processes. The IDGE allows users to create games simply by natural language instructions, which significantly lowers the barrier for game development. We approach the learning process for IDGEs as a Next State Prediction task, wherein the model autoregressively predicts the game states given player actions. The computation of game states must be precise; otherwise, slight errors could corrupt the game-play experience. This is challenging because of the gap between stability and diversity. To address this, we train the IDGE in a curriculum manner that progressively increases its exposure to complex scenarios. Our initial progress lies in developing an IDGE for Poker, which not only supports a wide range of poker variants but also allows for highly individualized new poker games through natural language inputs. This work lays the groundwork for future advancements in transforming how games are created and played. △ Less

Submitted 17 October, 2024; originally announced October 2024.

Comments: EMNLP 2024 Demo. arXiv admin note: substantial text overlap with arXiv:2404.00276

arXiv:2410.12425 [pdf, other]

Perseus: Leveraging Common Data Patterns with Curriculum Learning for More Robust Graph Neural Networks

Authors: Kaiwen Xia, Huijun Wu, Duanyu Li, Min Xie, Ruibo Wang, Wenzhe Zhang

Abstract: Graph Neural Networks (GNNs) excel at handling graph data but remain vulnerable to adversarial attacks. Existing defense methods typically rely on assumptions like graph sparsity and homophily to either preprocess the graph or guide structure learning. However, preprocessing methods often struggle to accurately distinguish between normal edges and adversarial perturbations, leading to suboptimal r… ▽ More Graph Neural Networks (GNNs) excel at handling graph data but remain vulnerable to adversarial attacks. Existing defense methods typically rely on assumptions like graph sparsity and homophily to either preprocess the graph or guide structure learning. However, preprocessing methods often struggle to accurately distinguish between normal edges and adversarial perturbations, leading to suboptimal results due to the loss of valuable edge information. Robust graph neural network models train directly on graph data affected by adversarial perturbations, without preprocessing. This can cause the model to get stuck in poor local optima, negatively affecting its performance. To address these challenges, we propose Perseus, a novel adversarial defense method based on curriculum learning. Perseus assesses edge difficulty using global homophily and applies a curriculum learning strategy to adjust the learning order, guiding the model to learn the full graph structure while adaptively focusing on common data patterns. This approach mitigates the impact of adversarial perturbations. Experiments show that models trained with Perseus achieve superior performance and are significantly more robust to adversarial attacks. △ Less

Submitted 16 October, 2024; originally announced October 2024.

arXiv:2410.12307 [pdf, other]

DAT: Improving Adversarial Robustness via Generative Amplitude Mix-up in Frequency Domain

Authors: Fengpeng Li, Kemou Li, Haiwei Wu, Jinyu Tian, Jiantao Zhou

Abstract: To protect deep neural networks (DNNs) from adversarial attacks, adversarial training (AT) is developed by incorporating adversarial examples (AEs) into model training. Recent studies show that adversarial attacks disproportionately impact the patterns within the phase of the sample's frequency spectrum -- typically containing crucial semantic information -- more than those in the amplitude, resul… ▽ More To protect deep neural networks (DNNs) from adversarial attacks, adversarial training (AT) is developed by incorporating adversarial examples (AEs) into model training. Recent studies show that adversarial attacks disproportionately impact the patterns within the phase of the sample's frequency spectrum -- typically containing crucial semantic information -- more than those in the amplitude, resulting in the model's erroneous categorization of AEs. We find that, by mixing the amplitude of training samples' frequency spectrum with those of distractor images for AT, the model can be guided to focus on phase patterns unaffected by adversarial perturbations. As a result, the model's robustness can be improved. Unfortunately, it is still challenging to select appropriate distractor images, which should mix the amplitude without affecting the phase patterns. To this end, in this paper, we propose an optimized Adversarial Amplitude Generator (AAG) to achieve a better tradeoff between improving the model's robustness and retaining phase patterns. Based on this generator, together with an efficient AE production procedure, we design a new Dual Adversarial Training (DAT) strategy. Experiments on various datasets show that our proposed DAT leads to significantly improved robustness against diverse adversarial attacks. △ Less

Submitted 16 October, 2024; originally announced October 2024.

Journal ref: NeurIPS 2024

arXiv:2410.12130 [pdf, other]

Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning

Authors: Huiwen Wu, Xiaohan Li, Xiaogang Xu, Jiafei Wu, Deyi Zhang, Zhe Liu

Abstract: The development of Large Language Models (LLMs) has significantly advanced various AI applications in commercial and scientific research fields, such as scientific literature summarization, writing assistance, and knowledge graph construction. However, a significant challenge is the high risk of hallucination during LLM inference, which can lead to security concerns like factual inaccuracies, inco… ▽ More The development of Large Language Models (LLMs) has significantly advanced various AI applications in commercial and scientific research fields, such as scientific literature summarization, writing assistance, and knowledge graph construction. However, a significant challenge is the high risk of hallucination during LLM inference, which can lead to security concerns like factual inaccuracies, inconsistent information, and fabricated content. To tackle this issue, it is essential to develop effective methods for reducing hallucination while maintaining the original capabilities of the LLM. This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination. This method modifies the representation layers of pre-trained LLMs by using contrastive `positive' and `negative' models, trained on data with and without hallucinations. By leveraging the differences between these two models, we create a more straightforward pathway to eliminate hallucinations, and the iterative nature of contrastive learning further enhances performance. Experimental validation on four pre-trained foundation LLMs (LLaMA2, Alpaca, LLaMA3, and Qwen) finetuning with a specially designed dataset shows that our approach achieves an average improvement of 10.1 points on the TruthfulQA benchmark. Comprehensive experiments demonstrate the effectiveness of Iter-AHMCL in reducing hallucination while maintaining the general capabilities of LLMs. △ Less

Submitted 15 October, 2024; originally announced October 2024.

arXiv:2410.12122 [pdf, ps, other]

Explicit Representatives and Sizes of Cyclotomic Cosets and their Application to Cyclic Codes over Finite Fields

Authors: Li Zhu, Jinle Liu, Hongfeng Wu

Abstract: Cyclotomic coset is a basic notion which has wide application in various computation problems. Let $q$ be a prime power, and $n$ be a positive integer coprime to $q$. In this paper we determine explicitly the representatives and the sizes of all $q$-cyclotomic cosets modulo $n$ in the general settings. Instead of the $q$-cyclotomic cosets modulo a fixed integer, we consider the profinite spaces of… ▽ More Cyclotomic coset is a basic notion which has wide application in various computation problems. Let $q$ be a prime power, and $n$ be a positive integer coprime to $q$. In this paper we determine explicitly the representatives and the sizes of all $q$-cyclotomic cosets modulo $n$ in the general settings. Instead of the $q$-cyclotomic cosets modulo a fixed integer, we consider the profinite spaces of compatible sequences of $q$-cyclotomic cosets modulo $2^{N}n^{\prime}$ for $N \geq 0$, where $n^{\prime}$ is the maximal odd divisor of $n$, with a fixed leader component. We give precise characterization of the structure of these profinite spaces, which reveals the general formula for representatives of cyclotomic cosets. As applications, we determine the generator polynomials of all cyclic codes over finite fields, and further enumerate and depict the self-dual cyclic codes. △ Less

Submitted 15 October, 2024; originally announced October 2024.

Comments: 30 pages

arXiv:2410.11986 [pdf, ps, other]

Age-of-Gradient Updates for Federated Learning over Random Access Channels

Authors: Yu Heng Wu, Houman Asgari, Stefano Rini, Andrea Munari

Abstract: This paper studies the problem of federated training of a deep neural network (DNN) over a random access channel (RACH) such as in computer networks, wireless networks, and cellular systems. More precisely, a set of remote users participate in training a centralized DNN model using SGD under the coordination of a parameter server (PS). The local model updates are transmitted from the remote users… ▽ More This paper studies the problem of federated training of a deep neural network (DNN) over a random access channel (RACH) such as in computer networks, wireless networks, and cellular systems. More precisely, a set of remote users participate in training a centralized DNN model using SGD under the coordination of a parameter server (PS). The local model updates are transmitted from the remote users to the PS over a RACH using a slotted ALOHA protocol. The PS collects the updates from the remote users, accumulates them, and sends central model updates to the users at regular time intervals. We refer to this setting as the RACH-FL setting. The RACH-FL setting crucially addresses the problem of jointly designing a (i) client selection and (ii) gradient compression strategy which addresses the communication constraints between the remote users and the PS when transmission occurs over a RACH. For the RACH-FL setting, we propose a policy, which we term the ''age-of-gradient'' (AoG) policy in which (i) gradient sparsification is performed using top-K sparsification, (ii) the error correction is performed using memory accumulation, and (iii) the slot transmission probability is obtained by comparing the current local memory magnitude minus the magnitude of the gradient update to a threshold. Intuitively, the AoG measure of ''freshness'' of the memory state is reminiscent of the concept of age-of-information (AoI) in the context of communication theory and provides a rather natural interpretation of this policy. Numerical simulations show the superior performance of the AoG policy as compared to other RACH-FL policies. △ Less

Submitted 15 October, 2024; originally announced October 2024.

arXiv:2410.11766 [pdf, other]

DPD-NeuralEngine: A 22-nm 6.6-TOPS/W/mm$^2$ Recurrent Neural Network Accelerator for Wideband Power Amplifier Digital Pre-Distortion

Authors: Ang Li, Haolin Wu, Yizhuo Wu, Qinyu Chen, Leo C. N. de Vreede, Chang Gao

Abstract: The increasing adoption of Deep Neural Network (DNN)-based Digital Pre-distortion (DPD) in modern communication systems necessitates efficient hardware implementations. This paper presents DPD-NeuralEngine, an ultra-fast, tiny-area, and power-efficient DPD accelerator based on a Gated Recurrent Unit (GRU) neural network (NN). Leveraging a co-designed software and hardware approach, our 22 nm CMOS… ▽ More The increasing adoption of Deep Neural Network (DNN)-based Digital Pre-distortion (DPD) in modern communication systems necessitates efficient hardware implementations. This paper presents DPD-NeuralEngine, an ultra-fast, tiny-area, and power-efficient DPD accelerator based on a Gated Recurrent Unit (GRU) neural network (NN). Leveraging a co-designed software and hardware approach, our 22 nm CMOS implementation operates at 2 GHz, capable of processing I/Q signals up to 250 MSps. Experimental results demonstrate a throughput of 256.5 GOPS and power efficiency of 1.32 TOPS/W with DPD linearization performance measured in Adjacent Channel Power Ratio (ACPR) of -45.3 dBc and Error Vector Magnitude (EVM) of -39.8 dB. To our knowledge, this work represents the first AI-based DPD application-specific integrated circuit (ASIC) accelerator, achieving a power-area efficiency (PAE) of 6.6 TOPS/W/mm$^2$. △ Less

Submitted 15 October, 2024; originally announced October 2024.

Comments: 5 pages, 5 figures

arXiv:2410.11419 [pdf, other]

doi 10.1145/3680528.3687576

GS^3: Efficient Relighting with Triple Gaussian Splatting

Authors: Zoubin Bi, Yixin Zeng, Chong Zeng, Fan Pei, Xiang Feng, Kun Zhou, Hongzhi Wu

Abstract: We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images. To describe complex appearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian. To generate self-shadow, we splat all spatial Ga… ▽ More We present a spatial and angular Gaussian based representation and a triple splatting process, for real-time, high-quality novel lighting-and-view synthesis from multi-view point-lit input images. To describe complex appearance, we employ a Lambertian plus a mixture of angular Gaussians as an effective reflectance function for each spatial Gaussian. To generate self-shadow, we splat all spatial Gaussians towards the light source to obtain shadow values, which are further refined by a small multi-layer perceptron. To compensate for other effects like global illumination, another network is trained to compute and add a per-spatial-Gaussian RGB tuple. The effectiveness of our representation is demonstrated on 30 samples with a wide variation in geometry (from solid to fluffy) and appearance (from translucent to anisotropic), as well as using different forms of input data, including rendered images of synthetic/reconstructed objects, photographs captured with a handheld camera and a flash, or from a professional lightstage. We achieve a training time of 40-70 minutes and a rendering speed of 90 fps on a single commodity GPU. Our results compare favorably with state-of-the-art techniques in terms of quality/performance. Our code and data are publicly available at https://GSrelight.github.io/. △ Less

Submitted 15 October, 2024; originally announced October 2024.

Comments: Accepted to SIGGRAPH Asia 2024. Project page: https://gsrelight.github.io/

Journal ref: ACM SIGGRAPH Asia 2024 Conference Papers

arXiv:2410.10775 [pdf, other]

Browsing without Third-Party Cookies: What Do You See?

Authors: Maxwell Lin, Shihan Lin, Helen Wu, Karen Wang, Xiaowei Yang

Abstract: Third-party web cookies are often used for privacy-invasive behavior tracking. Partly due to privacy concerns, browser vendors have started to block all third-party cookies in recent years. To understand the effects of such third-party cookieless browsing, we crawled and measured the top 10,000 Tranco websites. We developed a framework to remove third-party cookies and analyze the differences betw… ▽ More Third-party web cookies are often used for privacy-invasive behavior tracking. Partly due to privacy concerns, browser vendors have started to block all third-party cookies in recent years. To understand the effects of such third-party cookieless browsing, we crawled and measured the top 10,000 Tranco websites. We developed a framework to remove third-party cookies and analyze the differences between the appearance of web pages with and without these cookies. We find that disabling third-party cookies has no substantial effect on website appearance including layouts, text, and images. This validates the industry-wide shift towards cookieless browsing as a way to protect user privacy without compromising on the user experience. △ Less

Submitted 14 October, 2024; originally announced October 2024.

Comments: To appear in IMC '24

arXiv:2410.10338 [pdf, other]

On Efficient Topology Management in Service-Oriented 6G Networks: An Edge Video Distribution Case Study

Authors: Zied Ennaceur, Mounir Bensalem, Admela Jukan, Claus Keuker, Huanzhuo Wu, Rastin Pries

Abstract: An efficient topology management in future 6G networks is one of the fundamental challenges for a dynamic network creation based on location services, whereby each autonomous network entity, i.e., a sub-network, can be created for a specific application scenario. In this paper, we study the performance of a novel topology changes management system in a sample 6G network being dynamically organized… ▽ More An efficient topology management in future 6G networks is one of the fundamental challenges for a dynamic network creation based on location services, whereby each autonomous network entity, i.e., a sub-network, can be created for a specific application scenario. In this paper, we study the performance of a novel topology changes management system in a sample 6G network being dynamically organized in autonomous sub-networks. We propose and analyze an algorithm for intelligent prediction of topology changes and provide a comparative analysis with topology monitoring based approach. To this end, we present an industrially relevant case study on edge video distribution, as it is envisioned to be implemented in line with the 3GPP and ETSI MEC (Multi-access Edge Computing) standards. For changes prediction, we implement and analyze a novel topology change prediction algorithm, which can automatically optimize, train and, finally, select the best of different machine learning models available, based on the specific scenario under study. For link change scenario, the results show that three selected ML models exhibit high accuracy in detecting changes in link delay and bandwidth using measured throughput and RTT. ANN demonstrates the best performance in identifying cases with no changes, slightly outperforming random forest and XGBoost. For user mobility scenario, XGBoost is more efficient in learning patterns for topology change prediction while delivering much faster results compared to the more computationally demanding deep learning models, such as LSTM and CNN. In terms of cost efficiency, our ML-based approach represents a significantly cost-effective alternative to traditional monitoring approaches. △ Less

Submitted 14 October, 2024; originally announced October 2024.

arXiv:2410.10260 [pdf, other]

Slide-based Graph Collaborative Training for Histopathology Whole Slide Image Analysis

Authors: Jun Shi, Tong Shu, Zhiguo Jiang, Wei Wang, Haibo Wu, Yushan Zheng

Abstract: The development of computational pathology lies in the consensus that pathological characteristics of tumors are significant guidance for cancer diagnostics. Most existing research focuses on the inner-contextual information within each WSI yet ignores the possible inter-correlations between slides. As the development of tumors is a continuous process involving a series of histological, morphologi… ▽ More The development of computational pathology lies in the consensus that pathological characteristics of tumors are significant guidance for cancer diagnostics. Most existing research focuses on the inner-contextual information within each WSI yet ignores the possible inter-correlations between slides. As the development of tumors is a continuous process involving a series of histological, morphological, and genetic changes that accumulate over time, the similarities and differences between WSIs across various stages, grades, locations and patients should potentially contribute to the representation of WSIs and deserve to be taken into account in WSI modeling. To verify the advancement of introducing the slide inter-correlations into the representation learning of WSIs, we proposed a generic WSI analysis pipeline SlideGCD that can be adapted to any existing Multiple Instance Learning (MIL) frameworks and improve their performance. With the new paradigm, the prior knowledge of cancer development can participate in the end-to-end workflow, which concurrently initializes and refines the slide representation, as a guide for message passing in the slide-based graph. Extensive comparisons and experiments are conducted to validate the effectiveness and robustness of the proposed pipeline across 4 different tasks, including cancer subtyping, cancer staging, survival prediction, and gene mutation prediction, with 7 representative SOTA WSI analysis frameworks as backbones. △ Less

Submitted 14 October, 2024; originally announced October 2024.

arXiv:2410.09738 [pdf]

Can Large Language Models Generate Geospatial Code?

Authors: Shuyang Hou, Zhangxiao Shen, Jianyuan Liang, Anqi Zhao, Zhipeng Gui, Rui Li, Huayi Wu

Abstract: With the growing demand for spatiotemporal data processing and geospatial modeling, automating geospatial code generation has become essential for productivity. Large language models (LLMs) show promise in code generation but face challenges like domain-specific knowledge gaps and "coding hallucinations." This paper introduces GeoCode-Eval (GCE), a framework for assessing LLMs' ability to generate… ▽ More With the growing demand for spatiotemporal data processing and geospatial modeling, automating geospatial code generation has become essential for productivity. Large language models (LLMs) show promise in code generation but face challenges like domain-specific knowledge gaps and "coding hallucinations." This paper introduces GeoCode-Eval (GCE), a framework for assessing LLMs' ability to generate geospatial code across three dimensions: "Cognition and Memory," "Comprehension and Interpretation," and "Innovation and Creation," distributed across eight capability levels. We developed a benchmark dataset, GeoCode-Bench, consisting of 5,000 multiple-choice, 1,500 fill-in-the-blank, 1,500 true/false questions, and 1,000 subjective tasks covering code summarization, generation, completion, and correction. Using GeoCode-Bench, we evaluated three commercial closed-source LLMs, four open-source general-purpose LLMs, and 14 specialized code generation models. We also conducted experiments on few-shot and zero-shot learning, Chain of Thought reasoning, and multi-round majority voting to measure their impact on geospatial code generation. Additionally, we fine-tuned the Code LLaMA-7B model using Google Earth Engine-related JavaScript, creating GEECode-GPT, and evaluated it on subjective tasks. Results show that constructing pre-training and instruction datasets significantly improves code generation, offering insights for optimizing LLMs in specific domains. △ Less

Submitted 17 October, 2024; v1 submitted 13 October, 2024; originally announced October 2024.

arXiv:2410.07901 [pdf, other]

Semi-Supervised Video Desnowing Network via Temporal Decoupling Experts and Distribution-Driven Contrastive Regularization

Authors: Hongtao Wu, Yijun Yang, Angelica I Aviles-Rivero, Jingjing Ren, Sixiang Chen, Haoyu Chen, Lei Zhu

Abstract: Snow degradations present formidable challenges to the advancement of computer vision tasks by the undesirable corruption in outdoor scenarios. While current deep learning-based desnowing approaches achieve success on synthetic benchmark datasets, they struggle to restore out-of-distribution real-world snowy videos due to the deficiency of paired real-world training data. To address this bottlenec… ▽ More Snow degradations present formidable challenges to the advancement of computer vision tasks by the undesirable corruption in outdoor scenarios. While current deep learning-based desnowing approaches achieve success on synthetic benchmark datasets, they struggle to restore out-of-distribution real-world snowy videos due to the deficiency of paired real-world training data. To address this bottleneck, we devise a new paradigm for video desnowing in a semi-supervised spirit to involve unlabeled real data for the generalizable snow removal. Specifically, we construct a real-world dataset with 85 snowy videos, and then present a Semi-supervised Video Desnowing Network (SemiVDN) equipped by a novel Distribution-driven Contrastive Regularization. The elaborated contrastive regularization mitigates the distribution gap between the synthetic and real data, and consequently maintains the desired snow-invariant background details. Furthermore, based on the atmospheric scattering model, we introduce a Prior-guided Temporal Decoupling Experts module to decompose the physical components that make up a snowy video in a frame-correlated manner. We evaluate our SemiVDN on benchmark datasets and the collected real snowy data. The experimental results demonstrate the superiority of our approach against state-of-the-art image- and video-level desnowing methods. △ Less

Submitted 10 October, 2024; originally announced October 2024.

arXiv:2410.07592 [pdf, other]

Diversified and Adaptive Negative Sampling on Knowledge Graphs

Authors: Ran Liu, Zhongzhou Liu, Xiaoli Li, Hao Wu, Yuan Fang

Abstract: In knowledge graph embedding, aside from positive triplets (ie: facts in the knowledge graph), the negative triplets used for training also have a direct influence on the model performance. In reality, since knowledge graphs are sparse and incomplete, negative triplets often lack explicit labels, and thus they are often obtained from various sampling strategies (eg: randomly replacing an entity in… ▽ More In knowledge graph embedding, aside from positive triplets (ie: facts in the knowledge graph), the negative triplets used for training also have a direct influence on the model performance. In reality, since knowledge graphs are sparse and incomplete, negative triplets often lack explicit labels, and thus they are often obtained from various sampling strategies (eg: randomly replacing an entity in a positive triplet). An ideal sampled negative triplet should be informative enough to help the model train better. However, existing methods often ignore diversity and adaptiveness in their sampling process, which harms the informativeness of negative triplets. As such, we propose a generative adversarial approach called Diversified and Adaptive Negative Sampling DANS on knowledge graphs. DANS is equipped with a two-way generator that generates more diverse negative triplets through two pathways, and an adaptive mechanism that produces more fine-grained examples by localizing the global generator for different entities and relations. On the one hand, the two-way generator increase the overall informativeness with more diverse negative examples; on the other hand, the adaptive mechanism increases the individual sample-wise informativeness with more fine-grained sampling. Finally, we evaluate the performance of DANS on three benchmark knowledge graphs to demonstrate its effectiveness through quantitative and qualitative experiments. △ Less

Submitted 9 October, 2024; originally announced October 2024.

Comments: 30 pages, 7 figures, Journal

arXiv:2410.07589 [pdf, other]

No Free Lunch: Retrieval-Augmented Generation Undermines Fairness in LLMs, Even for Vigilant Users

Authors: Mengxuan Hu, Hongyi Wu, Zihan Guan, Ronghang Zhu, Dongliang Guo, Daiqing Qi, Sheng Li

Abstract: Retrieval-Augmented Generation (RAG) is widely adopted for its effectiveness and cost-efficiency in mitigating hallucinations and enhancing the domain-specific generation capabilities of large language models (LLMs). However, is this effectiveness and cost-efficiency truly a free lunch? In this study, we comprehensively investigate the fairness costs associated with RAG by proposing a practical th… ▽ More Retrieval-Augmented Generation (RAG) is widely adopted for its effectiveness and cost-efficiency in mitigating hallucinations and enhancing the domain-specific generation capabilities of large language models (LLMs). However, is this effectiveness and cost-efficiency truly a free lunch? In this study, we comprehensively investigate the fairness costs associated with RAG by proposing a practical three-level threat model from the perspective of user awareness of fairness. Specifically, varying levels of user fairness awareness result in different degrees of fairness censorship on the external dataset. We examine the fairness implications of RAG using uncensored, partially censored, and fully censored datasets. Our experiments demonstrate that fairness alignment can be easily undermined through RAG without the need for fine-tuning or retraining. Even with fully censored and supposedly unbiased external datasets, RAG can lead to biased outputs. Our findings underscore the limitations of current alignment methods in the context of RAG-based LLMs and highlight the urgent need for new strategies to ensure fairness. We propose potential mitigations and call for further research to develop robust fairness safeguards in RAG-based LLMs. △ Less

Submitted 9 October, 2024; originally announced October 2024.

arXiv:2410.06158 [pdf, other]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Authors: Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, Minzhao Zhu

Abstract: We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments… ▽ More We present GR-2, a state-of-the-art generalist robot agent for versatile and generalizable robot manipulation. GR-2 is first pre-trained on a vast number of Internet videos to capture the dynamics of the world. This large-scale pre-training, involving 38 million video clips and over 50 billion tokens, equips GR-2 with the ability to generalize across a wide range of robotic tasks and environments during subsequent policy learning. Following this, GR-2 is fine-tuned for both video generation and action prediction using robot trajectories. It exhibits impressive multi-task learning capabilities, achieving an average success rate of 97.7% across more than 100 tasks. Moreover, GR-2 demonstrates exceptional generalization to new, previously unseen scenarios, including novel backgrounds, environments, objects, and tasks. Notably, GR-2 scales effectively with model size, underscoring its potential for continued growth and application. Project page: \url{https://gr2-manipulation.github.io}. △ Less

Submitted 8 October, 2024; originally announced October 2024.

Comments: Tech Report. Authors are listed in alphabetical order. Project page: https://gr2-manipulation.github.io

arXiv:2410.06115 [pdf, other]

A physics-based perspective for understanding and utilizing spatial resources of wireless channels

Authors: Hui Xu, Jun Wei Wu, Zhen Jie Qi, Hao Tian Wu, Rui Wen Shao, Qiang Cheng, Jieao Zhu, Linglong Dai, Tie Jun Cui

Abstract: To satisfy the increasing demands for transmission rates of wireless communications, it is necessary to use spatial resources of electromagnetic (EM) waves. In this context, EM information theory (EIT) has become a hot topic by integrating the theoretical framework of deterministic mathematics and stochastic statistics to explore the transmission mechanisms of continuous EM waves. However, the pre… ▽ More To satisfy the increasing demands for transmission rates of wireless communications, it is necessary to use spatial resources of electromagnetic (EM) waves. In this context, EM information theory (EIT) has become a hot topic by integrating the theoretical framework of deterministic mathematics and stochastic statistics to explore the transmission mechanisms of continuous EM waves. However, the previous studies were primarily focused on frame analysis, with limited exploration of practical applications and a comprehensive understanding of its essential physical characteristics. In this paper, we present a three-dimensional (3-D) line-of-sight channel capacity formula that captures the vector EM physics and accommodates both near- and far-field scenes. Based on the rigorous mathematical equation and the physical mechanism of fast multipole expansion, a channel model is established, and the finite angular spectral bandwidth feature of scattered waves is revealed. To adapt to the feature of the channel, an optimization problem is formulated for determining the mode currents on the transmitter, aiming to obtain the optimal design of the precoder and combiner. We make comprehensive analyses to investigate the relationship among the spatial degree of freedom, noise, and transmitted power, thereby establishing a rigorous upper bound of channel capacity. A series of simulations are conducted to validate the theoretical model and numerical method. This work offers a novel perspective and methodology for understanding and leveraging EIT, and provides a theoretical foundation for the design and optimization of future wireless communications. △ Less

Submitted 8 October, 2024; originally announced October 2024.

Comments: 31pages, 8 figures

arXiv:2410.05993 [pdf, other]

Aria: An Open Multimodal Native Mixture-of-Experts Model

Authors: Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, Junnan Li

Abstract: Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wi… ▽ More Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications. △ Less

Submitted 10 October, 2024; v1 submitted 8 October, 2024; originally announced October 2024.

arXiv:2410.05863 [pdf, other]

Enhancing Playback Performance in Video Recommender Systems with an On-Device Gating and Ranking Framework

Authors: Yunfei Yang, Zhenghao Qi, Honghuan Wu, Qi Song, Tieyao Zhang, Hao Li, Yimin Tu, Kaiqiao Zhan, Ben Wang

Abstract: Video recommender systems (RSs) have gained increasing attention in recent years. Existing mainstream RSs focus on optimizing the matching function between users and items. However, we noticed that users frequently encounter playback issues such as slow loading or stuttering while browsing the videos, especially in weak network conditions, which will lead to a subpar browsing experience, and may c… ▽ More Video recommender systems (RSs) have gained increasing attention in recent years. Existing mainstream RSs focus on optimizing the matching function between users and items. However, we noticed that users frequently encounter playback issues such as slow loading or stuttering while browsing the videos, especially in weak network conditions, which will lead to a subpar browsing experience, and may cause users to leave, even when the video content and recommendations are superior. It is quite a serious issue, yet easily overlooked. To tackle this issue, we propose an on-device Gating and Ranking Framework (GRF) that cooperates with server-side RS. Specifically, we utilize a gate model to identify videos that may have playback issues in real-time, and then we employ a ranking model to select the optimal result from a locally-cached pool to replace the stuttering videos. Our solution has been fully deployed on Kwai, a large-scale short video platform with hundreds of millions of users globally. Moreover, it significantly enhances video playback performance and improves overall user experience and retention rates. △ Less

Submitted 8 October, 2024; originally announced October 2024.

Comments: CIKM 2024 applied research track, 7 pages

arXiv:2410.05474 [pdf, other]

R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?

Authors: Chunyi Li, Jianbo Zhang, Zicheng Zhang, Haoning Wu, Yuan Tian, Wei Sun, Guo Lu, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai

Abstract: The outstanding performance of Large Multimodal Models (LMMs) has made them widely applied in vision-related tasks. However, various corruptions in the real world mean that images will not be as ideal as in simulations, presenting significant challenges for the practical application of LMMs. To address this issue, we introduce R-Bench, a benchmark focused on the **Real-world Robustness of LMMs**.… ▽ More The outstanding performance of Large Multimodal Models (LMMs) has made them widely applied in vision-related tasks. However, various corruptions in the real world mean that images will not be as ideal as in simulations, presenting significant challenges for the practical application of LMMs. To address this issue, we introduce R-Bench, a benchmark focused on the **Real-world Robustness of LMMs**. Specifically, we: (a) model the complete link from user capture to LMMs reception, comprising 33 corruption dimensions, including 7 steps according to the corruption sequence, and 7 groups based on low-level attributes; (b) collect reference/distorted image dataset before/after corruption, including 2,970 question-answer pairs with human labeling; (c) propose comprehensive evaluation for absolute/relative robustness and benchmark 20 mainstream LMMs. Results show that while LMMs can correctly handle the original reference images, their performance is not stable when faced with distorted images, and there is a significant gap in robustness compared to the human visual system. We hope that R-Bench will inspire improving the robustness of LMMs, **extending them from experimental simulations to the real-world application**. Check https://q-future.github.io/R-Bench for details. △ Less

Submitted 7 October, 2024; originally announced October 2024.

arXiv:2410.04636 [pdf, other]

Multi-Tiered Self-Contrastive Learning for Medical Microwave Radiometry (MWR) Breast Cancer Detection

Authors: Christoforos Galazis, Huiyi Wu, Igor Goryanin

Abstract: The pursuit of enhanced breast cancer detection and monitoring techniques is a paramount healthcare objective, driving the need for innovative imaging technologies and diagnostic approaches. This study introduces a novel multi-tiered self-contrastive model tailored for the application of microwave radiometry (MWR) breast cancer detection. Our approach encompasses three distinct models: Local-MWR (… ▽ More The pursuit of enhanced breast cancer detection and monitoring techniques is a paramount healthcare objective, driving the need for innovative imaging technologies and diagnostic approaches. This study introduces a novel multi-tiered self-contrastive model tailored for the application of microwave radiometry (MWR) breast cancer detection. Our approach encompasses three distinct models: Local-MWR (L-MWR), Regional-MWR (R-MWR), and Global-MWR (G-MWR), each engineered to analyze varying sub-regional comparisons within the breasts. These models are cohesively integrated through the Joint-MWR (J-MWR) network, which leverages the self-contrastive data generated at each analytical level to enhance detection capabilities. Employing a dataset comprising 4,932 cases of female patients, our research showcases the effectiveness of our proposed models. Notably, the J-MWR model distinguishes itself by achieving a Matthews correlation coefficient of 0.74 $\pm$ 0.018, surpassing existing MWR neural networks and contrastive methods. These results highlight the significant potential of self-contrastive learning techniques in improving both the diagnostic accuracy and generalizability of MWR-based breast cancer detection processes. Such advancements hold considerable promise for further investigative and clinical endeavors. The source code is available at: https://github.com/cgalaz01/self_contrastive_mwr △ Less

Submitted 6 October, 2024; originally announced October 2024.

arXiv:2410.03806 [pdf, other]

Metadata Matters for Time Series: Informative Forecasting with Transformers

Authors: Jiaxiang Dong, Haixu Wu, Yuxuan Wang, Li Zhang, Jianmin Wang, Mingsheng Long

Abstract: Time series forecasting is prevalent in extensive real-world applications, such as financial analysis and energy planning. Previous studies primarily focus on time series modality, endeavoring to capture the intricate variations and dependencies inherent in time series. Beyond numerical time series data, we notice that metadata (e.g.~dataset and variate descriptions) also carries valuable informat… ▽ More Time series forecasting is prevalent in extensive real-world applications, such as financial analysis and energy planning. Previous studies primarily focus on time series modality, endeavoring to capture the intricate variations and dependencies inherent in time series. Beyond numerical time series data, we notice that metadata (e.g.~dataset and variate descriptions) also carries valuable information essential for forecasting, which can be used to identify the application scenario and provide more interpretable knowledge than digit sequences. Inspired by this observation, we propose a Metadata-informed Time Series Transformer (MetaTST), which incorporates multiple levels of context-specific metadata into Transformer forecasting models to enable informative time series forecasting. To tackle the unstructured nature of metadata, MetaTST formalizes them into natural languages by pre-designed templates and leverages large language models (LLMs) to encode these texts into metadata tokens as a supplement to classic series tokens, resulting in an informative embedding. Further, a Transformer encoder is employed to communicate series and metadata tokens, which can extend series representations by metadata information for more accurate forecasting. This design also allows the model to adaptively learn context-specific patterns across various scenarios, which is particularly effective in handling large-scale, diverse-scenario forecasting tasks. Experimentally, MetaTST achieves state-of-the-art compared to advanced time series models and LLM-based methods on widely acknowledged short- and long-term forecasting benchmarks, covering both single-dataset individual and multi-dataset joint training settings. △ Less

Submitted 4 October, 2024; originally announced October 2024.

arXiv:2410.03777 [pdf, other]

Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling

Authors: Yuxuan Yao, Han Wu, Mingyang Liu, Sichun Luo, Xiongwei Han, Jie Liu, Zhijiang Guo, Linqi Song

Abstract: Large language models (LLMs) exhibit varying strengths and weaknesses across different tasks, prompting recent studies to explore the benefits of ensembling models to leverage their complementary advantages. However, existing LLM ensembling methods often overlook model compatibility and struggle with inefficient alignment of probabilities across the entire vocabulary. In this study, we empirically… ▽ More Large language models (LLMs) exhibit varying strengths and weaknesses across different tasks, prompting recent studies to explore the benefits of ensembling models to leverage their complementary advantages. However, existing LLM ensembling methods often overlook model compatibility and struggle with inefficient alignment of probabilities across the entire vocabulary. In this study, we empirically investigate the factors influencing ensemble performance, identifying model performance, vocabulary size, and response style as key determinants, revealing that compatibility among models is essential for effective ensembling. This analysis leads to the development of a simple yet effective model selection strategy that identifies compatible models. Additionally, we introduce the \textsc{Uni}on \textsc{T}op-$k$ \textsc{E}nsembling (\textsc{UniTE}), a novel approach that efficiently combines models by focusing on the union of the top-k tokens from each model, thereby avoiding the need for full vocabulary alignment and reducing computational overhead. Extensive evaluations across multiple benchmarks demonstrate that \textsc{UniTE} significantly enhances performance compared to existing methods, offering a more efficient framework for LLM ensembling. △ Less

Submitted 3 October, 2024; originally announced October 2024.

arXiv:2410.02743 [pdf, other]

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Authors: Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu

Abstract: Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows conv… ▽ More Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will make our code and data publicly available at https://github.com/ernie-research/MA-RLHF . △ Less

Submitted 3 October, 2024; originally announced October 2024.

arXiv:2410.01610 [pdf, other]

Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging

Authors: Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Hua Wu, Sen Su

Abstract: Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and demonstrates outstanding performance in plentiful natural language processing tasks. However, existing methods transforming LLMs from dense to MoE face significant data requirements and typically rely on large-scale post-training. In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient approach for… ▽ More Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and demonstrates outstanding performance in plentiful natural language processing tasks. However, existing methods transforming LLMs from dense to MoE face significant data requirements and typically rely on large-scale post-training. In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model. Specifically, we first point out that intermediate checkpoints during instruction tuning of the dense model are naturally suitable for specialized experts, and then propose an expert expansion stage to flexibly achieve models with flexible numbers of experts, where genetic algorithm and parameter merging are introduced to ensure sufficient diversity of new extended experts. To ensure that each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router. Extensive experiments with various data scales and upcycling settings demonstrate the outstanding performance and data efficiency of UpIT, as well as stable improvement in expert or data scaling. Further analysis reveals the importance of ensuring expert diversity in upcycling. △ Less

Submitted 2 October, 2024; originally announced October 2024.

Comments: work in progress

arXiv:2410.01240 [pdf]

Automatic deductive coding in discourse analysis: an application of large language models in learning analytics

Authors: Lishan Zhang, Han Wu, Xiaoshan Huang, Tengfei Duan, Hanxiang Du

Abstract: Deductive coding is a common discourse analysis method widely used by learning science and learning analytics researchers for understanding teaching and learning interactions. It often requires researchers to manually label all discourses to be analyzed according to a theoretically guided coding scheme, which is time-consuming and labor-intensive. The emergence of large language models such as GPT… ▽ More Deductive coding is a common discourse analysis method widely used by learning science and learning analytics researchers for understanding teaching and learning interactions. It often requires researchers to manually label all discourses to be analyzed according to a theoretically guided coding scheme, which is time-consuming and labor-intensive. The emergence of large language models such as GPT has opened a new avenue for automatic deductive coding to overcome the limitations of traditional deductive coding. To evaluate the usefulness of large language models in automatic deductive coding, we employed three different classification methods driven by different artificial intelligence technologies, including the traditional text classification method with text feature engineering, BERT-like pretrained language model and GPT-like pretrained large language model (LLM). We applied these methods to two different datasets and explored the potential of GPT and prompt engineering in automatic deductive coding. By analyzing and comparing the accuracy and Kappa values of these three classification methods, we found that GPT with prompt engineering outperformed the other two methods on both datasets with limited number of training samples. By providing detailed prompt structures, the reported work demonstrated how large language models can be used in the implementation of automatic deductive coding. △ Less

Submitted 2 October, 2024; originally announced October 2024.

Comments: 20 pages

arXiv:2410.00428 [pdf, other]

LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management

Authors: Yi Xiong, Hao Wu, Changxu Shao, Ziqing Wang, Rui Zhang, Yuhong Guo, Junping Zhao, Ke Zhang, Zhenxuan Pan

Abstract: The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token (TTFT). This paper identifies that the sharp rise in TTFT as context length increases is predominantly driven by queuing delays, which are caused by the growing… ▽ More The expanding context windows in large language models (LLMs) have greatly enhanced their capabilities in various applications, but they also introduce significant challenges in maintaining low latency, particularly in Time to First Token (TTFT). This paper identifies that the sharp rise in TTFT as context length increases is predominantly driven by queuing delays, which are caused by the growing demands for GPU Key-Value (KV) cache allocation clashing with the limited availability of KV cache blocks. To address this issue, we propose LayerKV, a simple yet effective plug-in method that effectively reduces TTFT without requiring additional hardware or compromising output performance, while seamlessly integrating with existing parallelism strategies and scheduling techniques. Specifically, LayerKV introduces layer-wise KV block allocation, management, and offloading for fine-grained control over system memory, coupled with an SLO-aware scheduler to optimize overall Service Level Objectives (SLOs). Comprehensive evaluations on representative models, ranging from 7B to 70B parameters, across various GPU configurations, demonstrate that LayerKV improves TTFT latency up to 69x and reduces SLO violation rates by 28.7%, significantly enhancing the user experience. △ Less

Submitted 9 October, 2024; v1 submitted 1 October, 2024; originally announced October 2024.

Comments: 11 pages, 7 figures, 1 table

ACM Class: I.2.11; C.4

arXiv:2409.20310 [pdf, other]

A SSM is Polymerized from Multivariate Time Series

Authors: Haixiang Wu

Abstract: For multivariate time series (MTS) tasks, previous state space models (SSMs) followed the modeling paradigm of Transformer-based methods. However, none of them explicitly model the complex dependencies of MTS: the Channel Dependency variations with Time (CDT). In view of this, we delve into the derivation of SSM, which involves approximating continuously updated functions by orthogonal function ba… ▽ More For multivariate time series (MTS) tasks, previous state space models (SSMs) followed the modeling paradigm of Transformer-based methods. However, none of them explicitly model the complex dependencies of MTS: the Channel Dependency variations with Time (CDT). In view of this, we delve into the derivation of SSM, which involves approximating continuously updated functions by orthogonal function basis. We then develop Poly-Mamba, a novel method for MTS forecasting. Its core concept is to expand the original orthogonal function basis space into a multivariate orthogonal function space containing variable mixing terms, and make a projection on this space so as to explicitly describe the CDT by weighted coefficients. In Poly-Mamba, we propose the Multivariate Orthogonal Polynomial Approximation (MOPA) as a simplified implementation of this concept. For the simple linear relationship between channels, we propose Linear Channel Mixing (LCM) and generate CDT patterns adaptively for different channels through a proposed Order Combining method. Experiments on six real-world datasets demonstrate that Poly-Mamba outperforms the SOTA methods, especially when dealing with datasets having a large number of channels and complex correlations. The codes and log files will be released at: https://github.com/Joeland4/Poly-Mamba. △ Less

Submitted 30 September, 2024; v1 submitted 30 September, 2024; originally announced September 2024.

arXiv:2409.20063 [pdf, other]

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

Authors: Zicheng Zhang, Ziheng Jia, Haoning Wu, Chunyi Li, Zijian Chen, Yingjie Zhou, Wei Sun, Xiaohong Liu, Xiongkuo Min, Weisi Lin, Guangtao Zhai

Abstract: With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video qua… ▽ More With the rising interest in research on Large Multi-modal Models (LMMs) for video understanding, many studies have emphasized general video comprehension capabilities, neglecting the systematic exploration into video quality understanding. To address this oversight, we introduce Q-Bench-Video in this paper, a new benchmark specifically designed to evaluate LMMs' proficiency in discerning video quality. a) To ensure video source diversity, Q-Bench-Video encompasses videos from natural scenes, AI-generated Content (AIGC), and Computer Graphics (CG). b) Building on the traditional multiple-choice questions format with the Yes-or-No and What-How categories, we include Open-ended questions to better evaluate complex scenarios. Additionally, we incorporate the video pair quality comparison question to enhance comprehensiveness. c) Beyond the traditional Technical, Aesthetic, and Temporal distortions, we have expanded our evaluation aspects to include the dimension of AIGC distortions, which addresses the increasing demand for video generation. Finally, we collect a total of 2,378 question-answer pairs and test them on 12 open-source & 5 proprietary LMMs. Our findings indicate that while LMMs have a foundational understanding of video quality, their performance remains incomplete and imprecise, with a notable discrepancy compared to human performance. Through Q-Bench-Video, we seek to catalyze community interest, stimulate further research, and unlock the untapped potential of LMMs to close the gap in video quality understanding. △ Less

Submitted 30 September, 2024; originally announced September 2024.

arXiv:2409.19804 [pdf, other]

Does RAG Introduce Unfairness in LLMs? Evaluating Fairness in Retrieval-Augmented Generation Systems

Authors: Xuyang Wu, Shuowei Li, Hsin-Tai Wu, Zhiqiang Tao, Yi Fang

Abstract: RAG (Retrieval-Augmented Generation) have recently gained significant attention for their enhanced ability to integrate external knowledge sources in open-domain question answering (QA) tasks. However, it remains unclear how these models address fairness concerns, particularly with respect to sensitive attributes such as gender, geographic location, and other demographic factors. First, as languag… ▽ More RAG (Retrieval-Augmented Generation) have recently gained significant attention for their enhanced ability to integrate external knowledge sources in open-domain question answering (QA) tasks. However, it remains unclear how these models address fairness concerns, particularly with respect to sensitive attributes such as gender, geographic location, and other demographic factors. First, as language models evolve to prioritize utility, like improving exact match accuracy, fairness may have been largely overlooked. Second, RAG methods are complex pipelines, making it hard to identify and address biases, as each component is optimized for different goals. In this paper, we aim to empirically evaluate fairness in several RAG methods. We propose a fairness evaluation framework tailored to RAG methods, using scenario-based questions and analyzing disparities across demographic attributes. The experimental results indicate that, despite recent advances in utility-driven optimization, fairness issues persist in both the retrieval and generation stages, highlighting the need for more targeted fairness interventions within RAG pipelines. We will release our dataset and code upon acceptance of the paper. △ Less

Submitted 29 September, 2024; originally announced September 2024.

Comments: Under review

arXiv:2409.19691 [pdf, other]

CERD: A Comprehensive Chinese Rhetoric Dataset for Rhetorical Understanding and Generation in Essays

Authors: Nuowei Liu, Xinhao Chen, Hongyi Wu, Changzhi Sun, Man Lan, Yuanbin Wu, Xiaopeng Bai, Shaoguang Mao, Yan Xia

Abstract: Existing rhetorical understanding and generation datasets or corpora primarily focus on single coarse-grained categories or fine-grained categories, neglecting the common interrelations between different rhetorical devices by treating them as independent sub-tasks. In this paper, we propose the Chinese Essay Rhetoric Dataset (CERD), consisting of 4 commonly used coarse-grained categories including… ▽ More Existing rhetorical understanding and generation datasets or corpora primarily focus on single coarse-grained categories or fine-grained categories, neglecting the common interrelations between different rhetorical devices by treating them as independent sub-tasks. In this paper, we propose the Chinese Essay Rhetoric Dataset (CERD), consisting of 4 commonly used coarse-grained categories including metaphor, personification, hyperbole and parallelism and 23 fine-grained categories across both form and content levels. CERD is a manually annotated and comprehensive Chinese rhetoric dataset with five interrelated sub-tasks. Unlike previous work, our dataset aids in understanding various rhetorical devices, recognizing corresponding rhetorical components, and generating rhetorical sentences under given conditions, thereby improving the author's writing proficiency and language usage skills. Extensive experiments are conducted to demonstrate the interrelations between multiple tasks in CERD, as well as to establish a benchmark for future research on rhetoric. The experimental results indicate that Large Language Models achieve the best performance across most tasks, and jointly fine-tuning with multiple tasks further enhances performance. △ Less

Submitted 29 September, 2024; originally announced September 2024.

arXiv:2409.19674 [pdf, other]

Alternating Maximization Algorithm for Mismatch Capacity with Oblivious Relaying

Authors: Xinwei Li, Lingyi Chen, Shitong Wu, Huihui Wu, Hao Wu, Wenyi Zhang

Abstract: Reliable communication over a discrete memoryless channel with the help of a relay has aroused interest due to its widespread applications in practical scenarios. By considering the system with a mismatched decoder, previous works have provided optimization models to evaluate the mismatch capacity in these scenarios. The proposed models, however, are difficult due to the complicated structure of t… ▽ More Reliable communication over a discrete memoryless channel with the help of a relay has aroused interest due to its widespread applications in practical scenarios. By considering the system with a mismatched decoder, previous works have provided optimization models to evaluate the mismatch capacity in these scenarios. The proposed models, however, are difficult due to the complicated structure of the mismatched decoding problem with the information flows in hops given by the relay. Existing methods, such as the grid search, become impractical as they involve finding all roots of a nonlinear system, with the growing size of the alphabet. To address this problem, we reformulate the max-min optimization model as a consistent maximization form, by considering the dual form of the inner minimization problem and the Lagrangian with a fixed multiplier. Based on the proposed formulation, an alternating maximization framework is designed, which provides the closed-form solution with simple iterations in each step by introducing a suitable variable transformation. The effectiveness of the proposed approach is demonstrated by the simulations over practical scenarios, including Quaternary and Gaussian channels. Moreover, the simulation results of the transitional probability also shed light on the promising application attribute to the quantizer design in the relay node. △ Less

Submitted 15 October, 2024; v1 submitted 29 September, 2024; originally announced September 2024.

arXiv:2409.19608 [pdf, other]

Causal Deciphering and Inpainting in Spatio-Temporal Dynamics via Diffusion Model

Authors: Yifan Duan, Jian Zhao, pengcheng, Junyuan Mao, Hao Wu, Jingyu Xu, shilong wang, Caoyuan Ma, Kai Wang, Kun Wang, Xuelong Li

Abstract: Spatio-temporal (ST) prediction has garnered a De facto attention in earth sciences, such as meteorological prediction, human mobility perception. However, the scarcity of data coupled with the high expenses involved in sensor deployment results in notable data imbalances. Furthermore, models that are excessively customized and devoid of causal connections further undermine the generalizability an… ▽ More Spatio-temporal (ST) prediction has garnered a De facto attention in earth sciences, such as meteorological prediction, human mobility perception. However, the scarcity of data coupled with the high expenses involved in sensor deployment results in notable data imbalances. Furthermore, models that are excessively customized and devoid of causal connections further undermine the generalizability and interpretability. To this end, we establish a causal framework for ST predictions, termed CaPaint, which targets to identify causal regions in data and endow model with causal reasoning ability in a two-stage process. Going beyond this process, we utilize the back-door adjustment to specifically address the sub-regions identified as non-causal in the upstream phase. Specifically, we employ a novel image inpainting technique. By using a fine-tuned unconditional Diffusion Probabilistic Model (DDPM) as the generative prior, we in-fill the masks defined as environmental parts, offering the possibility of reliable extrapolation for potential data distributions. CaPaint overcomes the high complexity dilemma of optimal ST causal discovery models by reducing the data generation complexity from exponential to quasi-linear levels. Extensive experiments conducted on five real-world ST benchmarks demonstrate that integrating the CaPaint concept allows models to achieve improvements ranging from 4.3% to 77.3%. Moreover, compared to traditional mainstream ST augmenters, CaPaint underscores the potential of diffusion models in ST enhancement, offering a novel paradigm for this field. Our project is available at https://anonymous.4open.science/r/12345-DFCC. △ Less

Submitted 29 September, 2024; originally announced September 2024.

arXiv:2409.19592 [pdf, other]

DiffCP: Ultra-Low Bit Collaborative Perception via Diffusion Model

Authors: Ruiqing Mao, Haotian Wu, Yukuan Jia, Zhaojun Nan, Yuxuan Sun, Sheng Zhou, Deniz Gündüz, Zhisheng Niu

Abstract: Collaborative perception (CP) is emerging as a promising solution to the inherent limitations of stand-alone intelligence. However, current wireless communication systems are unable to support feature-level and raw-level collaborative algorithms due to their enormous bandwidth demands. In this paper, we propose DiffCP, a novel CP paradigm that utilizes a specialized diffusion model to efficiently… ▽ More Collaborative perception (CP) is emerging as a promising solution to the inherent limitations of stand-alone intelligence. However, current wireless communication systems are unable to support feature-level and raw-level collaborative algorithms due to their enormous bandwidth demands. In this paper, we propose DiffCP, a novel CP paradigm that utilizes a specialized diffusion model to efficiently compress the sensing information of collaborators. By incorporating both geometric and semantic conditions into the generative model, DiffCP enables feature-level collaboration with an ultra-low communication cost, advancing the practical implementation of CP systems. This paradigm can be seamlessly integrated into existing CP algorithms to enhance a wide range of downstream tasks. Through extensive experimentation, we investigate the trade-offs between communication, computation, and performance. Numerical results demonstrate that DiffCP can significantly reduce communication costs by 14.5-fold while maintaining the same performance as the state-of-the-art algorithm. △ Less

Submitted 29 September, 2024; originally announced September 2024.

Comments: 7 pages, 4 figures

arXiv:2409.17525 [pdf]

When A Man Says He Is Pregnant: ERP Evidence for A Rational Account of Speaker-contextualized Language Comprehension

Authors: Hanlin Wu, Zhenguang G. Cai

Abstract: Spoken language is often, if not always, understood in a context that includes the identities of speakers. For instance, we can easily make sense of an utterance such as "I'm going to have a manicure this weekend" or "The first time I got pregnant I had a hard time" when the utterance is spoken by a woman, but it would be harder to understand when it is spoken by a man. Previous event-related pote… ▽ More Spoken language is often, if not always, understood in a context that includes the identities of speakers. For instance, we can easily make sense of an utterance such as "I'm going to have a manicure this weekend" or "The first time I got pregnant I had a hard time" when the utterance is spoken by a woman, but it would be harder to understand when it is spoken by a man. Previous event-related potential (ERP) studies have shown mixed results regarding the neurophysiological responses to such speaker-mismatched utterances, with some reporting an N400 effect and others a P600 effect. In an experiment involving 64 participants, we showed that these different ERP effects reflect distinct cognitive processes employed to resolve the speaker-message mismatch. When possible, the message is integrated with the speaker context to arrive at an interpretation, as in the case of violations of social stereotypes (e.g., men getting a manicure), resulting in an N400 effect. However, when such integration is impossible due to violations of biological knowledge (e.g., men getting pregnant), listeners engage in an error correction process to revise either the perceived utterance or the speaker context, resulting in a P600 effect. Additionally, we found that the social N400 effect decreased as a function of the listener's personality trait of openness, while the biological P600 effect remained robust. Our findings help to reconcile the empirical inconsistencies in the literature and provide a rational account of speaker-contextualized language comprehension. △ Less

Submitted 26 September, 2024; originally announced September 2024.

arXiv:2409.16904 [pdf, other]

Discriminative Anchor Learning for Efficient Multi-view Clustering

Authors: Yalan Qin, Nan Pu, Hanzhou Wu, Nicu Sebe

Abstract: Multi-view clustering aims to study the complementary information across views and discover the underlying structure. For solving the relatively high computational cost for the existing approaches, works based on anchor have been presented recently. Even with acceptable clustering performance, these methods tend to map the original representation from multiple views into a fixed shared graph based… ▽ More Multi-view clustering aims to study the complementary information across views and discover the underlying structure. For solving the relatively high computational cost for the existing approaches, works based on anchor have been presented recently. Even with acceptable clustering performance, these methods tend to map the original representation from multiple views into a fixed shared graph based on the original dataset. However, most studies ignore the discriminative property of the learned anchors, which ruin the representation capability of the built model. Moreover, the complementary information among anchors across views is neglected to be ensured by simply learning the shared anchor graph without considering the quality of view-specific anchors. In this paper, we propose discriminative anchor learning for multi-view clustering (DALMC) for handling the above issues. We learn discriminative view-specific feature representations according to the original dataset and build anchors from different views based on these representations, which increase the quality of the shared anchor graph. The discriminative feature learning and consensus anchor graph construction are integrated into a unified framework to improve each other for realizing the refinement. The optimal anchors from multiple views and the consensus anchor graph are learned with the orthogonal constraints. We give an iterative algorithm to deal with the formulated problem. Extensive experiments on different datasets show the effectiveness and efficiency of our method compared with other methods. △ Less

Submitted 25 September, 2024; originally announced September 2024.

Comments: This work has been accepted by TMM

arXiv:2409.16784 [pdf, other]

World Model-based Perception for Visual Legged Locomotion

Authors: Hang Lai, Jiahang Cao, Jiafeng Xu, Hongtao Wu, Yunfeng Lin, Tao Kong, Yong Yu, Weinan Zhang

Abstract: Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a s… ▽ More Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher's behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real-world experiments demonstrate that WMP outperforms state-of-the-art baselines in traversability and robustness. Videos and Code are available at: https://wmp-loco.github.io/. △ Less

Submitted 25 September, 2024; originally announced September 2024.

Comments: under review

arXiv:2409.16295 [pdf, other]

Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget

Authors: Andy T. Liu, Yi-Cheng Lin, Haibin Wu, Stefan Winkler, Hung-yi Lee

Abstract: Despite their impressive success, training foundation models remains computationally costly. This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget. We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size. Our goal is to make analytical steps toward under… ▽ More Despite their impressive success, training foundation models remains computationally costly. This paper investigates how to efficiently train speech foundation models with self-supervised learning (SSL) under a limited compute budget. We examine critical factors in SSL that impact the budget, including model architecture, model size, and data size. Our goal is to make analytical steps toward understanding the training dynamics of speech foundation models. We benchmark SSL objectives in an entirely comparable setting and find that other factors contribute more significantly to the success of SSL. Our results show that slimmer model architectures outperform common small architectures under the same compute and parameter budget. We demonstrate that the size of the pre-training data remains crucial, even with data augmentation during SSL training, as performance suffers when iterating over limited data. Finally, we identify a trade-off between model size and data size, highlighting an optimal model size for a given compute budget. △ Less

Submitted 9 September, 2024; originally announced September 2024.

Comments: To appear in SLT 2024

arXiv:2409.15897 [pdf, ps, other]

ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

Authors: Jiatong Shi, Jinchuan Tian, Yihan Wu, Jee-weon Jung, Jia Qi Yip, Yoshiki Masuyama, William Chen, Yuning Wu, Yuxun Tang, Massa Baali, Dareen Alharhi, Dong Zhang, Ruifan Deng, Tejes Srivastava, Haibin Wu, Alexander H. Liu, Bhiksha Raj, Qin Jin, Ruihua Song, Shinji Watanabe

Abstract: Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse appli… ▽ More Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate that ESPnet-Codec can be integrated into six ESPnet tasks, supporting diverse applications. △ Less

Submitted 24 September, 2024; originally announced September 2024.

Comments: Accepted by SLT

arXiv:2409.15781 [pdf, other]

Training Data Attribution: Was Your Model Secretly Trained On Data Created By Mine?

Authors: Likun Zhang, Hao Wu, Lingcui Zhang, Fengyuan Xu, Jin Cao, Fenghua Li, Ben Niu

Abstract: The emergence of text-to-image models has recently sparked significant interest, but the attendant is a looming shadow of potential infringement by violating the user terms. Specifically, an adversary may exploit data created by a commercial model to train their own without proper authorization. To address such risk, it is crucial to investigate the attribution of a suspicious model's training dat… ▽ More The emergence of text-to-image models has recently sparked significant interest, but the attendant is a looming shadow of potential infringement by violating the user terms. Specifically, an adversary may exploit data created by a commercial model to train their own without proper authorization. To address such risk, it is crucial to investigate the attribution of a suspicious model's training data by determining whether its training data originates, wholly or partially, from a specific source model. To trace the generated data, existing methods require applying extra watermarks during either the training or inference phases of the source model. However, these methods are impractical for pre-trained models that have been released, especially when model owners lack security expertise. To tackle this challenge, we propose an injection-free training data attribution method for text-to-image models. It can identify whether a suspicious model's training data stems from a source model, without additional modifications on the source model. The crux of our method lies in the inherent memorization characteristic of text-to-image models. Our core insight is that the memorization of the training dataset is passed down through the data generated by the source model to the model trained on that data, making the source model and the infringing model exhibit consistent behaviors on specific samples. Therefore, our approach involves developing algorithms to uncover these distinct samples and using them as inherent watermarks to verify if a suspicious model originates from the source model. Our experiments demonstrate that our method achieves an accuracy of over 80\% in identifying the source of a suspicious model's training data, without interfering the original training or generation process of the source model. △ Less

Submitted 24 September, 2024; originally announced September 2024.

arXiv:2409.15259 [pdf, other]

S$^2$AG-Vid: Enhancing Multi-Motion Alignment in Video Diffusion Models via Spatial and Syntactic Attention-Based Guidance

Authors: Yuanhang Li, Qi Mao, Lan Chen, Zhen Fang, Lei Tian, Xinyan Xiao, Libiao Jin, Hua Wu

Abstract: Recent advancements in text-to-video (T2V) generation using diffusion models have garnered significant attention. However, existing T2V models primarily focus on simple scenes featuring a single object performing a single motion. Challenges arise in scenarios involving multiple objects with distinct motions, often leading to incorrect video-text alignment between subjects and their corresponding m… ▽ More Recent advancements in text-to-video (T2V) generation using diffusion models have garnered significant attention. However, existing T2V models primarily focus on simple scenes featuring a single object performing a single motion. Challenges arise in scenarios involving multiple objects with distinct motions, often leading to incorrect video-text alignment between subjects and their corresponding motions. To address this challenge, we propose \textbf{S$^2$AG-Vid}, a training-free inference-stage optimization method that improves the alignment of multiple objects with their corresponding motions in T2V models. S$^2$AG-Vid initially applies a spatial position-based, cross-attention (CA) constraint in the early stages of the denoising process, facilitating multiple nouns distinctly attending to the correct subject regions. To enhance the motion-subject binding, we implement a syntax-guided contrastive constraint in the subsequent denoising phase, aimed at improving the correlations between the CA maps of verbs and their corresponding nouns.Both qualitative and quantitative evaluations demonstrate that the proposed framework significantly outperforms baseline approaches, producing higher-quality videos with improved subject-motion consistency. △ Less

Submitted 23 September, 2024; originally announced September 2024.

arXiv:2409.14836 [pdf, other]

Orthogonal Finetuning for Direct Preference Optimization

Authors: Chenxu Yang, Ruipeng Jia, Naibin Gu, Zheng Lin, Siyuan Chen, Chao Pang, Weichong Yin, Yu Sun, Hua Wu, Weiping Wang

Abstract: DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovative… ▽ More DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points. △ Less

Submitted 23 September, 2024; v1 submitted 23 September, 2024; originally announced September 2024.

arXiv:2409.14264 [pdf, ps, other]

The Differential and Boomerang Properties of a Class of Binomials

Authors: Sihem Mesnager, Huawei Wu

Abstract: Let $q$ be an odd prime power with $q\equiv 3\ ({\rm{mod}}\ 4)$. In this paper, we study the differential and boomerang properties of the function $F_{2,u}(x)=x^2\big(1+uη(x)\big)$ over $\mathbb{F}_{q}$, where $u\in\mathbb{F}_{q}^*$ and $η$ is the quadratic character of $\mathbb{F}_{q}$. We determine the differential uniformity of $F_{2,u}$ for any $u\in\mathbb{F}_{q}^*$ and determine the differen… ▽ More Let $q$ be an odd prime power with $q\equiv 3\ ({\rm{mod}}\ 4)$. In this paper, we study the differential and boomerang properties of the function $F_{2,u}(x)=x^2\big(1+uη(x)\big)$ over $\mathbb{F}_{q}$, where $u\in\mathbb{F}_{q}^*$ and $η$ is the quadratic character of $\mathbb{F}_{q}$. We determine the differential uniformity of $F_{2,u}$ for any $u\in\mathbb{F}_{q}^*$ and determine the differential spectra and boomerang uniformity of the locally-APN functions $F_{2,\pm 1}$, thereby disproving a conjecture proposed in \cite{budaghyan2024arithmetization} which states that there exist infinitely many $q$ and $u$ such that $F_{2,u}$ is an APN function. △ Less

Submitted 25 September, 2024; v1 submitted 21 September, 2024; originally announced September 2024.

arXiv:2409.14085 [pdf, other]

Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models

Authors: Haibin Wu, Xuanjun Chen, Yi-Cheng Lin, Kaiwei Chang, Jiawei Du, Ke-Han Lu, Alexander H. Liu, Ho-Lam Chung, Yuan-Kuei Wu, Dongchao Yang, Songxiang Liu, Yi-Chiao Wu, Xu Tan, James Glass, Shinji Watanabe, Hung-yi Lee

Abstract: Neural audio codec models are becoming increasingly important as they serve as tokenizers for audio, enabling efficient transmission or facilitating speech language modeling. The ideal neural audio codec should maintain content, paralinguistics, speaker characteristics, and audio information even at low bitrates. Recently, numerous advanced neural codec models have been proposed. However, codec mo… ▽ More Neural audio codec models are becoming increasingly important as they serve as tokenizers for audio, enabling efficient transmission or facilitating speech language modeling. The ideal neural audio codec should maintain content, paralinguistics, speaker characteristics, and audio information even at low bitrates. Recently, numerous advanced neural codec models have been proposed. However, codec models are often tested under varying experimental conditions. As a result, we introduce the Codec-SUPERB challenge at SLT 2024, designed to facilitate fair and lightweight comparisons among existing codec models and inspire advancements in the field. This challenge brings together representative speech applications and objective metrics, and carefully selects license-free datasets, sampling them into small sets to reduce evaluation computation costs. This paper presents the challenge's rules, datasets, five participant systems, results, and findings. △ Less

Submitted 21 September, 2024; originally announced September 2024.

arXiv:2409.13621 [pdf, other]

Advancing Event Causality Identification via Heuristic Semantic Dependency Inquiry Network

Authors: Haoran Li, Qiang Gao, Hongmei Wu, Li Huang

Abstract: Event Causality Identification (ECI) focuses on extracting causal relations between events in texts. Existing methods for ECI primarily rely on causal features and external knowledge. However, these approaches fall short in two dimensions: (1) causal features between events in a text often lack explicit clues, and (2) external knowledge may introduce bias, while specific problems require tailored… ▽ More Event Causality Identification (ECI) focuses on extracting causal relations between events in texts. Existing methods for ECI primarily rely on causal features and external knowledge. However, these approaches fall short in two dimensions: (1) causal features between events in a text often lack explicit clues, and (2) external knowledge may introduce bias, while specific problems require tailored analyses. To address these issues, we propose SemDI - a simple and effective Semantic Dependency Inquiry Network for ECI. SemDI captures semantic dependencies within the context using a unified encoder. Then, it utilizes a Cloze Analyzer to generate a fill-in token based on comprehensive context understanding. Finally, this fill-in token is used to inquire about the causal relation between two events. Extensive experiments demonstrate the effectiveness of SemDI, surpassing state-of-the-art methods on three widely used benchmarks. Code is available at https://github.com/hrlics/SemDI. △ Less

Submitted 2 October, 2024; v1 submitted 20 September, 2024; originally announced September 2024.

Comments: EMNLP 2024 camera-ready version. Code is released at https://github.com/hrlics/SemDI

arXiv:2409.13321 [pdf, other]

SLaVA-CXR: Small Language and Vision Assistant for Chest X-ray Report Automation

Authors: Jinge Wu, Yunsoo Kim, Daqian Shi, David Cliffton, Fenglin Liu, Honghan Wu

Abstract: Inspired by the success of large language models (LLMs), there is growing research interest in developing LLMs in the medical domain to assist clinicians. However, for hospitals, using closed-source commercial LLMs involves privacy issues, and developing open-source public LLMs requires large-scale computational resources, which are usually limited, especially in resource-efficient regions and low… ▽ More Inspired by the success of large language models (LLMs), there is growing research interest in developing LLMs in the medical domain to assist clinicians. However, for hospitals, using closed-source commercial LLMs involves privacy issues, and developing open-source public LLMs requires large-scale computational resources, which are usually limited, especially in resource-efficient regions and low-income countries. We propose an open-source Small Language and Vision Assistant (SLaVA-CXR) that can be used for Chest X-Ray report automation. To efficiently train a small assistant, we first propose the Re$^3$Training method, which simulates the cognitive development of radiologists and optimizes the model in the Recognition, Reasoning, and Reporting training manner. Then, we introduce a data synthesis method, RADEX, which can generate a high-quality and diverse training corpus with privacy regulation compliance. The extensive experiments show that our SLaVA-CXR built on a 2.7B backbone not only outperforms but also achieves 6 times faster inference efficiency than previous state-of-the-art larger models. △ Less

Submitted 20 September, 2024; originally announced September 2024.

Showing 1–50 of 1,519 results for author: Wu, H