Skip to main content

Showing 1–50 of 334 results for author: Sun, G

  1. arXiv:2410.13184  [pdf, other

    cs.CL

    Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

    Authors: Shwai He, Tao Ge, Guoheng Sun, Bowei Tian, Xiaoyang Wang, Ang Li, Dong Yu

    Abstract: Traditional transformer models often allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this, the Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges: (… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

  2. arXiv:2410.10303  [pdf, other

    cs.CL

    A Comparative Study of Translation Bias and Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification

    Authors: Aryan Singhal, Veronica Shao, Gary Sun, Ryan Ding, Jonathan Lu, Kevin Zhu

    Abstract: The rise of digital misinformation has heightened interest in using multilingual Large Language Models (LLMs) for fact-checking. This study systematically evaluates translation bias and the effectiveness of LLMs for cross-lingual claim verification across 15 languages from five language families: Romance, Slavic, Turkic, Indo-Aryan, and Kartvelian. Using the XFACT dataset to assess their impact on… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: Accepted to ATTRIB @ NeurIPS 2024

  3. arXiv:2410.10215  [pdf, other

    cs.CL cs.LG

    SkillAggregation: Reference-free LLM-Dependent Aggregation

    Authors: Guangzhi Sun, Anmol Kagrecha, Potsawee Manakul, Phil Woodland, Mark Gales

    Abstract: Large Language Models (LLMs) are increasingly used to assess NLP tasks due to their ability to generate human-like judgments. Single LLMs were used initially, however, recent work suggests using multiple LLMs as judges yields improved performance. An important step in exploiting multiple judgements is the combination stage, aggregation. Existing methods in NLP either assign equal weight to all LLM… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

  4. arXiv:2410.08723  [pdf, other

    cs.HC

    Investigating Human-Computer Interaction and Visual Comprehension in Text Generation Process of Natural Language Generation Models

    Authors: Yunchao Wang, Zihang Fu, Chaoqing Xu, Guodao Sun, Ronghua Liang

    Abstract: Natural language generation (NLG) models are becoming a highly sought-after research focus in the field of natural language processing (NLP), demonstrating strong capabilities in text generation tasks such as writing and dialogue generation. Despite the impressive performance of NLG models, their complex architecture and extensive model weights result in a lack of interpretability. This limitation… ▽ More

    Submitted 11 October, 2024; originally announced October 2024.

  5. arXiv:2410.06713  [pdf, other

    cs.DC

    SHRINK: Data Compression by Semantic Extraction and Residuals Encoding

    Authors: Guoyou Sun, Panagiotis Karras, Qi Zhang

    Abstract: The distributed data infrastructure in Internet of Things (IoT) ecosystems requires efficient data-series compression methods, along with the ability to feed different accuracy demands. However, the compression performance of existing compression methods degrades sharply when calling for ultra-accurate data recovery. In this paper, we introduce SHRINK, a novel highly accurate data compression meth… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

    Comments: 11 pages

  6. arXiv:2410.06682  [pdf, other

    cs.CV cs.CL eess.IV

    Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization

    Authors: Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zujun Ma, Chao Zhang

    Abstract: Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimization (DPO). We propose new m… ▽ More

    Submitted 10 October, 2024; v1 submitted 9 October, 2024; originally announced October 2024.

  7. arXiv:2410.06052  [pdf, other

    cs.RO cs.MA

    Concurrent-Learning Based Relative Localization in Shape Formation of Robot Swarms

    Authors: Jinhu Lü, Kunrui Ze, Shuoyu Yue, Kexin Liu, Wei Wang, Guibin Sun

    Abstract: In this paper, we address the shape formation problem for massive robot swarms in environments where external localization systems are unavailable. Achieving this task effectively with solely onboard measurements is still scarcely explored and faces some practical challenges. To solve this challenging problem, we propose the following novel results. Firstly, to estimate the relative positions amon… ▽ More

    Submitted 11 October, 2024; v1 submitted 8 October, 2024; originally announced October 2024.

  8. arXiv:2410.05357  [pdf, other

    cs.LG cs.AI cs.CL

    Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild

    Authors: Xinyu Zhao, Guoheng Sun, Ruisi Cai, Yukun Zhou, Pingzhi Li, Peihao Wang, Bowen Tan, Yexiao He, Li Chen, Yi Liang, Beidi Chen, Binhang Yuan, Hongyi Wang, Ang Li, Zhangyang Wang, Tianlong Chen

    Abstract: As Large Language Models (LLMs) excel across tasks and specialized domains, scaling LLMs based on existing models has garnered significant attention, which faces the challenge of decreasing performance when combining disparate models. Various techniques have been proposed for the aggregation of pre-trained LLMs, including model merging, Mixture-of-Experts, and stacking. Despite their merits, a com… ▽ More

    Submitted 7 October, 2024; originally announced October 2024.

    Comments: 24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks Track

  9. arXiv:2409.18653  [pdf, other

    cs.CV cs.AI

    When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

    Authors: Yuli Zhou, Guolei Sun, Yawei Li, Luca Benini, Ender Konukoglu

    Abstract: This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in the challenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blend seamlessly in the surroundings for videos, due to similar colors and textures, poor light conditions, etc. Compared to the objects in normal scenes, camouflaged objects are much more diffic… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

    Comments: Technical report

  10. arXiv:2409.17882  [pdf, other

    cs.MA

    Multi-UAV Enabled MEC Networks: Optimizing Delay through Intelligent 3D Trajectory Planning and Resource Allocation

    Authors: Zhiying Wang, Tianxi Wei, Gang Sun, Xinyue Liu, Hongfang Yu, Dusit Niyato

    Abstract: Mobile Edge Computing (MEC) reduces the computational burden on terminal devices by shortening the distance between these devices and computing nodes. Integrating Unmanned Aerial Vehicles (UAVs) with enhanced MEC networks can leverage the high mobility of UAVs to flexibly adjust network topology, further expanding the applicability of MEC. However, in highly dynamic and complex real-world environm… ▽ More

    Submitted 26 September, 2024; originally announced September 2024.

  11. arXiv:2409.16644  [pdf, other

    eess.AS cs.CL cs.SD

    Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

    Authors: Siyin Wang, Wenyi Yu, Yudong Yang, Changli Tang, Yixuan Li, Jimin Zhuang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Chao Zhang

    Abstract: Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: submitted to ICASSP 2025

  12. arXiv:2409.10999  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models

    Authors: Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul

    Abstract: Audio language models can understand audio inputs and perform a range of audio-related tasks based on instructions, such as speech recognition and audio captioning, where the instructions are usually textual prompts. Audio language models are mostly initialized from pre-trained audio encoders and large language models (LLMs). Although these pre-trained components were developed to support multiple… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

    Comments: 5 pages. Preprint under review

  13. arXiv:2409.09642  [pdf, other

    eess.AS cs.LG cs.SD

    Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

    Authors: Yudong Yang, Zhan Liu, Wenyi Yu, Guangzhi Sun, Qiuqiang Kong, Chao Zhang

    Abstract: Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions. While these models generalize well to unseen acoustic environments, they may not achieve the same level of fidelity as the discriminative models specifically trained to enhance particular acoustic conditions. In this paper, we… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

  14. arXiv:2409.05976  [pdf, other

    cs.LG cs.DC

    FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

    Authors: Ziyao Wang, Zheyu Shen, Yexiao He, Guoheng Sun, Hongyi Wang, Lingjuan Lyu, Ang Li

    Abstract: The rapid development of Large Language Models (LLMs) has been pivotal in advancing AI, with pre-trained LLMs being adaptable to diverse downstream tasks through fine-tuning. Federated learning (FL) further enhances fine-tuning in a privacy-aware manner by utilizing clients' local data through in-situ computation, eliminating the need for data movement. However, fine-tuning LLMs, given their massi… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

  15. arXiv:2409.00364  [pdf, other

    cs.IT eess.SP

    Resource Management for IRS-Assisted Full-Duplex Integrated Sensing, Communication and Computing Systems

    Authors: Wanming Hao, Xue Wu, Xingwang Li, Gangcan Sun, Qingqing Wu, Liang Yang

    Abstract: In this paper, we investigate an intelligent reflecting surface (IRS) assisted full-duplex (FD) integrated sensing, communication and computing system. Specifically, an FD base station (BS) provides service for uplink and downlink transmission, and a local cache is connected to the BS through a backhaul link to store data. Meanwhile, active sensing elements are deployed on the IRS to receive targe… ▽ More

    Submitted 31 August, 2024; originally announced September 2024.

  16. arXiv:2408.17042  [pdf, other

    cs.DS

    E-Graphs as Circuits, and Optimal Extraction via Treewidth

    Authors: Glenn Sun, Yihong Zhang, Haobin Ni

    Abstract: We solve the optimal extraction problem for e-graphs by first showing a connection between e-graphs and cyclic monotone Boolean circuits, then solving the weighted satisfiability problem for such circuits. The solution is a parameterized algorithm based on treewidth. Additionally, we show how the circuit view of e-graphs allows us to apply simplification techniques that are not possible when opera… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

  17. arXiv:2408.15585  [pdf, other

    cs.SD eess.AS

    Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models

    Authors: Yiyang Zhao, Shuai Wang, Guangzhi Sun, Zehua Chen, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: In this paper, Whisper, a large-scale pre-trained model for automatic speech recognition, is proposed to apply to speaker verification. A partial multi-scale feature aggregation (PMFA) approach is proposed based on a subset of Whisper encoder blocks to derive highly discriminative speaker embeddings.Experimental results demonstrate that using the middle to later blocks of the Whisper encoder keeps… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: Accepted by Interspeech 2024

  18. arXiv:2408.11080  [pdf

    cs.CR cs.SE

    ARAP: Demystifying Anti Runtime Analysis Code in Android Apps

    Authors: Dewen Suo, Lei Xue, Runze Tan, Weihao Huang, Guozi Sun

    Abstract: With the continuous growth in the usage of Android apps, ensuring their security has become critically important. An increasing number of malicious apps adopt anti-analysis techniques to evade security measures. Although some research has started to consider anti-runtime analysis (ARA), it is unfortunate that they have not systematically examined ARA techniques. Furthermore, the rapid evolution of… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  19. arXiv:2408.09790  [pdf, other

    cs.LG

    Structure-enhanced Contrastive Learning for Graph Clustering

    Authors: Xunlian Wu, Jingqi Hu, Anqi Zhang, Yining Quan, Qiguang Miao, Peng Gang Sun

    Abstract: Graph clustering is a crucial task in network analysis with widespread applications, focusing on partitioning nodes into distinct groups with stronger intra-group connections than inter-group ones. Recently, contrastive learning has achieved significant progress in graph clustering. However, most methods suffer from the following issues: 1) an over-reliance on meticulously designed data augmentati… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  20. arXiv:2408.08862  [pdf, other

    cs.LG

    Visual Agents as Fast and Slow Thinkers

    Authors: Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Ying Nian Wu, Yongfeng Zhang, Dongfang Liu

    Abstract: Achieving human-level intelligence requires refining cognitive distinctions between System 1 and System 2 thinking. While contemporary AI, driven by large language models, demonstrates human-like traits, it falls short of genuine cognition. Transitioning from structured benchmarks to real-world scenarios presents challenges for visual agents, often leading to inaccurate and overly confident respon… ▽ More

    Submitted 6 September, 2024; v1 submitted 16 August, 2024; originally announced August 2024.

  21. arXiv:2408.08496  [pdf, other

    cs.NI eess.SP

    Generative AI for Energy Harvesting Internet of Things Network: Fundamental, Applications, and Opportunities

    Authors: Wenwen Xie, Geng Sun, Jiahui Li, Jiacheng Wang, Hongyang Du, Dusit Niyato, Octavia A. Dobre

    Abstract: Internet of Things (IoT) devices are typically powered by small-sized batteries with limited energy storage capacity, requiring regular replacement or recharging. To reduce costs and maintain connectivity in IoT networks, energy harvesting technologies are regarded as a promising solution. Notably, due to its robust analytical and generative capabilities, generative artificial intelligence (GenAI)… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  22. arXiv:2408.05776  [pdf

    cs.NI eess.SP

    Convergence of Symbiotic Communications and Blockchain for Sustainable and Trustworthy 6G Wireless Networks

    Authors: Haoxiang Luo, Gang Sun, Cheng Chi, Hongfang Yu, Mohsen Guizani

    Abstract: Symbiotic communication (SC) is known as a new wireless communication paradigm, similar to the natural ecosystem population, and can enable multiple communication systems to cooperate and mutualize through service exchange and resource sharing. As a result, SC is seen as an important potential technology for future sixth-generation (6G) communications, solving the problem of lack of spectrum resou… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  23. arXiv:2408.05614  [pdf, other

    cs.AR cs.ET eess.SY

    ICGMM: CXL-enabled Memory Expansion with Intelligent Caching Using Gaussian Mixture Model

    Authors: Hanqiu Chen, Yitu Wang, Luis Vitorio Cargnini, Mohammadreza Soltaniyeh, Dongyang Li, Gongjin Sun, Pradeep Subedi, Andrew Chang, Yiran Chen, Cong Hao

    Abstract: Compute Express Link (CXL) emerges as a solution for wide gap between computational speed and data communication rates among host and multiple devices. It fosters a unified and coherent memory space between host and CXL storage devices such as such as Solid-state drive (SSD) for memory expansion, with a corresponding DRAM implemented as the device cache. However, this introduces challenges such as… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

    Comments: This paper is accepted by DAC2024

  24. arXiv:2408.05141  [pdf, other

    cs.CL cs.IR

    A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning

    Authors: Ye Yuan, Chengwu Liu, Jingyang Yuan, Gongbo Sun, Siqi Li, Ming Zhang

    Abstract: Retrieval-augmented generation (RAG) is a framework enabling large language models (LLMs) to enhance their accuracy and reduce hallucinations by integrating external knowledge bases. In this paper, we introduce a hybrid RAG system enhanced through a comprehensive suite of optimizations that significantly improve retrieval quality, augment reasoning capabilities, and refine numerical computation ab… ▽ More

    Submitted 2 September, 2024; v1 submitted 9 August, 2024; originally announced August 2024.

    Comments: Technical report for 3rd prize in Task 1 of Meta CRAG KDD Cup 2024

  25. arXiv:2408.03979  [pdf, ps, other

    cs.SD eess.AS

    Speaker Adaptation for Quantised End-to-End ASR Models

    Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: End-to-end models have shown superior performance for automatic speech recognition (ASR). However, such models are often very large in size and thus challenging to deploy on resource-constrained edge devices. While quantisation can reduce model sizes, it can lead to increased word error rates (WERs). Although improved quantisation methods were proposed to address the issue of performance degradati… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

    Comments: submitted to ASRU 2023 Workshop

  26. arXiv:2407.20840  [pdf, other

    cs.NI

    Large Language Model (LLM)-enabled Graphs in Dynamic Networking

    Authors: Geng Sun, Yixian Wang, Dusit Niyato, Jiacheng Wang, Xinying Wang, H. Vincent Poor, Khaled B. Letaief

    Abstract: Recent advances in generative artificial intelligence (AI), and particularly the integration of large language models (LLMs), have had considerable impact on multiple domains. Meanwhile, enhancing dynamic network performance is a crucial element in promoting technological advancement and meeting the growing demands of users in many applications areas involving networks. In this article, we explore… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: 10 pages, 6 figures, published to IEEE NETWORK

  27. arXiv:2407.16237  [pdf, other

    cs.AR cs.AI cs.LG

    OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection

    Authors: Fan Cui, Chenyang Yin, Kexing Zhou, Youwei Xiao, Guangyu Sun, Qiang Xu, Qipeng Guo, Demin Song, Dahua Lin, Xingcheng Zhang, Yun, Liang

    Abstract: Recent studies have demonstrated the significant potential of Large Language Models (LLMs) in generating Register Transfer Level (RTL) code, with notable advancements showcased by commercial models such as GPT-4 and Claude3-Opus. However, these proprietary LLMs often raise concerns regarding privacy and security. While open-source LLMs offer solutions to these concerns, they typically underperform… ▽ More

    Submitted 2 September, 2024; v1 submitted 23 July, 2024; originally announced July 2024.

  28. arXiv:2407.11977  [pdf, other

    cs.HC cs.AI cs.CY

    Building Better AI Agents: A Provocation on the Utilisation of Persona in LLM-based Conversational Agents

    Authors: Guangzhi Sun, Xiao Zhan, Jose Such

    Abstract: The incorporation of Large Language Models (LLMs) such as the GPT series into diverse sectors including healthcare, education, and finance marks a significant evolution in the field of artificial intelligence (AI). The increasing demand for personalised applications motivated the design of conversational agents (CAs) to possess distinct personas. This paper commences by examining the rationale and… ▽ More

    Submitted 26 May, 2024; originally announced July 2024.

    Comments: Accepted by The international ACM Conversational User Interfaces (CUI) conference 2024

  29. arXiv:2407.11282  [pdf, other

    cs.CL

    Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models

    Authors: Qingcheng Zeng, Mingyu Jin, Qinkai Yu, Zhenting Wang, Wenyue Hua, Zihao Zhou, Guangyan Sun, Yanda Meng, Shiqing Ma, Qifan Wang, Felix Juefei-Xu, Kaize Ding, Fan Yang, Ruixiang Tang, Yongfeng Zhang

    Abstract: Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates… ▽ More

    Submitted 19 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

  30. arXiv:2407.10987  [pdf, ps, other

    cs.NI cs.AI eess.SP

    Adaptive Digital Twin and Communication-Efficient Federated Learning Network Slicing for 5G-enabled Internet of Things

    Authors: Daniel Ayepah-Mensah, Guolin Sun, Yu Pang, Wei Jiang

    Abstract: Network slicing enables industrial Internet of Things (IIoT) networks with multiservice and differentiated resource requirements to meet increasing demands through efficient use and management of network resources. Typically, the network slice orchestrator relies on demand forecasts for each slice to make informed decisions and maximize resource utilization. The new generation of Industry 4.0 has… ▽ More

    Submitted 22 June, 2024; originally announced July 2024.

    Comments: 8 pages, 7 figures, conference

  31. arXiv:2407.09047  [pdf, other

    cs.CV

    Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation

    Authors: Wei Cong, Yang Cong, Yuyang Liu, Gan Sun

    Abstract: Incremental semantic segmentation endeavors to segment newly encountered classes while maintaining knowledge of old classes. However, existing methods either 1) lack guidance from class-specific knowledge (i.e., old class prototypes), leading to a bias towards new classes, or 2) constrain class-shared knowledge (i.e., old model weights) excessively without discrimination, resulting in a preference… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  32. arXiv:2407.08914  [pdf, other

    cs.NI eess.SP

    Multi-objective Aerial Collaborative Secure Communication Optimization via Generative Diffusion Model-enabled Deep Reinforcement Learning

    Authors: Chuang Zhang, Geng Sun, Jiahui Li, Qingqing Wu, Jiacheng Wang, Dusit Niyato, Yuanwei Liu

    Abstract: Due to flexibility and low-cost, unmanned aerial vehicles (UAVs) are increasingly crucial for enhancing coverage and functionality of wireless networks. However, incorporating UAVs into next-generation wireless communication systems poses significant challenges, particularly in sustaining high-rate and long-range secure communications against eavesdropping attacks. In this work, we consider a UAV… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: This paper has been submitted to IEEE Transactions on Mobile Computing

  33. arXiv:2407.02079  [pdf, other

    cs.AR

    Theseus: Exploring Efficient Wafer-Scale Chip Design for Large Language Models

    Authors: Jingchen Zhu, Chenhao Xue, Yiqi Chen, Zhao Wang, Guangyu Sun

    Abstract: The emergence of the large language model~(LLM) poses an exponential growth of demand for computation throughput, memory capacity, and communication bandwidth. Such a demand growth has significantly surpassed the improvement of corresponding chip designs. With the advancement of fabrication and integration technologies, designers have been developing Wafer-Scale Chips~(WSCs) to scale up and exploi… ▽ More

    Submitted 5 October, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

  34. arXiv:2406.19973  [pdf, other

    cs.CV cs.LG

    STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical

    Authors: Guohao Sun, Can Qin, Huazhu Fu, Linwei Wang, Zhiqiang Tao

    Abstract: Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving i… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 10 pages, 6 figures

  35. arXiv:2406.19706  [pdf, other

    cs.SD eess.AS

    SAML: Speaker Adaptive Mixture of LoRA Experts for End-to-End ASR

    Authors: Qiuming Zhao, Guangzhi Sun, Chao Zhang, Mingxing Xu, Thomas Fang Zheng

    Abstract: Mixture-of-experts (MoE) models have achieved excellent results in many tasks. However, conventional MoE models are often very large, making them challenging to deploy on resource-constrained edge devices. In this paper, we propose a novel speaker adaptive mixture of LoRA experts (SAML) approach, which uses low-rank adaptation (LoRA) modules as experts to reduce the number of trainable parameters… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: 5 pages, accepted by Interspeech 2024. arXiv admin note: substantial text overlap with arXiv:2309.09136

  36. arXiv:2406.15786  [pdf, other

    cs.LG cs.AI cs.CL

    What Matters in Transformers? Not All Attention is Needed

    Authors: Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

    Abstract: While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored.… ▽ More

    Submitted 16 October, 2024; v1 submitted 22 June, 2024; originally announced June 2024.

    Comments: 15 pages, 13 figures, 6 tables

  37. arXiv:2406.15704  [pdf, other

    cs.CV

    video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

    Authors: Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

    Abstract: Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required b… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    Comments: Accepted at ICML 2024. arXiv admin note: substantial text overlap with arXiv:2310.05863

  38. arXiv:2406.11156  [pdf, other

    cs.IR cs.AI

    DELRec: Distilling Sequential Pattern to Enhance LLM-based Recommendation

    Authors: Guohao Sun, Haoyi Zhang

    Abstract: Sequential recommendation (SR) tasks enhance recommendation accuracy by capturing the connection between users' past interactions and their changing preferences. Conventional models often focus solely on capturing sequential patterns within the training data, neglecting the broader context and semantic information embedded in item titles from external sources. This limits their predictive power an… ▽ More

    Submitted 18 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

  39. arXiv:2406.08928  [pdf, other

    cs.CV eess.IV

    Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

    Authors: Guodong Sun, Junjie Liu, Mingxuan Liu, Moyun Liu, Yang Zhang

    Abstract: Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure a… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 28 pages, 12 figures

  40. arXiv:2406.07914  [pdf, other

    cs.SD eess.AS

    Can Large Language Models Understand Spatial Audio?

    Authors: Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Jun Zhang, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

    Abstract: This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance understanding of 3D environments via audio. We study 3 spatial audio tasks: sound source localization (SSL), far-field speech recognition (FSR), and lo… ▽ More

    Submitted 14 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  41. arXiv:2406.05700  [pdf, other

    cs.CV eess.IV

    HDMba: Hyperspectral Remote Sensing Imagery Dehazing with State Space Model

    Authors: Hang Fu, Genyun Sun, Yinhe Li, Jinchang Ren, Aizhu Zhang, Cheng Jing, Pedram Ghamisi

    Abstract: Haze contamination in hyperspectral remote sensing images (HSI) can lead to spatial visibility degradation and spectral distortion. Haze in HSI exhibits spatial irregularity and inhomogeneous spectral distribution, with few dehazing networks available. Current CNN and Transformer-based dehazing methods fail to balance global scene recovery, local detail retention, and computational efficiency. Ins… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

  42. arXiv:2406.03199  [pdf, other

    cs.CL cs.AI cs.LG

    Bayesian WeakS-to-Strong from Text Classification to Generation

    Authors: Ziyun Cui, Ziyang Zhang, Wen Wu, Guangzhi Sun, Chao Zhang

    Abstract: Advances in large language models raise the question of how alignment techniques will adapt as models become increasingly complex and humans will only be able to supervise them weakly. Weak-to-Strong mimics such a scenario where weak model supervision attempts to harness the full capabilities of a much stronger model. This work extends Weak-to-Strong to WeakS-to-Strong by exploring an ensemble of… ▽ More

    Submitted 2 October, 2024; v1 submitted 24 May, 2024; originally announced June 2024.

  43. arXiv:2406.01559  [pdf, other

    cs.CV

    Prototypical Transformer as Unified Motion Learners

    Authors: Cheng Han, Yawen Lu, Guohao Sun, James C. Liang, Zhiwen Cao, Qifan Wang, Qiang Guan, Sohail A. Dianat, Raghuveer M. Rao, Tong Geng, Zhiqiang Tao, Dongfang Liu

    Abstract: In this work, we introduce the Prototypical Transformer (ProtoFormer), a general and unified framework that approaches various motion tasks from a prototype perspective. ProtoFormer seamlessly integrates prototype learning with Transformer by thoughtfully considering motion dynamics, introducing two innovative designs. First, Cross-Attention Prototyping discovers prototypes based on signature moti… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 21 pages, 10 figures

  44. arXiv:2406.00522  [pdf, other

    eess.AS cs.SD

    Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning

    Authors: Keqi Deng, Guangzhi Sun, Philip C. Woodland

    Abstract: Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  45. arXiv:2405.20568  [pdf, other

    cs.LG cs.NI

    Generative AI for Deep Reinforcement Learning: Framework, Analysis, and Use Cases

    Authors: Geng Sun, Wenwen Xie, Dusit Niyato, Fang Mei, Jiawen Kang, Hongyang Du, Shiwen Mao

    Abstract: As a form of artificial intelligence (AI) technology based on interactive learning, deep reinforcement learning (DRL) has been widely applied across various fields and has achieved remarkable accomplishments. However, DRL faces certain limitations, including low sample efficiency and poor generalization. Therefore, we present how to leverage generative AI (GAI) to address these issues above and en… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  46. arXiv:2405.18797  [pdf, other

    cs.NI

    User Association and Channel Allocation in 5G Mobile Asymmetric Multi-band Heterogeneous Networks

    Authors: Miao Dai, Gang Sun, Hongfang Yu, Sheng Wang, Dusit Niyato

    Abstract: With the proliferation of mobile terminals and the continuous upgrading of services, 4G LTE networks are showing signs of weakness. To enhance the capacity of wireless networks, millimeter waves are introduced to drive the evolution of networks towards multi-band 5G heterogeneous networks. The distinct propagation characteristics of mmWaves and microwaves, as well as the vastly different hardware… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: 17 pages, 5 figures

  47. arXiv:2405.17773  [pdf, other

    cs.CV

    Towards a Generalist and Blind RGB-X Tracker

    Authors: Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte

    Abstract: With the emergence of a single large model capable of successfully solving a multitude of tasks in NLP, there has been growing research interest in achieving similar goals in computer vision. On the one hand, most of these generic models, referred to as generalist vision models, aim at producing unified outputs serving different tasks. On the other hand, some existing models aim to combine differe… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  48. arXiv:2405.13684  [pdf, other

    cs.CL

    CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models

    Authors: Guangzhi Sun, Potsawee Manakul, Adian Liusie, Kunat Pipatanakul, Chao Zhang, Phil Woodland, Mark Gales

    Abstract: Multimodal foundation models are prone to hallucination, generating outputs that either contradict the input or are not grounded by factual information. Given the diversity in architectures, training data and instruction tuning techniques, there can be large variations in systems' susceptibility to hallucinations. To assess system hallucination robustness, hallucination ranking approaches have bee… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: 21 pages. Preprint

  49. arXiv:2405.13349  [pdf, other

    cs.DC

    Building a Verifiable Logical Clock for P2P Networks

    Authors: Guangda Sun, Tianyang Tao, Yanpei Guo, Michael Yiqing Hu, Jialin Li

    Abstract: Logical clocks are a fundamental tool to establish causal ordering of events in a distributed system. They have been applied in weakly consistent storage systems, causally ordered broadcast, distributed snapshots, deadlock detection, and distributed system debugging. However, prior logical clock constructs fail to work in an open network with Byzantine participants. In this work, we present Chrono… ▽ More

    Submitted 13 August, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  50. arXiv:2405.10489  [pdf, other

    cs.CV

    MixCut:A Data Augmentation Method for Facial Expression Recognition

    Authors: Jiaxiang Yu, Yiyang Liu, Ruiyang Fan, Guobing Sun

    Abstract: In the facial expression recognition task, researchers always get low accuracy of expression classification due to a small amount of training samples. In order to solve this kind of problem, we proposes a new data augmentation method named MixCut. In this method, we firstly interpolate the two original training samples at the pixel level in a random ratio to generate new samples. Then, pixel remov… ▽ More

    Submitted 16 May, 2024; originally announced May 2024.