Skip to main content

Showing 1–50 of 544 results for author: Zheng, S

  1. arXiv:2410.15283  [pdf

    cs.LG eess.SY

    TRIZ Method for Urban Building Energy Optimization: GWO-SARIMA-LSTM Forecasting model

    Authors: Shirong Zheng, Shaobo Liu, Zhenhong Zhang, Dian Gu, Chunqiu Xia, Huadong Pang, Enock Mintah Ampaw

    Abstract: With the advancement of global climate change and sustainable development goals, urban building energy consumption optimization and carbon emission reduction have become the focus of research. Traditional energy consumption prediction methods often lack accuracy and adaptability due to their inability to fully consider complex energy consumption patterns, especially in dealing with seasonal fluctu… ▽ More

    Submitted 20 October, 2024; originally announced October 2024.

    Comments: 29 pages

  2. arXiv:2410.15040  [pdf, other

    cs.AI

    Retrieval Augmented Diffusion Model for Structure-informed Antibody Design and Optimization

    Authors: Zichen Wang, Yaokun Ji, Jianing Tian, Shuangjia Zheng

    Abstract: Antibodies are essential proteins responsible for immune responses in organisms, capable of specifically recognizing antigen molecules of pathogens. Recent advances in generative models have significantly enhanced rational antibody design. However, existing methods mainly create antibodies from scratch without template constraints, leading to model optimization challenges and unnatural sequences.… ▽ More

    Submitted 19 October, 2024; originally announced October 2024.

  3. arXiv:2410.12957  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

    Authors: Ruiqi Li, Siqi Zheng, Xize Cheng, Ziang Zhang, Shengpeng Ji, Zhou Zhao

    Abstract: Generating music that aligns with the visual content of a video has been a challenging task, as it requires a deep understanding of visual semantics and involves generating music whose melody, rhythm, and dynamics harmonize with the visual narratives. This paper presents MuVi, a novel framework that effectively addresses these challenges to enhance the cohesion and immersive experience of audio-vi… ▽ More

    Submitted 16 October, 2024; originally announced October 2024.

    Comments: Working in progress

  4. arXiv:2410.06044  [pdf, other

    cs.CV

    HyperDet: Generalizable Detection of Synthesized Images by Generating and Merging A Mixture of Hyper LoRAs

    Authors: Huangsen Cao, Yongwei Wang, Yinfeng Liu, Sixian Zheng, Kangtao Lv, Zhimeng Zhang, Bo Zhang, Xin Ding, Fei Wu

    Abstract: The emergence of diverse generative vision models has recently enabled the synthesis of visually realistic images, underscoring the critical need for effectively detecting these generated images from real photos. Despite advances in this field, existing detection approaches often struggle to accurately identify synthesized images generated by different generative models. In this work, we introduce… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

  5. arXiv:2410.03311  [pdf, other

    cs.CV cs.LG

    Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models

    Authors: Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Qin Jin, Zongqing Lu

    Abstract: Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted towards the development of large motion models. Despite some progress, current state-of-the-art works remain far from achieving truly generalist models, largely due to the lack of large-scale, high-quality motion data. To address this, we present MotionBase, the first million-level motion gener… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

  6. arXiv:2410.02155  [pdf, other

    cs.AI cs.CL cs.CV

    From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

    Authors: Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, Zongqing Lu

    Abstract: Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly inc… ▽ More

    Submitted 4 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

  7. arXiv:2410.00773  [pdf, other

    cs.AI cs.CL

    BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

    Authors: Xuwu Wang, Qiwen Cui, Yunzhe Tao, Yiran Wang, Ziwei Chai, Xiaotian Han, Boyi Liu, Jianbo Yuan, Jing Su, Guoyin Wang, Tingkai Liu, Liyu Chen, Tianyi Liu, Tao Sun, Yufeng Zhang, Sirui Zheng, Quanzeng You, Yang Yang, Hongxia Yang

    Abstract: Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal unstructured data processing as seen in Visual Question Answering (VQA). These areas have attracted significant attention from both industry and academia. Despite this, th… ▽ More

    Submitted 1 October, 2024; originally announced October 2024.

  8. arXiv:2410.00327  [pdf, other

    cs.LG cs.AI cs.CE q-bio.QM

    EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow Matching and Co-Evolutionary Dynamics

    Authors: Chenqing Hua, Yong Liu, Dinghuai Zhang, Odin Zhang, Sitao Luan, Kevin K. Yang, Guy Wolf, Doina Precup, Shuangjia Zheng

    Abstract: Enzyme design is a critical area in biotechnology, with applications ranging from drug development to synthetic biology. Traditional methods for enzyme function prediction or protein binding pocket design often fall short in capturing the dynamic and complex nature of enzyme-substrate interactions, particularly in catalytic processes. To address the challenges, we introduce EnzymeFlow, a generativ… ▽ More

    Submitted 30 September, 2024; originally announced October 2024.

  9. arXiv:2409.13292  [pdf, other

    eess.AS cs.SD

    Exploring Text-Queried Sound Event Detection with Audio Source Separation

    Authors: Han Yin, Jisheng Bai, Yang Xiao, Hui Wang, Siqi Zheng, Yafeng Chen, Rohan Kumar Das, Chong Deng, Jianfeng Chen

    Abstract: In sound event detection (SED), overlapping sound events pose a significant challenge, as certain events can be easily masked by background noise or other events, resulting in poor detection performance. To address this issue, we propose the text-queried SED (TQ-SED) framework. Specifically, we first pre-train a language-queried audio source separation (LASS) model to separate the audio tracks cor… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP2025

  10. arXiv:2409.09763  [pdf, other

    cs.RO

    Range-SLAM: Ultra-Wideband-Based Smoke-Resistant Real-Time Localization and Mapping

    Authors: Yi Liu, Zhuozhu Jian, Shengtao Zheng, Houde Liu, Xueqian Wang, Xinlei Chen, Bin Liang

    Abstract: This paper presents Range-SLAM, a real-time, lightweight SLAM system designed to address the challenges of localization and mapping in environments with smoke and other harsh conditions using Ultra-Wideband (UWB) signals. While optical sensors like LiDAR and cameras struggle in low-visibility environments, UWB signals provide a robust alternative for real-time positioning. The proposed system uses… ▽ More

    Submitted 15 September, 2024; originally announced September 2024.

  11. arXiv:2409.08782  [pdf, other

    cs.CV

    Contactless Fingerprint Recognition Using 3D Graph Matching

    Authors: Zhe Cui, Yuwei Jia, Siyang Zheng, Fei Su

    Abstract: Contactless fingerprint is a newly developed type of fingerprint, and has gained lots of attention in recent fingerprint studies. However, most existing contactless fingerprint algorithms treat contactless fingerprints as 2D plain fingerprints, and utilize similar recognition methods as traditional contact-based 2D fingerprints. This recognition approach does not consider the modality difference b… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  12. arXiv:2409.07055  [pdf, other

    cs.CL cs.AI cs.CY

    Legal Fact Prediction: Task Definition and Dataset Construction

    Authors: Junkai Liu, Yujie Tong, Hui Huang, Shuyuan Zheng, Muyun Yang, Peicheng Wu, Makoto Onizuka, Chuan Xiao

    Abstract: Legal facts refer to the facts that can be proven by acknowledged evidence in a trial. They form the basis for the determination of court judgments. This paper introduces a novel NLP task: legal fact prediction, which aims to predict the legal fact based on a list of evidence. The predicted facts can instruct the parties and their lawyers involved in a trial to strengthen their submissions and opt… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

  13. arXiv:2409.03757  [pdf, other

    cs.CV cs.AI cs.CL cs.LG cs.RO

    Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

    Authors: Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liang-Yan Gui, Yu-Xiong Wang

    Abstract: Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understandi… ▽ More

    Submitted 5 September, 2024; originally announced September 2024.

    Comments: Project page: https://yunzeman.github.io/lexicon3d , Github: https://github.com/YunzeMan/Lexicon3D

  14. arXiv:2409.01787  [pdf, other

    cs.CL

    LLM-GAN: Construct Generative Adversarial Network Through Large Language Models For Explainable Fake News Detection

    Authors: Yifeng Wang, Zhouhong Gu, Siwei Zhang, Suhang Zheng, Tao Wang, Tianyu Li, Hongwei Feng, Yanghua Xiao

    Abstract: Explainable fake news detection predicts the authenticity of news items with annotated explanations. Today, Large Language Models (LLMs) are known for their powerful natural language understanding and explanation generation abilities. However, presenting LLMs for explainable fake news detection remains two main challenges. Firstly, fake news appears reasonable and could easily mislead LLMs, leavin… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

  15. arXiv:2408.16532  [pdf, other

    eess.AS cs.LG cs.MM cs.SD eess.SP

    WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

    Authors: Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, Zhou Zhao

    Abstract: Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domai… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

    Comments: Working in progress. arXiv admin note: text overlap with arXiv:2402.12208

  16. arXiv:2408.16315  [pdf, other

    cs.HC cs.LG eess.SP

    Passenger hazard perception based on EEG signals for highly automated driving vehicles

    Authors: Ashton Yu Xuan Tan, Yingkai Yang, Xiaofei Zhang, Bowen Li, Xiaorong Gao, Sifa Zheng, Jianqiang Wang, Xinyu Gu, Jun Li, Yang Zhao, Yuxin Zhang, Tania Stathaki

    Abstract: Enhancing the safety of autonomous vehicles is crucial, especially given recent accidents involving automated systems. As passengers in these vehicles, humans' sensory perception and decision-making can be integrated with autonomous systems to improve safety. This study explores neural mechanisms in passenger-vehicle interactions, leading to the development of a Passenger Cognitive Model (PCM) and… ▽ More

    Submitted 29 August, 2024; originally announced August 2024.

  17. arXiv:2408.13659  [pdf, other

    cs.LG cs.AI cs.CE q-bio.QM

    ReactZyme: A Benchmark for Enzyme-Reaction Prediction

    Authors: Chenqing Hua, Bozitao Zhong, Sitao Luan, Liang Hong, Guy Wolf, Doina Precup, Shuangjia Zheng

    Abstract: Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptations. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies. Addressing the inherent complexities, we introduce a new approach to annotating en… ▽ More

    Submitted 30 September, 2024; v1 submitted 24 August, 2024; originally announced August 2024.

    Journal ref: 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

  18. arXiv:2408.12400  [pdf, other

    cs.CV cs.AI

    Multi-Style Facial Sketch Synthesis through Masked Generative Modeling

    Authors: Bowen Sun, Guo Lu, Shibao Zheng

    Abstract: The facial sketch synthesis (FSS) model, capable of generating sketch portraits from given facial photographs, holds profound implications across multiple domains, encompassing cross-modal face recognition, entertainment, art, media, among others. However, the production of high-quality sketches remains a formidable task, primarily due to the challenges and flaws associated with three key factors:… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  19. arXiv:2408.12102  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

    Authors: Luyao Cheng, Hui Wang, Siqi Zheng, Yafeng Chen, Rongjie Huang, Qinglin Zhang, Qian Chen, Xihao Li

    Abstract: Speaker diarization, the process of segmenting an audio stream or transcribed speech content into homogenous partitions based on speaker identity, plays a crucial role in the interpretation and analysis of human speech. Most existing speaker diarization systems rely exclusively on unimodal acoustic information, making the task particularly challenging due to the innate ambiguities of audio signals… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  20. arXiv:2408.09933  [pdf, other

    cs.SD cs.AI eess.AS

    SZU-AFS Antispoofing System for the ASVspoof 5 Challenge

    Authors: Yuxiong Xu, Jiafeng Zhong, Sengui Zheng, Zefeng Liu, Bin Li

    Abstract: This paper presents the SZU-AFS anti-spoofing system, designed for Track 1 of the ASVspoof 5 Challenge under open conditions. The system is built with four stages: selecting a baseline model, exploring effective data augmentation (DA) methods for fine-tuning, applying a co-enhancement strategy based on gradient norm aware minimization (GAM) for secondary fine-tuning, and fusing logits scores from… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: 8 pages, 2 figures, ASVspoof 5 Workshop (Interspeech2024 Satellite)

  21. arXiv:2408.05341  [pdf, other

    cs.CV cs.AI

    CAR: Contrast-Agnostic Deformable Medical Image Registration with Contrast-Invariant Latent Regularization

    Authors: Yinsong Wang, Siyi Du, Shaoming Zheng, Xinzhe Luo, Chen Qin

    Abstract: Multi-contrast image registration is a challenging task due to the complex intensity relationships between different imaging contrasts. Conventional image registration methods are typically based on iterative optimizations for each input image pair, which is time-consuming and sensitive to contrast variations. While learning-based approaches are much faster during the inference stage, due to gener… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

    Comments: 12 pages, 3 figures, 3 tables, accecpted by WBIR 2024

  22. arXiv:2408.03194  [pdf, other

    eess.IV cs.CV

    SGSR: Structure-Guided Multi-Contrast MRI Super-Resolution via Spatio-Frequency Co-Query Attention

    Authors: Shaoming Zheng, Yinsong Wang, Siyi Du, Chen Qin

    Abstract: Magnetic Resonance Imaging (MRI) is a leading diagnostic modality for a wide range of exams, where multiple contrast images are often acquired for characterizing different tissues. However, acquiring high-resolution MRI typically extends scan time, which can introduce motion artifacts. Super-resolution of MRI therefore emerges as a promising approach to mitigate these challenges. Earlier studies h… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

    Comments: The 15th International Workshop on Machine Learning in Medical Imaging (MLMI 2024)

  23. arXiv:2408.03166  [pdf, other

    cs.IR

    CADRL: Category-aware Dual-agent Reinforcement Learning for Explainable Recommendations over Knowledge Graphs

    Authors: Shangfei Zheng, Hongzhi Yin, Tong Chen, Xiangjie Kong, Jian Hou, Pengpeng Zhao

    Abstract: Knowledge graphs (KGs) have been widely adopted to mitigate data sparsity and address cold-start issues in recommender systems. While existing KGs-based recommendation methods can predict user preferences and demands, they fall short in generating explicit recommendation paths and lack explainability. As a step beyond the above methods, recent advancements utilize reinforcement learning (RL) to fi… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  24. arXiv:2407.19512  [pdf, other

    cs.CV

    Large-scale cervical precancerous screening via AI-assisted cytology whole slide image analysis

    Authors: Honglin Li, Yusuan Sun, Chenglu Zhu, Yunlong Zhang, Shichuan Zhang, Zhongyi Shui, Pingyi Chen, Jingxiong Li, Sunyi Zheng, Can Cui, Lin Yang

    Abstract: Cervical Cancer continues to be the leading gynecological malignancy, posing a persistent threat to women's health on a global scale. Early screening via cytology Whole Slide Image (WSI) diagnosis is critical to prevent this Cancer progression and improve survival rate, but pathologist's single test suffers inevitable false negative due to the immense number of cells that need to be reviewed withi… ▽ More

    Submitted 28 July, 2024; originally announced July 2024.

  25. arXiv:2407.10782  [pdf, other

    cs.RO

    LVCP: LiDAR-Vision Tightly Coupled Collaborative Real-time Relative Positioning

    Authors: Zhuozhu Jian, Qixuan Li, Shengtao Zheng, Xueqian Wang, Xinlei Chen

    Abstract: In air-ground collaboration scenarios without GPS and prior maps, the relative positioning of drones and unmanned ground vehicles (UGVs) has always been a challenge. For a drone equipped with monocular camera and an UGV equipped with LiDAR as an external sensor, we propose a robust and real-time relative pose estimation method (LVCP) based on the tight coupling of vision and LiDAR point cloud info… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: See more details in https://sites.google.com/view/lvcp

  26. arXiv:2407.09774  [pdf, other

    cs.CV cs.AI cs.MM

    ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context

    Authors: Sixiao Zheng, Yanwei Fu

    Abstract: Visual storytelling involves generating a sequence of coherent frames from a textual storyline while maintaining consistency in characters and scenes. Existing autoregressive methods, which rely on previous frame-sentence pairs, struggle with high memory usage, slow generation speeds, and limited context integration. To address these issues, we propose ContextualStory, a novel framework designed t… ▽ More

    Submitted 21 August, 2024; v1 submitted 13 July, 2024; originally announced July 2024.

  27. arXiv:2407.08234  [pdf, other

    cs.RO eess.SY

    Model Predictive Control For Mobile Manipulators Based On Neural Dynamics(Extended version)

    Authors: Tao Su, Shiqi Zheng

    Abstract: This article focuses on the trajectory tracking problem of mobile manipulators (MMs). Firstly, we construct a position and orientation model predictive tracking control (POMPTC) scheme for mobile manipulators. The proposed POMPTC scheme can simultaneously minimize the tracking error, joint velocity, and joint acceleration. Moreover, it can achieve synchronous control for the position and orientati… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: This article consists of 13 pages, including the text and the proof process

  28. arXiv:2407.08223  [pdf, other

    cs.CL cs.AI

    Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

    Authors: Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Mattapalli, Ankur Taly, Jingbo Shang, Chen-Yu Lee, Tomas Pfister

    Abstract: Retrieval augmented generation (RAG) combines the generative abilities of large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent RAG advancements focus on improving retrieval outcomes through iterative LLM refinement or self-critique capabilities acquired through additional instruction tuning of LLMs. In this work, we introduce Specul… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: Preprint

  29. arXiv:2407.07582  [pdf, other

    cs.CV

    TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data

    Authors: Siyi Du, Shaoming Zheng, Yinsong Wang, Wenjia Bai, Declan P. O'Regan, Chen Qin

    Abstract: Images and structured tables are essential parts of real-world databases. Though tabular-image representation learning is promising to create new insights, it remains a challenging task, as tabular data is typically heterogeneous and incomplete, presenting significant modality disparities with images. Earlier works have mainly focused on simple modality fusion strategies in complete data scenarios… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: 28 pages (including 9 pages of supplementary materials), accepted by ECCV 2024

  30. arXiv:2407.05603  [pdf, other

    cs.CV cs.AI

    WSI-VQA: Interpreting Whole Slide Images by Generative Visual Question Answering

    Authors: Pingyi Chen, Chenglu Zhu, Sunyi Zheng, Honglin Li, Lin Yang

    Abstract: Whole slide imaging is routinely adopted for carcinoma diagnosis and prognosis. Abundant experience is required for pathologists to achieve accurate and reliable diagnostic results of whole slide images (WSI). The huge size and heterogeneous features of WSIs make the workflow of pathological reading extremely time-consuming. In this paper, we propose a novel framework (WSI-VQA) to interpret WSIs b… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 2024

  31. arXiv:2407.05407  [pdf, other

    cs.SD cs.AI eess.AS

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Authors: Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, Zhijie Yan

    Abstract: Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role… ▽ More

    Submitted 9 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: work in progress. arXiv admin note: substantial text overlap with arXiv:2407.04051

  32. arXiv:2407.05364  [pdf, other

    cs.LG

    PTaRL: Prototype-based Tabular Representation Learning via Space Calibration

    Authors: Hangting Ye, Wei Fan, Xiaozhuang Song, Shun Zheng, He Zhao, Dandan Guo, Yi Chang

    Abstract: Tabular data have been playing a mostly important role in diverse real-world fields, such as healthcare, engineering, finance, etc. With the recent success of deep learning, many tabular machine learning (ML) methods based on deep networks (e.g., Transformer, ResNet) have achieved competitive performance on tabular benchmarks. However, existing deep tabular ML methods suffer from the representatio… ▽ More

    Submitted 15 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: Accepted by ICLR 2024

  33. arXiv:2407.04379  [pdf, other

    cs.SD cs.HC eess.AS

    A Mapping Strategy for Interacting with Latent Audio Synthesis Using Artistic Materials

    Authors: Shuoyang Zheng, Anna Xambó Sedó, Nick Bryan-Kinns

    Abstract: This paper presents a mapping strategy for interacting with the latent spaces of generative AI models. Our approach involves using unsupervised feature learning to encode a human control space and mapping it to an audio synthesis model's latent space. To demonstrate how this mapping strategy can turn high-dimensional sensor data into control mechanisms of a deep generative model, we present a proo… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Report number: XAIxArts/2024/10

  34. arXiv:2407.04051  [pdf, other

    cs.SD cs.AI eess.AS

    FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

    Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

    Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More

    Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Work in progress. Authors are listed in alphabetical order by family name

  35. arXiv:2407.02833  [pdf, other

    cs.IR cs.CL cs.LG

    LANE: Logic Alignment of Non-tuning Large Language Models and Online Recommendation Systems for Explainable Reason Generation

    Authors: Hongke Zhao, Songming Zheng, Likang Wu, Bowen Yu, Jing Wang

    Abstract: The explainability of recommendation systems is crucial for enhancing user trust and satisfaction. Leveraging large language models (LLMs) offers new opportunities for comprehensive recommendation logic generation. However, in existing related studies, fine-tuning LLM models for recommendation tasks incurs high computational costs and alignment issues with existing systems, limiting the applicatio… ▽ More

    Submitted 3 July, 2024; originally announced July 2024.

  36. arXiv:2407.02049  [pdf, other

    eess.AS cs.CL cs.SD

    Accompanied Singing Voice Synthesis with Fully Text-controlled Melody

    Authors: Ruiqi Li, Zhiqing Hong, Yongqi Wang, Lichao Zhang, Rongjie Huang, Siqi Zheng, Zhou Zhao

    Abstract: Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achie… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: Working in progress

  37. arXiv:2406.16853  [pdf, other

    cs.LG cond-mat.mtrl-sci cs.AI q-bio.BM

    GeoMFormer: A General Architecture for Geometric Molecular Representation Learning

    Authors: Tianlang Chen, Shengjie Luo, Di He, Shuxin Zheng, Tie-Yan Liu, Liwei Wang

    Abstract: Molecular modeling, a central topic in quantum mechanics, aims to accurately calculate the properties and simulate the behaviors of molecular systems. The molecular model is governed by physical laws, which impose geometric constraints such as invariance and equivariance to coordinate rotation and translation. While numerous deep learning approaches have been developed to learn molecular represent… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: 25 pages, 13 tables, l figure; ICML 2024 camera ready version

  38. arXiv:2406.16578  [pdf, other

    cs.RO cs.AI

    QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds

    Authors: Ye Wang, Yuting Mei, Sipeng Zheng, Qin Jin

    Abstract: While pets offer companionship, their limited intelligence restricts advanced reasoning and autonomous interaction with humans. Considering this, we propose QuadrupedGPT, a versatile agent designed to master a broad range of complex tasks with agility comparable to that of a pet. To achieve this goal, the primary challenges include: i) effectively leveraging multimodal observations for decision-ma… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Under review

  39. arXiv:2406.14485   

    cs.AI cs.HC cs.MM cs.SD eess.AS

    Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)

    Authors: Nick Bryan-Kinns, Corey Ford, Shuoyang Zheng, Helen Kennedy, Alan Chamberlain, Makayla Lewis, Drew Hemment, Zijin Li, Qiong Wu, Lanxi Xiao, Gus Xia, Jeba Rezwana, Michael Clemens, Gabriel Vigliensoni

    Abstract: This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.

    Submitted 21 October, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)

    Report number: Report-no: XAIxArts/2024/0

  40. arXiv:2406.12738  [pdf, other

    cs.CL cs.AI

    Large Language Model as a Universal Clinical Multi-task Decoder

    Authors: Yujiang Wu, Hongjian Song, Jiawen Zhang, Xumeng Wen, Shun Zheng, Jiang Bian

    Abstract: The development of effective machine learning methodologies for enhancing the efficiency and accuracy of clinical systems is crucial. Despite significant research efforts, managing a plethora of diversified clinical tasks and adapting to emerging new tasks remain significant challenges. This paper presents a novel paradigm that employs a pre-trained large language model as a universal clinical mul… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Work in progress

  41. FlexCare: Leveraging Cross-Task Synergy for Flexible Multimodal Healthcare Prediction

    Authors: Muhao Xu, Zhenfeng Zhu, Youru Li, Shuai Zheng, Yawei Zhao, Kunlun He, Yao Zhao

    Abstract: Multimodal electronic health record (EHR) data can offer a holistic assessment of a patient's health status, supporting various predictive healthcare tasks. Recently, several studies have embraced the multitask learning approach in the healthcare domain, exploiting the inherent correlations among clinical tasks to predict multiple outcomes simultaneously. However, existing methods necessitate samp… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted by KDD 2024 (Research Track)

  42. arXiv:2406.11274  [pdf, other

    cs.CL

    Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

    Authors: Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Hai Yu, Jiaqing Liu, Yukun Ma, Chong Zhang

    Abstract: The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows, refining the Transformer's architecture becomes critical. This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models by enabling direct attention… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: 7 pages, 1 figure

  43. arXiv:2406.11169   

    eess.AS cs.SD

    Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

    Authors: Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, Shiliang Zhang, Wen Wang

    Abstract: Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: We update this paper to an earlier paper

  44. arXiv:2406.10985  [pdf, other

    cs.CL

    Taking a Deep Breath: Enhancing Language Modeling of Large Language Models with Sentinel Tokens

    Authors: Weiyao Luo, Suncong Zheng, Heming Xia, Weikang Wang, Yan Lei, Tianyu Liu, Shuang Chen, Zhifang Sui

    Abstract: Large language models (LLMs) have shown promising efficacy across various tasks, becoming powerful tools in numerous aspects of human life. However, Transformer-based LLMs suffer a performance degradation when modeling long-term contexts due to they discard some information to reduce computational overhead. In this work, we propose a simple yet effective method to enable LLMs to take a deep breath… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  45. arXiv:2406.10948  [pdf

    cs.LG cs.AI

    Incorporating uncertainty quantification into travel mode choice modeling: a Bayesian neural network (BNN) approach and an uncertainty-guided active survey framework

    Authors: Shuwen Zheng, Zhou Fang, Liang Zhao

    Abstract: Existing deep learning approaches for travel mode choice modeling fail to inform modelers about their prediction uncertainty. Even when facing scenarios that are out of the distribution of training data, which implies high prediction uncertainty, these approaches still provide deterministic answers, potentially leading to misguidance. To address this limitation, this study introduces the concept o… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  46. arXiv:2406.10724  [pdf, other

    eess.IV cs.CV cs.LG

    Beyond the Visible: Jointly Attending to Spectral and Spatial Dimensions with HSI-Diffusion for the FINCH Spacecraft

    Authors: Ian Vyse, Rishit Dagli, Dav Vrat Chadha, John P. Ma, Hector Chen, Isha Ruparelia, Prithvi Seran, Matthew Xie, Eesa Aamer, Aidan Armstrong, Naveen Black, Ben Borstein, Kevin Caldwell, Orrin Dahanaggamaarachchi, Joe Dai, Abeer Fatima, Stephanie Lu, Maxime Michet, Anoushka Paul, Carrie Ann Po, Shivesh Prakash, Noa Prosser, Riddhiman Roy, Mirai Shinjo, Iliya Shofman , et al. (4 additional authors not shown)

    Abstract: Satellite remote sensing missions have gained popularity over the past fifteen years due to their ability to cover large swaths of land at regular intervals, making them ideal for monitoring environmental trends. The FINCH mission, a 3U+ CubeSat equipped with a hyperspectral camera, aims to monitor crop residue cover in agricultural fields. Although hyperspectral imaging captures both spectral and… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: To appear in 38th Annual Small Satellite Conference

  47. arXiv:2406.07661  [pdf, other

    cs.CV cs.RO

    ROADWork Dataset: Learning to Recognize, Observe, Analyze and Drive Through Work Zones

    Authors: Anurag Ghosh, Robert Tamburo, Shen Zheng, Juan R. Alvarez-Padilla, Hailiang Zhu, Michael Cardei, Nicholas Dunn, Christoph Mertz, Srinivasa G. Narasimhan

    Abstract: Perceiving and navigating through work zones is challenging and under-explored, even with major strides in self-driving research. An important reason is the lack of open datasets for developing new algorithms to address this long-tailed scenario. We propose the ROADWork dataset to learn how to recognize, observe and analyze and drive through work zones. We find that state-of-the-art foundation mod… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

  48. arXiv:2406.06542  [pdf, other

    cs.AR cs.LG

    vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

    Authors: Size Zheng, Renze Chen, Meng Li, Zihao Ye, Luis Ceze, Yun Liang

    Abstract: IoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN). However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs. Previous work separates memory management and kernel implementation for MCU and relies on coarse-graine… ▽ More

    Submitted 1 May, 2024; originally announced June 2024.

  49. arXiv:2406.05647  [pdf, other

    eess.SP cs.ET

    Sustainable Wireless Networks via Reconfigurable Intelligent Surfaces (RISs): Overview of the ETSI ISG RIS

    Authors: Ruiqi Liu, Shuang Zheng, Qingqing Wu, Yifan Jiang, Nan Zhang, Yuanwei Liu, Marco Di Renzo, and George C. Alexandropoulos

    Abstract: Reconfigurable Intelligent Surfaces (RISs) are a novel form of ultra-low power devices that are capable to increase the communication data rates as well as the cell coverage in a cost- and energy-efficient way. This is attributed to their programmable operation that enables them to dynamically manipulate the wireless propagation environment, a feature that has lately inspired numerous research inv… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: 7 pages, 5 figures, submitted to an IEEE Magazine

  50. arXiv:2406.04520  [pdf, other

    cs.CL cs.AI

    NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

    Authors: Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou

    Abstract: We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.