Search SciRate

1057 results for au:Luo_Y in:cs

Show all abstracts

Benchmarking Smoothness and Reducing High-Frequency Oscillations in Continuous Control Policies
Guilherme Christmann, Ying-Sheng Luo, Hanjaya Mandala, Wei-Chao Chen
Oct 23 2024 cs.RO cs.LG arXiv:2410.16632v1

@misc{2410.16632, author = {Guilherme Christmann and Ying-Sheng Luo and Hanjaya Mandala and Wei-Chao Chen}, title = {{B}enchmarking {S}moothness and {R}educing {H}igh-{F}requency {O}scillations in {C}ontinuous {C}ontrol {P}olicies}, year = {2024}, eprint = {2410.16632}, note = {arXiv:2410.16632v1} }
PDF
Reinforcement learning (RL) policies are prone to high-frequency oscillations, especially undesirable when deploying to hardware in the real-world. In this paper, we identify, categorize, and compare methods from the literature that aim to mitigate high-frequency oscillations in deep RL. We define two broad classes: loss regularization and architectural methods. At their core, these methods incentivize learning a smooth mapping, such that nearby states in the input space produce nearby actions in the output space. We present benchmarks in terms of policy performance and control smoothness on traditional RL environments from the Gymnasium and a complex manipulation task, as well as three robotics locomotion tasks that include deployment and evaluation with real-world hardware. Finally, we also propose hybrid methods that combine elements from both loss regularization and architectural methods. We find that the best-performing hybrid outperforms other methods, and improves control smoothness by 26.8% over the baseline, with a worst-case performance degradation of just 2.8%.
Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification
Yihong Luo, Yuhan Chen, Siya Qiu, Yiwei Wang, Chen Zhang, Yan Zhou, Xiaochun Cao, Jing Tang
Oct 23 2024 cs.LG cs.AI arXiv:2410.16845v1

@misc{2410.16845, author = {Yihong Luo and Yuhan Chen and Siya Qiu and Yiwei Wang and Chen Zhang and Yan Zhou and Xiaochun Cao and Jing Tang}, title = {{F}ast {G}raph {S}harpness-{A}ware {M}inimization for {E}nhancing and {A}ccelerating {F}ew-{S}hot {N}ode {C}lassification}, year = {2024}, eprint = {2410.16845}, note = {arXiv:2410.16845v1} }
PDF
Graph Neural Networks (GNNs) have shown superior performance in node classification. However, GNNs perform poorly in the Few-Shot Node Classification (FSNC) task that requires robust generalization to make accurate predictions for unseen classes with limited labels. To tackle the challenge, we propose the integration of Sharpness-Aware Minimization (SAM)--a technique designed to enhance model generalization by finding a flat minimum of the loss landscape--into GNN training. The standard SAM approach, however, consists of two forward-backward steps in each training iteration, doubling the computational cost compared to the base optimizer (e.g., Adam). To mitigate this drawback, we introduce a novel algorithm, Fast Graph Sharpness-Aware Minimization (FGSAM), that integrates the rapid training of Multi-Layer Perceptrons (MLPs) with the superior performance of GNNs. Specifically, we utilize GNNs for parameter perturbation while employing MLPs to minimize the perturbed loss so that we can find a flat minimum with good generalization more efficiently. Moreover, our method reutilizes the gradient from the perturbation phase to incorporate graph topology into the minimization process at almost zero additional cost. To further enhance training efficiency, we develop FGSAM+ that executes exact perturbations periodically. Extensive experiments demonstrate that our proposed algorithm outperforms the standard SAM with lower computational costs in FSNC tasks. In particular, our FGSAM+ as a SAM variant offers a faster optimization than the base optimizer in most cases. In addition to FSNC, our proposed methods also demonstrate competitive performance in the standard node classification task for heterophilic graphs, highlighting the broad applicability. The code is available at https://github.com/draym28/FGSAM_NeurIPS24.
Tokens on Demand: Token Condensation as Training-free Test-time Adaptation
Zixin Wang, Dong Gong, Sen Wang, Zi Huang, Yadan Luo
Oct 22 2024 cs.CV cs.AI cs.CL cs.LG arXiv:2410.14729v1

@misc{2410.14729, author = {Zixin Wang and Dong Gong and Sen Wang and Zi Huang and Yadan Luo}, title = {{T}okens on {D}emand: {T}oken {C}ondensation as {T}raining-free {T}est-time {A}daptation}, year = {2024}, eprint = {2410.14729}, note = {arXiv:2410.14729v1} }
PDF
In this work, we introduce Token Condensation as Adaptation (TCA), a training-free approach designed to mitigate distribution shifts encountered by vision-language models (VLMs) during test-time inference. TCA bridges distribution gaps at the patch level by condensing image tokens that exhibit low attentiveness to the <cls> token. Recognizing the <cls> token may correspond to universal concepts, TCA identifies and tracks the most reliable <cls> tokens that align specifically with target classes from historical data streams. To achieve this, we propose a context token reservoir (CTR), which retains tokens with the lowest uncertainty as ``anchors" to guide the preservation of class-relevant tokens during inference. These anchors, in turn, act as token-level classifiers to correct VLM predictions and improve visual-text alignment. Utilizing anchors sampled from CTR, TCA condenses tokens through two operations: (1) pruning class-irrelevant tokens that consistently rank low across all attention heads to reach cross-head consensus on their irrelevance, and (2) merging the remaining class-ambiguous tokens into representative centers using coreset selection, maintaining linear computational complexity. As the first method to explore token efficiency in test-time adaptation, TCA consistently demonstrates superior performance across cross-dataset and out-of-distribution adaptation tasks, reducing GFLOPs by 12.2% to 48.9% while achieving accuracy improvements up to 21.4% against the strongest baseline without introducing additional parameters.
SIFM: A Foundation Model for Multi-granularity Arctic Sea Ice Forecasting
Jingyi Xu, Yeqi Luo, Weidong Yang, Keyi Liu, Shengnan Wang, Ben Fei, Lei Bai
Oct 22 2024 cs.LG physics.ao-ph arXiv:2410.14732v1

@misc{2410.14732, author = {Jingyi Xu and Yeqi Luo and Weidong Yang and Keyi Liu and Shengnan Wang and Ben Fei and Lei Bai}, title = {{SIFM}: {A} {F}oundation {M}odel for {M}ulti-granularity {A}rctic {S}ea {I}ce {F}orecasting}, year = {2024}, eprint = {2410.14732}, note = {arXiv:2410.14732v1} }
PDF
Arctic sea ice performs a vital role in global climate and has paramount impacts on both polar ecosystems and coastal communities. In the last few years, multiple deep learning based pan-Arctic sea ice concentration (SIC) forecasting methods have emerged and showcased superior performance over physics-based dynamical models. However, previous methods forecast SIC at a fixed temporal granularity, e.g. sub-seasonal or seasonal, thus only leveraging inter-granularity information and overlooking the plentiful inter-granularity correlations. SIC at various temporal granularities exhibits cumulative effects and are naturally consistent, with short-term fluctuations potentially impacting long-term trends and long-term trends provides effective hints for facilitating short-term forecasts in Arctic sea ice. Therefore, in this study, we propose to cultivate temporal multi-granularity that naturally derived from Arctic sea ice reanalysis data and provide a unified perspective for modeling SIC via our Sea Ice Foundation Model. SIFM is delicately designed to leverage both intra-granularity and inter-granularity information for capturing granularity-consistent representations that promote forecasting skills. Our extensive experiments show that SIFM outperforms off-the-shelf deep learning models for their specific temporal granularity.
A Recommendation Model Utilizing Separation Embedding and Self-Attention for Feature Mining
Wenyi Liu, Rui Wang, Yuanshuai Luo, Jianjun Wei, Zihao Zhao, Junming Huang
Oct 22 2024 cs.IR cs.AI arXiv:2410.15026v1

@misc{2410.15026, author = {Wenyi Liu and Rui Wang and Yuanshuai Luo and Jianjun Wei and Zihao Zhao and Junming Huang}, title = {{A} {R}ecommendation {M}odel {U}tilizing {S}eparation {E}mbedding and {S}elf-{A}ttention for {F}eature {M}ining}, year = {2024}, eprint = {2410.15026}, note = {arXiv:2410.15026v1} }
PDF
With the explosive growth of Internet data, users are facing the problem of information overload, which makes it a challenge to efficiently obtain the required resources. Recommendation systems have emerged in this context. By filtering massive amounts of information, they provide users with content that meets their needs, playing a key role in scenarios such as advertising recommendation and product recommendation. However, traditional click-through rate prediction and TOP-K recommendation mechanisms are gradually unable to meet the recommendations needs in modern life scenarios due to high computational complexity, large memory consumption, long feature selection time, and insufficient feature interaction. This paper proposes a recommendations system model based on a separation embedding cross-network. The model uses an embedding neural network layer to transform sparse feature vectors into dense embedding vectors, and can independently perform feature cross operations on different dimensions, thereby improving the accuracy and depth of feature mining. Experimental results show that the model shows stronger adaptability and higher prediction accuracy in processing complex data sets, effectively solving the problems existing in existing models.
Environment Scan of Generative AI Infrastructure for Clinical and Translational Science
Betina Idnay, Zihan Xu, William G. Adams, Mohammad Adibuzzaman, Nicholas R. Anderson, Neil Bahroos, Douglas S. Bell, Cody Bumgardner, Thomas Campion, Mario Castro, James J. Cimino, I. Glenn Cohen, David Dorr, Peter L Elkin, Jungwei W. Fan, Todd Ferris, David J. Foran, David Hanauer, Mike Hogarth, Kun Huang, et al (37)
Oct 18 2024 cs.CY cs.AI cs.HC arXiv:2410.12793v1

@misc{2410.12793, author = {Betina Idnay and Zihan Xu and William G.~Adams and Mohammad Adibuzzaman and Nicholas R.~Anderson and Neil Bahroos and Douglas S.~Bell and Cody Bumgardner and Thomas Campion and Mario Castro and James J.~Cimino and I.~Glenn Cohen and David Dorr and Peter L Elkin and Jungwei W.~Fan and Todd Ferris and David J.~Foran and David Hanauer and Mike Hogarth and Kun Huang and Jayashree Kalpathy-Cramer and Manoj Kandpal and Niranjan S.~Karnik and Avnish Katoch and Albert M.~Lai and Christophe G.~Lambert and Lang Li and Christopher Lindsell and Jinze Liu and Zhiyong Lu and Yuan Luo and Peter McGarvey and Eneida A.~Mendonca and Parsa Mirhaji and Shawn Murphy and John D.~Osborne and Ioannis C.~Paschalidis and Paul A.~Harris and Fred Prior and Nicholas J.~Shaheen and Nawar Shara and Ida Sim and Umberto Tachinardi and Lemuel R.~Waitman and Rosalind J.~Wright and Adrian H.~Zai and Kai Zheng and Sandra Soo-Jin Lee and Bradley A.~Malin and Karthik Natarajan and W.~Nicholson Price II and Rui Zhang and Yiye Zhang and Hua Xu and Jiang Bian and Chunhua Weng and Yifan Peng}, title = {{E}nvironment {S}can of {G}enerative {AI} {I}nfrastructure for {C}linical and {T}ranslational {S}cience}, year = {2024}, eprint = {2410.12793}, note = {arXiv:2410.12793v1} }
PDF
This study reports a comprehensive environmental scan of the generative AI (GenAI) infrastructure in the national network for clinical and translational science across 36 institutions supported by the Clinical and Translational Science Award (CTSA) Program led by the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH) at the United States. With the rapid advancement of GenAI technologies, including large language models (LLMs), healthcare institutions face unprecedented opportunities and challenges. This research explores the current status of GenAI integration, focusing on stakeholder roles, governance structures, and ethical considerations by administering a survey among leaders of health institutions (i.e., representing academic medical centers and health systems) to assess the institutional readiness and approach towards GenAI adoption. Key findings indicate a diverse range of institutional strategies, with most organizations in the experimental phase of GenAI deployment. The study highlights significant variations in governance models, with a strong preference for centralized decision-making but notable gaps in workforce training and ethical oversight. Moreover, the results underscore the need for a more coordinated approach to GenAI governance, emphasizing collaboration among senior leaders, clinicians, information technology staff, and researchers. Our analysis also reveals concerns regarding GenAI bias, data security, and stakeholder trust, which must be addressed to ensure the ethical and effective implementation of GenAI technologies. This study offers valuable insights into the challenges and opportunities of GenAI integration in healthcare, providing a roadmap for institutions aiming to leverage GenAI for improved quality of care and operational efficiency.
Hiformer: Hybrid Frequency Feature Enhancement Inverted Transformer for Long-Term Wind Power Prediction
Chongyang Wan, Shunbo Lei, Yuan Luo
Oct 18 2024 cs.LG cs.AI arXiv:2410.13303v1

@misc{2410.13303, author = {Chongyang Wan and Shunbo Lei and Yuan Luo}, title = {{H}iformer: {H}ybrid {F}requency {F}eature {E}nhancement {I}nverted {T}ransformer for {L}ong-{T}erm {W}ind {P}ower {P}rediction}, year = {2024}, eprint = {2410.13303}, note = {arXiv:2410.13303v1} }
PDF
The increasing severity of climate change necessitates an urgent transition to renewable energy sources, making the large-scale adoption of wind energy crucial for mitigating environmental impact. However, the inherent uncertainty of wind power poses challenges for grid stability, underscoring the need for accurate wind energy prediction models to enable effective power system planning and operation. While many existing studies on wind power prediction focus on short-term forecasting, they often overlook the importance of long-term predictions. Long-term wind power forecasting is essential for effective power grid dispatch and market transactions, as it requires careful consideration of weather features such as wind speed and direction, which directly influence power output. Consequently, methods designed for short-term predictions may lead to inaccurate results and high computational costs in long-term settings. To adress these limitations, we propose a novel approach called Hybrid Frequency Feature Enhancement Inverted Transformer (Hiformer). Hiformer introduces a unique structure that integrates signal decomposition technology with weather feature extraction technique to enhance the modeling of correlations between meteorological conditions and wind power generation. Additionally, Hiformer employs an encoder-only architecture, which reduces the computational complexity associated with long-term wind power forecasting. Compared to the state-of-the-art methods, Hiformer: (i) can improve the prediction accuracy by up to 52.5\%; and (ii) can reduce computational time by up to 68.5\%.
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, et al (68)
Oct 18 2024 cs.CV cs.AI cs.LG eess.IV arXiv:2410.13720v1

@misc{2410.13720, author = {Adam Polyak and Amit Zohar and Andrew Brown and Andros Tjandra and Animesh Sinha and Ann Lee and Apoorv Vyas and Bowen Shi and Chih-Yao Ma and Ching-Yao Chuang and David Yan and Dhruv Choudhary and Dingkang Wang and Geet Sethi and Guan Pang and Haoyu Ma and Ishan Misra and Ji Hou and Jialiang Wang and Kiran Jagadeesh and Kunpeng Li and Luxin Zhang and Mannat Singh and Mary Williamson and Matt Le and Matthew Yu and Mitesh Kumar Singh and Peizhao Zhang and Peter Vajda and Quentin Duval and Rohit Girdhar and Roshan Sumbaly and Sai Saketh Rambhatla and Sam Tsai and Samaneh Azadi and Samyak Datta and Sanyuan Chen and Sean Bell and Sharadh Ramaswamy and Shelly Sheynin and Siddharth Bhattacharya and Simran Motwani and Tao Xu and Tianhe Li and Tingbo Hou and Wei-Ning Hsu and Xi Yin and Xiaoliang Dai and Yaniv Taigman and Yaqiao Luo and Yen-Cheng Liu and Yi-Chiao Wu and Yue Zhao and Yuval Kirstain and Zecheng He and Zijian He and Albert Pumarola and Ali Thabet and Artsiom Sanakoyeu and Arun Mallya and Baishan Guo and Boris Araya and Breena Kerr and Carleigh Wood and Ce Liu and Cen Peng and Dimitry Vengertsev and Edgar Schonfeld and Elliot Blanchard and Felix Juefei-Xu and Fraylie Nord and Jeff Liang and John Hoffman and Jonas Kohler and Kaolin Fire and Karthik Sivakumar and Lawrence Chen and Licheng Yu and Luya Gao and Markos Georgopoulos and Rashel Moritz and Sara K.~Sampson and Shikai Li and Simone Parmeggiani and Steve Fine and Tara Fowler and Vladan Petrovic and Yuming Du}, title = {{M}ovie {G}en: {A} {C}ast of {M}edia {F}oundation {M}odels}, year = {2024}, eprint = {2410.13720}, note = {arXiv:2410.13720v1} }
PDF
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.
$\gamma-$MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models
Yaxin Luo, Gen Luo, Jiayi Ji, Yiyi Zhou, Xiaoshuai Sun, Zhiqiang Shen, Rongrong Ji
Oct 18 2024 cs.CV arXiv:2410.13859v1

@misc{2410.13859, author = {Yaxin Luo and Gen Luo and Jiayi Ji and Yiyi Zhou and Xiaoshuai Sun and Zhiqiang Shen and Rongrong Ji}, title = {$\gamma-${M}o{D}: {E}xploring {M}ixture-of-{D}epth {A}daptation for {M}ultimodal {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2410.13859}, note = {arXiv:2410.13859v1} }
PDF
Despite the significant progress in multimodal large language models (MLLMs), their high computational cost remains a barrier to real-world deployment. Inspired by the mixture of depths (MoDs) in natural language processing, we aim to address this limitation from the perspective of ``activated tokens''. Our key insight is that if most tokens are redundant for the layer computation, then can be skipped directly via the MoD layer. However, directly converting the dense layers of MLLMs to MoD layers leads to substantial performance degradation. To address this issue, we propose an innovative MoD adaptation strategy for existing MLLMs called $\gamma$-MoD. In $\gamma$-MoD, a novel metric is proposed to guide the deployment of MoDs in the MLLM, namely rank of attention maps (ARank). Through ARank, we can effectively identify which layer is redundant and should be replaced with the MoD layer. Based on ARank, we further propose two novel designs to maximize the computational sparsity of MLLM while maintaining its performance, namely shared vision-language router and masked routing learning. With these designs, more than 90% dense layers of the MLLM can be effectively converted to the MoD ones. To validate our method, we apply it to three popular MLLMs, and conduct extensive experiments on 9 benchmark datasets. Experimental results not only validate the significant efficiency benefit of $\gamma$-MoD to existing MLLMs but also confirm its generalization ability on various MLLMs. For example, with a minor performance drop, i.e., -1.5%, $\gamma$-MoD can reduce the training and inference time of LLaVA-HR by 31.0% and 53.2%, respectively.
DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs
Yingsong Luo, Ling Chen
Oct 17 2024 cs.LG cs.AI arXiv:2410.12187v2

@misc{2410.12187, author = {Yingsong Luo and Ling Chen}, title = {{DAQ}: {D}ensity-{A}ware {P}ost-{T}raining {W}eight-{O}nly {Q}uantization {F}or {LLM}s}, year = {2024}, eprint = {2410.12187}, note = {arXiv:2410.12187v2} }
PDF
Large language models (LLMs) excel in various tasks but face deployment challenges due to hardware constraints. We propose density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which identifies the center of high-density weights and centers the dynamic range on this point to align high-density weight regions with floating-point high-precision regions; 2) learnable dynamic range adjustment, which adjusts the dynamic range by optimizing quantization parameters (i.e., scale and zero-point) based on the impact of weights on the model output. Experiments on LLaMA and LLaMA-2 show that DAQ consistently outperforms the best baseline method, reducing perplexity loss by an average of 22.8% on LLaMA and 19.6% on LLaMA-2. Our code is available at https://github.com/LuoYingSong/DAQ.
IceDiff: High Resolution and High-Quality Sea Ice Forecasting with Generative Diffusion Prior
Jingyi Xu, Siwei Tu, Weidong Yang, Shuhao Li, Keyi Liu, Yeqi Luo, Lipeng Ma, Ben Fei, Lei Bai
Oct 15 2024 physics.ao-ph cs.AI cs.LG arXiv:2410.09111v1

@misc{2410.09111, author = {Jingyi Xu and Siwei Tu and Weidong Yang and Shuhao Li and Keyi Liu and Yeqi Luo and Lipeng Ma and Ben Fei and Lei Bai}, title = {{I}ce{D}iff: {H}igh {R}esolution and {H}igh-{Q}uality {S}ea {I}ce {F}orecasting with {G}enerative {D}iffusion {P}rior}, year = {2024}, eprint = {2410.09111}, note = {arXiv:2410.09111v1} }
PDF
Variation of Arctic sea ice has significant impacts on polar ecosystems, transporting routes, coastal communities, and global climate. Tracing the change of sea ice at a finer scale is paramount for both operational applications and scientific studies. Recent pan-Arctic sea ice forecasting methods that leverage advances in artificial intelligence has made promising progress over numerical models. However, forecasting sea ice at higher resolutions is still under-explored. To bridge the gap, we propose a two-staged deep learning framework, IceDiff, to forecast sea ice concentration at finer scales. IceDiff first leverages an independently trained vision transformer to generate coarse yet superior forecasting over previous methods at a regular 25km x 25km grid. This high-quality sea ice forecasting can be utilized as reliable guidance for the next stage. Subsequently, an unconditional diffusion model pre-trained on sea ice concentration maps is utilized for sampling down-scaled sea ice forecasting via a zero-shot guided sampling strategy and a patch-based method. For the first time, IceDiff demonstrates sea ice forecasting with the 6.25km x 6.25km resolution. IceDiff extends the boundary of existing sea ice forecasting models and more importantly, its capability to generate high-resolution sea ice concentration data is vital for pragmatic usages and research.
Keys to Robust Edits: from Theoretical Insights to Practical Advances
Jianhao Yan, Futing Wang, Yun Luo, Yafu Li, Yue Zhang
Oct 15 2024 cs.CL arXiv:2410.09338v1

@misc{2410.09338, author = {Jianhao Yan and Futing Wang and Yun Luo and Yafu Li and Yue Zhang}, title = {{K}eys to {R}obust {E}dits: from {T}heoretical {I}nsights to {P}ractical {A}dvances}, year = {2024}, eprint = {2410.09338}, note = {arXiv:2410.09338v1} }
PDF
Large language models (LLMs) have revolutionized knowledge storage and retrieval, but face challenges with conflicting and outdated information. Knowledge editing techniques have been proposed to address these issues, yet they struggle with robustness tests involving long contexts, paraphrased subjects, and continuous edits. This work investigates the cause of these failures in locate-and-edit methods, offering theoretical insights into their key-value modeling and deriving mathematical bounds for robust and specific edits, leading to a novel 'group discussion' conceptual model for locate-and-edit methods. Empirical analysis reveals that keys used by current methods fail to meet robustness and specificity requirements. To address this, we propose a Robust Edit Pathway (REP) that disentangles editing keys from LLMs' inner representations. Evaluations on LLaMA2-7B and Mistral-7B using the CounterFact dataset show that REP significantly improves robustness across various metrics, both in-domain and out-of-domain, with minimal trade-offs in success rate and locality. Our findings advance the development of reliable and flexible knowledge updating in LLMs.
Deep Transfer Learning: Model Framework and Error Analysis
Yuling Jiao, Huazhen Lin, Yuchen Luo, Jerry Zhijian Yang
Oct 15 2024 cs.LG stat.ML arXiv:2410.09383v1

@misc{2410.09383, author = {Yuling Jiao and Huazhen Lin and Yuchen Luo and Jerry Zhijian Yang}, title = {{D}eep {T}ransfer {L}earning: {M}odel {F}ramework and {E}rror {A}nalysis}, year = {2024}, eprint = {2410.09383}, note = {arXiv:2410.09383v1} }
PDF
This paper presents a framework for deep transfer learning, which aims to leverage information from multi-domain upstream data with a large number of samples $n$ to a single-domain downstream task with a considerably smaller number of samples $m$, where $m \ll n$, in order to enhance performance on downstream task. Our framework has several intriguing features. First, it allows the existence of both shared and specific features among multi-domain data and provides a framework for automatic identification, achieving precise transfer and utilization of information. Second, our model framework explicitly indicates the upstream features that contribute to downstream tasks, establishing a relationship between upstream domains and downstream tasks, thereby enhancing interpretability. Error analysis demonstrates that the transfer under our framework can significantly improve the convergence rate for learning Lipschitz functions in downstream supervised tasks, reducing it from $\tilde{O}(m^{-\frac{1}{2(d+2)}}+n^{-\frac{1}{2(d+2)}})$ ("no transfer") to $\tilde{O}(m^{-\frac{1}{2(d^*+3)}} + n^{-\frac{1}{2(d+2)}})$ ("partial transfer"), and even to $\tilde{O}(m^{-1/2}+n^{-\frac{1}{2(d+2)}})$ ("complete transfer"), where $d^* \ll d$ and $d$ is the dimension of the observed data. Our theoretical findings are substantiated by empirical experiments conducted on image classification datasets, along with a regression dataset.
Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting
Yifan Luo, Zhennan Zhou, Meitan Wang, Bin Dong
Oct 15 2024 cs.CL cs.AI arXiv:2410.10150v1

@misc{2410.10150, author = {Yifan Luo and Zhennan Zhou and Meitan Wang and Bin Dong}, title = {{J}ailbreak {I}nstruction-{T}uned {LLM}s via end-of-sentence {MLP} {R}e-weighting}, year = {2024}, eprint = {2410.10150}, note = {arXiv:2410.10150v1} }
PDF
In this paper, we investigate the safety mechanisms of instruction fine-tuned large language models (LLMs). We discover that re-weighting MLP neurons can significantly compromise a model's safety, especially for MLPs in end-of-sentence inferences. We hypothesize that LLMs evaluate the harmfulness of prompts during end-of-sentence inferences, and MLP layers plays a critical role in this process. Based on this hypothesis, we develop 2 novel white-box jailbreak methods: a prompt-specific method and a prompt-general method. The prompt-specific method targets individual prompts and optimizes the attack on the fly, while the prompt-general method is pre-trained offline and can generalize to unseen harmful prompts. Our methods demonstrate robust performance across 7 popular open-source LLMs, size ranging from 2B to 72B. Furthermore, our study provides insights into vulnerabilities of instruction-tuned LLM's safety and deepens the understanding of the internal mechanisms of LLMs.
AFlow: Automating Agentic Workflow Generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu
Oct 15 2024 cs.AI cs.CL cs.LG cs.SE arXiv:2410.10762v1

@misc{2410.10762, author = {Jiayi Zhang and Jinyu Xiang and Zhaoyang Yu and Fengwei Teng and Xionghui Chen and Jiaqi Chen and Mingchen Zhuge and Xin Cheng and Sirui Hong and Jinlin Wang and Bingnan Zheng and Bang Liu and Yuyu Luo and Chenglin Wu}, title = {{AF}low: {A}utomating {A}gentic {W}orkflow {G}eneration}, year = {2024}, eprint = {2410.10762}, note = {arXiv:2410.10762v1} }
PDF
Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFlow's efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFlow enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code will be available at https://github.com/geekan/MetaGPT.
Balancing Innovation and Privacy: Data Security Strategies in Natural Language Processing Applications
Shaobo Liu, Guiran Liu, Binrong Zhu, Yuanshuai Luo, Linxiao Wu, Rui Wang
Oct 14 2024 cs.CR cs.AI cs.CL arXiv:2410.08553v1

@misc{2410.08553, author = {Shaobo Liu and Guiran Liu and Binrong Zhu and Yuanshuai Luo and Linxiao Wu and Rui Wang}, title = {{B}alancing {I}nnovation and {P}rivacy: {D}ata {S}ecurity {S}trategies in {N}atural {L}anguage {P}rocessing {A}pplications}, year = {2024}, eprint = {2410.08553}, note = {arXiv:2410.08553v1} }
PDF
This research addresses privacy protection in Natural Language Processing (NLP) by introducing a novel algorithm based on differential privacy, aimed at safeguarding user data in common applications such as chatbots, sentiment analysis, and machine translation. With the widespread application of NLP technology, the security and privacy protection of user data have become important issues that need to be solved urgently. This paper proposes a new privacy protection algorithm designed to effectively prevent the leakage of user sensitive information. By introducing a differential privacy mechanism, our model ensures the accuracy and reliability of data analysis results while adding random noise. This method not only reduces the risk caused by data leakage but also achieves effective processing of data while protecting user privacy. Compared to traditional privacy methods like data anonymization and homomorphic encryption, our approach offers significant advantages in terms of computational efficiency and scalability while maintaining high accuracy in data analysis. The proposed algorithm's efficacy is demonstrated through performance metrics such as accuracy (0.89), precision (0.85), and recall (0.88), outperforming other methods in balancing privacy and utility. As privacy protection regulations become increasingly stringent, enterprises and developers must take effective measures to deal with privacy risks. Our research provides an important reference for the application of privacy protection technology in the field of NLP, emphasizing the need to achieve a balance between technological innovation and user privacy. In the future, with the continuous advancement of technology, privacy protection will become a core element of data-driven applications and promote the healthy development of the entire industry.
ViT3D Alignment of LLaMA3: 3D Medical Image Report Generation
Siyou Li, Beining Xu, Yihao Luo, Dong Nie, Le Zhang
Oct 14 2024 eess.IV cs.AI cs.CV arXiv:2410.08588v1

@misc{2410.08588, author = {Siyou Li and Beining Xu and Yihao Luo and Dong Nie and Le Zhang}, title = {{V}i{T}3{D} {A}lignment of {LL}a{MA}3: 3{D} {M}edical {I}mage {R}eport {G}eneration}, year = {2024}, eprint = {2410.08588}, note = {arXiv:2410.08588v1} }
PDF
Automatic medical report generation (MRG), which aims to produce detailed text reports from medical images, has emerged as a critical task in this domain. MRG systems can enhance radiological workflows by reducing the time and effort required for report writing, thereby improving diagnostic efficiency. In this work, we present a novel approach for automatic MRG utilizing a multimodal large language model. Specifically, we employed the 3D Vision Transformer (ViT3D) image encoder introduced from M3D-CLIP to process 3D scans and use the Asclepius-Llama3-8B as the language model to generate the text reports by auto-regressive decoding. The experiment shows our model achieved an average Green score of 0.3 on the MRG task validation set and an average accuracy of 0.61 on the visual question answering (VQA) task validation set, outperforming the baseline model. Our approach demonstrates the effectiveness of the ViT3D alignment of LLaMA3 for automatic MRG and VQA tasks by tuning the model on a small dataset.
MKGL: Mastery of a Three-Word Language
Lingbing Guo, Zhongpu Bo, Zhuo Chen, Yichi Zhang, Jiaoyan Chen, Yarong Lan, Mengshu Sun, Zhiqiang Zhang, Yangyifei Luo, Qian Li, Qiang Zhang, Wen Zhang, Huajun Chen
Oct 11 2024 cs.CL cs.AI arXiv:2410.07526v1

@misc{2410.07526, author = {Lingbing Guo and Zhongpu Bo and Zhuo Chen and Yichi Zhang and Jiaoyan Chen and Yarong Lan and Mengshu Sun and Zhiqiang Zhang and Yangyifei Luo and Qian Li and Qiang Zhang and Wen Zhang and Huajun Chen}, title = {{MKGL}: {M}astery of a {T}hree-{W}ord {L}anguage}, year = {2024}, eprint = {2410.07526}, note = {arXiv:2410.07526v1} }
PDF
Large language models (LLMs) have significantly advanced performance across a spectrum of natural language processing (NLP) tasks. Yet, their application to knowledge graphs (KGs), which describe facts in the form of triplets and allow minimal hallucinations, remains an underexplored frontier. In this paper, we investigate the integration of LLMs with KGs by introducing a specialized KG Language (KGL), where a sentence precisely consists of an entity noun, a relation verb, and ends with another entity noun. Despite KGL's unfamiliar vocabulary to the LLM, we facilitate its learning through a tailored dictionary and illustrative sentences, and enhance context understanding via real-time KG context retrieval and KGL token embedding augmentation. Our results reveal that LLMs can achieve fluency in KGL, drastically reducing errors compared to conventional KG embedding methods on KG completion. Furthermore, our enhanced LLM shows exceptional competence in generating accurate three-word sentences from an initial entity and interpreting new unseen terms out of KGs.
Learning Recommender Systems with Soft Target: A Decoupled Perspective
Hao Zhang, Mingyue Cheng, Qi Liu, Yucong Luo, Rui Li, Enhong Chen
Oct 10 2024 cs.IR arXiv:2410.06536v1

@misc{2410.06536, author = {Hao Zhang and Mingyue Cheng and Qi Liu and Yucong Luo and Rui Li and Enhong Chen}, title = {{L}earning {R}ecommender {S}ystems with {S}oft {T}arget: {A} {D}ecoupled {P}erspective}, year = {2024}, eprint = {2410.06536}, note = {arXiv:2410.06536v1} }
PDF
Learning recommender systems with multi-class optimization objective is a prevalent setting in recommendation. However, as observed user feedback often accounts for a tiny fraction of the entire item pool, the standard Softmax loss tends to ignore the difference between potential positive feedback and truly negative feedback. To address this challenge, we propose a novel decoupled soft label optimization framework to consider the objectives as two aspects by leveraging soft labels, including target confidence and the latent interest distribution of non-target items. Futhermore, based on our carefully theoretical analysis, we design a decoupled loss function to flexibly adjust the importance of these two aspects. To maximize the performance of the proposed method, we additionally present a sensible soft-label generation algorithm that models a label propagation algorithm to explore users' latent interests in unobserved feedback via neighbors. We conduct extensive experiments on various recommendation system models and public datasets, the results demonstrate the effectiveness and generality of the proposed method.
Text-guided Diffusion Model for 3D Molecule Generation
Yanchen Luo, Junfeng Fang, Sihang Li, Zhiyuan Liu, Jiancan Wu, An Zhang, Wenjie Du, Xiang Wang
Oct 08 2024 cs.LG cs.AI physics.chem-ph q-bio.BM arXiv:2410.03803v1

@misc{2410.03803, author = {Yanchen Luo and Junfeng Fang and Sihang Li and Zhiyuan Liu and Jiancan Wu and An Zhang and Wenjie Du and Xiang Wang}, title = {{T}ext-guided {D}iffusion {M}odel for 3{D} {M}olecule {G}eneration}, year = {2024}, eprint = {2410.03803}, note = {arXiv:2410.03803v1} }
PDF
The de novo generation of molecules with targeted properties is crucial in biology, chemistry, and drug discovery. Current generative models are limited to using single property values as conditions, struggling with complex customizations described in detailed human language. To address this, we propose the text guidance instead, and introduce TextSMOG, a new Text-guided Small Molecule Generation Approach via 3D Diffusion Model which integrates language and diffusion models for text-guided small molecule generation. This method uses textual conditions to guide molecule generation, enhancing both stability and diversity. Experimental results show TextSMOG's proficiency in capturing and utilizing information from textual descriptions, making it a powerful tool for generating 3D molecular structures in response to complex textual customizations.
Data Playwright: Authoring Data Videos with Annotated Narration
Leixian Shen, Haotian Li, Yun Wang, Tianqi Luo, Yuyu Luo, Huamin Qu
Oct 07 2024 cs.HC arXiv:2410.03093v1

@misc{2410.03093, author = {Leixian Shen and Haotian Li and Yun Wang and Tianqi Luo and Yuyu Luo and Huamin Qu}, title = {{D}ata {P}laywright: {A}uthoring {D}ata {V}ideos with {A}nnotated {N}arration}, year = {2024}, eprint = {2410.03093}, note = {arXiv:2410.03093v1} }
PDF
Creating data videos that effectively narrate stories with animated visuals requires substantial effort and expertise. A promising research trend is leveraging the easy-to-use natural language (NL) interaction to automatically synthesize data video components from narrative content like text narrations, or NL commands that specify user-required designs. Nevertheless, previous research has overlooked the integration of narrative content and specific design authoring commands, leading to generated results that lack customization or fail to seamlessly fit into the narrative context. To address these issues, we introduce a novel paradigm for creating data videos, which seamlessly integrates users' authoring and narrative intents in a unified format called annotated narration, allowing users to incorporate NL commands for design authoring as inline annotations within the narration text. Informed by a formative study on users' preference for annotated narration, we develop a prototype system named Data Playwright that embodies this paradigm for effective creation of data videos. Within Data Playwright, users can write annotated narration based on uploaded visualizations. The system's interpreter automatically understands users' inputs and synthesizes data videos with narration-animation interplay, powered by large language models. Finally, users can preview and fine-tune the video. A user study demonstrated that participants can effectively create data videos with Data Playwright by effortlessly articulating their desired outcomes through annotated narration.
Differentiable Interacting Multiple Model Particle Filtering
John-Joseph Brady, Yuhui Luo, Wenwu Wang, Víctor Elvira, Yunpeng Li
Oct 02 2024 stat.ML cs.LG eess.SP arXiv:2410.00620v1

@misc{2410.00620, author = {John-Joseph Brady and Yuhui Luo and Wenwu Wang and Víctor Elvira and Yunpeng Li}, title = {{D}ifferentiable {I}nteracting {M}ultiple {M}odel {P}article {F}iltering}, year = {2024}, eprint = {2410.00620}, note = {arXiv:2410.00620v1} }
PDF
We propose a sequential Monte Carlo algorithm for parameter learning when the studied model exhibits random discontinuous jumps in behaviour. To facilitate the learning of high dimensional parameter sets, such as those associated to neural networks, we adopt the emerging framework of differentiable particle filtering, wherein parameters are trained by gradient descent. We design a new differentiable interacting multiple model particle filter to be capable of learning the individual behavioural regimes and the model which controls the jumping simultaneously. In contrast to previous approaches, our algorithm allows control of the computational effort assigned per regime whilst using the probability of being in a given regime to guide sampling. Furthermore, we develop a new gradient estimator that has a lower variance than established approaches and remains fast to compute, for which we prove consistency. We establish new theoretical results of the presented algorithms and demonstrate superior numerical performance compared to the previous state-of-the-art algorithms.
MG-Net: Learn to Customize QAOA with Circuit Depth Awareness
Yang Qian, Xinbiao Wang, Yuxuan Du, Yong Luo, Dacheng Tao
Sep 30 2024 quant-ph cs.AI cs.LG arXiv:2409.18692v1

@misc{2409.18692, author = {Yang Qian and Xinbiao Wang and Yuxuan Du and Yong Luo and Dacheng Tao}, title = {{MG}-{N}et: {L}earn to {C}ustomize {QAOA} with {C}ircuit {D}epth {A}wareness}, year = {2024}, eprint = {2409.18692}, note = {arXiv:2409.18692v1} }
PDF
Quantum Approximate Optimization Algorithm (QAOA) and its variants exhibit immense potential in tackling combinatorial optimization challenges. However, their practical realization confronts a dilemma: the requisite circuit depth for satisfactory performance is problem-specific and often exceeds the maximum capability of current quantum devices. To address this dilemma, here we first analyze the convergence behavior of QAOA, uncovering the origins of this dilemma and elucidating the intricate relationship between the employed mixer Hamiltonian, the specific problem at hand, and the permissible maximum circuit depth. Harnessing this understanding, we introduce the Mixer Generator Network (MG-Net), a unified deep learning framework adept at dynamically formulating optimal mixer Hamiltonians tailored to distinct tasks and circuit depths. Systematic simulations, encompassing Ising models and weighted Max-Cut instances with up to 64 qubits, substantiate our theoretical findings, highlighting MG-Net's superior performance in terms of both approximation ratio and efficiency.
VibraForge: A Scalable Prototyping Toolkit For Creating Spatialized Vibrotactile Feedback Systems
Bingjian Huang, Siyi Ren, Yuewen Luo, Qilong Cheng, Hanfeng Cai, Yeqi Sang, Mauricio Sousa, Paul H. Dietz, Daniel Wigdor
Sep 27 2024 cs.HC arXiv:2409.17420v1

@misc{2409.17420, author = {Bingjian Huang and Siyi Ren and Yuewen Luo and Qilong Cheng and Hanfeng Cai and Yeqi Sang and Mauricio Sousa and Paul H.~Dietz and Daniel Wigdor}, title = {{V}ibra{F}orge: {A} {S}calable {P}rototyping {T}oolkit {F}or {C}reating {S}patialized {V}ibrotactile {F}eedback {S}ystems}, year = {2024}, eprint = {2409.17420}, note = {arXiv:2409.17420v1} }
PDF
Spatialized vibrotactile feedback systems deliver tactile information by placing multiple vibrotactile actuators on the body. As increasing numbers of actuators are required to adequately convey information in complicated applications, haptic designers find it difficult to create such systems due to limited scalability of existing toolkits. We propose VibraForge, an open-source vibrotactile toolkit that supports up to 128 vibrotactile actuators. Each actuator is encapsulated within a self-contained vibration unit and driven by its own microcontroller. By leveraging a chain-connection method, each unit receives independent vibration commands from a control unit, with fine-grained control over intensity and frequency. We also designed a GUI Editor to expedite the authoring of spatial vibrotactile patterns. Technical evaluations show that vibration units reliably reproduce audio waveforms with low-latency and high-bandwidth data communication. Case studies of phonemic tactile display, virtual reality fitness training, and drone teleoperation demonstrate the potential usage of VibraForge within different domains.
Hierarchical End-to-End Autonomous Driving: Integrating BEV Perception with Deep Reinforcement Learning
Siyi Lu, Lei He, Shengbo Eben Li, Yugong Luo, Jianqiang Wang, Keqiang Li
Sep 27 2024 cs.AI arXiv:2409.17659v1

@misc{2409.17659, author = {Siyi Lu and Lei He and Shengbo Eben Li and Yugong Luo and Jianqiang Wang and Keqiang Li}, title = {{H}ierarchical {E}nd-to-{E}nd {A}utonomous {D}riving: {I}ntegrating {BEV} {P}erception with {D}eep {R}einforcement {L}earning}, year = {2024}, eprint = {2409.17659}, note = {arXiv:2409.17659v1} }
PDF
End-to-end autonomous driving offers a streamlined alternative to the traditional modular pipeline, integrating perception, prediction, and planning within a single framework. While Deep Reinforcement Learning (DRL) has recently gained traction in this domain, existing approaches often overlook the critical connection between feature extraction of DRL and perception. In this paper, we bridge this gap by mapping the DRL feature extraction network directly to the perception phase, enabling clearer interpretation through semantic segmentation. By leveraging Bird's-Eye-View (BEV) representations, we propose a novel DRL-based end-to-end driving framework that utilizes multi-sensor inputs to construct a unified three-dimensional understanding of the environment. This BEV-based system extracts and translates critical environmental features into high-level abstract states for DRL, facilitating more informed control. Extensive experimental evaluations demonstrate that our approach not only enhances interpretability but also significantly outperforms state-of-the-art methods in autonomous driving control tasks, reducing the collision rate by 20%.
Context-aware and Style-related Incremental Decoding framework for Discourse-Level Literary Translation
Yuanchang Luo, Jiaxin Guo, Daimeng Wei, Hengchao Shang, Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Shaojun Li, Jinlong Yang, Hao Yang
Sep 26 2024 cs.AI arXiv:2409.16539v2

@misc{2409.16539, author = {Yuanchang Luo and Jiaxin Guo and Daimeng Wei and Hengchao Shang and Zongyao Li and Zhanglin Wu and Zhiqiang Rao and Shaojun Li and Jinlong Yang and Hao Yang}, title = {{C}ontext-aware and {S}tyle-related {I}ncremental {D}ecoding framework for {D}iscourse-{L}evel {L}iterary {T}ranslation}, year = {2024}, eprint = {2409.16539}, note = {arXiv:2409.16539v2} }
PDF
This report outlines our approach for the WMT24 Discourse-Level Literary Translation Task, focusing on the Chinese-English language pair in the Constrained Track. Translating literary texts poses significant challenges due to the nuanced meanings, idiomatic expressions, and intricate narrative structures inherent in such works. To address these challenges, we leveraged the Chinese-Llama2 model, specifically enhanced for this task through a combination of Continual Pre-training (CPT) and Supervised Fine-Tuning (SFT). Our methodology includes a novel Incremental Decoding framework, which ensures that each sentence is translated with consideration of its broader context, maintaining coherence and consistency throughout the text. This approach allows the model to capture long-range dependencies and stylistic elements, producing translations that faithfully preserve the original literary quality. Our experiments demonstrate significant improvements in both sentence-level and document-level BLEU scores, underscoring the effectiveness of our proposed framework in addressing the complexities of document-level literary translation.
Ascend HiFloat8 Format for Deep Learning
Yuanyong Luo, Zhongxing Zhang, Richard Wu, Hu Liu, Ying Jin, Kai Zheng, Minmin Wang, Zhanying He, Guipeng Hu, Luyao Chen, Tianchi Hu, Junsong Wang, Minqi Chen, Mikhaylov Dmitry, Korviakov Vladimir, Bobrin Maxim, Yuhao Hu, Guanfu Chen, Zeyi Huang
Sep 26 2024 cs.LG cs.AI cs.AR arXiv:2409.16626v2

@misc{2409.16626, author = {Yuanyong Luo and Zhongxing Zhang and Richard Wu and Hu Liu and Ying Jin and Kai Zheng and Minmin Wang and Zhanying He and Guipeng Hu and Luyao Chen and Tianchi Hu and Junsong Wang and Minqi Chen and Mikhaylov Dmitry and Korviakov Vladimir and Bobrin Maxim and Yuhao Hu and Guanfu Chen and Zeyi Huang}, title = {{A}scend {H}i{F}loat8 {F}ormat for {D}eep {L}earning}, year = {2024}, eprint = {2409.16626}, note = {arXiv:2409.16626v2} }
PDF
This preliminary white paper proposes a novel 8-bit floating-point data format HiFloat8 (abbreviated as HiF8) for deep learning. HiF8 features tapered precision. For normal value encoding, it provides 7 exponent values with 3-bit mantissa, 8 exponent values with 2-bit mantissa, and 16 exponent values with 1-bit mantissa. For denormal value encoding, it extends the dynamic range by 7 extra powers of 2, from 31 to 38 binades (notice that FP16 covers 40 binades). Meanwhile, HiF8 encodes all the special values except that positive zero and negative zero are represented by only one bit-pattern. Thanks to the better balance between precision and dynamic range, HiF8 can be simultaneously used in both forward and backward passes of AI training. In this paper, we will describe the definition and rounding methods of HiF8, as well as the tentative training and inference solutions. To demonstrate the efficacy of HiF8, massive simulation results on various neural networks, including traditional neural networks and large language models (LLMs), will also be presented.
Exploring the traditional NMT model and Large Language Model for chat translation
Jinlong Yang, Hengchao Shang, Daimeng Wei, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Shaojun Li, Yuhao Xie, Yuanchang Luo, Jiawei Zheng, Bin Wei, Hao Yang
Sep 26 2024 cs.CL cs.AI arXiv:2409.16331v1

@misc{2409.16331, author = {Jinlong Yang and Hengchao Shang and Daimeng Wei and Jiaxin Guo and Zongyao Li and Zhanglin Wu and Zhiqiang Rao and Shaojun Li and Yuhao Xie and Yuanchang Luo and Jiawei Zheng and Bin Wei and Hao Yang}, title = {{E}xploring the traditional {NMT} model and {L}arge {L}anguage {M}odel for chat translation}, year = {2024}, eprint = {2409.16331}, note = {arXiv:2409.16331v1} }
PDF
This paper describes the submissions of Huawei Translation Services Center(HW-TSC) to WMT24 chat translation shared task on English$\leftrightarrow$Germany (en-de) bidirection. The experiments involved fine-tuning models using chat data and exploring various strategies, including Minimum Bayesian Risk (MBR) decoding and self-training. The results show significant performance improvements in certain directions, with the MBR self-training method achieving the best results. The Large Language Model also discusses the challenges and potential avenues for further research in the field of chat translation.
Expert-level vision-language foundation model for real-world radiology and comprehensive evaluation
Xiaohong Liu, Guoxing Yang, Yulin Luo, Jiaji Mao, Xiang Zhang, Ming Gao, Shanghang Zhang, Jun Shen, Guangyu Wang
Sep 25 2024 cs.CV arXiv:2409.16183v1

@misc{2409.16183, author = {Xiaohong Liu and Guoxing Yang and Yulin Luo and Jiaji Mao and Xiang Zhang and Ming Gao and Shanghang Zhang and Jun Shen and Guangyu Wang}, title = {{E}xpert-level vision-language foundation model for real-world radiology and comprehensive evaluation}, year = {2024}, eprint = {2409.16183}, note = {arXiv:2409.16183v1} }
PDF
Radiology is a vital and complex component of modern clinical workflow and covers many tasks. Recently, vision-language (VL) foundation models in medicine have shown potential in processing multimodal information, offering a unified solution for various radiology tasks. However, existing studies either pre-trained VL models on natural data or did not fully integrate vision-language architecture and pretraining, often neglecting the unique multimodal complexity in radiology images and their textual contexts. Additionally, their practical applicability in real-world scenarios remains underexplored. Here, we present RadFound, a large and open-source vision-language foundation model tailored for radiology, that is trained on the most extensive dataset of over 8.1 million images and 250,000 image-text pairs, covering 19 major organ systems and 10 imaging modalities. To establish expert-level multimodal perception and generation capabilities, RadFound introduces an enhanced vision encoder to capture intra-image local features and inter-image contextual information, and a unified cross-modal learning design tailored to radiology. To fully assess the models' capability, we construct a benchmark, RadVLBench, including radiology interpretation tasks like medical vision-language question-answering, as well as text generation tasks ranging from captioning to report generation. We also propose a human evaluation framework. When evaluated on the real-world benchmark involving three representative modalities, 2D images (chest X-rays), multi-view images (mammograms), and 3D images (thyroid CT scans), RadFound significantly outperforms other VL foundation models on both quantitative metrics and human evaluation. In summary, the development of RadFound represents an advancement in radiology generalists, demonstrating broad applicability potential for integration into clinical workflows.
Multilingual Transfer and Domain Adaptation for Low-Resource Languages of Spain
Yuanchang Luo, Zhanglin Wu, Daimeng Wei, Hengchao Shang, Zongyao Li, Jiaxin Guo, Zhiqiang Rao, Shaojun Li, Jinlong Yang, Yuhao Xie, Jiawei Zheng Bin Wei, Hao Yang
Sep 25 2024 cs.CL cs.AI arXiv:2409.15924v2

@misc{2409.15924, author = {Yuanchang Luo and Zhanglin Wu and Daimeng Wei and Hengchao Shang and Zongyao Li and Jiaxin Guo and Zhiqiang Rao and Shaojun Li and Jinlong Yang and Yuhao Xie and Jiawei Zheng Bin Wei and Hao Yang}, title = {{M}ultilingual {T}ransfer and {D}omain {A}daptation for {L}ow-{R}esource {L}anguages of {S}pain}, year = {2024}, eprint = {2409.15924}, note = {arXiv:2409.15924v2} }
PDF
This article introduces the submission status of the Translation into Low-Resource Languages of Spain task at (WMT 2024) by Huawei Translation Service Center (HW-TSC). We participated in three translation tasks: spanish to aragonese (es-arg), spanish to aranese (es-arn), and spanish to asturian (es-ast). For these three translation tasks, we use training strategies such as multilingual transfer, regularized dropout, forward translation and back translation, labse denoising, transduction ensemble learning and other strategies to neural machine translation (NMT) model based on training deep transformer-big architecture. By using these enhancement strategies, our submission achieved a competitive result in the final evaluation.
Revisiting the Solution of Meta KDD Cup 2024: CRAG
Jie Ouyang, Yucong Luo, Mingyue Cheng, Daoyu Wang, Shuo Yu, Qi Liu, Enhong Chen
Sep 25 2024 cs.IR cs.AI cs.CL arXiv:2409.15337v1

@misc{2409.15337, author = {Jie Ouyang and Yucong Luo and Mingyue Cheng and Daoyu Wang and Shuo Yu and Qi Liu and Enhong Chen}, title = {{R}evisiting the {S}olution of {M}eta {KDD} {C}up 2024: {CRAG}}, year = {2024}, eprint = {2409.15337}, note = {arXiv:2409.15337v1} }
PDF
This paper presents the solution of our team APEX in the Meta KDD CUP 2024: CRAG Comprehensive RAG Benchmark Challenge. The CRAG benchmark addresses the limitations of existing QA benchmarks in evaluating the diverse and dynamic challenges faced by Retrieval-Augmented Generation (RAG) systems. It provides a more comprehensive assessment of RAG performance and contributes to advancing research in this field. We propose a routing-based domain and dynamic adaptive RAG pipeline, which performs specific processing for the diverse and dynamic nature of the question in all three stages: retrieval, augmentation, and generation. Our method achieved superior performance on CRAG and ranked 2nd for Task 2&3 on the final competition leaderboard. Our implementation is available at this link: https://github.com/USTCAGI/CRAG-in-KDD-Cup2024.
Optimizing News Text Classification with Bi-LSTM and Attention Mechanism for Efficient Data Processing
Bingyao Liu, Jiajing Chen, Rui Wang, Junming Huang, Yuanshuai Luo, Jianjun Wei
Sep 25 2024 cs.CL cs.IR arXiv:2409.15576v1

@misc{2409.15576, author = {Bingyao Liu and Jiajing Chen and Rui Wang and Junming Huang and Yuanshuai Luo and Jianjun Wei}, title = {{O}ptimizing {N}ews {T}ext {C}lassification with {B}i-{LSTM} and {A}ttention {M}echanism for {E}fficient {D}ata {P}rocessing}, year = {2024}, eprint = {2409.15576}, note = {arXiv:2409.15576v1} }
PDF
The development of Internet technology has led to a rapid increase in news information. Filtering out valuable content from complex information has become an urgentproblem that needs to be solved. In view of the shortcomings of traditional manual classification methods that are time-consuming and inefficient, this paper proposes an automaticclassification scheme for news texts based on deep learning. This solution achieves efficient classification and management of news texts by introducing advanced machine learning algorithms, especially an optimization model that combines Bi-directional Long Short-Term Memory Network (Bi-LSTM) and Attention Mechanism. Experimental results show that this solution can not only significantly improve the accuracy and timeliness of classification, but also significantly reduce the need for manual intervention. It has important practical significance for improving the information processing capabilities of the news industry and accelerating the speed of information flow. Through comparative analysis of multiple common models, the effectiveness and advancement of the proposed method are proved, laying a solid foundation for future news text classification research.
Machine Translation Advancements of Low-Resource Indian Languages by Transfer Learning
Bin Wei, Jiawei Zhen, Zongyao Li, Zhanglin Wu, Daimeng Wei, Jiaxin Guo, Zhiqiang Rao, Shaojun Li, Yuanchang Luo, Hengchao Shang, Jinlong Yang, Yuhao Xie, Hao Yang
Sep 25 2024 cs.CL cs.AI arXiv:2409.15879v1

@misc{2409.15879, author = {Bin Wei and Jiawei Zhen and Zongyao Li and Zhanglin Wu and Daimeng Wei and Jiaxin Guo and Zhiqiang Rao and Shaojun Li and Yuanchang Luo and Hengchao Shang and Jinlong Yang and Yuhao Xie and Hao Yang}, title = {{M}achine {T}ranslation {A}dvancements of {L}ow-{R}esource {I}ndian {L}anguages by {T}ransfer {L}earning}, year = {2024}, eprint = {2409.15879}, note = {arXiv:2409.15879v1} }
PDF
This paper introduces the submission by Huawei Translation Center (HW-TSC) to the WMT24 Indian Languages Machine Translation (MT) Shared Task. To develop a reliable machine translation system for low-resource Indian languages, we employed two distinct knowledge transfer strategies, taking into account the characteristics of the language scripts and the support available from existing open-source models for Indian languages. For Assamese(as) and Manipuri(mn), we fine-tuned the existing IndicTrans2 open-source model to enable bidirectional translation between English and these languages. For Khasi (kh) and Mizo (mz), We trained a multilingual model as a baseline using bilingual data from these four language pairs, along with an additional about 8kw English-Bengali bilingual data, all of which share certain linguistic features. This was followed by fine-tuning to achieve bidirectional translation between English and Khasi, as well as English and Mizo. Our transfer learning experiments produced impressive results: 23.5 BLEU for en-as, 31.8 BLEU for en-mn, 36.2 BLEU for as-en, and 47.9 BLEU for mn-en on their respective test sets. Similarly, the multilingual model transfer learning experiments yielded impressive outcomes, achieving 19.7 BLEU for en-kh, 32.8 BLEU for en-mz, 16.1 BLEU for kh-en, and 33.9 BLEU for mz-en on their respective test sets. These results not only highlight the effectiveness of transfer learning techniques for low-resource languages but also contribute to advancing machine translation capabilities for low-resource Indian languages.
HW-TSC's Submission to the CCMT 2024 Machine Translation Tasks
Zhanglin Wu, Yuanchang Luo, Daimeng Wei, Jiawei Zheng, Bin Wei, Zongyao Li, Hengchao Shang, Jiaxin Guo, Shaojun Li, Weidong Zhang, Ning Xie, Hao Yang
Sep 24 2024 cs.AI cs.CL arXiv:2409.14842v3

@misc{2409.14842, author = {Zhanglin Wu and Yuanchang Luo and Daimeng Wei and Jiawei Zheng and Bin Wei and Zongyao Li and Hengchao Shang and Jiaxin Guo and Shaojun Li and Weidong Zhang and Ning Xie and Hao Yang}, title = {{HW}-{TSC}'s {S}ubmission to the {CCMT} 2024 {M}achine {T}ranslation {T}asks}, year = {2024}, eprint = {2409.14842}, note = {arXiv:2409.14842v3} }
PDF
This paper presents the submission of Huawei Translation Services Center (HW-TSC) to machine translation tasks of the 20th China Conference on Machine Translation (CCMT 2024). We participate in the bilingual machine translation task and multi-domain machine translation task. For these two translation tasks, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning to train neural machine translation (NMT) models based on the deep Transformer-big architecture. Furthermore, to explore whether large language model (LLM) can help improve the translation quality of NMT systems, we use supervised fine-tuning to train llama2-13b as an Automatic post-editing (APE) model to improve the translation results of the NMT model on the multi-domain machine translation task. By using these plyometric strategies, our submission achieves a competitive result in the final evaluation.
Choose the Final Translation from NMT and LLM hypotheses Using MBR Decoding: HW-TSC's Submission to the WMT24 General MT Shared Task
Zhanglin Wu, Daimeng Wei, Zongyao Li, Hengchao Shang, Jiaxin Guo, Shaojun Li, Zhiqiang Rao, Yuanchang Luo, Ning Xie, Hao Yang
Sep 24 2024 cs.AI arXiv:2409.14800v1

@misc{2409.14800, author = {Zhanglin Wu and Daimeng Wei and Zongyao Li and Hengchao Shang and Jiaxin Guo and Shaojun Li and Zhiqiang Rao and Yuanchang Luo and Ning Xie and Hao Yang}, title = {{C}hoose the {F}inal {T}ranslation from {NMT} and {LLM} hypotheses {U}sing {MBR} {D}ecoding: {HW}-{TSC}'s {S}ubmission to the {WMT}24 {G}eneral {MT} {S}hared {T}ask}, year = {2024}, eprint = {2409.14800}, note = {arXiv:2409.14800v1} }
PDF
This paper presents the submission of Huawei Translate Services Center (HW-TSC) to the WMT24 general machine translation (MT) shared task, where we participate in the English to Chinese (en2zh) language pair. Similar to previous years' work, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning to train the neural machine translation (NMT) model based on the deep Transformer-big architecture. The difference is that we also use continue pre-training, supervised fine-tuning, and contrastive preference optimization to train the large language model (LLM) based MT model. By using Minimum Bayesian risk (MBR) decoding to select the final translation from multiple hypotheses for NMT and LLM-based MT models, our submission receives competitive results in the final evaluation.
Graph Neural Network Framework for Sentiment Analysis Using Syntactic Feature
Linxiao Wu, Yuanshuai Luo, Binrong Zhu, Guiran Liu, Rui Wang, Qian Yu
Sep 24 2024 cs.CL cs.AI arXiv:2409.14000v1

@misc{2409.14000, author = {Linxiao Wu and Yuanshuai Luo and Binrong Zhu and Guiran Liu and Rui Wang and Qian Yu}, title = {{G}raph {N}eural {N}etwork {F}ramework for {S}entiment {A}nalysis {U}sing {S}yntactic {F}eature}, year = {2024}, eprint = {2409.14000}, note = {arXiv:2409.14000v1} }
PDF
Amidst the swift evolution of social media platforms and e-commerce ecosystems, the domain of opinion mining has surged as a pivotal area of exploration within natural language processing. A specialized segment within this field focuses on extracting nuanced evaluations tied to particular elements within textual contexts. This research advances a composite framework that amalgamates the positional cues of topical descriptors. The proposed system converts syntactic structures into a matrix format, leveraging convolutions and attention mechanisms within a graph to distill salient characteristics. Incorporating the positional relevance of descriptors relative to lexical items enhances the sequential integrity of the input. Trials have substantiated that this integrated graph-centric scheme markedly elevates the efficacy of evaluative categorization, showcasing preeminence.
Revisiting Physical-World Adversarial Attack on Traffic Sign Recognition: A Commercial Systems Perspective
Ningfei Wang, Shaoyuan Xie, Takami Sato, Yunpeng Luo, Kaidi Xu, Qi Alfred Chen
Sep 17 2024 cs.CR cs.CV arXiv:2409.09860v1

@misc{2409.09860, author = {Ningfei Wang and Shaoyuan Xie and Takami Sato and Yunpeng Luo and Kaidi Xu and Qi Alfred Chen}, title = {{R}evisiting {P}hysical-{W}orld {A}dversarial {A}ttack on {T}raffic {S}ign {R}ecognition: {A} {C}ommercial {S}ystems {P}erspective}, year = {2024}, eprint = {2409.09860}, doi = {10.14722/ndss.2025.23090}, note = {arXiv:2409.09860v1} }
PDF
Traffic Sign Recognition (TSR) is crucial for safe and correct driving automation. Recent works revealed a general vulnerability of TSR models to physical-world adversarial attacks, which can be low-cost, highly deployable, and capable of causing severe attack effects such as hiding a critical traffic sign or spoofing a fake one. However, so far existing works generally only considered evaluating the attack effects on academic TSR models, leaving the impacts of such attacks on real-world commercial TSR systems largely unclear. In this paper, we conduct the first large-scale measurement of physical-world adversarial attacks against commercial TSR systems. Our testing results reveal that it is possible for existing attack works from academia to have highly reliable (100\%) attack success against certain commercial TSR system functionality, but such attack capabilities are not generalizable, leading to much lower-than-expected attack success rates overall. We find that one potential major factor is a spatial memorization design that commonly exists in today's commercial TSR systems. We design new attack success metrics that can mathematically model the impacts of such design on the TSR system-level attack success, and use them to revisit existing attacks. Through these efforts, we uncover 7 novel observations, some of which directly challenge the observations or claims in prior works due to the introduction of the new metrics.
CF-PRNet: Coarse-to-Fine Prototype Refining Network for Point Cloud Completion and Reconstruction
Zhi Chen, Tianqi Wei, Zecheng Zhao, Jia Syuen Lim, Yadan Luo, Hu Zhang, Xin Yu, Scott Chapman, Zi Huang
Sep 16 2024 cs.CV arXiv:2409.08443v1

@misc{2409.08443, author = {Zhi Chen and Tianqi Wei and Zecheng Zhao and Jia Syuen Lim and Yadan Luo and Hu Zhang and Xin Yu and Scott Chapman and Zi Huang}, title = {{CF}-{PRN}et: {C}oarse-to-{F}ine {P}rototype {R}efining {N}etwork for {P}oint {C}loud {C}ompletion and {R}econstruction}, year = {2024}, eprint = {2409.08443}, note = {arXiv:2409.08443v1} }
PDF
In modern agriculture, precise monitoring of plants and fruits is crucial for tasks such as high-throughput phenotyping and automated harvesting. This paper addresses the challenge of reconstructing accurate 3D shapes of fruits from partial views, which is common in agricultural settings. We introduce CF-PRNet, a coarse-to-fine prototype refining network, leverages high-resolution 3D data during the training phase but requires only a single RGB-D image for real-time inference. Our approach begins by extracting the incomplete point cloud data that constructed from a partial view of a fruit with a series of convolutional blocks. The extracted features inform the generation of scaling vectors that refine two sequentially constructed 3D mesh prototypes - one coarse and one fine-grained. This progressive refinement facilitates the detailed completion of the final point clouds, achieving detailed and accurate reconstructions. CF-PRNet demonstrates excellent performance metrics with a Chamfer Distance of 3.78, an F1 Score of 66.76%, a Precision of 56.56%, and a Recall of 85.31%, and win the first place in the Shape Completion and Reconstruction of Sweet Peppers Challenge.
Apollo: Band-sequence Modeling for High-Quality Audio Restoration
Kai Li, Yi Luo
Sep 16 2024 cs.SD cs.AI eess.AS arXiv:2409.08514v1

@misc{2409.08514, author = {Kai Li and Yi Luo}, title = {{A}pollo: {B}and-sequence {M}odeling for {H}igh-{Q}uality {A}udio {R}estoration}, year = {2024}, eprint = {2409.08514}, note = {arXiv:2409.08514v1} }
PDF
Audio restoration has become increasingly significant in modern society, not only due to the demand for high-quality auditory experiences enabled by advanced playback devices, but also because the growing capabilities of generative audio models necessitate high-fidelity audio. Typically, audio restoration is defined as a task of predicting undistorted audio from damaged input, often trained using a GAN framework to balance perception and distortion. Since audio degradation is primarily concentrated in mid- and high-frequency ranges, especially due to codecs, a key challenge lies in designing a generator capable of preserving low-frequency information while accurately reconstructing high-quality mid- and high-frequency content. Inspired by recent advancements in high-sample-rate music separation, speech enhancement, and audio codec models, we propose Apollo, a generative model designed for high-sample-rate audio restoration. Apollo employs an explicit frequency band split module to model the relationships between different frequency bands, allowing for more coherent and higher-quality restored audio. Evaluated on the MUSDB18-HQ and MoisesDB datasets, Apollo consistently outperforms existing SR-GAN models across various bit rates and music genres, particularly excelling in complex scenarios involving mixtures of multiple instruments and vocals. Apollo significantly improves music restoration quality while maintaining computational efficiency. The source code for Apollo is publicly available at https://github.com/JusperLee/Apollo.
DICS: Find Domain-Invariant and Class-Specific Features for Out-of-Distribution Generalization
Qiaowei Miao, Yawei Luo, Yi Yang
Sep 16 2024 cs.CV arXiv:2409.08557v1

@misc{2409.08557, author = {Qiaowei Miao and Yawei Luo and Yi Yang}, title = {{DICS}: {F}ind {D}omain-{I}nvariant and {C}lass-{S}pecific {F}eatures for {O}ut-of-{D}istribution {G}eneralization}, year = {2024}, eprint = {2409.08557}, note = {arXiv:2409.08557v1} }
PDF
While deep neural networks have made remarkable progress in various vision tasks, their performance typically deteriorates when tested in out-of-distribution (OOD) scenarios. Many OOD methods focus on extracting domain-invariant features but neglect whether these features are unique to each class. Even if some features are domain-invariant, they cannot serve as key classification criteria if shared across different classes. In OOD tasks, both domain-related and class-shared features act as confounders that hinder generalization. In this paper, we propose a DICS model to extract Domain-Invariant and Class-Specific features, including Domain Invariance Testing (DIT) and Class Specificity Testing (CST), which mitigate the effects of spurious correlations introduced by confounders. DIT learns domain-related features of each source domain and removes them from inputs to isolate domain-invariant class-related features. DIT ensures domain invariance by aligning same-class features across different domains. Then, CST calculates soft labels for those features by comparing them with features learned in previous steps. We optimize the cross-entropy between the soft labels and their true labels, which enhances same-class similarity and different-class distinctiveness, thereby reinforcing class specificity. Extensive experiments on widely-used benchmarks demonstrate the effectiveness of our proposed algorithm. Additional visualizations further demonstrate that DICS effectively identifies the key features of each class in target domains.
OneEdit: A Neural-Symbolic Collaboratively Knowledge Editing System
Ningyu Zhang, Zekun Xi, Yujie Luo, Peng Wang, Bozhong Tian, Yunzhi Yao, Jintian Zhang, Shumin Deng, Mengshu Sun, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, Huajun Chen
Sep 13 2024 cs.AI cs.CL cs.DB cs.IR cs.LG arXiv:2409.07497v1

@misc{2409.07497, author = {Ningyu Zhang and Zekun Xi and Yujie Luo and Peng Wang and Bozhong Tian and Yunzhi Yao and Jintian Zhang and Shumin Deng and Mengshu Sun and Lei Liang and Zhiqiang Zhang and Xiaowei Zhu and Jun Zhou and Huajun Chen}, title = {{O}ne{E}dit: {A} {N}eural-{S}ymbolic {C}ollaboratively {K}nowledge {E}diting {S}ystem}, year = {2024}, eprint = {2409.07497}, note = {arXiv:2409.07497v1} }
PDF
Knowledge representation has been a central aim of AI since its inception. Symbolic Knowledge Graphs (KGs) and neural Large Language Models (LLMs) can both represent knowledge. KGs provide highly accurate and explicit knowledge representation, but face scalability issue; while LLMs offer expansive coverage of knowledge, but incur significant training costs and struggle with precise and reliable knowledge manipulation. To this end, we introduce OneEdit, a neural-symbolic prototype system for collaborative knowledge editing using natural language, which facilitates easy-to-use knowledge management with KG and LLM. OneEdit consists of three modules: 1) The Interpreter serves for user interaction with natural language; 2) The Controller manages editing requests from various users, leveraging the KG with rollbacks to handle knowledge conflicts and prevent toxic knowledge attacks; 3) The Editor utilizes the knowledge from the Controller to edit KG and LLM. We conduct experiments on two new datasets with KGs which demonstrate that OneEdit can achieve superior performance.
DV-FSR: A Dual-View Target Attack Framework for Federated Sequential Recommendation
Qitao Qin, Yucong Luo, Mingyue Cheng, Qingyang Mao, Chenyi Lei
Sep 13 2024 cs.CR cs.IR arXiv:2409.07500v1

@misc{2409.07500, author = {Qitao Qin and Yucong Luo and Mingyue Cheng and Qingyang Mao and Chenyi Lei}, title = {{DV}-{FSR}: {A} {D}ual-{V}iew {T}arget {A}ttack {F}ramework for {F}ederated {S}equential {R}ecommendation}, year = {2024}, eprint = {2409.07500}, note = {arXiv:2409.07500v1} }
PDF
Federated recommendation (FedRec) preserves user privacy by enabling decentralized training of personalized models, but this architecture is inherently vulnerable to adversarial attacks. Significant research has been conducted on targeted attacks in FedRec systems, motivated by commercial and social influence considerations. However, much of this work has largely overlooked the differential robustness of recommendation models. Moreover, our empirical findings indicate that existing targeted attack methods achieve only limited effectiveness in Federated Sequential Recommendation (FSR) tasks. Driven by these observations, we focus on investigating targeted attacks in FSR and propose a novel dualview attack framework, named DV-FSR. This attack method uniquely combines a sampling-based explicit strategy with a contrastive learning-based implicit gradient strategy to orchestrate a coordinated attack. Additionally, we introduce a specific defense mechanism tailored for targeted attacks in FSR, aiming to evaluate the mitigation effects of the attack method we proposed. Extensive experiments validate the effectiveness of our proposed approach on representative sequential models.
Equivariant Filter for Tightly Coupled LiDAR-Inertial Odometry
Anbo Tao, Yarong Luo, Chunxi Xia, Chi Guo, Xingxing Li
Sep 12 2024 cs.RO cs.SY eess.SY arXiv:2409.06948v1

@misc{2409.06948, author = {Anbo Tao and Yarong Luo and Chunxi Xia and Chi Guo and Xingxing Li}, title = {{E}quivariant {F}ilter for {T}ightly {C}oupled {L}i{DAR}-{I}nertial {O}dometry}, year = {2024}, eprint = {2409.06948}, note = {arXiv:2409.06948v1} }
PDF
Pose estimation is a crucial problem in simultaneous localization and mapping (SLAM). However, developing a robust and consistent state estimator remains a significant challenge, as the traditional extended Kalman filter (EKF) struggles to handle the model nonlinearity, especially for inertial measurement unit (IMU) and light detection and ranging (LiDAR). To provide a consistent and efficient solution of pose estimation, we propose Eq-LIO, a robust state estimator for tightly coupled LIO systems based on an equivariant filter (EqF). Compared with the invariant Kalman filter based on the $\SE_2(3)$ group structure, the EqF uses the symmetry of the semi-direct product group to couple the system state including IMU bias, navigation state and LiDAR extrinsic calibration state, thereby suppressing linearization error and improving the behavior of the estimator in the event of unexpected state changes. The proposed Eq-LIO owns natural consistency and higher robustness, which is theoretically proven with mathematical derivation and experimentally verified through a series of tests on both public and private datasets.
FreeEnhance: Tuning-Free Image Enhancement via Content-Consistent Noising-and-Denoising Process
Yang Luo, Yiheng Zhang, Zhaofan Qiu, Ting Yao, Zhineng Chen, Yu-Gang Jiang, Tao Mei
Sep 12 2024 cs.CV cs.MM arXiv:2409.07451v1

@misc{2409.07451, author = {Yang Luo and Yiheng Zhang and Zhaofan Qiu and Ting Yao and Zhineng Chen and Yu-Gang Jiang and Tao Mei}, title = {{F}ree{E}nhance: {T}uning-{F}ree {I}mage {E}nhancement via {C}ontent-{C}onsistent {N}oising-and-{D}enoising {P}rocess}, year = {2024}, eprint = {2409.07451}, note = {arXiv:2409.07451v1} }
PDF
The emergence of text-to-image generation models has led to the recognition that image enhancement, performed as post-processing, would significantly improve the visual quality of the generated images. Exploring diffusion models to enhance the generated images nevertheless is not trivial and necessitates to delicately enrich plentiful details while preserving the visual appearance of key content in the original image. In this paper, we propose a novel framework, namely FreeEnhance, for content-consistent image enhancement using the off-the-shelf image diffusion models. Technically, FreeEnhance is a two-stage process that firstly adds random noise to the input image and then capitalizes on a pre-trained image diffusion model (i.e., Latent Diffusion Models) to denoise and enhance the image details. In the noising stage, FreeEnhance is devised to add lighter noise to the region with higher frequency to preserve the high-frequent patterns (e.g., edge, corner) in the original image. In the denoising stage, we present three target properties as constraints to regularize the predicted noise, enhancing images with high acutance and high visual quality. Extensive experiments conducted on the HPDv2 dataset demonstrate that our FreeEnhance outperforms the state-of-the-art image enhancement models in terms of quantitative metrics and human preference. More remarkably, FreeEnhance also shows higher human preference compared to the commercial image enhancement solution of Magnific AI.
$\mathbb{USCD}$: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding
Shuai Wang, Liang Ding, Li Shen, Yong Luo, Zheng He, Wei Yu, Dacheng Tao
Sep 11 2024 cs.SE cs.AI arXiv:2409.05923v1

@misc{2409.05923, author = {Shuai Wang and Liang Ding and Li Shen and Yong Luo and Zheng He and Wei Yu and Dacheng Tao}, title = {$\mathbb{{USCD}}$: {I}mproving {C}ode {G}eneration of {LLM}s by {U}ncertainty-{A}ware {S}elective {C}ontrastive {D}ecoding}, year = {2024}, eprint = {2409.05923}, note = {arXiv:2409.05923v1} }
PDF
Large language models (LLMs) have shown remarkable capabilities in code generation. However, the effects of hallucinations (e.g., output noise) make it particularly challenging for LLMs to generate high-quality code in one pass. In this work, we propose a simple and effective \textbfuncertainty-aware \textbfselective \textbfcontrastive \textbfdecoding ($\mathbb{USCD}$) mechanism to improve the quality of one-pass code generation in LLMs and reduce the impact of output noise. To be specific, we first elaborately designed a negative prompt (namely lame prompt) to output noise by removing input-output examples from the standard few-shot prompt. Our preliminary study shows that the Jensen-Shannon divergence (JS divergence) between token distribution uncertainty and the output noise is relatively low (approximately $0.25$), indicating their high relevance. Then, we selectively eliminate output noise induced by lame prompts based on the uncertainty of the prediction distribution from the standard prompt. Notably, our proposed plug-and-play mechanism is an inference-only method, enjoying appealing flexibility. Extensive experiments on widely used benchmarks, e.g., HumanEval, MBPP, and MultiPL-E, upon several LLMs (i.e., Inocder-6b, CodeLlama-7b, WizardCoder-15b, StarCoder, and Llama2-7b), demonstrate that our proposed USCD significantly improves one-pass code generation, with an average \textitpass@$1$ scores increase of 16.59\%. We will release code and data on GitHub.
Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation
Yihang Zheng, Bo Li, Zhenghao Lin, Yi Luo, Xuanhe Zhou, Chen Lin, Jinsong Su, Guoliang Li, Shifu Li
Sep 10 2024 cs.DB cs.AI arXiv:2409.04475v1

@misc{2409.04475, author = {Yihang Zheng and Bo Li and Zhenghao Lin and Yi Luo and Xuanhe Zhou and Chen Lin and Jinsong Su and Guoliang Li and Shifu Li}, title = {{R}evolutionizing {D}atabase {Q}&{A} with {L}arge {L}anguage {M}odels: {C}omprehensive {B}enchmark and {E}valuation}, year = {2024}, eprint = {2409.04475}, note = {arXiv:2409.04475v1} }
PDF
The development of Large Language Models (LLMs) has revolutionized Q&A across various industries, including the database domain. However, there is still a lack of a comprehensive benchmark to evaluate the capabilities of different LLMs and their modular components in database Q&A. To this end, we introduce DQA, the first comprehensive database Q&A benchmark. DQA features an innovative LLM-based method for automating the generation, cleaning, and rewriting of database Q&A, resulting in over 240,000 Q&A pairs in English and Chinese. These Q&A pairs cover nearly all aspects of database knowledge, including database manuals, database blogs, and database tools. This inclusion allows for additional assessment of LLMs' Retrieval-Augmented Generation (RAG) and Tool Invocation Generation (TIG) capabilities in the database Q&A task. Furthermore, we propose a comprehensive LLM-based database Q&A testbed on DQA. This testbed is highly modular and scalable, with both basic and advanced components like Question Classification Routing (QCR), RAG, TIG, and Prompt Template Engineering (PTE). Besides, DQA provides a complete evaluation pipeline, featuring diverse metrics and a standardized evaluation process to ensure comprehensiveness, accuracy, and fairness. We use DQA to evaluate the database Q&A capabilities under the proposed testbed comprehensively. The evaluation reveals findings like (i) the strengths and limitations of nine different LLM-based Q&A bots and (ii) the performance impact and potential improvements of various service components (e.g., QCR, RAG, TIG). We hope our benchmark and findings will better guide the future development of LLM-based database Q&A research.
GS-PT: Exploiting 3D Gaussian Splatting for Comprehensive Point Cloud Understanding via Self-supervised Learning
Keyi Liu, Yeqi Luo, Weidong Yang, Jingyi Xu, Zhijun Li, Wen-Ming Chen, Ben Fei
Sep 10 2024 cs.CV arXiv:2409.04963v1

@misc{2409.04963, author = {Keyi Liu and Yeqi Luo and Weidong Yang and Jingyi Xu and Zhijun Li and Wen-Ming Chen and Ben Fei}, title = {{GS}-{PT}: {E}xploiting 3{D} {G}aussian {S}platting for {C}omprehensive {P}oint {C}loud {U}nderstanding via {S}elf-supervised {L}earning}, year = {2024}, eprint = {2409.04963}, note = {arXiv:2409.04963v1} }
PDF
Self-supervised learning of point cloud aims to leverage unlabeled 3D data to learn meaningful representations without reliance on manual annotations. However, current approaches face challenges such as limited data diversity and inadequate augmentation for effective feature learning. To address these challenges, we propose GS-PT, which integrates 3D Gaussian Splatting (3DGS) into point cloud self-supervised learning for the first time. Our pipeline utilizes transformers as the backbone for self-supervised pre-training and introduces novel contrastive learning tasks through 3DGS. Specifically, the transformers aim to reconstruct the masked point cloud. 3DGS utilizes multi-view rendered images as input to generate enhanced point cloud distributions and novel view images, facilitating data augmentation and cross-modal contrastive learning. Additionally, we incorporate features from depth maps. By optimizing these tasks collectively, our method enriches the tri-modal self-supervised learning process, enabling the model to leverage the correlation across 3D point clouds and 2D images from various modalities. We freeze the encoder after pre-training and test the model's performance on multiple downstream tasks. Experimental results indicate that GS-PT outperforms the off-the-shelf self-supervised learning methods on various downstream tasks including 3D object classification, real-world classifications, and few-shot learning and segmentation.
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, Jie Zhang
Sep 10 2024 cs.AR cs.CL arXiv:2409.04992v1

@misc{2409.04992, author = {Xiurui Pan and Endian Li and Qiao Li and Shengwen Liang and Yizhou Shan and Ke Zhou and Yingwei Luo and Xiaolin Wang and Jie Zhang}, title = {{I}nst{I}nfer: {I}n-{S}torage {A}ttention {O}ffloading for {C}ost-{E}ffective {L}ong-{C}ontext {LLM} {I}nference}, year = {2024}, eprint = {2409.04992}, note = {arXiv:2409.04992v1} }
PDF
The widespread of Large Language Models (LLMs) marks a significant milestone in generative AI. Nevertheless, the increasing context length and batch size in offline LLM inference escalate the memory requirement of the key-value (KV) cache, which imposes a huge burden on the GPU VRAM, especially for resource-constraint scenarios (e.g., edge computing and personal devices). Several cost-effective solutions leverage host memory or SSDs to reduce storage costs for offline inference scenarios and improve the throughput. Nevertheless, they suffer from significant performance penalties imposed by intensive KV cache accesses due to limited PCIe bandwidth. To address these issues, we propose InstInfer, a novel LLM inference system that offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs), which minimize the enormous KV transfer overheads. InstInfer designs a dedicated flash-aware in-storage attention engine with KV cache management mechanisms to exploit the high internal bandwidths of CSDs instead of being limited by the PCIe bandwidth. The optimized P2P transmission between GPU and CSDs further reduces data migration overheads. Experimental results demonstrate that for a 13B model using an NVIDIA A6000 GPU, InstInfer improves throughput for long-sequence inference by up to 11.1$\times$, compared to existing SSD-based solutions such as FlexGen.
Joint Input and Output Coordination for Class-Incremental Learning
Shuai Wang, Yibing Zhan, Yong Luo, Han Hu, Wei Yu, Yonggang Wen, Dacheng Tao
Sep 10 2024 cs.LG cs.AI arXiv:2409.05620v1

@misc{2409.05620, author = {Shuai Wang and Yibing Zhan and Yong Luo and Han Hu and Wei Yu and Yonggang Wen and Dacheng Tao}, title = {{J}oint {I}nput and {O}utput {C}oordination for {C}lass-{I}ncremental {L}earning}, year = {2024}, eprint = {2409.05620}, note = {arXiv:2409.05620v1} }
PDF
Incremental learning is nontrivial due to severe catastrophic forgetting. Although storing a small amount of data on old tasks during incremental learning is a feasible solution, current strategies still do not 1) adequately address the class bias problem, and 2) alleviate the mutual interference between new and old tasks, and 3) consider the problem of class bias within tasks. This motivates us to propose a joint input and output coordination (JIOC) mechanism to address these issues. This mechanism assigns different weights to different categories of data according to the gradient of the output score, and uses knowledge distillation (KD) to reduce the mutual interference between the outputs of old and new tasks. The proposed mechanism is general and flexible, and can be incorporated into different incremental learning approaches that use memory storage. Extensive experiments show that our mechanism can significantly improve their performance.
ELO-Rated Sequence Rewards: Advancing Reinforcement Learning Models
Qi Ju, Falin Hei, Zhemei Fang, Yunfeng Luo
Sep 06 2024 cs.LG arXiv:2409.03301v1

@misc{2409.03301, author = {Qi Ju and Falin Hei and Zhemei Fang and Yunfeng Luo}, title = {{ELO}-{R}ated {S}equence {R}ewards: {A}dvancing {R}einforcement {L}earning {M}odels}, year = {2024}, eprint = {2409.03301}, note = {arXiv:2409.03301v1} }
PDF
Reinforcement Learning (RL) is highly dependent on the meticulous design of the reward function. However, accurately assigning rewards to each state-action pair in Long-Term RL (LTRL) challenges is formidable. Consequently, RL agents are predominantly trained with expert guidance. Drawing on the principles of ordinal utility theory from economics, we propose a novel reward estimation algorithm: ELO-Rating based RL (ERRL). This approach is distinguished by two main features. Firstly, it leverages expert preferences over trajectories instead of cardinal rewards (utilities) to compute the ELO rating of each trajectory as its reward. Secondly, a new reward redistribution algorithm is introduced to mitigate training volatility in the absence of a fixed anchor reward. Our method demonstrates superior performance over several leading baselines in long-term scenarios (extending up to 5000 steps), where conventional RL algorithms falter. Furthermore, we conduct a thorough analysis of how expert preferences affect the outcomes.
Beyond Nash Equilibrium: Achieving Bayesian Perfect Equilibrium with Belief Update Fictitious Play
Qi Ju, Zhemei Fang, Yunfeng Luo
Sep 05 2024 cs.GT arXiv:2409.02706v1

@misc{2409.02706, author = {Qi Ju and Zhemei Fang and Yunfeng Luo}, title = {{B}eyond {N}ash {E}quilibrium: {A}chieving {B}ayesian {P}erfect {E}quilibrium with {B}elief {U}pdate {F}ictitious {P}lay}, year = {2024}, eprint = {2409.02706}, note = {arXiv:2409.02706v1} }
PDF
In the domain of machine learning and game theory, the quest for Nash Equilibrium (NE) in extensive-form games with incomplete information is challenging yet crucial for enhancing AI's decision-making support under varied scenarios. Traditional Counterfactual Regret Minimization (CFR) techniques excel in navigating towards NE, focusing on scenarios where opponents deploy optimal strategies. However, the essence of machine learning in strategic game play extends beyond reacting to optimal moves; it encompasses aiding human decision-making in all circumstances. This includes not only crafting responses to optimal strategies but also recovering from suboptimal decisions and capitalizing on opponents' errors. Herein lies the significance of transitioning from NE to Bayesian Perfect Equilibrium (BPE), which accounts for every possible condition, including the irrationality of opponents. To bridge this gap, we propose Belief Update Fictitious Play (BUFP), which innovatively blends fictitious play with belief to target BPE, a more comprehensive equilibrium concept than NE. Specifically, through adjusting iteration stepsizes, BUFP allows for strategic convergence to both NE and BPE. For instance, in our experiments, BUFP(EF) leverages the stepsize of Extensive Form Fictitious Play (EFFP) to achieve BPE, outperforming traditional CFR by securing a 48.53\% increase in benefits in scenarios characterized by dominated strategies.
SOAR: Simultaneous Exploration and Photographing with Heterogeneous UAVs for Fast Autonomous Reconstruction
Mingjie Zhang, Chen Feng, Zengzhi Li, Guiyong Zheng, Yiming Luo, Zhu Wang, Jinni Zhou, Shaojie Shen, Boyu Zhou
Sep 05 2024 cs.RO arXiv:2409.02738v1

@misc{2409.02738, author = {Mingjie Zhang and Chen Feng and Zengzhi Li and Guiyong Zheng and Yiming Luo and Zhu Wang and Jinni Zhou and Shaojie Shen and Boyu Zhou}, title = {{SOAR}: {S}imultaneous {E}xploration and {P}hotographing with {H}eterogeneous {UAV}s for {F}ast {A}utonomous {R}econstruction}, year = {2024}, eprint = {2409.02738}, note = {arXiv:2409.02738v1} }
PDF
Unmanned Aerial Vehicles (UAVs) have gained significant popularity in scene reconstruction. This paper presents SOAR, a LiDAR-Visual heterogeneous multi-UAV system specifically designed for fast autonomous reconstruction of complex environments. Our system comprises a LiDAR-equipped explorer with a large field-of-view (FoV), alongside photographers equipped with cameras. To ensure rapid acquisition of the scene's surface geometry, we employ a surface frontier-based exploration strategy for the explorer. As the surface is progressively explored, we identify the uncovered areas and generate viewpoints incrementally. These viewpoints are then assigned to photographers through solving a Consistent Multiple Depot Multiple Traveling Salesman Problem (Consistent-MDMTSP), which optimizes scanning efficiency while ensuring task consistency. Finally, photographers utilize the assigned viewpoints to determine optimal coverage paths for acquiring images. We present extensive benchmarks in the realistic simulator, which validates the performance of SOAR compared with classical and state-of-the-art methods. For more details, please see our project page at https://sysu-star.github.io/SOARsysu-star.github.io/SOAR.
Explicit Differentiable Slicing and Global Deformation for Cardiac Mesh Reconstruction
Yihao Luo, Dario Sesia, Fanwen Wang, Yinzhe Wu, Wenhao Ding, Jiahao Huang, Fadong Shi, Anoop Shah, Amit Kaural, Jamil Mayet, Guang Yang, ChoonHwai Yap
Sep 04 2024 eess.IV cs.CV arXiv:2409.02070v2

@misc{2409.02070, author = {Yihao Luo and Dario Sesia and Fanwen Wang and Yinzhe Wu and Wenhao Ding and Jiahao Huang and Fadong Shi and Anoop Shah and Amit Kaural and Jamil Mayet and Guang Yang and ChoonHwai Yap}, title = {{E}xplicit {D}ifferentiable {S}licing and {G}lobal {D}eformation for {C}ardiac {M}esh {R}econstruction}, year = {2024}, eprint = {2409.02070}, note = {arXiv:2409.02070v2} }
PDF
Mesh reconstruction of the cardiac anatomy from medical images is useful for shape and motion measurements and biophysics simulations to facilitate the assessment of cardiac function and health. However, 3D medical images are often acquired as 2D slices that are sparsely sampled and noisy, and mesh reconstruction on such data is a challenging task. Traditional voxel-based approaches rely on pre- and post-processing that compromises image fidelity, while mesh-level deep learning approaches require mesh annotations that are difficult to get. Therefore, direct cross-domain supervision from 2D images to meshes is a key technique for advancing 3D learning in medical imaging, but it has not been well-developed. While there have been attempts to approximate the optimized meshes' slicing, few existing methods directly use 2D slices to supervise mesh reconstruction in a differentiable manner. Here, we propose a novel explicit differentiable voxelization and slicing (DVS) algorithm that allows gradient backpropagation to a mesh from its slices, facilitating refined mesh optimization directly supervised by the losses defined on 2D images. Further, we propose an innovative framework for extracting patient-specific left ventricle (LV) meshes from medical images by coupling DVS with a graph harmonic deformation (GHD) mesh morphing descriptor of cardiac shape that naturally preserves mesh quality and smoothness during optimization. Experimental results demonstrate that our method achieves state-of-the-art performance in cardiac mesh reconstruction tasks from CT and MRI, with an overall Dice score of 90% on multi-datasets, outperforming existing approaches. The proposed method can further quantify clinically useful parameters such as ejection fraction and global myocardial strains, closely matching the ground truth and surpassing the traditional voxel-based approach in sparse images.
RTop-K: Ultra-Fast Row-Wise Top-K Algorithm and GPU Implementation for Neural Networks
Xi Xie, Yuebo Luo, Hongwu Peng, Caiwen Ding
Sep 04 2024 cs.DC arXiv:2409.00822v2

@misc{2409.00822, author = {Xi Xie and Yuebo Luo and Hongwu Peng and Caiwen Ding}, title = {{RT}op-{K}: {U}ltra-{F}ast {R}ow-{W}ise {T}op-{K} {A}lgorithm and {GPU} {I}mplementation for {N}eural {N}etworks}, year = {2024}, eprint = {2409.00822}, note = {arXiv:2409.00822v2} }
PDF
Top-k algorithms are essential in various applications, from high-performance computing and information retrieval to big data and neural network model training. This paper introduces RTop-K, a highly efficient parallel row-wise top-k selection algorithm designed for GPUs. RTop-K employs a Binary Search-based approach to optimize resource allocation and provides a scalable solution that significantly accelerates top-k operations. We perform a theoretical analysis of the effects of early stopping in our algorithm, demonstrating that it maintains the accuracy of neural network models while enhancing performance. Comprehensive tests show that our GPU implementation of RTop-K outperforms other row-wise top-k GPU implementations, with minimal impact on testing accuracy when early stopping is applied. Notably, RTop-K achieves speed increases ranging from 4.245$\times$ to 9.506$\times$ with early stopping, and 3.936$\times$ without early stopping, compared to state-of-the-art implementations. The proposed methods offer significant improvements in the training and inference of Graph Neural Networks (GNNs), addressing critical challenges in latency and throughput on GPU platforms.
MV-Match: Multi-View Matching for Domain-Adaptive Identification of Plant Nutrient Deficiencies
Jinhui Yi, Yanan Luo, Marion Deichmann, Gabriel Schaaf, Juergen Gall
Sep 04 2024 cs.CV arXiv:2409.00903v1

@misc{2409.00903, author = {Jinhui Yi and Yanan Luo and Marion Deichmann and Gabriel Schaaf and Juergen Gall}, title = {{MV}-{M}atch: {M}ulti-{V}iew {M}atching for {D}omain-{A}daptive {I}dentification of {P}lant {N}utrient {D}eficiencies}, year = {2024}, eprint = {2409.00903}, note = {arXiv:2409.00903v1} }
PDF
An early, non-invasive, and on-site detection of nutrient deficiencies is critical to enable timely actions to prevent major losses of crops caused by lack of nutrients. While acquiring labeled data is very expensive, collecting images from multiple views of a crop is straightforward. Despite its relevance for practical applications, unsupervised domain adaptation where multiple views are available for the labeled source domain as well as the unlabeled target domain is an unexplored research area. In this work, we thus propose an approach that leverages multiple camera views in the source and target domain for unsupervised domain adaptation. We evaluate the proposed approach on two nutrient deficiency datasets. The proposed method achieves state-of-the-art results on both datasets compared to other unsupervised domain adaptation methods. The dataset and source code are available at https://github.com/jh-yi/MV-Match.
ViRED: Prediction of Visual Relations in Engineering Drawings
Chao Gu, Ke Lin, Yiyang Luo, Jiahui Hou, Xiang-Yang Li
Sep 04 2024 cs.CV cs.AI arXiv:2409.00909v1

@misc{2409.00909, author = {Chao Gu and Ke Lin and Yiyang Luo and Jiahui Hou and Xiang-Yang Li}, title = {{V}i{RED}: {P}rediction of {V}isual {R}elations in {E}ngineering {D}rawings}, year = {2024}, eprint = {2409.00909}, note = {arXiv:2409.00909v1} }
PDF
To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96\% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.
Can We Leave Deepfake Data Behind in Training Deepfake Detector?
Jikang Cheng, Zhiyuan Yan, Ying Zhang, Yuhao Luo, Zhongyuan Wang, Chen Li
Sep 02 2024 cs.CV arXiv:2408.17052v1

@misc{2408.17052, author = {Jikang Cheng and Zhiyuan Yan and Ying Zhang and Yuhao Luo and Zhongyuan Wang and Chen Li}, title = {{C}an {W}e {L}eave {D}eepfake {D}ata {B}ehind in {T}raining {D}eepfake {D}etector?}, year = {2024}, eprint = {2408.17052}, note = {arXiv:2408.17052v1} }
PDF
The generalization ability of deepfake detectors is vital for their applications in real-world scenarios. One effective solution to enhance this ability is to train the models with manually-blended data, which we termed "blendfake", encouraging models to learn generic forgery artifacts like blending boundary. Interestingly, current SoTA methods utilize blendfake without incorporating any deepfake data in their training process. This is likely because previous empirical observations suggest that vanilla hybrid training (VHT), which combines deepfake and blendfake data, results in inferior performance to methods using only blendfake data (so-called "1+1<2"). Therefore, a critical question arises: Can we leave deepfake behind and rely solely on blendfake data to train an effective deepfake detector? Intuitively, as deepfakes also contain additional informative forgery clues (e.g., deep generative artifacts), excluding all deepfake data in training deepfake detectors seems counter-intuitive. In this paper, we rethink the role of blendfake in detecting deepfakes and formulate the process from "real to blendfake to deepfake" to be a progressive transition. Specifically, blendfake and deepfake can be explicitly delineated as the oriented pivot anchors between "real-to-fake" transitions. The accumulation of forgery information should be oriented and progressively increasing during this transition process. To this end, we propose an Oriented Progressive Regularizor (OPR) to establish the constraints that compel the distribution of anchors to be discretely arranged. Furthermore, we introduce feature bridging to facilitate the smooth transition between adjacent anchors. Extensive experiments confirm that our design allows leveraging forgery information from both blendfake and deepfake effectively and comprehensively.
MQRLD: A Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index Based on Data Lake
Ming Sheng, Shuliang Wang, Yong Zhang, Kaige Wang, Jingyi Wang, Yi Luo, Rui Hao
Aug 30 2024 cs.DB arXiv:2408.16237v1

@misc{2408.16237, author = {Ming Sheng and Shuliang Wang and Yong Zhang and Kaige Wang and Jingyi Wang and Yi Luo and Rui Hao}, title = {{MQRLD}: {A} {M}ultimodal {D}ata {R}etrieval {P}latform with {Q}uery-aware {F}eature {R}epresentation and {L}earned {I}ndex {B}ased on {D}ata {L}ake}, year = {2024}, eprint = {2408.16237}, note = {arXiv:2408.16237v1} }
PDF
Multimodal data has become a crucial element in the realm of big data analytics, driving advancements in data exploration, data mining, and empowering artificial intelligence applications. To support high-quality retrieval for these cutting-edge applications, a robust data retrieval platform should meet the requirements for transparent data storage, rich hybrid queries, effective feature representation, and high query efficiency. However, among the existing platforms, traditional schema-on-write systems, multi-model databases, vector databases, and data lakes, which are the primary options for multimodal data retrieval, are difficult to fulfill these requirements simultaneously. Therefore, there is an urgent need to develop a more versatile multimodal data retrieval platform to address these issues. In this paper, we introduce a Multimodal Data Retrieval Platform with Query-aware Feature Representation and Learned Index based on Data Lake (MQRLD). It leverages the transparent storage capabilities of data lakes, integrates the multimodal open API to provide a unified interface that supports rich hybrid queries, introduces a query-aware multimodal data feature representation strategy to obtain effective features, and offers high-dimensional learned indexes to optimize data query. We conduct a comparative analysis of the query performance of MQRLD against other methods for rich hybrid queries. Our results underscore the superior efficiency of MQRLD in handling multimodal data retrieval tasks, demonstrating its potential to significantly improve retrieval performance in complex environments. We also clarify some potential concerns in the discussion.
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Dacheng Tao
Aug 29 2024 cs.CV arXiv:2408.15556v1

@misc{2408.15556, author = {Wenbin Wang and Liang Ding and Minyan Zeng and Xiabin Zhou and Li Shen and Yong Luo and Dacheng Tao}, title = {{D}ivide, {C}onquer and {C}ombine: {A} {T}raining-{F}ree {F}ramework for {H}igh-{R}esolution {I}mage {P}erception in {M}ultimodal {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2408.15556}, note = {arXiv:2408.15556v1} }
PDF
Multimodal large language models (MLLMs) have experienced significant advancements recently, but still struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. While state-of-the-art (SOTA) MLLMs claim to process images at 4K resolution, existing MLLM benchmarks only support up to 2K, leaving the capabilities of SOTA models on true HR images largely untested. Furthermore, existing methods for enhancing HR image perception in MLLMs rely on computationally expensive visual instruction tuning. To address these limitations, we introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. Through extensive experiments, we demonstrate that while downsampling HR images leads to vision information loss, leveraging complementary modalities, e.g., text, can effectively compensate for this loss. Building upon this insight, we propose Divide, Conquer and Combine (DC$^2$), a novel training-free framework for enhancing MLLM perception of HR images. DC$^2$ follows a three-staged approach: 1) Divide: recursively partitioning the HR image into patches and merging similar patches to minimize computational overhead, 2) Conquer: leveraging the MLLM to generate accurate textual descriptions for each image patch, and 3) Combine: utilizing the generated text descriptions to enhance the MLLM's understanding of the overall HR image. Extensive experiments show that: 1) the SOTA MLLM achieves 63% accuracy, which is markedly lower than the 87% accuracy achieved by humans on HR-Bench; 2) our DC$^2$ brings consistent and significant improvements (a relative increase of +6% on HR-Bench and +8% on general multimodal benchmarks). The benchmark and code will be released to facilitate the multimodal R&D community.
Social Welfare Maximization for Federated Learning with Network Effects
Xiang Li, Yuan Luo, Bing Luo, Jianwei Huang
Aug 26 2024 cs.GT arXiv:2408.13223v1

@misc{2408.13223, author = {Xiang Li and Yuan Luo and Bing Luo and Jianwei Huang}, title = {{S}ocial {W}elfare {M}aximization for {F}ederated {L}earning with {N}etwork {E}ffects}, year = {2024}, eprint = {2408.13223}, note = {arXiv:2408.13223v1} }
PDF
A proper mechanism design can help federated learning (FL) to achieve good social welfare by coordinating self-interested clients through the learning process. However, existing mechanisms neglect the network effects of client participation, leading to suboptimal incentives and social welfare. This paper addresses this gap by exploring network effects in FL incentive mechanism design. We establish a theoretical model to analyze FL model performance and quantify the impact of network effects on heterogeneous client participation. Our analysis reveals the non-monotonic nature of FL network effects. To leverage such effects, we propose a model trading and sharing (MTS) framework that allows clients to obtain FL models through participation or purchase. To tackle heterogeneous clients' strategic behaviors, we further design a socially efficient model trading and sharing (SEMTS) mechanism. Our mechanism achieves social welfare maximization solely through customer payments, without additional incentive costs. Experimental results on an FL hardware prototype demonstrate up to 148.86% improvement in social welfare compared to existing mechanisms.
Xinyu: An Efficient LLM-based System for Commentary Generation
Yiquan Wu, Bo Tang, Chenyang Xi, Yu Yu, Pengyu Wang, Yifei Liu, Kun Kuang, Haiying Deng, Zhiyu Li, Feiyu Xiong, Jie Hu, Peng Cheng, Zhonghao Wang, Yi Wang, Yi Luo, Mingchuan Yang
Aug 22 2024 cs.CL cs.AI arXiv:2408.11609v2

@misc{2408.11609, author = {Yiquan Wu and Bo Tang and Chenyang Xi and Yu Yu and Pengyu Wang and Yifei Liu and Kun Kuang and Haiying Deng and Zhiyu Li and Feiyu Xiong and Jie Hu and Peng Cheng and Zhonghao Wang and Yi Wang and Yi Luo and Mingchuan Yang}, title = {{X}inyu: {A}n {E}fficient {LLM}-based {S}ystem for {C}ommentary {G}eneration}, year = {2024}, eprint = {2408.11609}, note = {arXiv:2408.11609v2} }
PDF
Commentary provides readers with a deep understanding of events by presenting diverse arguments and evidence. However, creating commentary is a time-consuming task, even for skilled commentators. Large language models (LLMs) have simplified the process of natural language generation, but their direct application in commentary creation still faces challenges due to unique task requirements. These requirements can be categorized into two levels: 1) fundamental requirements, which include creating well-structured and logically consistent narratives, and 2) advanced requirements, which involve generating quality arguments and providing convincing evidence. In this paper, we introduce Xinyu, an efficient LLM-based system designed to assist commentators in generating Chinese commentaries. To meet the fundamental requirements, we deconstruct the generation process into sequential steps, proposing targeted strategies and supervised fine-tuning (SFT) for each step. To address the advanced requirements, we present an argument ranking model for arguments and establish a comprehensive evidence database that includes up-to-date events and classic books, thereby strengthening the substantiation of the evidence with retrieval augmented generation (RAG) technology. To evaluate the generated commentaries more fairly, corresponding to the two-level requirements, we introduce a comprehensive evaluation metric that considers five distinct perspectives in commentary generation. Our experiments confirm the effectiveness of our proposed system. We also observe a significant increase in the efficiency of commentators in real-world scenarios, with the average time spent on creating a commentary dropping from 4 hours to 20 minutes. Importantly, such an increase in efficiency does not compromise the quality of the commentaries.
DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation
Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji
Aug 21 2024 cs.SD cs.AI cs.LG eess.AS arXiv:2408.10807v1

@misc{2408.10807, author = {Yin-Jyun Luo and Kin Wai Cheuk and Woosung Choi and Toshimitsu Uesaka and Keisuke Toyama and Koichi Saito and Chieh-Hsin Lai and Yuhta Takida and Wei-Hsiang Liao and Simon Dixon and Yuki Mitsufuji}, title = {{D}is{M}ix: {D}isentangling {M}ixtures of {M}usical {I}nstruments for {S}ource-level {P}itch and {T}imbre {M}anipulation}, year = {2024}, eprint = {2408.10807}, note = {arXiv:2408.10807v1} }
PDF
Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models
Anke Tang, Li Shen, Yong Luo, Shuai Xie, Han Hu, Lefei Zhang, Bo Du, Dacheng Tao
Aug 20 2024 cs.LG cs.AI arXiv:2408.10174v2

@misc{2408.10174, author = {Anke Tang and Li Shen and Yong Luo and Shuai Xie and Han Hu and Lefei Zhang and Bo Du and Dacheng Tao}, title = {{SMILE}: {Z}ero-{S}hot {S}parse {M}ixture of {L}ow-{R}ank {E}xperts {C}onstruction {F}rom {P}re-{T}rained {F}oundation {M}odels}, year = {2024}, eprint = {2408.10174}, note = {arXiv:2408.10174v2} }
PDF
Deep model training on extensive datasets is increasingly becoming cost-prohibitive, prompting the widespread adoption of deep model fusion techniques to leverage knowledge from pre-existing models. From simple weight averaging to more sophisticated methods like AdaMerging, model fusion effectively improves model performance and accelerates the development of new models. However, potential interference between parameters of individual models and the lack of interpretability in the fusion progress remain significant challenges. Existing methods often try to resolve the parameter interference issue by evaluating attributes of parameters, such as their magnitude or sign, or by parameter pruning. In this study, we begin by examining the fine-tuning of linear layers through the lens of subspace analysis and explicitly define parameter interference as an optimization problem to shed light on this subject. Subsequently, we introduce an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction, which allows for the upscaling of source models into an MoE model without extra data or further training. Our approach relies on the observation that fine-tuning mostly keeps the important parts from the pre-training, but it uses less significant or unused areas to adapt to new tasks. Also, the issue of parameter interference, which is intrinsically intractable in the original parameter space, can be managed by expanding the dimensions. We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning, and we apply our method to large language models (CLIP models, Flan-T5 models, and Mistral-7B models), highlighting the adaptability and scalability of SMILE. Code is available at https://github.com/tanganke/fusion_bench
Enhancing Exploratory Learning through Exploratory Search with the Emergence of Large Language Models
Yiming Luo, Patrick Cheong-Iao, Shanton Chang
Aug 20 2024 cs.IR cs.AI cs.CL arXiv:2408.08894v1

@misc{2408.08894, author = {Yiming Luo and Patrick Cheong-Iao and Shanton Chang}, title = {{E}nhancing {E}xploratory {L}earning through {E}xploratory {S}earch with the {E}mergence of {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2408.08894}, note = {arXiv:2408.08894v1} }
PDF
In the information era, how learners find, evaluate, and effectively use information has become a challenging issue, especially with the added complexity of large language models (LLMs) that have further confused learners in their information retrieval and search activities. This study attempts to unpack this complexity by combining exploratory search strategies with the theories of exploratory learning to form a new theoretical model of exploratory learning from the perspective of students' learning. Our work adapts Kolb's learning model by incorporating high-frequency exploration and feedback loops, aiming to promote deep cognitive and higher-order cognitive skill development in students. Additionally, this paper discusses and suggests how advanced LLMs integrated into information retrieval and information theory can support students in their exploratory searches, contributing theoretically to promoting student-computer interaction and supporting their learning journeys in the new era with LLMs.
GeneticPrism: Multifaceted Visualization of Scientific Impact Evolutions
Ye Sun, Zipeng Liu, Yuankai Luo, Lei Xia, Lei Shi
Aug 20 2024 cs.DL cs.GR cs.SI arXiv:2408.08912v1

@misc{2408.08912, author = {Ye Sun and Zipeng Liu and Yuankai Luo and Lei Xia and Lei Shi}, title = {{G}enetic{P}rism: {M}ultifaceted {V}isualization of {S}cientific {I}mpact {E}volutions}, year = {2024}, eprint = {2408.08912}, note = {arXiv:2408.08912v1} }
PDF
Understanding the evolution of scholarly impact is essential for many real-life decision-making processes in academia, such as research planning, frontier exploration, and award selection. Popular platforms like Google Scholar and Web of Science rely on numerical indicators that are too abstract to convey the context and content of scientific impact, while most existing visualization approaches on mapping science do not consider the presentation of individual scholars' impact evolution using curated self-citation data. This paper builds on our previous work and proposes an integrated pipeline to visualize a scholar's impact evolution from multiple topic facets. A novel 3D prism-shaped visual metaphor is introduced as the overview of a scholar's impact, whilst their scientific evolution on each topic is displayed in a more structured manner. Additional designs by topic chord diagram, streamgraph visualization, and inter-topic flow map, optimized by an elaborate layout algorithm, assist in perceiving the scholar's scientific evolution across topics. A new six-degree-impact glyph metaphor highlights key interdisciplinary works driving the evolution. The proposed visualization methods are evaluated through case studies analyzing the careers of prestigious Turing award laureates and a major visualization venue.
Sequential Federated Learning in Hierarchical Architecture on Non-IID Datasets
Xingrun Yan, Shiyuan Zuo, Rongfei Fan, Han Hu, Li Shen, Puning Zhao, Yong Luo
Aug 20 2024 cs.LG arXiv:2408.09762v1

@misc{2408.09762, author = {Xingrun Yan and Shiyuan Zuo and Rongfei Fan and Han Hu and Li Shen and Puning Zhao and Yong Luo}, title = {{S}equential {F}ederated {L}earning in {H}ierarchical {A}rchitecture on {N}on-{IID} {D}atasets}, year = {2024}, eprint = {2408.09762}, note = {arXiv:2408.09762v1} }
PDF
In a real federated learning (FL) system, communication overhead for passing model parameters between the clients and the parameter server (PS) is often a bottleneck. Hierarchical federated learning (HFL) that poses multiple edge servers (ESs) between clients and the PS can partially alleviate communication pressure but still needs the aggregation of model parameters from multiple ESs at the PS. To further reduce communication overhead, we bring sequential FL (SFL) into HFL for the first time, which removes the central PS and enables the model training to be completed only through passing the global model between two adjacent ESs for each iteration, and propose a novel algorithm adaptive to such a combinational framework, referred to as Fed-CHS. Convergence results are derived for strongly convex and non-convex loss functions under various data heterogeneity setups, which show comparable convergence performance with the algorithms for HFL or SFL solely. Experimental results provide evidence of the superiority of our proposed Fed-CHS on both communication overhead saving and test accuracy over baseline methods.
Geometry Informed Tokenization of Molecules for Language Model Generation
Xiner Li, Limei Wang, Youzhi Luo, Carl Edwards, Shurui Gui, Yuchao Lin, Heng Ji, Shuiwang Ji
Aug 20 2024 cs.AI arXiv:2408.10120v1

@misc{2408.10120, author = {Xiner Li and Limei Wang and Youzhi Luo and Carl Edwards and Shurui Gui and Yuchao Lin and Heng Ji and Shuiwang Ji}, title = {{G}eometry {I}nformed {T}okenization of {M}olecules for {L}anguage {M}odel {G}eneration}, year = {2024}, eprint = {2408.10120}, note = {arXiv:2408.10120v1} }
PDF
We consider molecule generation in 3D space using language models (LMs), which requires discrete tokenization of 3D molecular geometries. Although tokenization of molecular graphs exists, that for 3D geometries is largely unexplored. Here, we attempt to bridge this gap by proposing the Geo2Seq, which converts molecular geometries into $SE(3)$-invariant 1D discrete sequences. Geo2Seq consists of canonical labeling and invariant spherical representation steps, which together maintain geometric and atomic fidelity in a format conducive to LMs. Our experiments show that, when coupled with Geo2Seq, various LMs excel in molecular geometry generation, especially in controlled generation tasks.
OC3D: Weakly Supervised Outdoor 3D Object Detection with Only Coarse Click Annotation
Qiming Xia, Hongwei Lin, Wei Ye, Hai Wu, Yadan Luo, Shijia Zhao, Xin Li, Chenglu Wen
Aug 16 2024 cs.CV cs.AI arXiv:2408.08092v2

@misc{2408.08092, author = {Qiming Xia and Hongwei Lin and Wei Ye and Hai Wu and Yadan Luo and Shijia Zhao and Xin Li and Chenglu Wen}, title = {{OC}3{D}: {W}eakly {S}upervised {O}utdoor 3{D} {O}bject {D}etection with {O}nly {C}oarse {C}lick {A}nnotation}, year = {2024}, eprint = {2408.08092}, note = {arXiv:2408.08092v2} }
PDF
LiDAR-based outdoor 3D object detection has received widespread attention. However, training 3D detectors from the LiDAR point cloud typically relies on expensive bounding box annotations. This paper presents OC3D, an innovative weakly supervised method requiring only coarse clicks on the bird's eye view of the 3D point cloud. A key challenge here is the absence of complete geometric descriptions of the target objects from such simple click annotations. To address this problem, our proposed OC3D adopts a two-stage strategy. In the first stage, we initially design a novel dynamic and static classification strategy and then propose the Click2Box and Click2Mask modules to generate box-level and mask-level pseudo-labels for static and dynamic instances, respectively. In the second stage, we design a Mask2Box module, leveraging the learning capabilities of neural networks to update mask-level pseudo-labels, which contain less information, to box-level pseudo-labels. Experimental results on the widely used KITTI and nuScenes datasets demonstrate that our OC3D with only coarse clicks achieves state-of-the-art performance compared to weakly-supervised 3D detection methods. Combining OC3D with a missing click mining strategy, we propose an OC3D++ pipeline, which requires only 0.2% annotation cost in the KITTI dataset to achieve performance comparable to fully supervised methods. The code will be made publicly available.
COD: Learning Conditional Invariant Representation for Domain Adaptation Regression
Hao-Ran Yang, Chuan-Xian Ren, You-Wei Luo
Aug 14 2024 cs.LG cs.CV arXiv:2408.06638v1

@misc{2408.06638, author = {Hao-Ran Yang and Chuan-Xian Ren and You-Wei Luo}, title = {{COD}: {L}earning {C}onditional {I}nvariant {R}epresentation for {D}omain {A}daptation {R}egression}, year = {2024}, eprint = {2408.06638}, note = {arXiv:2408.06638v1} }
PDF
Aiming to generalize the label knowledge from a source domain with continuous outputs to an unlabeled target domain, Domain Adaptation Regression (DAR) is developed for complex practical learning problems. However, due to the continuity problem in regression, existing conditional distribution alignment theory and methods with discrete prior, which are proven to be effective in classification settings, are no longer applicable. In this work, focusing on the feasibility problems in DAR, we establish the sufficiency theory for the regression model, which shows the generalization error can be sufficiently dominated by the cross-domain conditional discrepancy. Further, to characterize conditional discrepancy with continuous conditioning variable, a novel Conditional Operator Discrepancy (COD) is proposed, which admits the metric property on conditional distributions via the kernel embedding theory. Finally, to minimize the discrepancy, a COD-based conditional invariant representation learning model is proposed, and the reformulation is derived to show that reasonable modifications on moment statistics can further improve the discriminability of the adaptation model. Extensive experiments on standard DAR datasets verify the validity of theoretical results and the superiority over SOTA DAR methods.
OpenResearcher: Unleashing AI for Accelerated Scientific Research
Yuxiang Zheng, Shichao Sun, Lin Qiu, Dongyu Ru, Cheng Jiayang, Xuefeng Li, Jifan Lin, Binjie Wang, Yun Luo, Renjie Pan, Yang Xu, Qingkai Min, Zizhao Zhang, Yiwen Wang, Wenjie Li, Pengfei Liu
Aug 14 2024 cs.IR arXiv:2408.06941v1

@misc{2408.06941, author = {Yuxiang Zheng and Shichao Sun and Lin Qiu and Dongyu Ru and Cheng Jiayang and Xuefeng Li and Jifan Lin and Binjie Wang and Yun Luo and Renjie Pan and Yang Xu and Qingkai Min and Zizhao Zhang and Yiwen Wang and Wenjie Li and Pengfei Liu}, title = {{O}pen{R}esearcher: {U}nleashing {AI} for {A}ccelerated {S}cientific {R}esearch}, year = {2024}, eprint = {2408.06941}, note = {arXiv:2408.06941v1} }
PDF
The rapid growth of scientific literature imposes significant challenges for researchers endeavoring to stay updated with the latest advancements in their fields and delve into new areas. We introduce OpenResearcher, an innovative platform that leverages Artificial Intelligence (AI) techniques to accelerate the research process by answering diverse questions from researchers. OpenResearcher is built based on Retrieval-Augmented Generation (RAG) to integrate Large Language Models (LLMs) with up-to-date, domain-specific knowledge. Moreover, we develop various tools for OpenResearcher to understand researchers' queries, search from the scientific literature, filter retrieved information, provide accurate and comprehensive answers, and self-refine these answers. OpenResearcher can flexibly use these tools to balance efficiency and effectiveness. As a result, OpenResearcher enables researchers to save time and increase their potential to discover new insights and drive scientific breakthroughs. Demo, video, and code are available at: https://github.com/GAIR-NLP/OpenResearcher.