subscribe to arXiv mailings

Exploring Forgetting in Large Language Model Pre-Training

Authors: Chonghua Liao, Ruobing Xie, Xingwu Sun, Haowen Sun, Zhanhui Kang

Abstract: Catastrophic forgetting remains a formidable obstacle to building an omniscient model in large language models (LLMs). Despite the pioneering research on task-level forgetting in LLM fine-tuning, there is scant focus on forgetting during pre-training. We systematically explored the existence and measurement of forgetting in pre-training, questioning traditional metrics such as perplexity (PPL) and… ▽ More Catastrophic forgetting remains a formidable obstacle to building an omniscient model in large language models (LLMs). Despite the pioneering research on task-level forgetting in LLM fine-tuning, there is scant focus on forgetting during pre-training. We systematically explored the existence and measurement of forgetting in pre-training, questioning traditional metrics such as perplexity (PPL) and introducing new metrics to better detect entity memory retention. Based on our revised assessment of forgetting metrics, we explored low-cost, straightforward methods to mitigate forgetting during the pre-training phase. Further, we carefully analyzed the learning curves, offering insights into the dynamics of forgetting. Extensive evaluations and analyses on forgetting of pre-training could facilitate future research on LLMs. △ Less

Submitted 22 October, 2024; originally announced October 2024.

arXiv:2410.16912 [pdf, ps, other]

Measurement of the branching fractions of the decays $Λ_{c}^{+}\rightarrowΛK_{S}^{0}K^{+}$, $Λ_{c}^{+}\rightarrowΛK_{S}^{0}π^{+}$ and $Λ_{c}^{+}\rightarrowΛK^{*+}$

Authors: BESIII Collaboration, M. Ablikim, M. N. Achasov, P. Adlarson, O. Afedulidis, X. C. Ai, R. Aliberti, A. Amoroso, Q. An, Y. Bai, O. Bakina, I. Balossino, Y. Ban, H. -R. Bao, V. Batozskaya, K. Begzsuren, N. Berger, M. Berlowski, M. Bertani, D. Bettoni, F. Bianchi, E. Bianco, A. Bortone, I. Boyko, R. A. Briere , et al. (639 additional authors not shown)

Abstract: Studies are performed of the Cabibbo-favored decay $Λ_{c}^{+}\toΛK_{S}^{0}K^+$ and the singly Cabibbo-suppressed decay $Λ_{c}^{+}\toΛK_{S}^{0}π^+$, based on a sample of $e^{+}e^{-}$ collision data, corresponding to an integrated luminosity of 4.5 fb$^{-1}$, accumulated at center-of-mass energies between $4599.53$ MeV and $4698.82$ MeV with the BESIII detector. The decay… ▽ More Studies are performed of the Cabibbo-favored decay $Λ_{c}^{+}\toΛK_{S}^{0}K^+$ and the singly Cabibbo-suppressed decay $Λ_{c}^{+}\toΛK_{S}^{0}π^+$, based on a sample of $e^{+}e^{-}$ collision data, corresponding to an integrated luminosity of 4.5 fb$^{-1}$, accumulated at center-of-mass energies between $4599.53$ MeV and $4698.82$ MeV with the BESIII detector. The decay $Λ_{c}^{+}\toΛK_{S}^{0}π^+$ is observed for the first time. The branching fractions of $Λ_{c}^{+}\toΛK_{S}^{0}K^+$ and $Λ_{c}^{+}\toΛK_{S}^{0}π^+$ are measured to be $(3.04\pm0.30\pm0.16)\times 10^{-3}$ and $(1.73\pm0.27\pm0.10)\times 10^{-3}$, respectively, where the first uncertainties are statistical and the second are systematic. These results correspond to the most precise measurement of these quantities for both decays. Evidence of a $K^{*+}$ contribution in the $Λ_{c}^{+}\toΛK_{S}^{0}π^+$ decay is found with a statistical significance of $4.7σ$. The branching fraction of $Λ_{c}^{+}\toΛK^{*+}$ is calculated under three possible interference scenarios. △ Less

Submitted 22 October, 2024; originally announced October 2024.

arXiv:2410.16830 [pdf, other]

Random spanning trees in random environment

Authors: Luca Makowiec, Michele Salvi, Rongfeng Sun

Abstract: We introduce a new spanning tree model called the random spanning tree in random environment (RSTRE), which interpolates between the uniform spanning tree and the minimum spanning tree as the inverse temperature (disorder strength) $β$ varies. On the complete graph with $n$ vertices and i.i.d.\ uniform disorder variables on the edges, we identify: (1) a low disorder regime with $β\leq C n/\log n$,… ▽ More We introduce a new spanning tree model called the random spanning tree in random environment (RSTRE), which interpolates between the uniform spanning tree and the minimum spanning tree as the inverse temperature (disorder strength) $β$ varies. On the complete graph with $n$ vertices and i.i.d.\ uniform disorder variables on the edges, we identify: (1) a low disorder regime with $β\leq C n/\log n$, where the diameter of the random spanning tree is typically of order $n^{1/2}$, the same as for the uniform spanning tree; (2) a high disorder regime with $β\geq n^{4/3} \log n$, where the diameter is typically of order $n^{1/3}$, the same as for the minimum spanning tree. We conjecture that for $β=n^α$ with $α\in (1, 4/3)$, the diameter is of order $n^{γ+o(1)}$ for some $γ=γ(α)$ strictly between $1/2$ and $1/3$. △ Less

Submitted 22 October, 2024; originally announced October 2024.

Comments: 36 pages, 2 figures. Comments are welcome!

MSC Class: 60K35 (Primary) 82B41; 82B44; 05C05 (Secondary)

arXiv:2410.16720 [pdf, other]

NodeOP: Optimizing Node Management for Decentralized Networks

Authors: Angela Tsang, Jiankai Sun, Boo Xie, Azeem Khan, Ender Lu, Fletcher Fan, Maggie Wu, Jing Tang

Abstract: We present NodeOP, a novel framework designed to optimize the management of General Node Operators in decentralized networks. By integrating Agent-Based Modeling (ABM) with a Tendermint Byzantine Fault Tolerance (BFT)-based consensus mechanism, NodeOP addresses key challenges in task allocation, consensus formation, and system stability. Through rigorous mathematical modeling and formal optimizati… ▽ More We present NodeOP, a novel framework designed to optimize the management of General Node Operators in decentralized networks. By integrating Agent-Based Modeling (ABM) with a Tendermint Byzantine Fault Tolerance (BFT)-based consensus mechanism, NodeOP addresses key challenges in task allocation, consensus formation, and system stability. Through rigorous mathematical modeling and formal optimization, NodeOP ensures stable equilibrium in node task distribution. We validate the framework via convergence analysis and performance metrics such as transaction throughput, system latency, and fault tolerance. We further demonstrate NodeOP's practical utility through two use cases: decentralized sequencer management in Layer 2 networks and off-chain payment validation. These examples underscore how NodeOP enhances validation efficiency and unlocks new revenue opportunities in large-scale decentralized environments. Our results position NodeOP as a scalable and flexible solution, significantly improving operational efficiency and economic sustainability in decentralized systems. △ Less

Submitted 22 October, 2024; originally announced October 2024.

arXiv:2410.16695 [pdf, other]

MPT: A Large-scale Multi-Phytoplankton Tracking Benchmark

Authors: Yang Yu, Yuezun Li, Xin Sun, Junyu Dong

Abstract: Phytoplankton are a crucial component of aquatic ecosystems, and effective monitoring of them can provide valuable insights into ocean environments and ecosystem changes. Traditional phytoplankton monitoring methods are often complex and lack timely analysis. Therefore, deep learning algorithms offer a promising approach for automated phytoplankton monitoring. However, the lack of large-scale, hig… ▽ More Phytoplankton are a crucial component of aquatic ecosystems, and effective monitoring of them can provide valuable insights into ocean environments and ecosystem changes. Traditional phytoplankton monitoring methods are often complex and lack timely analysis. Therefore, deep learning algorithms offer a promising approach for automated phytoplankton monitoring. However, the lack of large-scale, high-quality training samples has become a major bottleneck in advancing phytoplankton tracking. In this paper, we propose a challenging benchmark dataset, Multiple Phytoplankton Tracking (MPT), which covers diverse background information and variations in motion during observation. The dataset includes 27 species of phytoplankton and zooplankton, 14 different backgrounds to simulate diverse and complex underwater environments, and a total of 140 videos. To enable accurate real-time observation of phytoplankton, we introduce a multi-object tracking method, Deviation-Corrected Multi-Scale Feature Fusion Tracker(DSFT), which addresses issues such as focus shifts during tracking and the loss of small target information when computing frame-to-frame similarity. Specifically, we introduce an additional feature extractor to predict the residuals of the standard feature extractor's output, and compute multi-scale frame-to-frame similarity based on features from different layers of the extractor. Extensive experiments on the MPT have demonstrated the validity of the dataset and the superiority of DSFT in tracking phytoplankton, providing an effective solution for phytoplankton monitoring. △ Less

Submitted 22 October, 2024; originally announced October 2024.

arXiv:2410.16663 [pdf, other]

FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs

Authors: Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu

Abstract: FlashAttention series has been widely applied in the inference of large language models (LLMs). However, FlashAttention series only supports the high-level GPU architectures, e.g., Ampere and Hopper. At present, FlashAttention series is not easily transferrable to NPUs and low-resource GPUs. Moreover, FlashAttention series is inefficient for multi- NPUs or GPUs inference scenarios. In this work, w… ▽ More FlashAttention series has been widely applied in the inference of large language models (LLMs). However, FlashAttention series only supports the high-level GPU architectures, e.g., Ampere and Hopper. At present, FlashAttention series is not easily transferrable to NPUs and low-resource GPUs. Moreover, FlashAttention series is inefficient for multi- NPUs or GPUs inference scenarios. In this work, we propose FastAttention which pioneers the adaptation of FlashAttention series for NPUs and low-resource GPUs to boost LLM inference efficiency. Specifically, we take Ascend NPUs and Volta-based GPUs as representatives for designing our FastAttention. We migrate FlashAttention series to Ascend NPUs by proposing a novel two-level tiling strategy for runtime speedup, tiling-mask strategy for memory saving and the tiling-AllReduce strategy for reducing communication overhead, respectively. Besides, we adapt FlashAttention for Volta-based GPUs by redesigning the operands layout in shared memory and introducing a simple yet effective CPU-GPU cooperative strategy for efficient memory utilization. On Ascend NPUs, our FastAttention can achieve a 10.7$\times$ speedup compared to the standard attention implementation. Llama-7B within FastAttention reaches up to 5.16$\times$ higher throughput than within the standard attention. On Volta architecture GPUs, FastAttention yields 1.43$\times$ speedup compared to its equivalents in \texttt{xformers}. Pangu-38B within FastAttention brings 1.46$\times$ end-to-end speedup using FasterTransformer. Coupled with the propose CPU-GPU cooperative strategy, FastAttention supports a maximal input length of 256K on 8 V100 GPUs. All the codes will be made available soon. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16638 [pdf, other]

LLMScan: Causal Scan for LLM Misbehavior Detection

Authors: Mengdi Zhang, Kai Kiat Goh, Peixin Zhang, Jun Sun

Abstract: Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces… ▽ More Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful, biased and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when misbehaving. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16594 [pdf, other]

The Impact of Initial Composition on Massive Star Evolution and Nucleosynthesis

Authors: Christopher West, Alexander Heger, Benoit Cote, Lev Serxner, Haoxuan Sun

Abstract: We study the sensitivity of presupernova evolution and supernova nucleosynthesis yields of massive stars to variations of the initial composition. We use the solar abundances from Lodders (2009), and compute two different initial stellar compositions: i) scaled solar abundances, and ii) the isotopic galactic chemical history model (GCH) developed by West and Heger (2013b). We run a grid of models… ▽ More We study the sensitivity of presupernova evolution and supernova nucleosynthesis yields of massive stars to variations of the initial composition. We use the solar abundances from Lodders (2009), and compute two different initial stellar compositions: i) scaled solar abundances, and ii) the isotopic galactic chemical history model (GCH) developed by West and Heger (2013b). We run a grid of models using the KEPLER stellar evolution code, with 7 initial stellar masses, 12 initial metallicities, and two for each scaling method to explore the effects on nucleosynthesis over a metallicity range of $-4.0\leq[Z]\leq+0.3$. We find that the compositions from the GCH model better reproduce the weak \emph{s}-process peak than the scaled solar models. The model yields are then used in the OMEGA Galactic Chemical Evolution (GCE) code to assess this result further. We find that initial abundances used in computing stellar structure have more of an impact on GCE results than initial abundances used in the burn network, with the GCH model again being favored when compared to observations. Lastly, a machine learning algorithm was used to verify the free parameter values of the GCH model, which were previously found by West and Heger (2013b) using a stochastic fitting process. The updated model is provided as an accessible tool for further nucleosynthesis studies. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16565 [pdf, other]

Search for gravitational waves emitted from SN 2023ixf

Authors: The LIGO Scientific Collaboration, the Virgo Collaboration, the KAGRA Collaboration, A. G. Abac, R. Abbott, I. Abouelfettouh, F. Acernese, K. Ackley, S. Adhicary, N. Adhikari, R. X. Adhikari, V. K. Adkins, D. Agarwal, M. Agathos, M. Aghaei Abchouyeh, O. D. Aguiar, I. Aguilar, L. Aiello, A. Ain, T. Akutsu, S. Albanesi, R. A. Alfaidi, A. Al-Jodah, C. Alléné, A. Allocca , et al. (1758 additional authors not shown)

Abstract: We present the results of a search for gravitational-wave transients associated with core-collapse supernova SN 2023ixf, which was observed in the galaxy Messier 101 via optical emission on 2023 May 19th, during the LIGO-Virgo-KAGRA 15th Engineering Run. We define a five-day on-source window during which an accompanying gravitational-wave signal may have occurred. No gravitational waves have been… ▽ More We present the results of a search for gravitational-wave transients associated with core-collapse supernova SN 2023ixf, which was observed in the galaxy Messier 101 via optical emission on 2023 May 19th, during the LIGO-Virgo-KAGRA 15th Engineering Run. We define a five-day on-source window during which an accompanying gravitational-wave signal may have occurred. No gravitational waves have been identified in data when at least two gravitational-wave observatories were operating, which covered $\sim 14\%$ of this five-day window. We report the search detection efficiency for various possible gravitational-wave emission models. Considering the distance to M101 (6.7 Mpc), we derive constraints on the gravitational-wave emission mechanism of core-collapse supernovae across a broad frequency spectrum, ranging from 50 Hz to 2 kHz where we assume the GW emission occurred when coincident data are available in the on-source window. Considering an ellipsoid model for a rotating proto-neutron star, our search is sensitive to gravitational-wave energy $1 \times 10^{-5} M_{\odot} c^2$ and luminosity $4 \times 10^{-5} M_{\odot} c^2/\text{s}$ for a source emitting at 50 Hz. These constraints are around an order of magnitude more stringent than those obtained so far with gravitational-wave data. The constraint on the ellipticity of the proto-neutron star that is formed is as low as $1.04$, at frequencies above $1200$ Hz, surpassing results from SN 2019ejj. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: Main paper: 6 pages, 4 figures and 1 table. Total with appendices: 20 pages, 4 figures, and 1 table

Report number: LIGO-P2400125

arXiv:2410.16561 [pdf, ps, other]

Gradient Normalization with(out) Clipping Ensures Convergence of Nonconvex SGD under Heavy-Tailed Noise with Improved Results

Authors: Tao Sun, Xinwang Liu, Kun Yuan

Abstract: This paper investigates Gradient Normalization Stochastic Gradient Descent without Clipping (NSGDC) and its variance reduction variant (NSGDC-VR) for nonconvex optimization under heavy-tailed noise. We present significant improvements in the theoretical results for both algorithms, including the removal of logarithmic factors from the convergence rates and the recovery of the convergence rate to m… ▽ More This paper investigates Gradient Normalization Stochastic Gradient Descent without Clipping (NSGDC) and its variance reduction variant (NSGDC-VR) for nonconvex optimization under heavy-tailed noise. We present significant improvements in the theoretical results for both algorithms, including the removal of logarithmic factors from the convergence rates and the recovery of the convergence rate to match the deterministic case when the noise variance σ is zero. Additionally, we demonstrate that gradient normalization alone, assuming individual Lipschitz smoothness, is sufficient to ensure convergence of SGD under heavy-tailed noise, eliminating the need for gradient clipping. Furthermore, we introduce accelerated nonconvex algorithms that utilize second-order Lipschitz smoothness to achieve enhanced convergence rates in the presence of heavy-tailed noise. Our findings offer a deeper understanding of how gradient normalization and variance reduction techniques can be optimized for robust performance in challenging optimization scenarios. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16446 [pdf, ps, other]

Lifetimes and Branching Ratios Apparatus (LIBRA)

Authors: L. J. Sun, J. Dopfer, A. Adams, C. Wrede, A. Banerjee, B. A. Brown, J. Chen, E. A. M. Jensen, R. Mahajan, T. Rauscher, C. Sumithrarachchi, L. E. Weghorn, D. Weisshaar, T. Wheeler

Abstract: The Particle X-ray Coincidence Technique (PXCT) was originally developed to measure average lifetimes in the $10^{-17}-10^{-15}$~s range for proton-unbound states populated by electron capture (EC). We have designed and built the Lifetimes and Branching Ratios Apparatus (LIBRA) to be used in the stopped-beam area at the Facility for Rare Isotope Beams that extends PXCT to measure both lifetimes an… ▽ More The Particle X-ray Coincidence Technique (PXCT) was originally developed to measure average lifetimes in the $10^{-17}-10^{-15}$~s range for proton-unbound states populated by electron capture (EC). We have designed and built the Lifetimes and Branching Ratios Apparatus (LIBRA) to be used in the stopped-beam area at the Facility for Rare Isotope Beams that extends PXCT to measure both lifetimes and decay branching ratios of resonances populated by EC/$β^+$ decay. The first application of LIBRA aims to obtain essential nuclear data from $^{60}$Ga EC/$β^+$ decay to constrain the thermonuclear rates of the $^{59}$Cu$(p,γ)^{60}$Zn and $^{59}$Cu$(p,α)^{56}$Ni reactions, and in turn, the strength of the NiCu nucleosynthesis cycle, which is predicted to significantly impact the modeling of Type I X-ray burst light curves and the composition of the burst ashes. Detailed theoretical calculations, Monte Carlo simulations, and performance tests with radioactive sources have been conducted to validate the feasibility of employing LIBRA for the $^{60}$Ga experiment. The method introduced with LIBRA has the potential to measure nearly all essential ingredients for thermonuclear reaction rate calculations in a single experiment, in the absence of direct measurements, which are often impractical for radioactive reactants. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16429 [pdf, other]

Pantograph: A Machine-to-Machine Interaction Interface for Advanced Theorem Proving, High Level Reasoning, and Data Extraction in Lean 4

Authors: Leni Aniva, Chuyue Sun, Brando Miranda, Clark Barrett, Sanmi Koyejo

Abstract: Machine-assisted theorem proving refers to the process of conducting structured reasoning to automatically generate proofs for mathematical theorems. Recently, there has been a surge of interest in using machine learning models in conjunction with proof assistants to perform this task. In this paper, we introduce Pantograph, a tool that provides a versatile interface to the Lean 4 proof assistant… ▽ More Machine-assisted theorem proving refers to the process of conducting structured reasoning to automatically generate proofs for mathematical theorems. Recently, there has been a surge of interest in using machine learning models in conjunction with proof assistants to perform this task. In this paper, we introduce Pantograph, a tool that provides a versatile interface to the Lean 4 proof assistant and enables efficient proof search via powerful search algorithms such as Monte Carlo Tree Search. In addition, Pantograph enables high-level reasoning by enabling a more robust handling of Lean 4's inference steps. We provide an overview of Pantograph's architecture and features. We also report on an illustrative use case: using machine learning models and proof sketches to prove Lean 4 theorems. Pantograph's innovative features pave the way for more advanced machine learning models to perform complex proof searches and high-level reasoning, equipping future researchers to design more versatile and powerful theorem provers. △ Less

Submitted 21 October, 2024; originally announced October 2024.

ACM Class: F.4.1; I.2.3; I.2.7

arXiv:2410.16404 [pdf, other]

UVCANDELS: Catalogs of photometric redshifts and galaxy physical properties

Authors: Vihang Mehta, Marc Rafelski, Ben Sunnquist, Harry I. Teplitz, Claudia Scarlata, Xin Wang, Adriano Fontana, Nimish P. Hathi, Kartheik G. Iyer, Anahita Alavi, James Colbert, Norman Grogin, Anton Koekemoer, Kalina V. Nedkova, Matthew Hayes, Laura Prichard, Brian Siana, Brent M. Smith, Rogier Windhorst, Teresa Ashcraft, Micaela Bagley, Ivano Baronchelli, Guillermo Barro, Alex Blanche, Adam Broussard , et al. (54 additional authors not shown)

Abstract: The UltraViolet imaging of the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey Fields (UVCANDELS) program provides deep HST F275W and F435W imaging over four CANDELS fields (GOODS-N, GOODS-S, COSMOS, and EGS). We combine this newly acquired UV imaging with existing HST imaging from CANDELS as well as existing ancillary data to obtain robust photometric redshifts and reliable estimat… ▽ More The UltraViolet imaging of the Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey Fields (UVCANDELS) program provides deep HST F275W and F435W imaging over four CANDELS fields (GOODS-N, GOODS-S, COSMOS, and EGS). We combine this newly acquired UV imaging with existing HST imaging from CANDELS as well as existing ancillary data to obtain robust photometric redshifts and reliable estimates for galaxy physical properties for over 150,000 galaxies in the $\sim$430 arcmin$^2$ UVCANDELS area. Here, we leverage the power of the new UV photometry to not only improve the photometric redshift measurements in these fields, but also constrain the full redshift probability distribution combining multiple redshift fitting tools. Furthermore, using the full UV-to-IR photometric dataset, we measure the galaxy physical properties by fitting templates from population synthesis models with two different parameterizations (flexible and fixed-form) of the star-formation histories (SFHs). Compared to the flexible SFH parametrization, we find that the fixed-form SFHs systematically underestimate the galaxy stellar masses, both at the low- ($\lesssim10^9 M_\odot$) and high- ($\gtrsim10^{10} M_\odot$) mass end, by as much as $\sim0.5$ dex. This underestimation is primarily due the limited ability of fixed-form SFH parameterization to simultaneously capture the chaotic nature of star-formation in these galaxies. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: 22 pages, 6 figures; accepted to ApJS; catalogs available via MAST

arXiv:2410.16322 [pdf, other]

SouLLMate: An Application Enhancing Diverse Mental Health Support with Adaptive LLMs, Prompt Engineering, and RAG Techniques

Authors: Qiming Guo, Jinwen Tang, Wenbo Sun, Haoteng Tang, Yi Shang, Wenlu Wang

Abstract: Mental health issues significantly impact individuals' daily lives, yet many do not receive the help they need even with available online resources. This study aims to provide diverse, accessible, stigma-free, personalized, and real-time mental health support through cutting-edge AI technologies. It makes the following contributions: (1) Conducting an extensive survey of recent mental health suppo… ▽ More Mental health issues significantly impact individuals' daily lives, yet many do not receive the help they need even with available online resources. This study aims to provide diverse, accessible, stigma-free, personalized, and real-time mental health support through cutting-edge AI technologies. It makes the following contributions: (1) Conducting an extensive survey of recent mental health support methods to identify prevalent functionalities and unmet needs. (2) Introducing SouLLMate, an adaptive LLM-driven system that integrates LLM technologies, Chain, Retrieval-Augmented Generation (RAG), prompt engineering, and domain knowledge. This system offers advanced features such as Risk Detection and Proactive Guidance Dialogue, and utilizes RAG for personalized profile uploads and Conversational Information Extraction. (3) Developing novel evaluation approaches for preliminary assessments and risk detection via professionally annotated interview data and real-life suicide tendency data. (4) Proposing the Key Indicator Summarization (KIS), Proactive Questioning Strategy (PQS), and Stacked Multi-Model Reasoning (SMMR) methods to enhance model performance and usability through context-sensitive response adjustments, semantic coherence evaluations, and enhanced accuracy of long-context reasoning in language models. This study contributes to advancing mental health support technologies, potentially improving the accessibility and effectiveness of mental health care globally. △ Less

Submitted 17 October, 2024; originally announced October 2024.

Comments: 26 pages, 19 figures, 8 tables

arXiv:2410.16271 [pdf, other]

FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

Authors: Chin-Yang Lin, Chung-Ho Wu, Chang-Han Yeh, Shih-Han Yen, Cheng Sun, Yu-Lun Liu

Abstract: Neural Radiance Fields (NeRF) face significant challenges in few-shot scenarios, primarily due to overfitting and long training times for high-fidelity rendering. Existing methods, such as FreeNeRF and SparseNeRF, use frequency regularization or pre-trained priors but struggle with complex scheduling and bias. We introduce FrugalNeRF, a novel few-shot NeRF framework that leverages weight-sharing v… ▽ More Neural Radiance Fields (NeRF) face significant challenges in few-shot scenarios, primarily due to overfitting and long training times for high-fidelity rendering. Existing methods, such as FreeNeRF and SparseNeRF, use frequency regularization or pre-trained priors but struggle with complex scheduling and bias. We introduce FrugalNeRF, a novel few-shot NeRF framework that leverages weight-sharing voxels across multiple scales to efficiently represent scene details. Our key contribution is a cross-scale geometric adaptation scheme that selects pseudo ground truth depth based on reprojection errors across scales. This guides training without relying on externally learned priors, enabling full utilization of the training data. It can also integrate pre-trained priors, enhancing quality without slowing convergence. Experiments on LLFF, DTU, and RealEstate-10K show that FrugalNeRF outperforms other few-shot NeRF methods while significantly reducing training time, making it a practical solution for efficient and accurate 3D scene reconstruction. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: Project page: https://linjohnss.github.io/frugalnerf/

arXiv:2410.16240 [pdf, other]

Nonlinear Magnetics Model for Permanent Magnet Synchronous Machines Capturing Saturation and Temperature Effects

Authors: Kishan Srinivasan, Heath Hofmann, Jing Sun

Abstract: This paper proposes a nonlinear magnetics model for Permanent Magnet Synchronous Machines (PMSMs) that accurately captures the effects of magnetic saturation in the machine iron and variations in rotor temperature on the permanent magnet excitation. The proposed model considers the permanent magnet as a current source rather than the more commonly used flux-linkage source. A comparison of the two… ▽ More This paper proposes a nonlinear magnetics model for Permanent Magnet Synchronous Machines (PMSMs) that accurately captures the effects of magnetic saturation in the machine iron and variations in rotor temperature on the permanent magnet excitation. The proposed model considers the permanent magnet as a current source rather than the more commonly used flux-linkage source. A comparison of the two modelling approaches is conducted using Finite Element Analysis (FEA) for different machine designs as well as experimental validation, where it is shown that the proposed model has substantially better accuracy. The proposed model decouples magnetic saturation and rotor temperature effects in the current/flux-linkage relationship, allowing for adaptive estimation of the PM excitation. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16198 [pdf, other]

Improve Vision Language Model Chain-of-thought Reasoning

Authors: Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang

Abstract: Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed r… ▽ More Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model's reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: 10 pages + appendix

MSC Class: 68T07

arXiv:2410.16173 [pdf, other]

Fast Physics-Informed Model Predictive Control Approximation for Lyapunov Stability

Authors: Josue N. Rivera, Jianqi Ruan, XiaoLin Xu, Shuting Yang, Dengfeng Sun, Neera Jain

Abstract: At the forefront of control techniques is Model Predictive Control (MPC). While MPCs are effective, their requisite to recompute an optimal control given a new state leads to sparse response to the system and may make their implementation infeasible in small systems with low computational resources. To address these limitations in stability control, this research presents a small deterministic Phy… ▽ More At the forefront of control techniques is Model Predictive Control (MPC). While MPCs are effective, their requisite to recompute an optimal control given a new state leads to sparse response to the system and may make their implementation infeasible in small systems with low computational resources. To address these limitations in stability control, this research presents a small deterministic Physics-Informed MPC Surrogate model (PI-MPCS). PI-MPCS was developed to approximate the control by an MPC while encouraging stability and robustness through the integration of the system dynamics and the formation of a Lyapunov stability profile. Empirical results are presented on the task of 2D quadcopter landing. They demonstrate a rapid and precise MPC approximation on a non-linear system along with an estimated two times speed up on the computational requirements when compared against an MPC. PI-MPCS, in addition, displays a level of stable control for in- and out-of-distribution states as encouraged by the discrete dynamics residual and Lyapunov stability loss functions. PI-MPCS is meant to serve as a surrogate to MPC on situations in which the computational resources are limited. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.16135 [pdf, other]

Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs

Authors: Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen

Abstract: To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($\leq 1.3$) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in… ▽ More To date, 2:4 sparsity has stood as the only sparse pattern that can be accelerated using sparse tensor cores on GPUs. In practice, 2:4 sparsity often possesses low actual speedups ($\leq 1.3$) and requires fixed sparse ratios, meaning that other ratios, such as 4:8, 8:16, or those exceeding 50% sparsity, do not incur any speedups on GPUs. Recent studies suggest that V:N:M sparsity is promising in addressing these limitations of 2:4 sparsity. However, regarding accuracy, the effects of V:N:M sparsity on broader Transformer models, such as vision Transformers and large language models (LLMs), are largely unexamined. Moreover, Some specific issues related to V:N:M sparsity, such as how to select appropriate V and M values, remain unresolved. In this study, we thoroughly investigate the application of V:N:M sparsity in vision models and LLMs across multiple tasks, from pertaining to downstream tasks. We propose three key approaches to enhance the applicability and accuracy of V:N:M-sparse Transformers, including heuristic V and M selection, V:N:M-specific channel permutation, and three-staged LoRA training techniques. Experimental results show that, with our methods, the DeiT-small achieves lossless accuracy at 64:2:5 sparsity, while the DeiT-base maintains accuracy even at 64:2:8 sparsity. In addition, the fine-tuned LLama2-7B at 64:2:5 sparsity performs comparably or better than training-free 2:4 sparse alternatives on downstream tasks. More importantly, V:N:M-sparse Transformers offer a wider range of speedup-accuracy trade-offs compared to 2:4 sparsity. Overall, our exploration largely facilitates the V:N:M sparsity to act as a truly effective acceleration solution for Transformers in cost-sensitive inference scenarios. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15832 [pdf, other]

Nonlinear Bayesian Filtering with Natural Gradient Gaussian Approximation

Authors: Wenhan Cao, Tianyi Zhang, Zeju Sun, Chang Liu, Stephen S. -T. Yau, Shengbo Eben Li

Abstract: Practical Bayes filters often assume the state distribution of each time step to be Gaussian for computational tractability, resulting in the so-called Gaussian filters. When facing nonlinear systems, Gaussian filters such as extended Kalman filter (EKF) or unscented Kalman filter (UKF) typically rely on certain linearization techniques, which can introduce large estimation errors. To address this… ▽ More Practical Bayes filters often assume the state distribution of each time step to be Gaussian for computational tractability, resulting in the so-called Gaussian filters. When facing nonlinear systems, Gaussian filters such as extended Kalman filter (EKF) or unscented Kalman filter (UKF) typically rely on certain linearization techniques, which can introduce large estimation errors. To address this issue, this paper reconstructs the prediction and update steps of Gaussian filtering as solutions to two distinct optimization problems, whose optimal conditions are found to have analytical forms from Stein's lemma. It is observed that the stationary point for the prediction step requires calculating the first two moments of the prior distribution, which is equivalent to that step in existing moment-matching filters. In the update step, instead of linearizing the model to approximate the stationary points, we propose an iterative approach to directly minimize the update step's objective to avoid linearization errors. For the purpose of performing the steepest descent on the Gaussian manifold, we derive its natural gradient that leverages Fisher information matrix to adjust the gradient direction, accounting for the curvature of the parameter space. Combining this update step with moment matching in the prediction step, we introduce a new iterative filter for nonlinear systems called Natural Gradient Gaussian Approximation filter, or NANO filter for short. We prove that NANO filter locally converges to the optimal Gaussian approximation at each time step. The estimation error is proven exponentially bounded for nearly linear measurement equation and low noise levels through constructing a supermartingale-like inequality across consecutive time steps. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15817 [pdf, other]

Large Language Models Empower Personalized Valuation in Auction

Authors: Jie Sun, Tianyu Zhang, Houcheng Jiang, Kexin Huang, Chi Luo, Junkang Wu, Jiancan Wu, An Zhang, Xiang Wang

Abstract: Auctions, a fundamental economic mechanism, encompass the valuation of goods or services and the competitive bidding algorithms within a specific framework, serving to uncover the true market value. However, current research predominantly focuses on the bidding algorithms within a given auction mechanism, often overlooking the advantages of incorporating individual bidders' unique preferences and… ▽ More Auctions, a fundamental economic mechanism, encompass the valuation of goods or services and the competitive bidding algorithms within a specific framework, serving to uncover the true market value. However, current research predominantly focuses on the bidding algorithms within a given auction mechanism, often overlooking the advantages of incorporating individual bidders' unique preferences and the semantic information related to the items into the valuation process. Our analysis, both theoretical and empirical, shows that imprecise or noisy valuations can significantly affect the overall utility for participants. To bridge this gap, we propose a personalized valuation framework, namely \textbf{S}emantic-enhanced \textbf{P}ersonalized \textbf{V}aluation in \textbf{A}uction (\ours), which integrates Large Language Models (LLMs) to incorporate semantic information into each bidder's unique valuation process. Specifically, SPVA employs a two-stage approach: it first fine-tunes LLMs to encode bidder preferences in personalized valuations, and then constructs a Vickrey auction environment integrated with a bidding algorithm to demonstrate that SPVA's more accurate valuations result in higher profits. Additionally, we have developed a semantic-enhanced dataset comprising over 23,000 samples and introduced new personalized evaluation metrics that reflect both bidder preferences and profit. Through simulations of various auction scenarios, our method demonstrates its ability to provide accurate valuations and capture bidder preferences, affirming the method's effectiveness in real-world auction settings. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: 14 pages, 5 figures

arXiv:2410.15774 [pdf, other]

Generalizing Motion Planners with Mixture of Experts for Autonomous Driving

Authors: Qiao Sun, Huimin Wang, Jiahao Zhan, Fan Nie, Xin Wen, Leimeng Xu, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao

Abstract: Large real-world driving datasets have sparked significant research into various aspects of data-driven motion planners for autonomous driving. These include data augmentation, model architecture, reward design, training strategies, and planner pipelines. These planners promise better generalizations on complicated and few-shot cases than previous methods. However, experiment results show that man… ▽ More Large real-world driving datasets have sparked significant research into various aspects of data-driven motion planners for autonomous driving. These include data augmentation, model architecture, reward design, training strategies, and planner pipelines. These planners promise better generalizations on complicated and few-shot cases than previous methods. However, experiment results show that many of these approaches produce limited generalization abilities in planning performance due to overly complex designs or training paradigms. In this paper, we review and benchmark previous methods focusing on generalizations. The experimental results indicate that as models are appropriately scaled, many design elements become redundant. We introduce StateTransformer-2 (STR2), a scalable, decoder-only motion planner that uses a Vision Transformer (ViT) encoder and a mixture-of-experts (MoE) causal Transformer architecture. The MoE backbone addresses modality collapse and reward balancing by expert routing during training. Extensive experiments on the NuPlan dataset show that our method generalizes better than previous approaches across different test sets and closed-loop simulations. Furthermore, we assess its scalability on billions of real-world urban driving scenarios, demonstrating consistent accuracy improvements as both data and model size grow. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: 7 pages, 3 figures

arXiv:2410.15755 [pdf, other]

Search for New Particles with Flying Quantum Sensors in Space

Authors: Huang Xingming, Wang Yuanhong, Jiang Min, Kang Xiang, Su Haowen, Wang Zehao, Lin Qing, Zheng Wenqiang, Sun Yuan, Liu Liang, Peng Xinhua, Zhao Zhengguo, Du JiangFeng

Abstract: Recent advancements in space science and technologies offer exciting prospects for investigating novel research that is unattainable within terrestrial laboratories. Here we propose the implementation of space-based quantum sensing to explore ultralight new particles beyond the standard model. The central idea involves probing long-range interactions between spin ensembles of space quantum sensors… ▽ More Recent advancements in space science and technologies offer exciting prospects for investigating novel research that is unattainable within terrestrial laboratories. Here we propose the implementation of space-based quantum sensing to explore ultralight new particles beyond the standard model. The central idea involves probing long-range interactions between spin ensembles of space quantum sensors and the particles residing within Earth, mediated by ultralight particles. We show that such interactions can be substantially enhanced in space platforms and thus increase the search sensitivity. In contrast to their terrestrial counterparts, space-based quantum searches exhibit remarkable velocity enhancements, approaching the first cosmic speed, and thus enables the exploration of unexplored parameter space concerning ultralight new particles. Furthermore, the substantial abundance of electrons and nucleons within Earth plays a crucial role in extending the scope of our mission. Our projected search sensitivity can surpass the sensitivities of both terrestrial experiments and proposals by up to approximately 7 orders of magnitude. We also briefly discuss other space mission, including ``space-ground integrated" network of quantum sensors for dark matter searches. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15738 [pdf, ps, other]

A Fair Allocation is Approximately Optimal for Indivisible Chores, or Is It?

Authors: Bo Li, Ankang Sun, Shiji Xing

Abstract: In this paper, we study the allocation of indivisible chores and consider the problem of finding a fair allocation that is approximately efficient. We shift our attention from the multiplicative approximation to the additive one. Our results are twofold, with (1) bounding how the optimal social cost escalates resulting from fairness requirements and (2) presenting the hardness of approximation for… ▽ More In this paper, we study the allocation of indivisible chores and consider the problem of finding a fair allocation that is approximately efficient. We shift our attention from the multiplicative approximation to the additive one. Our results are twofold, with (1) bounding how the optimal social cost escalates resulting from fairness requirements and (2) presenting the hardness of approximation for the problems of finding fair allocations with the minimum social cost. To quantify the escalation, we introduce cost of fairness (CoF) $\unicode{x2014}$ an alternative to the price of fairness (PoF) $\unicode{x2014}$ to bound the difference (v.s. ratio for PoF) between the optimal social cost with and without fairness constraints in the worst-case instance. We find that CoF is more informative than PoF for chores in the sense that the PoF is infinity regarding all EQX (equitable up to any item), EQ1 (equitable up to one item) and EF1 (envy-free up to one item), while the CoF is $n$ regarding EQX and 1 regarding EQ1 and EF1, where $n$ is the number of agents. For inapproximability, we present a detailed picture of hardness of approximation. We prove that finding the optimal EQX allocation within an additive approximation factor of $n$ is NP-hard for any $n \geq 2$ where $n$ is the number of agents and the cost functions are normalized to 1. For EQ1 and EF1, the problem is NP-hard when the additive factor is a constant and $n \geq 3$. When $n = 2$, we design additive approximation schemes for EQ1 and EF1. △ Less

Submitted 21 October, 2024; originally announced October 2024.

Comments: Appears in the 20th Conference on Web and Internet Economics (WINE), 2024

ACM Class: F.2.2

arXiv:2410.15732 [pdf, other]

ViMoE: An Empirical Study of Designing Vision Mixture-of-Experts

Authors: Xumeng Han, Longhui Wei, Zhiyang Dou, Zipeng Wang, Chenhui Qiang, Xin He, Yingfei Sun, Zhenjun Han, Qi Tian

Abstract: Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classificati… ▽ More Mixture-of-Experts (MoE) models embody the divide-and-conquer concept and are a promising approach for increasing model capacity, demonstrating excellent scalability across multiple domains. In this paper, we integrate the MoE structure into the classic Vision Transformer (ViT), naming it ViMoE, and explore the potential of applying MoE to vision through a comprehensive study on image classification. However, we observe that the performance is sensitive to the configuration of MoE layers, making it challenging to obtain optimal results without careful design. The underlying cause is that inappropriate MoE layers lead to unreliable routing and hinder experts from effectively acquiring helpful knowledge. To address this, we introduce a shared expert to learn and capture common information, serving as an effective way to construct stable ViMoE. Furthermore, we demonstrate how to analyze expert routing behavior, revealing which MoE layers are capable of specializing in handling specific information and which are not. This provides guidance for retaining the critical layers while removing redundancies, thereby advancing ViMoE to be more efficient without sacrificing accuracy. We aspire for this work to offer new insights into the design of vision MoE models and provide valuable empirical guidance for future research. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15698 [pdf, other]

Solving Continual Offline RL through Selective Weights Activation on Aligned Spaces

Authors: Jifeng Hu, Sili Huang, Li Shen, Zhejian Yang, Shengchao Hu, Shisong Tang, Hechang Chen, Yi Chang, Dacheng Tao, Lichao Sun

Abstract: Continual offline reinforcement learning (CORL) has shown impressive ability in diffusion-based lifelong learning systems by modeling the joint distributions of trajectories. However, most research only focuses on limited continual task settings where the tasks have the same observation and action space, which deviates from the realistic demands of training agents in various environments. In view… ▽ More Continual offline reinforcement learning (CORL) has shown impressive ability in diffusion-based lifelong learning systems by modeling the joint distributions of trajectories. However, most research only focuses on limited continual task settings where the tasks have the same observation and action space, which deviates from the realistic demands of training agents in various environments. In view of this, we propose Vector-Quantized Continual Diffuser, named VQ-CD, to break the barrier of different spaces between various tasks. Specifically, our method contains two complementary sections, where the quantization spaces alignment provides a unified basis for the selective weights activation. In the quantized spaces alignment, we leverage vector quantization to align the different state and action spaces of various tasks, facilitating continual training in the same space. Then, we propose to leverage a unified diffusion model attached by the inverse dynamic model to master all tasks by selectively activating different weights according to the task-related sparse masks. Finally, we conduct extensive experiments on 15 continual learning (CL) tasks, including conventional CL task settings (identical state and action spaces) and general CL task settings (various state and action spaces). Compared with 16 baselines, our method reaches the SOTA performance. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15633 [pdf, other]

Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement

Authors: Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun

Abstract: The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated. The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment. Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples. However, indi… ▽ More The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated. The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment. Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples. However, indiscriminately increasing the quantity of data without a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the final performance. To bridge this gap, we aim to address the unique challenge of long-context alignment, i.e., modeling the long-range dependencies for handling instructions and lengthy input contexts. We propose GATEAU, a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations by utilizing crafted Homologous Models' Guidance (HMG) and Contextual Awareness Measurement (CAM). Specifically, HMG attempts to measure the difficulty of generating corresponding responses due to the long-range dependencies, using the perplexity scores of the response from two homologous models with different context windows. Also, the role of CAM is to measure the difficulty of understanding the long input contexts due to long-range dependencies by evaluating whether the model's attention is focused on important segments. Built upon both proposed methods, we select the most challenging samples as the influential data to effectively frame the long-range dependencies, thereby achieving better performance of LLMs. Comprehensive experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15631 [pdf, other]

Security of Language Models for Code: A Systematic Literature Review

Authors: Yuchen Chen, Weisong Sun, Chunrong Fang, Zhenpeng Chen, Yifei Ge, Tingxu Han, Quanjun Zhang, Yang Liu, Zhenyu Chen, Baowen Xu

Abstract: Language models for code (CodeLMs) have emerged as powerful tools for code-related tasks, outperforming traditional methods and standard machine learning approaches. However, these models are susceptible to security vulnerabilities, drawing increasing research attention from domains such as software engineering, artificial intelligence, and cybersecurity. Despite the growing body of research focus… ▽ More Language models for code (CodeLMs) have emerged as powerful tools for code-related tasks, outperforming traditional methods and standard machine learning approaches. However, these models are susceptible to security vulnerabilities, drawing increasing research attention from domains such as software engineering, artificial intelligence, and cybersecurity. Despite the growing body of research focused on the security of CodeLMs, a comprehensive survey in this area remains absent. To address this gap, we systematically review 67 relevant papers, organizing them based on attack and defense strategies. Furthermore, we provide an overview of commonly used language models, datasets, and evaluation metrics, and highlight open-source tools and promising directions for future research in securing CodeLMs. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.15575 [pdf, other]

Neural Search Space in Gboard Decoder

Authors: Yanxiang Zhang, Yuanbo Zhang, Haicheng Sun, Yun Wang, Billy Dou, Gary Sivek, Shumin Zhai

Abstract: Gboard Decoder produces suggestions by looking for paths that best match input touch points on the context aware search space, which is backed by the language Finite State Transducers (FST). The language FST is currently an N-gram language model (LM). However, N-gram LMs, limited in context length, are known to have sparsity problem under device model size constraint. In this paper, we propose \te… ▽ More Gboard Decoder produces suggestions by looking for paths that best match input touch points on the context aware search space, which is backed by the language Finite State Transducers (FST). The language FST is currently an N-gram language model (LM). However, N-gram LMs, limited in context length, are known to have sparsity problem under device model size constraint. In this paper, we propose \textbf{Neural Search Space} which substitutes the N-gram LM with a Neural Network LM (NN-LM) and dynamically constructs the search space during decoding. Specifically, we integrate the long range context awareness of NN-LM into the search space by converting its outputs given context, into the language FST at runtime. This involves language FST structure redesign, pruning strategy tuning, and data structure optimizations. Online experiments demonstrate improved quality results, reducing Words Modified Ratio by [0.26\%, 1.19\%] on various locales with acceptable latency increases. This work opens new avenues for further improving keyboard decoding quality by enhancing neural LM more directly. △ Less

Submitted 20 October, 2024; originally announced October 2024.

Comments: 10 pages, 7 figures, 3 tables

arXiv:2410.15567 [pdf, other]

Pruning Foundation Models for High Accuracy without Retraining

Authors: Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin

Abstract: Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consum… ▽ More Despite the superior performance, it is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. While pruning is a promising technique to reduce model size and accelerate the inference, the traditional pruning techniques can hardly be applied for LLMs as they need to finetune the model on the full dataset with multiple epochs consuming massive data and hardware resources. To deal with this problem, post-training pruning methods are proposed to prune LLMs in one-shot without retraining. However, their accuracy after pruning may suffer from certain performance degradation due to the lack of retraining with massive data. To address this issue, in this paper, we first formulate the post-training problem for layer-wise LLM compression to simultaneously prune multiple weights in LLMs. Next, we provide an optimal solution for this problem and design our post-training pruning algorithm for both unstructured and semi-structured sparsity. Our extensive experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines across various LLM families including transformer-based LLMs and Mamba-based LLMs. Code link: https://github.com/piuzha/APT △ Less

Submitted 20 October, 2024; originally announced October 2024.

Comments: Accepted by EMNLP 2024 findings

arXiv:2410.15536 [pdf, other]

GRS: Generating Robotic Simulation Tasks from Real-World Images

Authors: Alex Zook, Fan-Yun Sun, Josef Spjut, Valts Blukis, Stan Birchfield, Jonathan Tremblay

Abstract: We introduce GRS (Generating Robotic Simulation tasks), a novel system to address the challenge of real-to-sim in robotics, computer vision, and AR/VR. GRS enables the creation of digital twin simulations from single real-world RGB-D observations, complete with diverse, solvable tasks for virtual agent training. We use state-of-the-art vision-language models (VLMs) to achieve a comprehensive real-… ▽ More We introduce GRS (Generating Robotic Simulation tasks), a novel system to address the challenge of real-to-sim in robotics, computer vision, and AR/VR. GRS enables the creation of digital twin simulations from single real-world RGB-D observations, complete with diverse, solvable tasks for virtual agent training. We use state-of-the-art vision-language models (VLMs) to achieve a comprehensive real-to-sim pipeline. GRS operates in three stages: 1) scene comprehension using SAM2 for object segmentation and VLMs for object description, 2) matching identified objects with simulation-ready assets, and 3) generating contextually appropriate robotic tasks. Our approach ensures simulations align with task specifications by generating test suites designed to verify adherence to the task specification. We introduce a router that iteratively refines the simulation and test code to ensure the simulation is solvable by a robot policy while remaining aligned to the task specification. Our experiments demonstrate the system's efficacy in accurately identifying object correspondence, which allows us to generate task environments that closely match input environments, and enhance automated simulation task generation through our novel router mechanism. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.15529 [pdf, other]

Measurement of gas properties for the ion-TPC of N$ν$DEx experiment

Authors: Tianyu Liang, Meiqiang Zhan, Hulin Wang, Xianglun Wei, Dongliang Zhang, Jun Liu, Chengui Lu, Qiang Hu, Yichen Yang, Chaosong Gao, Le Xiao, Xiangming Sun, Feng Liu, Chengxin Zhao, Hao Qiu, Kai Chen

Abstract: In the N$ν$DEx collaboration, a high-pressure gas TPC is being developed to search for the neutrinoless double beta decay. The use of electronegative $\mathrm{^{82}SeF_{6}}$ gas mandates an ion-TPC. The reconstruction of $z$ coordinate is to be realized exploiting the feature of multiple species of charge carriers. As the initial stage of the development, we studied the properties of the… ▽ More In the N$ν$DEx collaboration, a high-pressure gas TPC is being developed to search for the neutrinoless double beta decay. The use of electronegative $\mathrm{^{82}SeF_{6}}$ gas mandates an ion-TPC. The reconstruction of $z$ coordinate is to be realized exploiting the feature of multiple species of charge carriers. As the initial stage of the development, we studied the properties of the $\mathrm{SF_{6}}$ gas, which is non-toxic and has similar molecular structure to $\mathrm{SeF_{6}}$. In the paper we present the measurement of drift velocities and mobilities of the majority and minority negative charge carriers found in $\mathrm{SF_{6}}$ at a pressure of 750 Torr, slightly higher than the local atmospheric pressure. The reduced fields range between 3.0 and 5.5 Td. It was performed using a laser beam to ionize the gas inside a small TPC, with a drift length of 3.7 cm. A customized charge sensitive amplifier was developed to read out the anode signals induced by the slowly drifting ions. The reconstruction of $z$ coordinate using the difference in the velocities of the two carriers was also demonstrated. △ Less

Submitted 20 October, 2024; originally announced October 2024.

Comments: 10 pages, 8 figures

arXiv:2410.15526 [pdf, other]

SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

Authors: Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao

Abstract: Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challeng… ▽ More Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challenge in the scalability of ShardedDP is the intensive communication of weights and gradients. While compression techniques can alleviate this issue, they often result in worse accuracy. Driven by this limitation, we propose SDP4Bit (Toward 4Bit Communication Quantization in Sharded Data Parallelism for LLM Training), which effectively reduces the communication of weights and gradients to nearly 4 bits via two novel techniques: quantization on weight differences, and two-level gradient smooth quantization. Furthermore, SDP4Bit presents an algorithm-system co-design with runtime optimization to minimize the computation overhead of compression. In addition to the theoretical guarantees of convergence, we empirically evaluate the accuracy of SDP4Bit on the pre-training of GPT models with up to 6.7 billion parameters, and the results demonstrate a negligible impact on training loss. Furthermore, speed experiments show that SDP4Bit achieves up to 4.08$\times$ speedup in end-to-end throughput on a scale of 128 GPUs. △ Less

Submitted 20 October, 2024; originally announced October 2024.

Comments: Accepted by NeurIPS 2024

arXiv:2410.15397 [pdf, other]

IPO: Interpretable Prompt Optimization for Vision-Language Models

Authors: Yingjun Du, Wenfang Sun, Cees G. M. Snoek

Abstract: Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. Howev… ▽ More Pre-trained vision-language models like CLIP have remarkably adapted to various downstream tasks. Nonetheless, their performance heavily depends on the specificity of the input text prompts, which requires skillful prompt template engineering. Instead, current approaches to prompt optimization learn the prompts through gradient descent, where the prompts are treated as adjustable parameters. However, these methods tend to lead to overfitting of the base classes seen during training and produce prompts that are no longer understandable by humans. This paper introduces a simple but interpretable prompt optimizer (IPO), that utilizes large language models (LLMs) to generate textual prompts dynamically. We introduce a Prompt Optimization Prompt that not only guides LLMs in creating effective prompts but also stores past prompts with their performance metrics, providing rich in-context information. Additionally, we incorporate a large multimodal model (LMM) to condition on visual content by generating image descriptions, which enhance the interaction between textual and visual modalities. This allows for thae creation of dataset-specific prompts that improve generalization performance, while maintaining human comprehension. Extensive testing across 11 datasets reveals that IPO not only improves the accuracy of existing gradient-descent-based prompt learning methods but also considerably enhances the interpretability of the generated prompts. By leveraging the strengths of LLMs, our approach ensures that the prompts remain human-understandable, thereby facilitating better transparency and oversight for vision-language models. △ Less

Submitted 20 October, 2024; originally announced October 2024.

Comments: Accepted by NeurIPS 2024

arXiv:2410.15261 [pdf]

Emerging quantum critical phase in a cluster spin-glass

Authors: Fang Zhang, Tao Feng, Yurong Ruan, Xiaoyuan Ye, Bing Wen, Liang Zhou, Minglin He, Zhaotong Zhuang, Liusuo Wu, Hongtao He, Peijie Sun, Zhiyang Yu, Weishu Liu, Wenqing Zhang

Abstract: Magnetic frustration has been recognized as pivotal to investigating new phases of matter in correlation-driven Kondo breakdown quantum phase transitions that are not clearly associated with broken symmetry. The nature of these new phases, however, remains underexplored. Here, we report quantum criticalities emerging from a cluster spin-glass in the heavy-fermion metal TiFe$_x$Cu$_{2x-1}$Sb, where… ▽ More Magnetic frustration has been recognized as pivotal to investigating new phases of matter in correlation-driven Kondo breakdown quantum phase transitions that are not clearly associated with broken symmetry. The nature of these new phases, however, remains underexplored. Here, we report quantum criticalities emerging from a cluster spin-glass in the heavy-fermion metal TiFe$_x$Cu$_{2x-1}$Sb, where frustration originates from intrinsic disorder. Specific heat and magnetic Grüneisen parameter measurements under varying magnetic fields exhibit quantum critical scaling, indicating a quantum critical point near 0.13 Tesla. As the magnetic field increases, the cluster spin-glass phase is progressively suppressed. Upon crossing the quantum critical point, resistivity and Hall effect measurements reveal enhanced screening of local moments and an expanding Fermi surface, consistent with the Kondo breakdown scenario. △ Less

Submitted 19 October, 2024; originally announced October 2024.

Comments: 18 pages, 4 figures, with Supplementary Information

arXiv:2410.15252 [pdf, other]

Lossless KV Cache Compression to 2%

Authors: Zhen Yang, J. N. Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang

Abstract: Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is essential. Nonetheless, the growing demands for KV cache memory create significant hurdles for efficient implementation. This work introduces a novel architecture, Cr… ▽ More Large language models have revolutionized data processing in numerous domains, with their ability to handle extended context reasoning receiving notable recognition. To speed up inference, maintaining a key-value (KV) cache memory is essential. Nonetheless, the growing demands for KV cache memory create significant hurdles for efficient implementation. This work introduces a novel architecture, Cross-Layer Latent Attention (CLLA), aimed at compressing the KV cache to less than 2% of its original size while maintaining comparable performance levels. CLLA integrates multiple aspects of KV cache compression, including attention head/dimension reduction, layer sharing, and quantization techniques, into a cohesive framework. Our extensive experiments demonstrate that CLLA achieves lossless performance on most tasks while utilizing minimal KV cache, marking a significant advancement in practical KV cache compression. △ Less

Submitted 19 October, 2024; originally announced October 2024.

arXiv:2410.15130 [pdf, ps, other]

Seminorm estimates and joint ergodicity for pairwise independent Hardy sequences

Authors: Sebastián Donoso, Andreas Koutsogiannis, Borys Kuca, Wenbo Sun, Konstantinos Tsinas

Abstract: We develop a robust structure theory for multiple ergodic averages of commuting transformations along Hardy sequences of polynomial growth. We then apply it to derive a number of novel results on joint ergodicity, recurrence and convergence. Specifically, we construct a suitable generalization of Host-Kra and box seminorms that quantitatively controls the aforementioned averages subject to necessa… ▽ More We develop a robust structure theory for multiple ergodic averages of commuting transformations along Hardy sequences of polynomial growth. We then apply it to derive a number of novel results on joint ergodicity, recurrence and convergence. Specifically, we construct a suitable generalization of Host-Kra and box seminorms that quantitatively controls the aforementioned averages subject to necessary nondegeneracy conditions on the Hardy sequences. Combining this with a variant of the seminorm smoothing argument, we obtain Host-Kra seminorm estimates for averages along all pairwise independent Hardy sequences. In conjunction with joint ergodicity criteria and classical equidistribution results, these estimates yield a number of novel joint ergodicity results that reach far beyond the state of the art. In particular, we prove joint ergodicity for (a) pairwise independent Hardy sequences and weakly mixing transformations, (b) strongly independent Hardy sequences and ergodic transformations, (c) irrationally strongly independent Hardy sequences and totally ergodic transformations. We also positively resolve the joint ergodicity classification problem for all pairwise independent Hardy sequences, of which the aforementioned three families are special cases. Lastly, we use these joint ergodicity results to provide new recurrence results for multidimensional patterns along strongly independent Hardy sequences. While building on recent technical advances (e.g. PET coefficient tracking schemes and joint ergodicity criteria), our work introduces a number of technical developments of its own, including the theory of generalized box seminorms, an ergodic version of the quantitative concatenation argument, improved simultaneous Taylor approximations for Hardy sequences, and a generalization of the seminorm smoothing argument. △ Less

Submitted 19 October, 2024; originally announced October 2024.

Comments: 105 pages. Comments welcome!

MSC Class: Primary: 37A44; Secondary: 11B30; 28D05

arXiv:2410.15100 [pdf]

A Flat Plasmonic Biosensing Interface on Optical Fiber End-Facet via SPP-MIM Hybridization

Authors: Chenjia He, Xiaqing Sun, Hao Zhong, Qingfeng Meng, Xuetong Zhou, Sihang Liu, Li Zheng, Xiangyang Kong, Shengfu Chen, Shengce Tao, Tian Yang

Abstract: We found that the specific dispersion of metal-insulator-metal (MIM) waveguide allows the hybridization of surface plasmon polaritons (SPPs) and the waveguide, which is not possible with dielectric waveguides. The SPP-MIM hybridization structure forms such a meta-film that integrates the previously incompatible respective merits of SPR and LSPR, including flat interfaces, high sensitivities, short… ▽ More We found that the specific dispersion of metal-insulator-metal (MIM) waveguide allows the hybridization of surface plasmon polaritons (SPPs) and the waveguide, which is not possible with dielectric waveguides. The SPP-MIM hybridization structure forms such a meta-film that integrates the previously incompatible respective merits of SPR and LSPR, including flat interfaces, high sensitivities, short evanescent fields and easy coupling with confined light. On the other hand, to achieve stable and reproducible performance is one of the greatest unresolved challenges for the development of nanophotonic biosensors. We point out that the key is to obtain well-controlled biomolecular behaviors using simple physical interfaces, for which the SPP-MIM meta-film provides a capable solution. We embed the SPP-MIM meta-film with a plasmonic crystal cavity and integrate it on a single-mode fiber's end-facet to detect biomolecular interactions. This device demonstrates highly reproducible sensorgrams and convincing detection of biotinylated proteins at down to 30 fM, with the sensorgrams following the Langmuir model. By unprecedentedly having both high sensitivity and high reproducibility, our device proposal provides a comprehensive solution for optical fiber-tip plasmonic devices to turn into a useful industrial biosensing technology. △ Less

Submitted 19 October, 2024; originally announced October 2024.

Comments: article + supplementary information

arXiv:2410.15020 [pdf, other]

Iterative Methods via Locally Evolving Set Process

Authors: Baojian Zhou, Yifan Sun, Reza Babanezhad Harikandeh, Xingzhi Guo, Deqing Yang, Yanghua Xiao

Abstract: Given the damping factor $α$ and precision tolerance $ε$, \citet{andersen2006local} introduced Approximate Personalized PageRank (APPR), the \textit{de facto local method} for approximating the PPR vector, with runtime bounded by $Θ(1/(αε))$ independent of the graph size. Recently, \citet{fountoulakis2022open} asked whether faster local algorithms could be developed using $\tilde{O}(1/(\sqrtαε))$… ▽ More Given the damping factor $α$ and precision tolerance $ε$, \citet{andersen2006local} introduced Approximate Personalized PageRank (APPR), the \textit{de facto local method} for approximating the PPR vector, with runtime bounded by $Θ(1/(αε))$ independent of the graph size. Recently, \citet{fountoulakis2022open} asked whether faster local algorithms could be developed using $\tilde{O}(1/(\sqrtαε))$ operations. By noticing that APPR is a local variant of Gauss-Seidel, this paper explores the question of \textit{whether standard iterative solvers can be effectively localized}. We propose to use the \textit{locally evolving set process}, a novel framework to characterize the algorithm locality, and demonstrate that many standard solvers can be effectively localized. Let $\overline{\operatorname{vol}}{ (S_t)}$ and $\overlineγ_{t}$ be the running average of volume and the residual ratio of active nodes $\textstyle S_{t}$ during the process. We show $\overline{\operatorname{vol}}{ (S_t)}/\overlineγ_{t} \leq 1/ε$ and prove APPR admits a new runtime bound $\tilde{O}(\overline{\operatorname{vol}}(S_t)/(α\overlineγ_{t}))$ mirroring the actual performance. Furthermore, when the geometric mean of residual reduction is $Θ(\sqrtα)$, then there exists $c \in (0,2)$ such that the local Chebyshev method has runtime $\tilde{O}(\overline{\operatorname{vol}}(S_{t})/(\sqrtα(2-c)))$ without the monotonicity assumption. Numerical results confirm the efficiency of this novel framework and show up to a hundredfold speedup over corresponding standard solvers on real-world graphs. △ Less

Submitted 19 October, 2024; originally announced October 2024.

Comments: 58 pages, 15 figures, NeurIPS 2024

arXiv:2410.14961 [pdf, other]

LangGFM: A Large Language Model Alone Can be a Powerful Graph Foundation Model

Authors: Tianqianjin Lin, Pengwei Yan, Kaisong Song, Zhuoren Jiang, Yangyang Kang, Jun Lin, Weikang Yuan, Junjie Cao, Changlong Sun, Xiaozhong Liu

Abstract: Graph foundation models (GFMs) have recently gained significant attention. However, the unique data processing and evaluation setups employed by different studies hinder a deeper understanding of their progress. Additionally, current research tends to focus on specific subsets of graph learning tasks, such as structural tasks, node-level tasks, or classification tasks. As a result, they often inco… ▽ More Graph foundation models (GFMs) have recently gained significant attention. However, the unique data processing and evaluation setups employed by different studies hinder a deeper understanding of their progress. Additionally, current research tends to focus on specific subsets of graph learning tasks, such as structural tasks, node-level tasks, or classification tasks. As a result, they often incorporate specialized modules tailored to particular task types, losing their applicability to other graph learning tasks and contradicting the original intent of foundation models to be universal. Therefore, to enhance consistency, coverage, and diversity across domains, tasks, and research interests within the graph learning community in the evaluation of GFMs, we propose GFMBench-a systematic and comprehensive benchmark comprising 26 datasets. Moreover, we introduce LangGFM, a novel GFM that relies entirely on large language models. By revisiting and exploring the effective graph textualization principles, as well as repurposing successful techniques from graph augmentation and graph self-supervised learning within the language space, LangGFM achieves performance on par with or exceeding the state of the art across GFMBench, which can offer us new perspectives, experiences, and baselines to drive forward the evolution of GFMs. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: under review

arXiv:2410.14940 [pdf, other]

Baichuan Alignment Technical Report

Authors: Mingan Lin, Fan Yang, Yanjun Shen, Haoze Sun, Tianpeng Li, Tao Zhang, Chenzheng Zhu, Tao Zhang, Miao Zheng, Xu Li, Yijie Zhou, Mingyang Chen, Yanzhao Qin, Youquan Li, Hao Liang, Fei Li, Yadong Li, Mang Wang, Guosheng Dong, Kun Fang, Jianhua Xu, Bin Cui, Wentao Zhang, Zenan Zhou, Weipeng Chen

Abstract: We introduce Baichuan Alignment, a detailed analysis of the alignment techniques employed in the Baichuan series of models. This represents the industry's first comprehensive account of alignment methodologies, offering valuable insights for advancing AI research. We investigate the critical components that enhance model performance during the alignment process, including optimization methods, dat… ▽ More We introduce Baichuan Alignment, a detailed analysis of the alignment techniques employed in the Baichuan series of models. This represents the industry's first comprehensive account of alignment methodologies, offering valuable insights for advancing AI research. We investigate the critical components that enhance model performance during the alignment process, including optimization methods, data strategies, capability enhancements, and evaluation processes. The process spans three key stages: Prompt Augmentation System (PAS), Supervised Fine-Tuning (SFT), and Preference Alignment. The problems encountered, the solutions applied, and the improvements made are thoroughly recorded. Through comparisons across well-established benchmarks, we highlight the technological advancements enabled by Baichuan Alignment. Baichuan-Instruct is an internal model, while Qwen2-Nova-72B and Llama3-PBM-Nova-70B are instruct versions of the Qwen2-72B and Llama-3-70B base models, optimized through Baichuan Alignment. Baichuan-Instruct demonstrates significant improvements in core capabilities, with user experience gains ranging from 17% to 28%, and performs exceptionally well on specialized benchmarks. In open-source benchmark evaluations, both Qwen2-Nova-72B and Llama3-PBM-Nova-70B consistently outperform their respective official instruct versions across nearly all datasets. This report aims to clarify the key technologies behind the alignment process, fostering a deeper understanding within the community. Llama3-PBM-Nova-70B model is available at https://huggingface.co/PKU-Baichuan-MLSystemLab/Llama3-PBM-Nova-70B. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14932 [pdf, other]

Can AI weather models predict out-of-distribution gray swan tropical cyclones?

Authors: Y. Qiang Sun, Pedram Hassanzadeh, Mohsen Zand, Ashesh Chattopadhyay, Jonathan Weare, Dorian S. Abbot

Abstract: Predicting gray swan weather extremes, which are possible but so rare that they are absent from the training dataset, is a major concern for AI weather/climate models. An important open question is whether AI models can extrapolate from weaker weather events present in the training set to stronger, unseen weather extremes. To test this, we train independent versions of the AI model FourCastNet on… ▽ More Predicting gray swan weather extremes, which are possible but so rare that they are absent from the training dataset, is a major concern for AI weather/climate models. An important open question is whether AI models can extrapolate from weaker weather events present in the training set to stronger, unseen weather extremes. To test this, we train independent versions of the AI model FourCastNet on the 1979-2015 ERA5 dataset with all data, or with Category 3-5 tropical cyclones (TCs) removed, either globally or only over the North Atlantic or Western Pacific basin. We then test these versions of FourCastNet on 2018-2023 Category 5 TCs (gray swans). All versions yield similar accuracy for global weather, but the one trained without Category 3-5 TCs cannot accurately forecast Category 5 TCs, indicating that these models cannot extrapolate from weaker storms. The versions trained without Category 3-5 TCs in one basin show some skill forecasting Category 5 TCs in that basin, suggesting that FourCastNet can generalize across tropical basins. This is encouraging and surprising because regional information is implicitly encoded in inputs. No version satisfies gradient-wind balance, implying that enforcing such physical constraints may not improve generalizability to gray swans. Given that current state-of-the-art AI weather/climate models have similar learning strategies, we expect our findings to apply to other models and extreme events. Our work demonstrates that novel learning strategies are needed for AI weather/climate models to provide early warning or estimated statistics for the rarest, most impactful weather extremes. △ Less

Submitted 22 October, 2024; v1 submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14900 [pdf, other]

DRACO: Differentiable Reconstruction for Arbitrary CBCT Orbits

Authors: Chengze Ye, Linda-Sophie Schneider, Yipeng Sun, Mareike Thies, Siyuan Mei, Andreas Maier

Abstract: This paper introduces a novel method for reconstructing cone beam computed tomography (CBCT) images for arbitrary orbits using a differentiable shift-variant filtered backprojection (FBP) neural network. Traditional CBCT reconstruction methods for arbitrary orbits, like iterative reconstruction algorithms, are computationally expensive and memory-intensive. The proposed method addresses these chal… ▽ More This paper introduces a novel method for reconstructing cone beam computed tomography (CBCT) images for arbitrary orbits using a differentiable shift-variant filtered backprojection (FBP) neural network. Traditional CBCT reconstruction methods for arbitrary orbits, like iterative reconstruction algorithms, are computationally expensive and memory-intensive. The proposed method addresses these challenges by employing a shift-variant FBP algorithm optimized for arbitrary trajectories through a deep learning approach that adapts to a specific orbit geometry. This approach overcomes the limitations of existing techniques by integrating known operators into the learning model, minimizing the number of parameters, and improving the interpretability of the model. The proposed method is a significant advancement in interventional medical imaging, particularly for robotic C-arm CT systems, enabling faster and more accurate CBCT reconstructions with customized orbits. Especially this method can also be used for the analytical reconstruction of non-continuous orbits like circular plus arc. The experimental results demonstrate that the proposed method significantly accelerates the reconstruction process compared to conventional iterative algorithms. It achieves comparable or superior image quality, as evidenced by metrics such as the mean squared error (MSE), the peak signal-to-noise ratio (PSNR), and the structural similarity index measure (SSIM). The validation experiments show that the method can handle data from different trajectories, demonstrating its flexibility and robustness across different scan geometries. Our method demonstrates a significant improvement, particularly for the sinusoidal trajectory, achieving a 38.6% reduction in MSE, a 7.7% increase in PSNR, and a 5.0% improvement in SSIM. Furthermore, the computation time for reconstruction was reduced by more than 97%. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14882 [pdf]

Multi-diseases detection with memristive system on chip

Authors: Zihan Wang, Daniel W. Yang, Zerui Liu, Evan Yan, Heming Sun, Ning Ge, Miao Hu, Wei Wu

Abstract: This study presents the first implementation of multilayer neural networks on a memristor/CMOS integrated system on chip (SoC) to simultaneously detect multiple diseases. To overcome limitations in medical data, generative AI techniques are used to enhance the dataset, improving the classifier's robustness and diversity. The system achieves notable performance with low latency, high accuracy (91.8… ▽ More This study presents the first implementation of multilayer neural networks on a memristor/CMOS integrated system on chip (SoC) to simultaneously detect multiple diseases. To overcome limitations in medical data, generative AI techniques are used to enhance the dataset, improving the classifier's robustness and diversity. The system achieves notable performance with low latency, high accuracy (91.82%), and energy efficiency, facilitated by end-to-end execution on a memristor-based SoC with ten 256x256 crossbar arrays and an integrated on-chip processor. This research showcases the transformative potential of memristive in-memory computing hardware in accelerating machine learning applications for medical diagnostics. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: 14 pages, 5 figures

ACM Class: C.1.3; I.2.0

arXiv:2410.14853 [pdf, other]

DFlow: Diverse Dialogue Flow Simulation with Large Language Models

Authors: Wanyu Du, Song Feng, James Gung, Lijia Sun, Yi Zhang, Saab Mansour, Yanjun Qi

Abstract: Developing language model-based dialogue agents requires effective data to train models that can follow specific task logic. However, most existing data augmentation methods focus on increasing diversity in language, topics, or dialogue acts at the utterance level, largely neglecting a critical aspect of task logic diversity at the dialogue level. This paper proposes a novel data augmentation meth… ▽ More Developing language model-based dialogue agents requires effective data to train models that can follow specific task logic. However, most existing data augmentation methods focus on increasing diversity in language, topics, or dialogue acts at the utterance level, largely neglecting a critical aspect of task logic diversity at the dialogue level. This paper proposes a novel data augmentation method designed to enhance the diversity of synthetic dialogues by focusing on task execution logic. Our method uses LLMs to generate decision tree-structured task plans, which enables the derivation of diverse dialogue trajectories for a given task. Each trajectory, referred to as a "dialog flow", guides the generation of a multi-turn dialogue that follows a unique trajectory. We apply this method to generate a task-oriented dialogue dataset comprising 3,886 dialogue flows across 15 different domains. We validate the effectiveness of this dataset using the next action prediction task, where models fine-tuned on our dataset outperform strong baselines, including GPT-4. Upon acceptance of this paper, we plan to release the code and data publicly. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: 16 pages

arXiv:2410.14804 [pdf, other]

SMILES: Discovery of Higher Ionizing Photon Production Efficiency in Overdense Regions

Authors: Yongda Zhu, Stacey Alberts, Jianwei Lyu, Jane Morrison, George H. Rieke, Yang Sun, Jakob M. Helton, Zhiyuan Ji, Rachana Bhatawdekar, Nina Bonaventura, Andrew J. Bunker, Xiaojing Lin, Marcia J. Rieke, Pierluigi Rinaldi, Irene Shivaei, Christopher N. A. Willmer, Junyu Zhang

Abstract: The topology of reionization and the environments where galaxies efficiently produce ionizing photons are key open questions. For the first time, we investigate the correlation between ionizing photon production efficiency, $ξ_{\rm ion}$, and galaxy overdensity, $\log(1+δ)$. We analyze the ionizing properties of 93 galaxies between $0.7 < z < 6.9$ using JWST NIRSpec medium-resolution spectra from… ▽ More The topology of reionization and the environments where galaxies efficiently produce ionizing photons are key open questions. For the first time, we investigate the correlation between ionizing photon production efficiency, $ξ_{\rm ion}$, and galaxy overdensity, $\log(1+δ)$. We analyze the ionizing properties of 93 galaxies between $0.7 < z < 6.9$ using JWST NIRSpec medium-resolution spectra from the Systematic Mid-infrared Instrument (MIRI) Legacy Extragalactic Survey (SMILES) program. Among these, 67 galaxies have H$α$ coverage, spanning $0.7 < z < 3.7$. The galaxy overdensity, $\log(1+δ)$, is measured using the JADES photometric catalog, which covers the SMILES footprint. For the subset with H$α$ coverage, we find that $\logξ_{\rm ion}$ is positively correlated with $\log(1+δ)$, with a slope of $0.94_{-0.46}^{+0.46}$. Additionally, the mean $ξ_{\rm ion}$ for galaxies in overdense regions ($\log(1+δ) > 0.1$) is 2.43 times that of galaxies in lower density regions ($\log(1+δ) < 0.1$). This strong correlation is found to be independent of redshift evolution. Furthermore, our results confirm the robust correlations between $ξ_{\rm ion}$ and the rest-frame equivalent widths of the [O III] or H$α$ emission lines. Our results suggest that galaxies in high-density regions are efficient producers of ionizing photons. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: 14 pages, 7 figures, 1 table. Submitted to AAS journals. The machine-readable table will be made available upon acceptance

arXiv:2410.14795 [pdf, other]

Cross-Document Event-Keyed Summarization

Authors: William Walden, Pavlo Kuchmiichuk, Alexander Martin, Chihsheng Jin, Angela Cao, Claire Sun, Curisia Allen, Aaron Steven White

Abstract: Event-keyed summarization (EKS) requires generating a summary about a specific event described in a document, given the document and an event representation extracted from it. In this work, we extend EKS to the cross-document setting (CDEKS), in which summaries must synthesize information from accounts of the same event given by multiple sources. We introduce SEAMUS (Summaries of Events Across Mul… ▽ More Event-keyed summarization (EKS) requires generating a summary about a specific event described in a document, given the document and an event representation extracted from it. In this work, we extend EKS to the cross-document setting (CDEKS), in which summaries must synthesize information from accounts of the same event given by multiple sources. We introduce SEAMUS (Summaries of Events Across Multiple Sources), a high-quality dataset for CDEKS based on an expert reannotation of the FAMUS dataset for cross-document argument extraction. We present a suite of baselines on SEAMUS, covering both smaller, fine-tuned models, as well as zero- and few-shot prompted LLMs, along with detailed ablations, and a human evaluation study, showing SEAMUS to be a valuable benchmark for this new task. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14660 [pdf, other]

A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning

Authors: Shengjie Sun, Runze Liu, Jiafei Lyu, Jing-Wen Yang, Liangpeng Zhang, Xiu Li

Abstract: Large Language Models (LLMs) have shown significant potential in designing reward functions for Reinforcement Learning (RL) tasks. However, obtaining high-quality reward code often involves human intervention, numerous LLM queries, or repetitive RL training. To address these issues, we propose CARD, a LLM-driven Reward Design framework that iteratively generates and improves reward function code.… ▽ More Large Language Models (LLMs) have shown significant potential in designing reward functions for Reinforcement Learning (RL) tasks. However, obtaining high-quality reward code often involves human intervention, numerous LLM queries, or repetitive RL training. To address these issues, we propose CARD, a LLM-driven Reward Design framework that iteratively generates and improves reward function code. Specifically, CARD includes a Coder that generates and verifies the code, while a Evaluator provides dynamic feedback to guide the Coder in improving the code, eliminating the need for human feedback. In addition to process feedback and trajectory feedback, we introduce Trajectory Preference Evaluation (TPE), which evaluates the current reward function based on trajectory preferences. If the code fails the TPE, the Evaluator provides preference feedback, avoiding RL training at every iteration and making the reward function better aligned with the task objective. Empirical results on Meta-World and ManiSkill2 demonstrate that our method achieves an effective balance between task performance and token efficiency, outperforming or matching the baselines across all tasks. On 10 out of 12 tasks, CARD shows better or comparable performance to policies trained with expert-designed rewards, and our method even surpasses the oracle on 3 tasks. △ Less

Submitted 18 October, 2024; originally announced October 2024.

arXiv:2410.14605 [pdf, ps, other]

Universal sums via products of Ramanujan's theta functions

Authors: Nasser Abdo Saeed Bulkhali, Zhi-Wei Sun

Abstract: An integer-valued polynomial $P(x,y,z)$ is said to be universal (over $\mathbb Z$) if each nonnegative integer can be written as $P(x,y,z)$ with $x,y,z\in\mathbb Z$. In this paper, we mainly introduce a new technique to determine the universality of some sums in the form $x(a_1x+a_2)/2+y(b_1y+b_2)/2+z(c_1z+c_2)/2$ conjectured by Sun, using various identities of Ramanujan's theta functions. An integer-valued polynomial $P(x,y,z)$ is said to be universal (over $\mathbb Z$) if each nonnegative integer can be written as $P(x,y,z)$ with $x,y,z\in\mathbb Z$. In this paper, we mainly introduce a new technique to determine the universality of some sums in the form $x(a_1x+a_2)/2+y(b_1y+b_2)/2+z(c_1z+c_2)/2$ conjectured by Sun, using various identities of Ramanujan's theta functions. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Comments: 20 pages

MSC Class: 11D72; 11E20; 11E25; 11F27; 14H42

arXiv:2410.14210 [pdf, other]

Shape Transformation Driven by Active Contour for Class-Imbalanced Semi-Supervised Medical Image Segmentation

Authors: Yuliang Gu, Yepeng Liu, Zhichao Sun, Jinchi Zhu, Yongchao Xu, Laurent Najman

Abstract: Annotating 3D medical images demands expert knowledge and is time-consuming. As a result, semi-supervised learning (SSL) approaches have gained significant interest in 3D medical image segmentation. The significant size differences among various organs in the human body lead to imbalanced class distribution, which is a major challenge in the real-world application of these SSL approaches. To addre… ▽ More Annotating 3D medical images demands expert knowledge and is time-consuming. As a result, semi-supervised learning (SSL) approaches have gained significant interest in 3D medical image segmentation. The significant size differences among various organs in the human body lead to imbalanced class distribution, which is a major challenge in the real-world application of these SSL approaches. To address this issue, we develop a novel Shape Transformation driven by Active Contour (STAC), that enlarges smaller organs to alleviate imbalanced class distribution across different organs. Inspired by curve evolution theory in active contour methods, STAC employs a signed distance function (SDF) as the level set function, to implicitly represent the shape of organs, and deforms voxels in the direction of the steepest descent of SDF (i.e., the normal vector). To ensure that the voxels far from expansion organs remain unchanged, we design an SDF-based weight function to control the degree of deformation for each voxel. We then use STAC as a data-augmentation process during the training stage. Experimental results on two benchmark datasets demonstrate that the proposed method significantly outperforms some state-of-the-art methods. Source code is publicly available at https://github.com/GuGuLL123/STAC. △ Less

Submitted 18 October, 2024; originally announced October 2024.

Journal ref: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Dec 2024, Lisbon (Portugal), Portugal

Showing 1–50 of 28,992 results for author: Sun