-
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
Authors:
Kevin Wang,
Junbo Li,
Neel P. Bhatt,
Yihan Xi,
Qiang Liu,
Ufuk Topcu,
Zhangyang Wang
Abstract:
Recent advancements in Large Language Models (LLMs) have showcased their ability to perform complex reasoning tasks, but their effectiveness in planning remains underexplored. In this study, we evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks, focusing on three key aspects: feasibility, optimality, and generalizability. Through empirical evaluations on c…
▽ More
Recent advancements in Large Language Models (LLMs) have showcased their ability to perform complex reasoning tasks, but their effectiveness in planning remains underexplored. In this study, we evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks, focusing on three key aspects: feasibility, optimality, and generalizability. Through empirical evaluations on constraint-heavy tasks (e.g., $\textit{Barman}$, $\textit{Tyreworld}$) and spatially complex environments (e.g., $\textit{Termes}$, $\textit{Floortile}$), we highlight o1-preview's strengths in self-evaluation and constraint-following, while also identifying bottlenecks in decision-making and memory management, particularly in tasks requiring robust spatial reasoning. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints and managing state transitions in structured environments. However, the model often generates suboptimal solutions with redundant actions and struggles to generalize effectively in spatially complex tasks. This pilot study provides foundational insights into the planning limitations of LLMs, offering key directions for future research on improving memory management, decision-making, and generalization in LLM-based planning. Code available at https://github.com/VITA-Group/o1-planning.
△ Less
Submitted 13 October, 2024; v1 submitted 29 September, 2024;
originally announced September 2024.
-
A new baseline for edge detection: Make Encoder-Decoder great again
Authors:
Yachuan Li,
Xavier Soria Pomab,
Yongke Xi,
Guanlin Li,
Chaozhi Yang,
Qian Xiao,
Yun Bai,
Zongmin LI
Abstract:
The performance of deep learning based edge detector has far exceeded that of humans, but the huge computational cost and complex training strategy hinder its further development and application. In this paper, we eliminate these complexities with a vanilla encoder-decoder based detector. Firstly, we design a bilateral encoder to decouple the extraction process of location features and semantic fe…
▽ More
The performance of deep learning based edge detector has far exceeded that of humans, but the huge computational cost and complex training strategy hinder its further development and application. In this paper, we eliminate these complexities with a vanilla encoder-decoder based detector. Firstly, we design a bilateral encoder to decouple the extraction process of location features and semantic features. Since the location branch no longer provides cues for the semantic branch, the richness of features can be further compressed, which is the key to make our model more compact. We propose a cascaded feature fusion decoder, where the location features are progressively refined by semantic features. The refined location features are the only basis for generating the edge map. The coarse original location features and semantic features are avoided from direct contact with the final result. So the noise in the location features and the location error in the semantic features can be suppressed in the generated edge map. The proposed New Baseline for Edge Detection (NBED) achieves superior performance consistently across multiple edge detection benchmarks, even compared with those methods with huge computational cost and complex training strategy. The ODS of NBED on BSDS500 is 0.838, achieving state-of-the-art performance. Our study shows that what really matters in the current edge detection is high-quality features, and we can make the encoder-decoder based detector great again even without complex training strategies and huge computational cost. The code is available at https://github.com/Li-yachuan/NBED.
△ Less
Submitted 23 September, 2024;
originally announced September 2024.
-
A Survey on Diffusion Models for Recommender Systems
Authors:
Jianghao Lin,
Jiaqi Liu,
Jiachen Zhu,
Yunjia Xi,
Chengkai Liu,
Yangtian Zhang,
Yong Yu,
Weinan Zhang
Abstract:
While traditional recommendation techniques have made significant strides in the past decades, they still suffer from limited generalization performance caused by factors like inadequate collaborative signals, weak latent representations, and noisy data. In response, diffusion models (DMs) have emerged as promising solutions for recommender systems due to their robust generative capabilities, soli…
▽ More
While traditional recommendation techniques have made significant strides in the past decades, they still suffer from limited generalization performance caused by factors like inadequate collaborative signals, weak latent representations, and noisy data. In response, diffusion models (DMs) have emerged as promising solutions for recommender systems due to their robust generative capabilities, solid theoretical foundations, and improved training stability. To this end, in this paper, we present the first comprehensive survey on diffusion models for recommendation, and draw a bird's-eye view from the perspective of the whole pipeline in real-world recommender systems. We systematically categorize existing research works into three primary domains: (1) diffusion for data engineering & encoding, focusing on data augmentation and representation enhancement; (2) diffusion as recommender models, employing diffusion models to directly estimate user preferences and rank items; and (3) diffusion for content presentation, utilizing diffusion models to generate personalized content such as fashion and advertisement creatives. Our taxonomy highlights the unique strengths of diffusion models in capturing complex data distributions and generating high-quality, diverse samples that closely align with user preferences. We also summarize the core characteristics of the adapting diffusion models for recommendation, and further identify key areas for future exploration, which helps establish a roadmap for researchers and practitioners seeking to advance recommender systems through the innovative application of diffusion models. To further facilitate the research community of recommender systems based on diffusion models, we actively maintain a GitHub repository for papers and other related resources in this rising direction https://github.com/CHIANGEL/Awesome-Diffusion-for-RecSys.
△ Less
Submitted 15 September, 2024; v1 submitted 8 September, 2024;
originally announced September 2024.
-
Efficient and Deployable Knowledge Infusion for Open-World Recommendations via Large Language Models
Authors:
Yunjia Xi,
Weiwen Liu,
Jianghao Lin,
Muyan Weng,
Xiaoling Cai,
Hong Zhu,
Jieming Zhu,
Bo Chen,
Ruiming Tang,
Yong Yu,
Weinan Zhang
Abstract:
Recommender systems (RSs) play a pervasive role in today's online services, yet their closed-loop nature constrains their access to open-world knowledge. Recently, large language models (LLMs) have shown promise in bridging this gap. However, previous attempts to directly implement LLMs as recommenders fall short in meeting the requirements of industrial RSs, particularly in terms of online infere…
▽ More
Recommender systems (RSs) play a pervasive role in today's online services, yet their closed-loop nature constrains their access to open-world knowledge. Recently, large language models (LLMs) have shown promise in bridging this gap. However, previous attempts to directly implement LLMs as recommenders fall short in meeting the requirements of industrial RSs, particularly in terms of online inference latency and offline resource efficiency. Thus, we propose REKI to acquire two types of external knowledge about users and items from LLMs. Specifically, we introduce factorization prompting to elicit accurate knowledge reasoning on user preferences and items. We develop individual knowledge extraction and collective knowledge extraction tailored for different scales of scenarios, effectively reducing offline resource consumption. Subsequently, generated knowledge undergoes efficient transformation and condensation into augmented vectors through a hybridized expert-integrated network, ensuring compatibility. The obtained vectors can then be used to enhance any conventional recommendation model. We also ensure efficient inference by preprocessing and prestoring the knowledge from LLMs. Experiments demonstrate that REKI outperforms state-of-the-art baselines and is compatible with lots of recommendation algorithms and tasks. Now, REKI has been deployed to Huawei's news and music recommendation platforms and gained a 7% and 1.99% improvement during the online A/B test.
△ Less
Submitted 19 August, 2024;
originally announced August 2024.
-
Posterior Covariance Structures in Gaussian Processes
Authors:
Difeng Cai,
Edmond Chow,
Yuanzhe Xi
Abstract:
In this paper, we present a comprehensive analysis of the posterior covariance field in Gaussian processes, with applications to the posterior covariance matrix. The analysis is based on the Gaussian prior covariance but the approach also applies to other covariance kernels. Our geometric analysis reveals how the Gaussian kernel's bandwidth parameter and the spatial distribution of the observation…
▽ More
In this paper, we present a comprehensive analysis of the posterior covariance field in Gaussian processes, with applications to the posterior covariance matrix. The analysis is based on the Gaussian prior covariance but the approach also applies to other covariance kernels. Our geometric analysis reveals how the Gaussian kernel's bandwidth parameter and the spatial distribution of the observations influence the posterior covariance as well as the corresponding covariance matrix, enabling straightforward identification of areas with high or low covariance in magnitude. Drawing inspiration from the a posteriori error estimation techniques in adaptive finite element methods, we also propose several estimators to efficiently measure the absolute posterior covariance field, which can be used for efficient covariance matrix approximation and preconditioning. We conduct a wide range of experiments to illustrate our theoretical findings and their practical applications.
△ Less
Submitted 14 August, 2024;
originally announced August 2024.
-
A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems
Authors:
Yunjia Xi,
Hangyu Wang,
Bo Chen,
Jianghao Lin,
Menghui Zhu,
Weiwen Liu,
Ruiming Tang,
Weinan Zhang,
Yong Yu
Abstract:
Recently, increasing attention has been paid to LLM-based recommender systems, but their deployment is still under exploration in the industry. Most deployments utilize LLMs as feature enhancers, generating augmentation knowledge in the offline stage. However, in recommendation scenarios, involving numerous users and items, even offline generation with LLMs consumes considerable time and resources…
▽ More
Recently, increasing attention has been paid to LLM-based recommender systems, but their deployment is still under exploration in the industry. Most deployments utilize LLMs as feature enhancers, generating augmentation knowledge in the offline stage. However, in recommendation scenarios, involving numerous users and items, even offline generation with LLMs consumes considerable time and resources. This generation inefficiency stems from the autoregressive nature of LLMs, and a promising direction for acceleration is speculative decoding, a Draft-then-Verify paradigm that increases the number of generated tokens per decoding step. In this paper, we first identify that recommendation knowledge generation is suitable for retrieval-based speculative decoding. Then, we discern two characteristics: (1) extensive items and users in RSs bring retrieval inefficiency, and (2) RSs exhibit high diversity tolerance for text generated by LLMs. Based on the above insights, we propose a Decoding Acceleration Framework for LLM-based Recommendation (dubbed DARE), with Customized Retrieval Pool to improve retrieval efficiency and Relaxed Verification to increase the acceptance rate of draft tokens, respectively. Extensive experiments demonstrate that DARE achieves a 3-5x speedup and is compatible with various frameworks and backbone LLMs. DARE has also been deployed to online advertising scenarios within a large-scale commercial environment, achieving a 3.45x speedup while maintaining the downstream performance.
△ Less
Submitted 10 August, 2024;
originally announced August 2024.
-
Respiratory Subtraction for Pulmonary Microwave Ablation Evaluation
Authors:
Wan Li,
Xinyun Zhong,
Wei Li,
Song Zhang,
Moheng Rong,
Yan Xi,
Peng Yuan,
Zechen Wang,
Xiaolei Jiang,
Rongxi Yi,
Hui Tang,
Yang Chen,
Chaohui Tong,
Zhan Wu,
Feng Wang
Abstract:
Currently, lung cancer is a leading cause of global cancer mortality, often necessitating minimally invasive interventions. Microwave ablation (MWA) is extensively utilized for both primary and secondary lung tumors. Although numerous clinical guidelines and standards for MWA have been established, the clinical evaluation of ablation surgery remains challenging and requires long-term patient follo…
▽ More
Currently, lung cancer is a leading cause of global cancer mortality, often necessitating minimally invasive interventions. Microwave ablation (MWA) is extensively utilized for both primary and secondary lung tumors. Although numerous clinical guidelines and standards for MWA have been established, the clinical evaluation of ablation surgery remains challenging and requires long-term patient follow-up for confirmation. In this paper, we propose a method termed respiratory subtraction to evaluate lung tumor ablation therapy performance based on pre- and post-operative image guidance. Initially, preoperative images undergo coarse rigid registration to their corresponding postoperative positions, followed by further non-rigid registration. Subsequently, subtraction images are generated by subtracting the registered preoperative images from the postoperative ones. Furthermore, to enhance the clinical assessment of MWA treatment performance, we devise a quantitative analysis metric to evaluate ablation efficacy by comparing differences between tumor areas and treatment areas. To the best of our knowledge, this is the pioneering work in the field to facilitate the assessment of MWA surgery performance on pulmonary tumors. Extensive experiments involving 35 clinical cases further validate the efficacy of the respiratory subtraction method. The experimental results confirm the effectiveness of the respiratory subtraction method and the proposed quantitative evaluation metric in assessing lung tumor treatment.
△ Less
Submitted 8 August, 2024;
originally announced August 2024.
-
Hearing the shape of a drum by knocking around
Authors:
Xing Wang,
Emmett L. Wyman,
Yakun Xi
Abstract:
We study a variation of Kac's question, "Can one hear the shape of a drum?" if we allow ourselves access to some additional information. In particular, we allow ourselves to ``hear" the local Weyl counting function at each point on the manifold and ask if this is enough to uniquely recover the Riemannian metric. This is physically equivalent to asking whether one can determine the shape of a drum…
▽ More
We study a variation of Kac's question, "Can one hear the shape of a drum?" if we allow ourselves access to some additional information. In particular, we allow ourselves to ``hear" the local Weyl counting function at each point on the manifold and ask if this is enough to uniquely recover the Riemannian metric. This is physically equivalent to asking whether one can determine the shape of a drum if one is allowed to knock at any place on the drum. We show that the answer to this question is ``yes" provided the Laplace-Beltrami spectrum of the drum is simple. We also provide a counterexample illustrating why this hypothesis is necessary.
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Numerical simulations of attachment-line boundary layer in hypersonic flow, Part II: the features of three-dimensional turbulent boundary layer
Authors:
Youcheng Xi,
Bowen Yan,
Guangwen Yang,
Song Fu
Abstract:
In this study,we investigate the characteristics of three-dimensional turbulent boundary layers influenced by transverse flow and pressure gradients. Our findings reveal that even without assuming an infinite sweep, a fully developed turbulent boundary layer over the present swept blunt body maintains spanwise homogeneity, consistent with infinite sweep assumptions.We critically examine the law-of…
▽ More
In this study,we investigate the characteristics of three-dimensional turbulent boundary layers influenced by transverse flow and pressure gradients. Our findings reveal that even without assuming an infinite sweep, a fully developed turbulent boundary layer over the present swept blunt body maintains spanwise homogeneity, consistent with infinite sweep assumptions.We critically examine the law-of-the and temperature-velocity relationships, typically applied two-dimensional turbulent boundary layers, in three-dimensional contexts. Results show that with transverse velocity and pressure gradient, streamwise velocity adheres to classical velocity transformation relationships and the predictive accuracy of classical temperaturevelocity relationships diminishes because of pressure gradient. We show that near-wall streak structures persist and correspond with energetic structures in the outer region, though three-dimensional effects redistribute energy to align more with the external flow direction. Analysis of shear Reynolds stress and mean flow shear directions reveals in near-wall regions with low transverse flow velocity, but significant deviations at higher transverse velocities. Introduction of transverse pressure gradients together with the transverse velocities alter the velocity profile and mean flow shear directions, with shear Reynolds stress experiencing similar changes but with a lag increasing with transverse. Consistent directional alignment in outer regions suggests a partitioned relationship between shear Reynolds stress and mean flow shear: nonlinear in the inner region and approximately linear in the outer region.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Numerical simulations of attachment-line boundary layer in hypersonic flow, Part I: roughness-induced subcritical transitions
Authors:
Youcheng Xi,
Bowen Yan,
Guangwen Yang,
Xinguo Sha,
Dehua Zhu,
Song Fu
Abstract:
The attachment-line boundary layer is critical in hypersonic flows because of its significant impact on heat transfer and aerodynamic performance. In this study, high-fidelity numerical simulations are conducted to analyze the subcritical roughness-induced laminar-turbulent transition at the leading-edge attachment-line boundary layer of a blunt swept body under hypersonic conditions. This simulat…
▽ More
The attachment-line boundary layer is critical in hypersonic flows because of its significant impact on heat transfer and aerodynamic performance. In this study, high-fidelity numerical simulations are conducted to analyze the subcritical roughness-induced laminar-turbulent transition at the leading-edge attachment-line boundary layer of a blunt swept body under hypersonic conditions. This simulation represents a significant advancement by successfully reproducing the complete leading-edge contamination process induced by surface roughness elements in a realistic configuration, thereby providing previously unattainable insights. Two roughness elements of different heights are examined. For the lower-height roughness element, additional unsteady perturbations are required to trigger a transition in the wake, suggesting that the flow field around the roughness element acts as a disturbance amplifier for upstream perturbations. Conversely, a higher roughness element can independently induce the transition. A low-frequency absolute instability is detected behind the roughness, leading to the formation of streaks. The secondary instabilities of these streaks are identified as the direct cause of the final transition.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Low latency carbon budget analysis reveals a large decline of the land carbon sink in 2023
Authors:
Piyu Ke,
Philippe Ciais,
Stephen Sitch,
Wei Li,
Ana Bastos,
Zhu Liu,
Yidi Xu,
Xiaofan Gui,
Jiang Bian,
Daniel S Goll,
Yi Xi,
Wanjing Li,
Michael O'Sullivan,
Jeffeson Goncalves de Souza,
Pierre Friedlingstein,
Frederic Chevallier
Abstract:
In 2023, the CO2 growth rate was 3.37 +/- 0.11 ppm at Mauna Loa, 86% above the previous year, and hitting a record high since observations began in 1958, while global fossil fuel CO2 emissions only increased by 0.6 +/- 0.5%. This implies an unprecedented weakening of land and ocean sinks, and raises the question of where and why this reduction happened. Here we show a global net land CO2 sink of 0…
▽ More
In 2023, the CO2 growth rate was 3.37 +/- 0.11 ppm at Mauna Loa, 86% above the previous year, and hitting a record high since observations began in 1958, while global fossil fuel CO2 emissions only increased by 0.6 +/- 0.5%. This implies an unprecedented weakening of land and ocean sinks, and raises the question of where and why this reduction happened. Here we show a global net land CO2 sink of 0.44 +/- 0.21 GtC yr-1, the weakest since 2003. We used dynamic global vegetation models, satellites fire emissions, an atmospheric inversion based on OCO-2 measurements, and emulators of ocean biogeochemical and data driven models to deliver a fast-track carbon budget in 2023. Those models ensured consistency with previous carbon budgets. Regional flux anomalies from 2015-2022 are consistent between top-down and bottom-up approaches, with the largest abnormal carbon loss in the Amazon during the drought in the second half of 2023 (0.31 +/- 0.19 GtC yr-1), extreme fire emissions of 0.58 +/- 0.10 GtC yr-1 in Canada and a loss in South-East Asia (0.13 +/- 0.12 GtC yr-1). Since 2015, land CO2 uptake north of 20 degree N declined by half to 1.13 +/- 0.24 GtC yr-1 in 2023. Meanwhile, the tropics recovered from the 2015-16 El Nino carbon loss, gained carbon during the La Nina years (2020-2023), then switched to a carbon loss during the 2023 El Nino (0.56 +/- 0.23 GtC yr-1). The ocean sink was stronger than normal in the equatorial eastern Pacific due to reduced upwelling from La Nina's retreat in early 2023 and the development of El Nino later. Land regions exposed to extreme heat in 2023 contributed a gross carbon loss of 1.73 GtC yr-1, indicating that record warming in 2023 had a strong negative impact on the capacity of terrestrial ecosystems to mitigate climate change.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Achieving Peta-Ohm Resistance for Semi-Insulating 4H-SiC Devices by Atomic Layer Deposition
Authors:
Yuying Xi,
Helios Y. Li,
Guohui Li,
Qingmei Su,
Kaili Mao,
Bingshe Xu,
Yuying Hao,
Nicholas X. Fang,
Yanxia Cui
Abstract:
Growing demands for precise current measurements, such as atto-ampere-level measurement of cross-cellular biological current transduction, have spotlighted a pressing need for low-noise resistors with ultra-high resistance immune to voltage fluctuations. Traditional semi-insulating materials, however, struggle to provide consistent resistance across varying voltages. To bridge this gap, we introdu…
▽ More
Growing demands for precise current measurements, such as atto-ampere-level measurement of cross-cellular biological current transduction, have spotlighted a pressing need for low-noise resistors with ultra-high resistance immune to voltage fluctuations. Traditional semi-insulating materials, however, struggle to provide consistent resistance across varying voltages. To bridge this gap, we introduce a design that integrates semi-insulating 4H-SiC with atomic-level metal oxide interlayers and electrodes. The strategic adjustment of surface states via atomic-scale metal oxide layers optimizes the work functions on 4H-SiC surfaces, validated through density functional theory simulations. This design transcends conventional limitations, establishing an ideal Ohmic behavior and maintains Peta-Ohm-level resistance, unaffected by voltage variations. These on-chip devices with fine-tuned resistance are compatible with integrated circuit manufacturing processes, making them ideally suited for applications in precision electronics.
△ Less
Submitted 14 July, 2024;
originally announced July 2024.
-
MemoCRS: Memory-enhanced Sequential Conversational Recommender Systems with Large Language Models
Authors:
Yunjia Xi,
Weiwen Liu,
Jianghao Lin,
Bo Chen,
Ruiming Tang,
Weinan Zhang,
Yong Yu
Abstract:
Conversational recommender systems (CRSs) aim to capture user preferences and provide personalized recommendations through multi-round natural language dialogues. However, most existing CRS models mainly focus on dialogue comprehension and preferences mining from the current dialogue session, overlooking user preferences in historical dialogue sessions. The preferences embedded in the user's histo…
▽ More
Conversational recommender systems (CRSs) aim to capture user preferences and provide personalized recommendations through multi-round natural language dialogues. However, most existing CRS models mainly focus on dialogue comprehension and preferences mining from the current dialogue session, overlooking user preferences in historical dialogue sessions. The preferences embedded in the user's historical dialogue sessions and the current session exhibit continuity and sequentiality, and we refer to CRSs with this characteristic as sequential CRSs. In this work, we leverage memory-enhanced LLMs to model the preference continuity, primarily focusing on addressing two key issues: (1) redundancy and noise in historical dialogue sessions, and (2) the cold-start users problem. To this end, we propose a Memory-enhanced Conversational Recommender System Framework with Large Language Models (dubbed MemoCRS) consisting of user-specific memory and general memory. User-specific memory is tailored to each user for their personalized interests and implemented by an entity-based memory bank to refine preferences and retrieve relevant memory, thereby reducing the redundancy and noise of historical sessions. The general memory, encapsulating collaborative knowledge and reasoning guidelines, can provide shared knowledge for users, especially cold-start users. With the two kinds of memory, LLMs are empowered to deliver more precise and tailored recommendations for each user. Extensive experiments on both Chinese and English datasets demonstrate the effectiveness of MemoCRS.
△ Less
Submitted 6 July, 2024;
originally announced July 2024.
-
Romanization Encoding For Multilingual ASR
Authors:
Wen Ding,
Fei Jia,
Hainan Xu,
Yu Xi,
Junjie Lai,
Boris Ginsburg
Abstract:
We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and redu…
▽ More
We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. By adopting romanization encoding alongside a balanced concatenated tokenizer within a FastConformer-RNNT framework equipped with a Roman2Char module, we significantly reduce vocabulary and output dimensions, enabling larger training batches and reduced memory consumption. Our method decouples acoustic modeling and language modeling, enhancing the flexibility and adaptability of the system. In our study, applying this method to Mandarin-English ASR resulted in a remarkable 63.51% vocabulary reduction and notable performance gains of 13.72% and 15.03% on SEAME code-switching benchmarks. Ablation studies on Mandarin-Korean and Mandarin-Japanese highlight our method's strong capability to address the complexities of other script-heavy languages, paving the way for more versatile and effective multilingual ASR systems.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Semi-supervised Learning for Code-Switching ASR with Large Language Model Filter
Authors:
Yu Xi,
Wen Ding,
Kai Yu,
Junjie Lai
Abstract:
Code-switching (CS) phenomenon occurs when words or phrases from different languages are alternated in a single sentence. Due to data scarcity, building an effective CS Automatic Speech Recognition (ASR) system remains challenging. In this paper, we propose to enhance CS-ASR systems by utilizing rich unsupervised monolingual speech data within a semi-supervised learning framework, particularly whe…
▽ More
Code-switching (CS) phenomenon occurs when words or phrases from different languages are alternated in a single sentence. Due to data scarcity, building an effective CS Automatic Speech Recognition (ASR) system remains challenging. In this paper, we propose to enhance CS-ASR systems by utilizing rich unsupervised monolingual speech data within a semi-supervised learning framework, particularly when access to CS data is limited. To achieve this, we establish a general paradigm for applying noisy student training (NST) to the CS-ASR task. Specifically, we introduce the LLM-Filter, which leverages well-designed prompt templates to activate the correction capability of large language models (LLMs) for monolingual data selection and pseudo-labels refinement during NST. Our experiments on the supervised ASRU-CS and unsupervised AISHELL-2 and LibriSpeech datasets show that our method not only achieves significant improvements over supervised and semi-supervised learning baselines for the CS task, but also attains better performance compared with the fully-supervised oracle upper-bound on the CS English part. Additionally, we further investigate the influence of accent on AESRC dataset and demonstrate that our method can get achieve additional benefits when the monolingual data contains relevant linguistic characteristic.
△ Less
Submitted 20 September, 2024; v1 submitted 4 July, 2024;
originally announced July 2024.
-
Topologically nontrivial $1/3$-magnetization plateau state in a spin-1/2 trimer chain
Authors:
Y. Y. Han,
B. C. Yu,
Z. Du,
L. S. Ling,
L. Zhang,
W. Tong,
C. Y. Xi,
J. L. Zhang,
T. Shang,
Li Pi,
Long Ma
Abstract:
Topologically nontrivial Haldane phase is theoretically proposed to be realized in the 1/3-magnetization ($M$) plateau of spin-1/2 trimer systems. However, the spin excitation gap, typical characteristic of Haldane phase, is not yet experimentally verified. Here, we report the nuclear magnetic resonance investigations into the low-energy spin dynamics in the $S=1/2$ spin-trimer antiferromagnetic c…
▽ More
Topologically nontrivial Haldane phase is theoretically proposed to be realized in the 1/3-magnetization ($M$) plateau of spin-1/2 trimer systems. However, the spin excitation gap, typical characteristic of Haldane phase, is not yet experimentally verified. Here, we report the nuclear magnetic resonance investigations into the low-energy spin dynamics in the $S=1/2$ spin-trimer antiferromagnetic chain compound Na$_2$Cu$_3$Ge$_{4-x}$Si$_{x}$O$_{12}$ ($x=0, 0.1\sim1.5$). In the parent compound ($x=0$), the spin-lattice relaxation rate (1/$T_1$) shows significantly different temperature dependence when the external magnetic field is increased above the critical field of $μ_0$$H_{c}$ = 29 T. The spin excitation gap is evidenced from the thermally activated behavior of $1/T_1(T)$ in the 1/3-$M$ plateau state. By substituting Ge$^{4+}$ with Si$^{4+}$, the critical field for the 1/3-$M$ plateau significantly decreases, e.g. $μ_0H_{c}=17$ T in $x=1.0$ samples, which results from the suppressed inter-trimer coupling $J_2$. The gapped spin excitation is confirmed again above 17 T, whose size shows temperature-dependent behavior for $μ_0H\geq25.72$ T. These observations provide further insights into the Haldane physics.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Expressive Gaussian Human Avatars from Monocular RGB Video
Authors:
Hezhen Hu,
Zhiwen Fan,
Tianhao Wu,
Yihan Xi,
Seoyoung Lee,
Georgios Pavlakos,
Zhangyang Wang
Abstract:
Nuanced expressiveness, particularly through fine-grained hand and facial expressions, is pivotal for enhancing the realism and vitality of digital human representations. In this work, we focus on investigating the expressiveness of human avatars when learned from monocular RGB video; a setting that introduces new challenges in capturing and animating fine-grained details. To this end, we introduc…
▽ More
Nuanced expressiveness, particularly through fine-grained hand and facial expressions, is pivotal for enhancing the realism and vitality of digital human representations. In this work, we focus on investigating the expressiveness of human avatars when learned from monocular RGB video; a setting that introduces new challenges in capturing and animating fine-grained details. To this end, we introduce EVA, a drivable human model that meticulously sculpts fine details based on 3D Gaussians and SMPL-X, an expressive parametric human model. Focused on enhancing expressiveness, our work makes three key contributions. First, we highlight the critical importance of aligning the SMPL-X model with RGB frames for effective avatar learning. Recognizing the limitations of current SMPL-X prediction methods for in-the-wild videos, we introduce a plug-and-play module that significantly ameliorates misalignment issues. Second, we propose a context-aware adaptive density control strategy, which is adaptively adjusting the gradient thresholds to accommodate the varied granularity across body parts. Last but not least, we develop a feedback mechanism that predicts per-pixel confidence to better guide the learning of 3D Gaussians. Extensive experiments on two benchmarks demonstrate the superiority of our framework both quantitatively and qualitatively, especially on the fine-grained hand and facial details. See the project website at \url{https://evahuman.github.io}
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Straggler-tolerant stationary methods for linear systems
Authors:
Vassilis Kalantzis,
Yuanzhe Xi,
Lior Horesh,
Yousef Saad
Abstract:
In this paper, we consider the iterative solution of linear algebraic equations under the condition that matrix-vector products with the coefficient matrix are computed only partially. At the same time, non-computed entries are set to zeros. We assume that both the number of computed entries and their associated row index set are random variables, with the row index set sampled uniformly given the…
▽ More
In this paper, we consider the iterative solution of linear algebraic equations under the condition that matrix-vector products with the coefficient matrix are computed only partially. At the same time, non-computed entries are set to zeros. We assume that both the number of computed entries and their associated row index set are random variables, with the row index set sampled uniformly given the number of computed entries. This model of computations is realized in hybrid cloud computing architectures following the controller-worker distributed model under the influence of straggling workers. We propose straggler-tolerant Richardson iteration scheme and Chebyshev semi-iterative schemes, and prove sufficient conditions for their convergence in expectation. Numerical experiments verify the presented theoretical results as well as the effectiveness of the proposed schemes on a few sparse matrix problems.
△ Less
Submitted 12 October, 2024; v1 submitted 1 July, 2024;
originally announced July 2024.
-
Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation
Authors:
Yuchuan Tian,
Jianhong Han,
Hanting Chen,
Yuanyuan Xi,
Guoyang Zhang,
Jie Hu,
Chao Xu,
Yunhe Wang
Abstract:
Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transform…
▽ More
Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. Rather than popular feature adaptation methods, we propose weight modulation that adapts weights to specific tasks. Firstly, we figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. Further, we propose to maneuver Instruct-IPT with text instructions for better user interfaces. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at https://github.com/huawei-noah/Pretrained-IPT.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
Text-aware Speech Separation for Multi-talker Keyword Spotting
Authors:
Haoyu Li,
Baochen Yang,
Yu Xi,
Linfeng Yu,
Tian Tan,
Hao Li,
Kai Yu
Abstract:
For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To ad…
▽ More
For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To address it, this paper proposes a novel Text-aware Permutation Determinization Training method for multi-talker KWS with a clue-based Speech Separation front-end (TPDT-SS). Our research highlights the critical role of SS front-ends and shows that incorporating keyword-specific clues into these models can greatly enhance the effectiveness. TPDT-SS shows remarkable success in addressing permutation problems in mixed keyword speech, thereby greatly boosting the performance of the backend. Additionally, fine-tuning our system on unseen mixed speech results in further performance improvement.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing
Authors:
Jing Chen,
Xinyu Zhu,
Cheng Yang,
Chufan Shi,
Yadong Xi,
Yuxiang Zhang,
Junjie Wang,
Jiashu Pu,
Rongsheng Zhang,
Yujiu Yang,
Tian Feng
Abstract:
Generative AI has demonstrated unprecedented creativity in the field of computer vision, yet such phenomena have not been observed in natural language processing. In particular, large language models (LLMs) can hardly produce written works at the level of human experts due to the extremely high complexity of literature writing. In this paper, we present HoLLMwood, an automated framework for unleas…
▽ More
Generative AI has demonstrated unprecedented creativity in the field of computer vision, yet such phenomena have not been observed in natural language processing. In particular, large language models (LLMs) can hardly produce written works at the level of human experts due to the extremely high complexity of literature writing. In this paper, we present HoLLMwood, an automated framework for unleashing the creativity of LLMs and exploring their potential in screenwriting, which is a highly demanding task. Mimicking the human creative process, we assign LLMs to different roles involved in the real-world scenario. In addition to the common practice of treating LLMs as ${Writer}$, we also apply LLMs as ${Editor}$, who is responsible for providing feedback and revision advice to ${Writer}$. Besides, to enrich the characters and deepen the plots, we introduce a role-playing mechanism and adopt LLMs as ${Actors}$ that can communicate and interact with each other. Evaluations on automatically generated screenplays show that HoLLMwood substantially outperforms strong baselines in terms of coherence, relevance, interestingness and overall quality.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
From Pixels to Progress: Generating Road Network from Satellite Imagery for Socioeconomic Insights in Impoverished Areas
Authors:
Yanxin Xi,
Yu Liu,
Zhicheng Liu,
Sasu Tarkoma,
Pan Hui,
Yong Li
Abstract:
The Sustainable Development Goals (SDGs) aim to resolve societal challenges, such as eradicating poverty and improving the lives of vulnerable populations in impoverished areas. Those areas rely on road infrastructure construction to promote accessibility and economic development. Although publicly available data like OpenStreetMap is available to monitor road status, data completeness in impoveri…
▽ More
The Sustainable Development Goals (SDGs) aim to resolve societal challenges, such as eradicating poverty and improving the lives of vulnerable populations in impoverished areas. Those areas rely on road infrastructure construction to promote accessibility and economic development. Although publicly available data like OpenStreetMap is available to monitor road status, data completeness in impoverished areas is limited. Meanwhile, the development of deep learning techniques and satellite imagery shows excellent potential for earth monitoring. To tackle the challenge of road network assessment in impoverished areas, we develop a systematic road extraction framework combining an encoder-decoder architecture and morphological operations on satellite imagery, offering an integrated workflow for interdisciplinary researchers. Extensive experiments of road network extraction on real-world data in impoverished regions achieve a 42.7% enhancement in the F1-score over the baseline methods and reconstruct about 80% of the actual roads. We also propose a comprehensive road network dataset covering approximately 794,178 km2 area and 17.048 million people in 382 impoverished counties in China. The generated dataset is further utilized to conduct socioeconomic analysis in impoverished counties, showing that road network construction positively impacts regional economic development. The technical appendix, code, and generated dataset can be found at https://github.com/tsinghua-fib-lab/Road_network_extraction_impoverished_counties.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
DisCo: Towards Harmonious Disentanglement and Collaboration between Tabular and Semantic Space for Recommendation
Authors:
Kounianhua Du,
Jizheng Chen,
Jianghao Lin,
Yunjia Xi,
Hangyu Wang,
Xinyi Dai,
Bo Chen,
Ruiming Tang,
Weinan Zhang
Abstract:
Recommender systems play important roles in various applications such as e-commerce, social media, etc. Conventional recommendation methods usually model the collaborative signals within the tabular representation space. Despite the personalization modeling and the efficiency, the latent semantic dependencies are omitted. Methods that introduce semantics into recommendation then emerge, injecting…
▽ More
Recommender systems play important roles in various applications such as e-commerce, social media, etc. Conventional recommendation methods usually model the collaborative signals within the tabular representation space. Despite the personalization modeling and the efficiency, the latent semantic dependencies are omitted. Methods that introduce semantics into recommendation then emerge, injecting knowledge from the semantic representation space where the general language understanding are compressed. However, existing semantic-enhanced recommendation methods focus on aligning the two spaces, during which the representations of the two spaces tend to get close while the unique patterns are discarded and not well explored. In this paper, we propose DisCo to Disentangle the unique patterns from the two representation spaces and Collaborate the two spaces for recommendation enhancement, where both the specificity and the consistency of the two spaces are captured. Concretely, we propose 1) a dual-side attentive network to capture the intra-domain patterns and the inter-domain patterns, 2) a sufficiency constraint to preserve the task-relevant information of each representation space and filter out the noise, and 3) a disentanglement constraint to avoid the model from discarding the unique information. These modules strike a balance between disentanglement and collaboration of the two representation spaces to produce informative pattern vectors, which could serve as extra features and be appended to arbitrary recommendation backbones for enhancement. Experiment results validate the superiority of our method against different models and the compatibility of DisCo over different backbones. Various ablation studies and efficiency analysis are also conducted to justify each model component.
△ Less
Submitted 4 June, 2024; v1 submitted 20 May, 2024;
originally announced June 2024.
-
JUNO Sensitivity to Invisible Decay Modes of Neutrons
Authors:
JUNO Collaboration,
Angel Abusleme,
Thomas Adam,
Kai Adamowicz,
Shakeel Ahmad,
Rizwan Ahmed,
Sebastiano Aiello,
Fengpeng An,
Qi An,
Giuseppe Andronico,
Nikolay Anfimov,
Vito Antonelli,
Tatiana Antoshkina,
João Pedro Athayde Marcondes de André,
Didier Auguste,
Weidong Bai,
Nikita Balashov,
Wander Baldini,
Andrea Barresi,
Davide Basilico,
Eric Baussan,
Marco Bellato,
Marco Beretta,
Antonio Bergnoli,
Daniel Bick
, et al. (635 additional authors not shown)
Abstract:
We explore the bound neutrons decay into invisible particles (e.g., $n\rightarrow 3 ν$ or $nn \rightarrow 2 ν$) in the JUNO liquid scintillator detector. The invisible decay includes two decay modes: $ n \rightarrow { inv} $ and $ nn \rightarrow { inv} $. The invisible decays of $s$-shell neutrons in $^{12}{\rm C}$ will leave a highly excited residual nucleus. Subsequently, some de-excitation mode…
▽ More
We explore the bound neutrons decay into invisible particles (e.g., $n\rightarrow 3 ν$ or $nn \rightarrow 2 ν$) in the JUNO liquid scintillator detector. The invisible decay includes two decay modes: $ n \rightarrow { inv} $ and $ nn \rightarrow { inv} $. The invisible decays of $s$-shell neutrons in $^{12}{\rm C}$ will leave a highly excited residual nucleus. Subsequently, some de-excitation modes of the excited residual nuclei can produce a time- and space-correlated triple coincidence signal in the JUNO detector. Based on a full Monte Carlo simulation informed with the latest available data, we estimate all backgrounds, including inverse beta decay events of the reactor antineutrino $\barν_e$, natural radioactivity, cosmogenic isotopes and neutral current interactions of atmospheric neutrinos. Pulse shape discrimination and multivariate analysis techniques are employed to further suppress backgrounds. With two years of exposure, JUNO is expected to give an order of magnitude improvement compared to the current best limits. After 10 years of data taking, the JUNO expected sensitivities at a 90% confidence level are $τ/B( n \rightarrow { inv} ) > 5.0 \times 10^{31} \, {\rm yr}$ and $τ/B( nn \rightarrow { inv} ) > 1.4 \times 10^{32} \, {\rm yr}$.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Spectral-Refiner: Fine-Tuning of Accurate Spatiotemporal Neural Operator for Turbulent Flows
Authors:
Shuhao Cao,
Francesco Brarda,
Ruipeng Li,
Yuanzhe Xi
Abstract:
Recent advancements in operator-type neural networks have shown promising results in approximating the solutions of spatiotemporal Partial Differential Equations (PDEs). However, these neural networks often entail considerable training expenses, and may not always achieve the desired accuracy required in many scientific and engineering disciplines. In this paper, we propose a new Spatiotemporal Fo…
▽ More
Recent advancements in operator-type neural networks have shown promising results in approximating the solutions of spatiotemporal Partial Differential Equations (PDEs). However, these neural networks often entail considerable training expenses, and may not always achieve the desired accuracy required in many scientific and engineering disciplines. In this paper, we propose a new Spatiotemporal Fourier Neural Operator (SFNO) that learns maps between Bochner spaces, and a new learning framework to address these issues. This new paradigm leverages wisdom from traditional numerical PDE theory and techniques to refine the pipeline of commonly adopted end-to-end neural operator training and evaluations. Specifically, in the learning problems for the turbulent flow modeling by the Navier-Stokes Equations (NSE), the proposed architecture initiates the training with a few epochs for SFNO, concluding with the freezing of most model parameters. Then, the last linear spectral convolution layer is fine-tuned without the frequency truncation. The optimization uses a negative Sobolev norm for the first time as the loss in operator learning, defined through a reliable functional-type \emph{a posteriori} error estimator whose evaluation is almost exact thanks to the Parseval identity. This design allows the neural operators to effectively tackle low-frequency errors while the relief of the de-aliasing filter addresses high-frequency errors. Numerical experiments on commonly used benchmarks for the 2D NSE demonstrate significant improvements in both computational efficiency and accuracy, compared to end-to-end evaluation and traditional numerical PDE solvers.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
LDPKiT: Recovering Utility in LDP Schemes by Training with Noise^2
Authors:
Kexin Li,
Yang Xi,
Aastha Mehta,
David Lie
Abstract:
The adoption of large cloud-based models for inference has been hampered by concerns about the privacy leakage of end-user data. One method to mitigate this leakage is to add local differentially private noise to queries before sending them to the cloud, but this degrades utility as a side effect. Our key insight is that knowledge available in the noisy labels returned from performing inference on…
▽ More
The adoption of large cloud-based models for inference has been hampered by concerns about the privacy leakage of end-user data. One method to mitigate this leakage is to add local differentially private noise to queries before sending them to the cloud, but this degrades utility as a side effect. Our key insight is that knowledge available in the noisy labels returned from performing inference on noisy inputs can be aggregated and used to recover the correct labels. We implement this insight in LDPKiT, which stands for Local Differentially-Private and Utility-Preserving Inference via Knowledge Transfer. LDPKiT uses the noisy labels returned from querying a set of noised inputs to train a local model (noise^2), which is then used to perform inference on the original set of inputs. Our experiments on CIFAR-10, Fashion-MNIST, SVHN, and CARER NLP datasets demonstrate that LDPKiT can improve utility without compromising privacy. For instance, on CIFAR-10, compared to a standard $ε$-LDP scheme with $ε=15$, which provides a weak privacy guarantee, LDPKiT can achieve nearly the same accuracy (within 1% drop) with $ε=7$, offering an enhanced privacy guarantee. Moreover, the benefits of using LDPKiT increase at higher, more privacy-protective noise levels. For Fashion-MNIST and CARER, LDPKiT's accuracy on the sensitive dataset with $ε=7$ not only exceeds the average accuracy of the standard $ε$-LDP scheme with $ε=7$ by roughly 20% and 9% but also outperforms the standard $ε$-LDP scheme with $ε=15$, a scenario with less noise and minimal privacy protection. We also perform Zest distance measurements to demonstrate that the type of distillation performed by LDPKiT is different from a model extraction attack.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Efficient Two-Stage Gaussian Process Regression Via Automatic Kernel Search and Subsampling
Authors:
Shifan Zhao,
Jiaying Lu,
Ji Yang,
Edmond Chow,
Yuanzhe Xi
Abstract:
Gaussian Process Regression (GPR) is widely used in statistics and machine learning for prediction tasks requiring uncertainty measures. Its efficacy depends on the appropriate specification of the mean function, covariance kernel function, and associated hyperparameters. Severe misspecifications can lead to inaccurate results and problematic consequences, especially in safety-critical application…
▽ More
Gaussian Process Regression (GPR) is widely used in statistics and machine learning for prediction tasks requiring uncertainty measures. Its efficacy depends on the appropriate specification of the mean function, covariance kernel function, and associated hyperparameters. Severe misspecifications can lead to inaccurate results and problematic consequences, especially in safety-critical applications. However, a systematic approach to handle these misspecifications is lacking in the literature. In this work, we propose a general framework to address these issues. Firstly, we introduce a flexible two-stage GPR framework that separates mean prediction and uncertainty quantification (UQ) to prevent mean misspecification, which can introduce bias into the model. Secondly, kernel function misspecification is addressed through a novel automatic kernel search algorithm, supported by theoretical analysis, that selects the optimal kernel from a candidate set. Additionally, we propose a subsampling-based warm-start strategy for hyperparameter initialization to improve efficiency and avoid hyperparameter misspecification. With much lower computational cost, our subsampling-based strategy can yield competitive or better performance than training exclusively on the full dataset. Combining all these components, we recommend two GPR methods-exact and scalable-designed to match available computational resources and specific UQ requirements. Extensive evaluation on real-world datasets, including UCI benchmarks and a safety-critical medical case study, demonstrates the robustness and precision of our methods.
△ Less
Submitted 19 September, 2024; v1 submitted 22 May, 2024;
originally announced May 2024.
-
Emergent intelligence of buckling-driven elasto-active structures
Authors:
Yuchen Xi,
Trevor J. Jones,
Richard Huang,
Tom Marzin,
P. -T. Brun
Abstract:
Active systems of self-propelled agents, e.g., birds, fish, and bacteria, can organize their collective motion into myriad autonomous behaviors. Ubiquitous in nature and across length scales, such phenomena are also amenable to artificial settings, e.g., where brainless self-propelled robots orchestrate their movements into spatio-temportal patterns via the application of external cues or when con…
▽ More
Active systems of self-propelled agents, e.g., birds, fish, and bacteria, can organize their collective motion into myriad autonomous behaviors. Ubiquitous in nature and across length scales, such phenomena are also amenable to artificial settings, e.g., where brainless self-propelled robots orchestrate their movements into spatio-temportal patterns via the application of external cues or when confined within flexible boundaries. Very much like their natural counterparts, these approaches typically require many units to initiate collective motion such that controlling the ensuing dynamics is challenging. Here, we demonstrate a novel yet simple mechanism that leverages nonlinear elasticity to tame near-diffusive motile particles in forming structures capable of directed motion and other emergent intelligent behaviors. Our elasto-active system comprises two centimeter-sized self-propelled microbots connected with elastic beams. These microbots exert forces that suffice to buckle the beam and set the structure in motion. We first rationalize the physics of the interaction between the beam and the microbots. Then we use reduced order models to predict the interactions of our elasto-active structure with boundaries, e.g., walls and constrictions, and demonstrate how they can exhibit intelligent behaviors such as maze navigation. The findings are relevant to designing intelligent materials or soft robots capable of autonomous space exploration, adaptation, and interaction with the surrounding environment.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Analysis of a finite element DtN method for scattering resonances of sound hard obstacles
Authors:
Yingxia Xi,
Bo Gong,
Jiguang Sun
Abstract:
Scattering resonances have important applications in many areas of science and engineering. They are the replacement of discrete spectral data for problems on non-compact domains. In this paper, we consider the computation of scattering resonances defined on the exterior to a compact sound hard obstacle. The resonances are the eigenvalues of a holomorphic Fredholm operator function. We truncate th…
▽ More
Scattering resonances have important applications in many areas of science and engineering. They are the replacement of discrete spectral data for problems on non-compact domains. In this paper, we consider the computation of scattering resonances defined on the exterior to a compact sound hard obstacle. The resonances are the eigenvalues of a holomorphic Fredholm operator function. We truncate the unbounded domain and impose the Dirichlet-to-Neumann (DtN) mapping. The problem is then discretized using the linear Lagrange element. Convergence of the resonances is proved using the abstract approximation theory for holomorphic Fredholm operator functions. The discretization leads to nonlinear algebraic eigenvalue problems, which are solved by the recently developed parallel spectral indicator methods. Numerical examples are presented for validation.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
MaSkel: A Model for Human Whole-body X-rays Generation from Human Masking Images
Authors:
Yingjie Xi,
Boyuan Cheng,
Jingyao Cai,
Jian Jun Zhang,
Xiaosong Yang
Abstract:
The human whole-body X-rays could offer a valuable reference for various applications, including medical diagnostics, digital animation modeling, and ergonomic design. The traditional method of obtaining X-ray information requires the use of CT (Computed Tomography) scan machines, which emit potentially harmful radiation. Thus it faces a significant limitation for realistic applications because it…
▽ More
The human whole-body X-rays could offer a valuable reference for various applications, including medical diagnostics, digital animation modeling, and ergonomic design. The traditional method of obtaining X-ray information requires the use of CT (Computed Tomography) scan machines, which emit potentially harmful radiation. Thus it faces a significant limitation for realistic applications because it lacks adaptability and safety. In our work, We proposed a new method to directly generate the 2D human whole-body X-rays from the human masking images. The predicted images will be similar to the real ones with the same image style and anatomic structure. We employed a data-driven strategy. By leveraging advanced generative techniques, our model MaSkel(Masking image to Skeleton X-rays) could generate a high-quality X-ray image from a human masking image without the need for invasive and harmful radiation exposure, which not only provides a new path to generate highly anatomic and customized data but also reduces health risks. To our knowledge, our model MaSkel is the first work for predicting whole-body X-rays. In this paper, we did two parts of the work. The first one is to solve the data limitation problem, the diffusion-based techniques are utilized to make a data augmentation, which provides two synthetic datasets for preliminary pretraining. Then we designed a two-stage training strategy to train MaSkel. At last, we make qualitative and quantitative evaluations of the generated X-rays. In addition, we invite some professional doctors to assess our predicted data. These evaluations demonstrate the MaSkel's superior ability to generate anatomic X-rays from human masking images. The related code and links of the dataset are available at https://github.com/2022yingjie/MaSkel.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
Play to Your Strengths: Collaborative Intelligence of Conventional Recommender Models and Large Language Models
Authors:
Yunjia Xi,
Weiwen Liu,
Jianghao Lin,
Chuhan Wu,
Bo Chen,
Ruiming Tang,
Weinan Zhang,
Yong Yu
Abstract:
The rise of large language models (LLMs) has opened new opportunities in Recommender Systems (RSs) by enhancing user behavior modeling and content understanding. However, current approaches that integrate LLMs into RSs solely utilize either LLM or conventional recommender model (CRM) to generate final recommendations, without considering which data segments LLM or CRM excel in. To fill in this gap…
▽ More
The rise of large language models (LLMs) has opened new opportunities in Recommender Systems (RSs) by enhancing user behavior modeling and content understanding. However, current approaches that integrate LLMs into RSs solely utilize either LLM or conventional recommender model (CRM) to generate final recommendations, without considering which data segments LLM or CRM excel in. To fill in this gap, we conduct experiments on MovieLens-1M and Amazon-Books datasets, and compare the performance of a representative CRM (DCNv2) and an LLM (LLaMA2-7B) on various groups of data samples. Our findings reveal that LLMs excel in data segments where CRMs exhibit lower confidence and precision, while samples where CRM excels are relatively challenging for LLM, requiring substantial training data and a long training time for comparable performance. This suggests potential synergies in the combination between LLM and CRM. Motivated by these insights, we propose Collaborative Recommendation with conventional Recommender and Large Language Model (dubbed \textit{CoReLLa}). In this framework, we first jointly train LLM and CRM and address the issue of decision boundary shifts through alignment loss. Then, the resource-efficient CRM, with a shorter inference time, handles simple and moderate samples, while LLM processes the small subset of challenging samples for CRM. Our experimental results demonstrate that CoReLLa outperforms state-of-the-art CRM and LLM methods significantly, underscoring its effectiveness in recommendation tasks.
△ Less
Submitted 24 March, 2024;
originally announced March 2024.
-
RSTAR4D: Rotational Streak Artifact Reduction in 4D CBCT using a Separable 4D CNN
Authors:
Ziheng Deng,
Hua Chen,
Yongzheng Zhou,
Haibo Hu,
Zhiyong Xu,
Jiayuan Sun,
Tianling Lyu,
Yan Xi,
Yang Chen,
Jun Zhao
Abstract:
Four-dimensional cone-beam computed tomography (4D CBCT) provides respiration-resolved images and can be used for image-guided radiation therapy. However, the ability to reveal respiratory motion comes at the cost of image artifacts. As raw projection data are sorted into multiple respiratory phases, the cone-beam projections become much sparser and the reconstructed 4D CBCT images will be covered…
▽ More
Four-dimensional cone-beam computed tomography (4D CBCT) provides respiration-resolved images and can be used for image-guided radiation therapy. However, the ability to reveal respiratory motion comes at the cost of image artifacts. As raw projection data are sorted into multiple respiratory phases, the cone-beam projections become much sparser and the reconstructed 4D CBCT images will be covered by severe streak artifacts. Although several deep learning-based methods have been proposed to address this issue, most algorithms employ 2D network models as backbones, neglecting the intrinsic structural priors within 4D CBCT images. In this paper, we first explore the origin and appearance of streak artifacts in 4D CBCT images. We find that streak artifacts exhibit a unique rotational motion along with the patient's respiration, distinguishable from diaphragm-driven respiratory motion in the spatiotemporal domain. Therefore, we propose a novel 4D neural network model, RSTAR4D-Net, designed to address Rotational STreak Artifact Reduction by integrating the spatial and temporal information within 4D CBCT images. Specifically, we overcome the computational and training difficulties of a 4D neural network. The specially designed model adopts an efficient implementation of 4D convolutions to reduce computational costs and thus can process the whole 4D image in one pass. Additionally, a Tetris training strategy pertinent to the separable 4D convolutions is proposed to effectively train the model using limited 4D training samples. Extensive experiments substantiate the effectiveness of our proposed method, and the RSTAR4D-Net shows superior performance compared to other methods. The source code and dynamic demos are available at https://github.com/ivy9092111111/RSTAR.
△ Less
Submitted 29 September, 2024; v1 submitted 24 March, 2024;
originally announced March 2024.
-
Anderson Acceleration with Truncated Gram-Schmidt
Authors:
Ziyuan Tang,
Tianshi Xu,
Huan He,
Yousef Saad,
Yuanzhe Xi
Abstract:
Anderson Acceleration (AA) is a popular algorithm designed to enhance the convergence of fixed-point iterations. In this paper, we introduce a variant of AA based on a Truncated Gram-Schmidt process (AATGS) which has a few advantages over the classical AA. In particular, an attractive feature of AATGS is that its iterates obey a three-term recurrence in the situation when it is applied to solving…
▽ More
Anderson Acceleration (AA) is a popular algorithm designed to enhance the convergence of fixed-point iterations. In this paper, we introduce a variant of AA based on a Truncated Gram-Schmidt process (AATGS) which has a few advantages over the classical AA. In particular, an attractive feature of AATGS is that its iterates obey a three-term recurrence in the situation when it is applied to solving symmetric linear problems and this can lead to a considerable reduction of memory and computational costs. We analyze the convergence of AATGS in both full-depth and limited-depth scenarios and establish its equivalence to the classical AA in the linear case. We also report on the effectiveness of AATGS through a set of numerical experiments, ranging from solving nonlinear partial differential equations to tackling nonlinear optimization problems. In particular, the performance of the method is compared with that of the classical AA algorithms.
△ Less
Submitted 16 July, 2024; v1 submitted 22 March, 2024;
originally announced March 2024.
-
TDT-KWS: Fast And Accurate Keyword Spotting Using Token-and-duration Transducer
Authors:
Yu Xi,
Hao Li,
Baochen Yang,
Haoyu Li,
Hainan Xu,
Kai Yu
Abstract:
Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT…
▽ More
Designing an efficient keyword spotting (KWS) system that delivers exceptional performance on resource-constrained edge devices has long been a subject of significant attention. Existing KWS search algorithms typically follow a frame-synchronous approach, where search decisions are made repeatedly at each frame despite the fact that most frames are keyword-irrelevant. In this paper, we propose TDT-KWS, which leverages token-and-duration Transducers (TDT) for KWS tasks. We also propose a novel KWS task-specific decoding algorithm for Transducer-based models, which supports highly effective frame-asynchronous keyword search in streaming speech scenarios. With evaluations conducted on both the public Hey Snips and self-constructed LibriKWS-20 datasets, our proposed KWS-decoding algorithm produces more accurate results than conventional ASR decoding algorithms. Additionally, TDT-KWS achieves on-par or better wake word detection performance than both RNN-T and traditional TDT-ASR systems while achieving significant inference speed-up. Furthermore, experiments show that TDT-KWS is more robust to noisy environments compared to RNN-T KWS.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
CoLeCLIP: Open-Domain Continual Learning via Joint Task Prompt and Vocabulary Learning
Authors:
Yukun Li,
Guansong Pang,
Wei Suo,
Chenchen Jing,
Yuling Xi,
Lingqiao Liu,
Hao Chen,
Guoqiang Liang,
Peng Wang
Abstract:
This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Cu…
▽ More
This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
A finite element contour integral method for computing the resonances of metallic grating structures with subwavelength holes
Authors:
Yingxia Xi,
Junshan Lin,
Jiguang Sun
Abstract:
We consider the numerical computation of resonances for metallic grating structures with dispersive media and small slit holes. The underlying eigenvalue problem is nonlinear and the mathematical model is multiscale due to the existence of several length scales in problem geometry and material contrast. We discretize the partial differential equation model over the truncated domain using the finit…
▽ More
We consider the numerical computation of resonances for metallic grating structures with dispersive media and small slit holes. The underlying eigenvalue problem is nonlinear and the mathematical model is multiscale due to the existence of several length scales in problem geometry and material contrast. We discretize the partial differential equation model over the truncated domain using the finite element method and develop a multi-step contour integral eigensolver to compute the resonances. The eigensolver first locates eigenvalues using a spectral indicator and then computes eigenvalues by a subspace projection scheme. The proposed numerical method is robust and scalable, and does not require initial guess as the iteration methods. Numerical examples are presented to demonstrate its effectiveness.
△ Less
Submitted 7 March, 2024;
originally announced March 2024.
-
Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining
Authors:
Jiarun Liu,
Hao Yang,
Hong-Yu Zhou,
Yan Xi,
Lequan Yu,
Yizhou Yu,
Yong Liang,
Guangming Shi,
Shaoting Zhang,
Hairong Zheng,
Shanshan Wang
Abstract:
Accurate medical image segmentation demands the integration of multi-scale information, spanning from local features to global dependencies. However, it is challenging for existing methods to model long-range global information, where convolutional neural networks (CNNs) are constrained by their local receptive fields, and vision transformers (ViTs) suffer from high quadratic complexity of their a…
▽ More
Accurate medical image segmentation demands the integration of multi-scale information, spanning from local features to global dependencies. However, it is challenging for existing methods to model long-range global information, where convolutional neural networks (CNNs) are constrained by their local receptive fields, and vision transformers (ViTs) suffer from high quadratic complexity of their attention mechanism. Recently, Mamba-based models have gained great attention for their impressive ability in long sequence modeling. Several studies have demonstrated that these models can outperform popular vision models in various tasks, offering higher accuracy, lower memory consumption, and less computational burden. However, existing Mamba-based models are mostly trained from scratch and do not explore the power of pretraining, which has been proven to be quite effective for data-efficient medical image analysis. This paper introduces a novel Mamba-based model, Swin-UMamba, designed specifically for medical image segmentation tasks, leveraging the advantages of ImageNet-based pretraining. Our experimental results reveal the vital role of ImageNet-based training in enhancing the performance of Mamba-based models. Swin-UMamba demonstrates superior performance with a large margin compared to CNNs, ViTs, and latest Mamba-based models. Notably, on AbdomenMRI, Encoscopy, and Microscopy datasets, Swin-UMamba outperforms its closest counterpart U-Mamba_Enc by an average score of 2.72%.
△ Less
Submitted 6 March, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Sliding ferroelectric memories and synapses
Authors:
Xiuzhen Li,
Biao Qin,
Yaxian Wang,
Yue Xi,
Zhiheng Huang,
Mengze Zhao,
Yalin Peng,
Zitao Chen,
Zitian Pan,
Jundong Zhu,
Chenyang Cui,
Rong Yang,
Wei Yang,
Sheng Meng,
Dongxia Shi,
Xuedong Bai,
Can Liu,
Na Li,
Jianshi Tang,
Kaihui Liu,
Luojun Du,
Guangyu Zhang
Abstract:
Ferroelectric materials with switchable electric polarization hold great promise for a plethora of emergent applications, such as post-Moore's law nanoelectronics, beyond-Boltzmann transistors, non-volatile memories, and above-bandgap photovoltaic devices. Recent advances have uncovered an exotic sliding ferroelectric mechanism, which endows to design atomically thin ferroelectrics from non-ferroe…
▽ More
Ferroelectric materials with switchable electric polarization hold great promise for a plethora of emergent applications, such as post-Moore's law nanoelectronics, beyond-Boltzmann transistors, non-volatile memories, and above-bandgap photovoltaic devices. Recent advances have uncovered an exotic sliding ferroelectric mechanism, which endows to design atomically thin ferroelectrics from non-ferroelectric parent monolayers. Although notable progress has been witnessed in understanding its fundamental properties, functional devices based on sliding ferroelectrics, the key touchstone toward applications, remain elusive. Here, we demonstrate the rewritable, non-volatile memory devices at room-temperature utilizing a two-dimensional (2D) sliding ferroelectric semiconductor of rhombohedral-stacked bilayer molybdenum disulfide. The 2D sliding ferroelectric memories (SFeMs) show superior performances with a large memory window of >8V, a high conductance ratio of above 106, a long retention time of >10 years, and a programming endurance greater than 104 cycles. Remarkably, flexible SFeMs are achieved with state-of-the-art performances competitive to their rigid counterparts and maintain their performances post bending over 103 cycles. Furthermore, synapse-specific Hebbian forms of plasticity and image recognition with a high accuracy of 97.81% are demonstrated based on flexible SFeMs. Our work demonstrates the sliding ferroelectric memories and synaptic plasticity on both rigid and flexible substrates, highlighting the great potential of sliding ferroelectrics for emerging technological applications in brain-inspired in-memory computing, edge intelligence and energy-efficient wearable electronics.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech
Authors:
Yu Xi,
Baochen Yang,
Hao Li,
Jiaqi Guo,
Kai Yu
Abstract:
Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-art…
▽ More
Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-articulation and streaming word segmentation can easily yield similar audio patterns for different texts, which may consequently trigger false alarms. To address this issue, we propose a novel CL with Audio Discrimination (CLAD) approach to learning keyword representation with both audio-text matching and audio-audio discrimination ability. Here, an InfoNCE loss considering both audio-audio and audio-text CL data pairs is employed for each sliding window during training. Evaluations on the open-source LibriPhrase dataset show that the use of sliding-window level InfoNCE loss yields comparable performance compared to previous CL approaches. Furthermore, experiments on the continuous speech dataset LibriSpeech demonstrate that, by incorporating audio discrimination, CLAD achieves significant performance gain over CL without audio discrimination. Meanwhile, compared to two-stage KWS approaches, the end-to-end KWS with CLAD achieves not only better performance, but also significant speed-up.
△ Less
Submitted 12 January, 2024;
originally announced January 2024.
-
Cospectral vertices, walk-regular planar graphs and the echolocation problem
Authors:
Shi-Lei Kong,
Emmett L. Wyman,
Yakun Xi
Abstract:
We study cospectral vertices on finite graphs in relation to the echolocation problem on Riemannian manifolds. First, We prove a computationally simple criterion to determine whether two vertices are cospectral. Then, we use this criterion in conjunction with a computer search to find minimal examples of various types of graphs on which cospectral but non-similar vertices exist, including minimal…
▽ More
We study cospectral vertices on finite graphs in relation to the echolocation problem on Riemannian manifolds. First, We prove a computationally simple criterion to determine whether two vertices are cospectral. Then, we use this criterion in conjunction with a computer search to find minimal examples of various types of graphs on which cospectral but non-similar vertices exist, including minimal walk-regular non-vertex-transitive graphs, which turn out to be non-planar. Moreover, as our main result, we classify all finite 3-connected walk-regular planar graphs, proving that such graphs must be vertex-transitive.
△ Less
Submitted 18 July, 2024; v1 submitted 11 January, 2024;
originally announced January 2024.
-
Parallel Multi-Step Contour Integral Methods for Nonlinear Eigenvalue Problems
Authors:
Yingxia Xi,
Jiguang Sun
Abstract:
We consider nonlinear eigenvalue problems to compute all eigenvalues in a bounded region on the complex plane. Based on domain decomposition and contour integrals, two robust and scalable parallel multi-step methods are proposed. The first method 1) uses the spectral indicator method to find eigenvalues and 2) calls a linear eigensolver to compute the associated eigenvectors. The second method 1)…
▽ More
We consider nonlinear eigenvalue problems to compute all eigenvalues in a bounded region on the complex plane. Based on domain decomposition and contour integrals, two robust and scalable parallel multi-step methods are proposed. The first method 1) uses the spectral indicator method to find eigenvalues and 2) calls a linear eigensolver to compute the associated eigenvectors. The second method 1) divides the region into subregions and uses the spectral indicator method to decide candidate regions that contain eigenvalues, 2) computes eigenvalues in each candidate subregion using Beyn's method; and 3) verifies each eigenvalue by substituting it back to the system and computes the smallest eigenvalue. Each step of the two methods is carried out in parallel. Both methods are robust, accurate, and does not require prior knowledge of the number and distribution of the eigenvalues in the region. Examples are presented to show the performance of the two methods.
△ Less
Submitted 17 January, 2024; v1 submitted 20 December, 2023;
originally announced December 2023.
-
Devil in the Landscapes: Inferring Epidemic Exposure Risks from Street View Imagery
Authors:
Zhenyu Han,
Yanxin Xi,
Tong Xia,
Yu Liu,
Yong Li
Abstract:
Built environment supports all the daily activities and shapes our health. Leveraging informative street view imagery, previous research has established the profound correlation between the built environment and chronic, non-communicable diseases; however, predicting the exposure risk of infectious diseases remains largely unexplored. The person-to-person contacts and interactions contribute to th…
▽ More
Built environment supports all the daily activities and shapes our health. Leveraging informative street view imagery, previous research has established the profound correlation between the built environment and chronic, non-communicable diseases; however, predicting the exposure risk of infectious diseases remains largely unexplored. The person-to-person contacts and interactions contribute to the complexity of infectious disease, which is inherently different from non-communicable diseases. Besides, the complex relationships between street view imagery and epidemic exposure also hinder accurate predictions. To address these problems, we construct a regional mobility graph informed by the gravity model, based on which we propose a transmission-aware graph convolutional network (GCN) to capture disease transmission patterns arising from human mobility. Experiments show that the proposed model significantly outperforms baseline models by 8.54% in weighted F1, shedding light on a low-cost, scalable approach to assess epidemic exposure risks from street view imagery.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Full-length-body CBCT imaging in upright position with robotic-arm system: a simulation study
Authors:
Tong Lin,
Tianling Lyu,
Zhan Wu,
Yan Xi,
Wentao Zhu,
Yang Chen
Abstract:
Upright position CT scans make it possible for full-length-body imaging at conditions more relevant to daily situations, but the substantial weight of the upright CT scanners increases the risks to floor's stability and patients'safety. Robotic-arm CBCT systems are supposed to be a better solution for this task, but such systems still face challenges including long scanning time and low reconstruc…
▽ More
Upright position CT scans make it possible for full-length-body imaging at conditions more relevant to daily situations, but the substantial weight of the upright CT scanners increases the risks to floor's stability and patients'safety. Robotic-arm CBCT systems are supposed to be a better solution for this task, but such systems still face challenges including long scanning time and low reconstruction quality. To address the above challenges, this paper proposes a novel method to calculate optimal scanning pitch based on data completeness analysis, which can complete the whole-body scan in the shortest time without a significant decline in image quality. Besides, an FDK-style reconstruction method based on normalized projections is proposed to obtain fast image reconstruction. Extensive experiments prove the effectiveness of the proposed optimal scanning trajectory. Qualitative and quantitative comparisons with FDK and iterative algorithms show that the proposed reconstruction method can obtain high imaging quality with reasonable computation costs. The method proposed in this paper is expected to promote the application of robotic-arm CBCT systems in orthopedic functional analysis.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction
Authors:
Jianghao Lin,
Bo Chen,
Hangyu Wang,
Yunjia Xi,
Yanru Qu,
Xinyi Dai,
Kangning Zhang,
Ruiming Tang,
Yong Yu,
Weinan Zhang
Abstract:
Click-through rate (CTR) prediction has become increasingly indispensable for various Internet applications. Traditional CTR models convert the multi-field categorical data into ID features via one-hot encoding, and extract the collaborative signals among features. Such a paradigm suffers from the problem of semantic information loss. Another line of research explores the potential of pretrained l…
▽ More
Click-through rate (CTR) prediction has become increasingly indispensable for various Internet applications. Traditional CTR models convert the multi-field categorical data into ID features via one-hot encoding, and extract the collaborative signals among features. Such a paradigm suffers from the problem of semantic information loss. Another line of research explores the potential of pretrained language models (PLMs) for CTR prediction by converting input data into textual sentences through hard prompt templates. Although semantic signals are preserved, they generally fail to capture the collaborative information (e.g., feature interactions, pure ID features), not to mention the unacceptable inference overhead brought by the huge model size. In this paper, we aim to model both the semantic knowledge and collaborative knowledge for accurate CTR estimation, and meanwhile address the inference inefficiency issue. To benefit from both worlds and close their gaps, we propose a novel model-agnostic framework (i.e., ClickPrompt), where we incorporate CTR models to generate interaction-aware soft prompts for PLMs. We design a prompt-augmented masked language modeling (PA-MLM) pretraining task, where PLM has to recover the masked tokens based on the language context, as well as the soft prompts generated by CTR model. The collaborative and semantic knowledge from ID and textual features would be explicitly aligned and interacted via the prompt interface. Then, we can either tune the CTR model with PLM for superior performance, or solely tune the CTR model without PLM for inference efficiency. Experiments on four real-world datasets validate the effectiveness of ClickPrompt compared with existing baselines.
△ Less
Submitted 26 June, 2024; v1 submitted 13 October, 2023;
originally announced October 2023.
-
IFT: Image Fusion Transformer for Ghost-free High Dynamic Range Imaging
Authors:
Hailing Wang,
Wei Li,
Yuanyuan Xi,
Jie Hu,
Hanting Chen,
Longyu Li,
Yunhe Wang
Abstract:
Multi-frame high dynamic range (HDR) imaging aims to reconstruct ghost-free images with photo-realistic details from content-complementary but spatially misaligned low dynamic range (LDR) images. Existing HDR algorithms are prone to producing ghosting artifacts as their methods fail to capture long-range dependencies between LDR frames with large motion in dynamic scenes. To address this issue, we…
▽ More
Multi-frame high dynamic range (HDR) imaging aims to reconstruct ghost-free images with photo-realistic details from content-complementary but spatially misaligned low dynamic range (LDR) images. Existing HDR algorithms are prone to producing ghosting artifacts as their methods fail to capture long-range dependencies between LDR frames with large motion in dynamic scenes. To address this issue, we propose a novel image fusion transformer, referred to as IFT, which presents a fast global patch searching (FGPS) module followed by a self-cross fusion module (SCF) for ghost-free HDR imaging. The FGPS searches the patches from supporting frames that have the closest dependency to each patch of the reference frame for long-range dependency modeling, while the SCF conducts intra-frame and inter-frame feature fusion on the patches obtained by the FGPS with linear complexity to input resolution. By matching similar patches between frames, objects with large motion ranges in dynamic scenes can be aligned, which can effectively alleviate the generation of artifacts. In addition, the proposed FGPS and SCF can be integrated into various deep HDR methods as efficient plug-in modules. Extensive experiments on multiple benchmarks show that our method achieves state-of-the-art performance both quantitatively and qualitatively.
△ Less
Submitted 8 October, 2023; v1 submitted 26 September, 2023;
originally announced September 2023.
-
Hierarchical Audio-Visual Information Fusion with Multi-label Joint Decoding for MER 2023
Authors:
Haotian Wang,
Yuxuan Xi,
Hang Chen,
Jun Du,
Yan Song,
Qing Wang,
Hengshun Zhou,
Chenxi Wang,
Jiefeng Ma,
Pengfei Hu,
Ya Jiang,
Shi Cheng,
Jie Zhang,
Yuzhe Weng
Abstract:
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for e…
▽ More
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions. In our framework, deep features extracted from foundation models are used as robust acoustic and visual representations of raw video. Three different structures based on attention-guided feature gathering (AFG) are designed for deep feature fusion. Then, we introduce a joint decoding structure for emotion classification and valence regression in the decoding stage. A multi-task loss based on uncertainty is also designed to optimize the whole process. Finally, by combining three different structures on the posterior probability level, we obtain the final predictions of discrete and dimensional emotions. When tested on the dataset of multimodal emotion recognition challenge (MER 2023), the proposed framework yields consistent improvements in both emotion classification and valence regression. Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.
△ Less
Submitted 10 September, 2023;
originally announced September 2023.
-
Real-time Monitoring for the Next Core-Collapse Supernova in JUNO
Authors:
Angel Abusleme,
Thomas Adam,
Shakeel Ahmad,
Rizwan Ahmed,
Sebastiano Aiello,
Muhammad Akram,
Abid Aleem,
Fengpeng An,
Qi An,
Giuseppe Andronico,
Nikolay Anfimov,
Vito Antonelli,
Tatiana Antoshkina,
Burin Asavapibhop,
João Pedro Athayde Marcondes de André,
Didier Auguste,
Weidong Bai,
Nikita Balashov,
Wander Baldini,
Andrea Barresi,
Davide Basilico,
Eric Baussan,
Marco Bellato,
Marco Beretta,
Antonio Bergnoli
, et al. (606 additional authors not shown)
Abstract:
The core-collapse supernova (CCSN) is considered one of the most energetic astrophysical events in the universe. The early and prompt detection of neutrinos before (pre-SN) and during the supernova (SN) burst presents a unique opportunity for multi-messenger observations of CCSN events. In this study, we describe the monitoring concept and present the sensitivity of the system to pre-SN and SN neu…
▽ More
The core-collapse supernova (CCSN) is considered one of the most energetic astrophysical events in the universe. The early and prompt detection of neutrinos before (pre-SN) and during the supernova (SN) burst presents a unique opportunity for multi-messenger observations of CCSN events. In this study, we describe the monitoring concept and present the sensitivity of the system to pre-SN and SN neutrinos at the Jiangmen Underground Neutrino Observatory (JUNO), a 20 kton liquid scintillator detector currently under construction in South China. The real-time monitoring system is designed to ensure both prompt alert speed and comprehensive coverage of progenitor stars. It incorporates prompt monitors on the electronic board as well as online monitors at the data acquisition stage. Assuming a false alert rate of 1 per year, this monitoring system exhibits sensitivity to pre-SN neutrinos up to a distance of approximately 1.6 (0.9) kiloparsecs and SN neutrinos up to about 370 (360) kiloparsecs for a progenitor mass of 30 solar masses, considering both normal and inverted mass ordering scenarios. The pointing ability of the CCSN is evaluated by analyzing the accumulated event anisotropy of inverse beta decay interactions from pre-SN or SN neutrinos. This, along with the early alert, can play a crucial role in facilitating follow-up multi-messenger observations of the next galactic or nearby extragalactic CCSN.
△ Less
Submitted 4 December, 2023; v1 submitted 13 September, 2023;
originally announced September 2023.
-
Falconer distance problem on Riemannian manifolds
Authors:
Changbiao Jian,
Bochen Liu,
Yakun Xi
Abstract:
We prove that on a $d$-dimensional Riemannian manifold, the distance set of a Borel set $E$ has a positive Lebesgue measure if $\dim_{\mathcal H} E>\frac d2+\frac14+\frac{3}{8d+4}.$ Moreover, on a Riemannian manifold with constant sectional curvature, we show that the distance set of $E$ has a positive Lebesgue measure if $\dim_{\mathcal{H}}(E)>\frac d2+\frac14+\frac{1-(-1)^d}{8d}.$
We prove that on a $d$-dimensional Riemannian manifold, the distance set of a Borel set $E$ has a positive Lebesgue measure if $\dim_{\mathcal H} E>\frac d2+\frac14+\frac{3}{8d+4}.$ Moreover, on a Riemannian manifold with constant sectional curvature, we show that the distance set of $E$ has a positive Lebesgue measure if $\dim_{\mathcal{H}}(E)>\frac d2+\frac14+\frac{1-(-1)^d}{8d}.$
△ Less
Submitted 22 January, 2024; v1 submitted 3 September, 2023;
originally announced September 2023.
-
EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting
Authors:
Zitao Wang,
Qiguang Miao,
Peipei Zhao,
Yue Xi
Abstract:
The portrait matting task aims to extract an alpha matte with complete semantics and finely-detailed contours. In comparison to CNN-based approaches, transformers with self-attention module have a better capacity to capture long-range dependencies and low-frequency semantic information of a portrait. However, the recent research shows that self-attention mechanism struggles with modeling high-freq…
▽ More
The portrait matting task aims to extract an alpha matte with complete semantics and finely-detailed contours. In comparison to CNN-based approaches, transformers with self-attention module have a better capacity to capture long-range dependencies and low-frequency semantic information of a portrait. However, the recent research shows that self-attention mechanism struggles with modeling high-frequency contour information and capturing fine contour details, which can lead to bias while predicting the portrait's contours. To deal with this issue, we propose EFormer to enhance the model's attention towards both of the low-frequency semantic and high-frequency contour features. For the high-frequency contours, our research demonstrates that cross-attention module between different resolutions can guide our model to allocate attention appropriately to these contour regions. Supported on this, we can successfully extract the high-frequency detail information around the portrait's contours, which are previously ignored by self-attention. Based on cross-attention module, we further build a semantic and contour detector (SCD) to accurately capture both of the low-frequency semantic and high-frequency contour features. And we design contour-edge extraction branch and semantic extraction branch to extract refined high-frequency contour features and complete low-frequency semantic information, respectively. Finally, we fuse the two kinds of features and leverage segmentation head to generate a predicted portrait matte. Experiments on VideoMatte240K (JPEG SD Format) and Adobe Image Matting (AIM) datasets demonstrate that EFormer outperforms previous portrait matte methods.
△ Less
Submitted 30 November, 2023; v1 submitted 24 August, 2023;
originally announced August 2023.
-
Prototypical Kernel Learning and Open-set Foreground Perception for Generalized Few-shot Semantic Segmentation
Authors:
Kai Huang,
Feigege Wang,
Ye Xi,
Yutao Gao
Abstract:
Generalized Few-shot Semantic Segmentation (GFSS) extends Few-shot Semantic Segmentation (FSS) to simultaneously segment unseen classes and seen classes during evaluation. Previous works leverage additional branch or prototypical aggregation to eliminate the constrained setting of FSS. However, representation division and embedding prejudice, which heavily results in poor performance of GFSS, have…
▽ More
Generalized Few-shot Semantic Segmentation (GFSS) extends Few-shot Semantic Segmentation (FSS) to simultaneously segment unseen classes and seen classes during evaluation. Previous works leverage additional branch or prototypical aggregation to eliminate the constrained setting of FSS. However, representation division and embedding prejudice, which heavily results in poor performance of GFSS, have not been synthetical considered. We address the aforementioned problems by jointing the prototypical kernel learning and open-set foreground perception. Specifically, a group of learnable kernels is proposed to perform segmentation with each kernel in charge of a stuff class. Then, we explore to merge the prototypical learning to the update of base-class kernels, which is consistent with the prototype knowledge aggregation of few-shot novel classes. In addition, a foreground contextual perception module cooperating with conditional bias based inference is adopted to perform class-agnostic as well as open-set foreground detection, thus to mitigate the embedding prejudice and prevent novel targets from being misclassified as background. Moreover, we also adjust our method to the Class Incremental Few-shot Semantic Segmentation (CIFSS) which takes the knowledge of novel classes in a incremental stream. Extensive experiments on PASCAL-5i and COCO-20i datasets demonstrate that our method performs better than previous state-of-the-art.
△ Less
Submitted 18 August, 2023; v1 submitted 9 August, 2023;
originally announced August 2023.