Search SciRate

932 results for au:Luo_J in:cs

Show all abstracts

Learning from others' mistakes: Finetuning machine translation models with span-level error annotations
Lily H. Zhang, Hamid Dadkhahi, Mara Finkelstein, Firas Trabelsi, Jiaming Luo, Markus Freitag
Oct 23 2024 cs.CL cs.LG arXiv:2410.16509v1

@misc{2410.16509, author = {Lily H.~Zhang and Hamid Dadkhahi and Mara Finkelstein and Firas Trabelsi and Jiaming Luo and Markus Freitag}, title = {{L}earning from others' mistakes: {F}inetuning machine translation models with span-level error annotations}, year = {2024}, eprint = {2410.16509}, note = {arXiv:2410.16509v1} }
PDF
Despite growing interest in incorporating feedback to improve language models, most efforts focus only on sequence-level annotations. In this work, we explore the potential of utilizing fine-grained span-level annotations from offline datasets to improve model quality. We develop a simple finetuning algorithm, called Training with Annotations (TWA), to directly train machine translation models on such annotated data. TWA utilizes targeted span-level error information while also flexibly learning what to penalize within a span. Moreover, TWA considers the overall trajectory of a sequence when deciding which non-error spans to utilize as positive signals. Experiments on English-German and Chinese-English machine translation show that TWA outperforms baselines such as Supervised FineTuning on sequences filtered for quality and Direct Preference Optimization on pairs constructed from the same data.
GALA: Graph Diffusion-based Alignment with Jigsaw for Source-free Domain Adaptation
Junyu Luo, Yiyang Gu, Xiao Luo, Wei Ju, Zhiping Xiao, Yusheng Zhao, Jingyang Yuan, Ming Zhang
Oct 23 2024 cs.LG cs.AI arXiv:2410.16606v1

@misc{2410.16606, author = {Junyu Luo and Yiyang Gu and Xiao Luo and Wei Ju and Zhiping Xiao and Yusheng Zhao and Jingyang Yuan and Ming Zhang}, title = {{GALA}: {G}raph {D}iffusion-based {A}lignment with {J}igsaw for {S}ource-free {D}omain {A}daptation}, year = {2024}, eprint = {2410.16606}, doi = {10.1109/TPAMI.2024.3416372}, note = {arXiv:2410.16606v1} }
PDF
Source-free domain adaptation is a crucial machine learning topic, as it contains numerous applications in the real world, particularly with respect to data privacy. Existing approaches predominantly focus on Euclidean data, such as images and videos, while the exploration of non-Euclidean graph data remains scarce. Recent graph neural network (GNN) approaches can suffer from serious performance decline due to domain shift and label scarcity in source-free adaptation scenarios. In this study, we propose a novel method named Graph Diffusion-based Alignment with Jigsaw (GALA), tailored for source-free graph domain adaptation. To achieve domain alignment, GALA employs a graph diffusion model to reconstruct source-style graphs from target data. Specifically, a score-based graph diffusion model is trained using source graphs to learn the generative source styles. Then, we introduce perturbations to target graphs via a stochastic differential equation instead of sampling from a prior, followed by the reverse process to reconstruct source-style graphs. We feed the source-style graphs into an off-the-shelf GNN and introduce class-specific thresholds with curriculum learning, which can generate accurate and unbiased pseudo-labels for target graphs. Moreover, we develop a simple yet effective graph-mixing strategy named graph jigsaw to combine confident graphs and unconfident graphs, which can enhance generalization capabilities and robustness via consistency learning. Extensive experiments on benchmark datasets validate the effectiveness of GALA.
SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
Junyu Luo, Xiao Luo, Xiusi Chen, Zhiping Xiao, Wei Ju, Ming Zhang
Oct 22 2024 cs.CL cs.AI arXiv:2410.14745v1

@misc{2410.14745, author = {Junyu Luo and Xiao Luo and Xiusi Chen and Zhiping Xiao and Wei Ju and Ming Zhang}, title = {{S}emi{E}vol: {S}emi-supervised {F}ine-tuning for {LLM} {A}daptation}, year = {2024}, eprint = {2410.14745}, note = {arXiv:2410.14745v1} }
PDF
Supervised fine-tuning (SFT) is crucial in adapting large language models (LLMs) to a specific domain or task. However, only a limited amount of labeled data is available in practical applications, which poses a severe challenge for SFT in yielding satisfactory results. Therefore, a data-efficient framework that can fully exploit labeled and unlabeled data for LLM fine-tuning is highly anticipated. Towards this end, we introduce a semi-supervised fine-tuning framework named SemiEvol for LLM adaptation from a propagate-and-select manner. For knowledge propagation, SemiEvol adopts a bi-level approach, propagating knowledge from labeled data to unlabeled data through both in-weight and in-context methods. For knowledge selection, SemiEvol incorporates a collaborative learning mechanism, selecting higher-quality pseudo-response samples. We conducted experiments using GPT-4o-mini and Llama-3.1 on seven general or domain-specific datasets, demonstrating significant improvements in model performance on target data. Furthermore, we compared SemiEvol with SFT and self-evolution methods, highlighting its practicality in hybrid data scenarios.
Focus on BEV: Self-calibrated Cycle View Transformation for Monocular Birds-Eye-View Segmentation
Jiawei Zhao, Qixing Jiang, Xuede Li, Junfeng Luo
Oct 22 2024 cs.CV arXiv:2410.15932v1

@misc{2410.15932, author = {Jiawei Zhao and Qixing Jiang and Xuede Li and Junfeng Luo}, title = {{F}ocus on {BEV}: {S}elf-calibrated {C}ycle {V}iew {T}ransformation for {M}onocular {B}irds-{E}ye-{V}iew {S}egmentation}, year = {2024}, eprint = {2410.15932}, note = {arXiv:2410.15932v1} }
PDF
Birds-Eye-View (BEV) segmentation aims to establish a spatial mapping from the perspective view to the top view and estimate the semantic maps from monocular images. Recent studies have encountered difficulties in view transformation due to the disruption of BEV-agnostic features in image space. To tackle this issue, we propose a novel FocusBEV framework consisting of $(i)$ a self-calibrated cross view transformation module to suppress the BEV-agnostic image areas and focus on the BEV-relevant areas in the view transformation stage, $(ii)$ a plug-and-play ego-motion-based temporal fusion module to exploit the spatiotemporal structure consistency in BEV space with a memory bank, and $(iii)$ an occupancy-agnostic IoU loss to mitigate both semantic and positional uncertainties. Experimental evidence demonstrates that our approach achieves new state-of-the-art on two popular benchmarks,\ie, 29.2\% mIoU on nuScenes and 35.2\% mIoU on Argoverse.
Reinforcement Learning with Euclidean Data Augmentation for State-Based Continuous Control
Jinzhu Luo, Dingyang Chen, Qi Zhang
Oct 18 2024 cs.LG cs.AI arXiv:2410.12983v1

@misc{2410.12983, author = {Jinzhu Luo and Dingyang Chen and Qi Zhang}, title = {{R}einforcement {L}earning with {E}uclidean {D}ata {A}ugmentation for {S}tate-{B}ased {C}ontinuous {C}ontrol}, year = {2024}, eprint = {2410.12983}, note = {arXiv:2410.12983v1} }
PDF
Data augmentation creates new data points by transforming the original ones for a reinforcement learning (RL) agent to learn from, which has been shown to be effective for the objective of improving the data efficiency of RL for continuous control. Prior work towards this objective has been largely restricted to perturbation-based data augmentation where new data points are created by perturbing the original ones, which has been impressively effective for tasks where the RL agent observes control states as images with perturbations including random cropping, shifting, etc. This work focuses on state-based control, where the RL agent can directly observe raw kinematic and task features, and considers an alternative data augmentation applied to these features based on Euclidean symmetries under transformations like rotations. We show that the default state features used in exiting benchmark tasks that are based on joint configurations are not amenable to Euclidean transformations. We therefore advocate using state features based on configurations of the limbs (i.e., the rigid bodies connected by the joints) that instead provide rich augmented data under Euclidean transformations. With minimal hyperparameter tuning, we show this new Euclidean data augmentation strategy significantly improves both data efficiency and asymptotic performance of RL on a wide range of continuous control tasks.
Aegis:An Advanced LLM-Based Multi-Agent for Intelligent Functional Safety Engineering
Lu Shi, Bin Qi, Jiarui Luo, Yang Zhang, Zhanzhao Liang, Zhaowei Gao, Wenke Deng, Lin Sun
Oct 17 2024 cs.MA arXiv:2410.12475v2

@misc{2410.12475, author = {Lu Shi and Bin Qi and Jiarui Luo and Yang Zhang and Zhanzhao Liang and Zhaowei Gao and Wenke Deng and Lin Sun}, title = {{A}egis:{A}n {A}dvanced {LLM}-{B}ased {M}ulti-{A}gent for {I}ntelligent {F}unctional {S}afety {E}ngineering}, year = {2024}, eprint = {2410.12475}, note = {arXiv:2410.12475v2} }
PDF
Functional safety is a critical aspect of automotive engineering, encompassing all phases of a vehicle's lifecycle, including design, development, production, operation, and decommissioning. This domain involves highly knowledge-intensive tasks. This paper introduces Aegis: An Advanced LLM-Based Multi-Agent for Intelligent Functional Safety Engineering. Aegis is specifically designed to support complex functional safety tasks within the automotive sector. It is tailored to perform Hazard Analysis and Risk Assessment(HARA), document Functional Safety Requirements(FSR), and plan test cases for Automatic Emergency Braking(AEB) systems. The most advanced version, Aegis-Max, leverages Retrieval-Augmented Generation(RAG) and reflective mechanisms to enhance its capability in managing complex, knowledge-intensive tasks. Additionally, targeted prompt refinement by professional functional safety practitioners can significantly optimize Aegis's performance in the functional safety domain. This paper demonstrates the potential of Aegis to improve the efficiency and effectiveness of functional safety processes in automotive engineering.
t-READi: Transformer-Powered Robust and Efficient Multimodal Inference for Autonomous Driving
Pengfei Hu, Yuhang Qian, Tianyue Zheng, Ang Li, Zhe Chen, Yue Gao, Xiuzhen Cheng, Jun Luo
Oct 15 2024 cs.CV cs.AI cs.DC cs.LG cs.RO arXiv:2410.09747v2

@misc{2410.09747, author = {Pengfei Hu and Yuhang Qian and Tianyue Zheng and Ang Li and Zhe Chen and Yue Gao and Xiuzhen Cheng and Jun Luo}, title = {t-{READ}i: {T}ransformer-{P}owered {R}obust and {E}fficient {M}ultimodal {I}nference for {A}utonomous {D}riving}, year = {2024}, eprint = {2410.09747}, note = {arXiv:2410.09747v2} }
PDF
Given the wide adoption of multimodal sensors (e.g., camera, lidar, radar) by autonomous vehicles (AVs), deep analytics to fuse their outputs for a robust perception become imperative. However, existing fusion methods often make two assumptions rarely holding in practice: i) similar data distributions for all inputs and ii) constant availability for all sensors. Because, for example, lidars have various resolutions and failures of radars may occur, such variability often results in significant performance degradation in fusion. To this end, we present tREADi, an adaptive inference system that accommodates the variability of multimodal sensory data and thus enables robust and efficient perception. t-READi identifies variation-sensitive yet structure-specific model parameters; it then adapts only these parameters while keeping the rest intact. t-READi also leverages a cross-modality contrastive learning method to compensate for the loss from missing modalities. Both functions are implemented to maintain compatibility with existing multimodal deep fusion methods. The extensive experiments evidently demonstrate that compared with the status quo approaches, t-READi not only improves the average inference accuracy by more than 6% but also reduces the inference latency by almost 15x with the cost of only 5% extra memory overhead in the worst case under realistic data and modal variations.
Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models
Jun Luo, Chen Chen, Shandong Wu
Oct 15 2024 cs.LG cs.CL cs.CV arXiv:2410.10114v2

@misc{2410.10114, author = {Jun Luo and Chen Chen and Shandong Wu}, title = {{M}ixture of {E}xperts {M}ade {P}ersonalized: {F}ederated {P}rompt {L}earning for {V}ision-{L}anguage {M}odels}, year = {2024}, eprint = {2410.10114}, note = {arXiv:2410.10114v2} }
PDF
Prompt learning for pre-trained Vision-Language Models (VLMs) like CLIP has demonstrated potent applicability across diverse downstream tasks. This lightweight approach has quickly gained traction from federated learning (FL) researchers who seek to efficiently adapt VLMs to heterogeneous scenarios. However, current federated prompt learning methods are habitually restricted to the traditional FL paradigm, where the participating clients are generally only allowed to download a single globally aggregated model from the server. While justifiable for training full-sized models under federated settings, in this work, we argue that this paradigm is ill-suited for lightweight prompts. By facilitating the clients to download multiple pre-aggregated prompts as fixed non-local experts, we propose Personalized Federated Mixture of Adaptive Prompts (pFedMoAP), a novel FL framework that personalizes the prompt learning process through the lens of Mixture of Experts (MoE). pFedMoAP implements a local attention-based gating network that learns to generate enhanced text features for better alignment with local image data on the client, benefiting from both local and downloaded non-local adaptive prompt experts. The non-local experts are sparsely selected from a server-maintained pool, fostering collaborative learning across clients. To evaluate the proposed algorithm, we conduct extensive experiments across 9 datasets under various heterogeneous federated settings. The results show that pFedMoAP consistently outperforms the state-of-the-art alternatives, underscoring its efficacy in personalizing prompt learning for CLIP within the federated learning paradigm.
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models
Hang Hua, Yunlong Tang, Ziyun Zeng, Liangliang Cao, Zhengyuan Yang, Hangfeng He, Chenliang Xu, Jiebo Luo
Oct 15 2024 cs.CV arXiv:2410.09733v1

@misc{2410.09733, author = {Hang Hua and Yunlong Tang and Ziyun Zeng and Liangliang Cao and Zhengyuan Yang and Hangfeng He and Chenliang Xu and Jiebo Luo}, title = {{MMCOMPOSITION}: {R}evisiting the {C}ompositionality of {P}re-trained {V}ision-{L}anguage {M}odels}, year = {2024}, eprint = {2410.09733}, note = {arXiv:2410.09733v1} }
PDF
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their compositionality -- the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs' compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: https://hanghuacs.github.io/MMComposition/
Small Tunes Transformer: Exploring Macro & Micro-Level Hierarchies for Skeleton-Conditioned Melody Generation
Yishan Lv, Jing Luo, Boyuan Ju, Xinyu Yang
Oct 14 2024 cs.SD eess.AS arXiv:2410.08626v2

@misc{2410.08626, author = {Yishan Lv and Jing Luo and Boyuan Ju and Xinyu Yang}, title = {{S}mall {T}unes {T}ransformer: {E}xploring {M}acro & {M}icro-{L}evel {H}ierarchies for {S}keleton-{C}onditioned {M}elody {G}eneration}, year = {2024}, eprint = {2410.08626}, note = {arXiv:2410.08626v2} }
PDF
Recently, symbolic music generation has become a focus of numerous deep learning research. Structure as an important part of music, contributes to improving the quality of music, and an increasing number of works start to study the hierarchical structure. In this study, we delve into the multi-level structures within music from macro-level and micro-level hierarchies. At the macro-level hierarchy, we conduct phrase segmentation algorithm to explore how phrases influence the overall development of music, and at the micro-level hierarchy, we design skeleton notes extraction strategy to explore how skeleton notes within each phrase guide the melody generation. Furthermore, we propose a novel Phrase-level Cross-Attention mechanism to capture the intrinsic relationship between macro-level hierarchy and micro-level hierarchy. Moreover, in response to the current lack of research on Chinese-style music, we construct our Small Tunes Dataset: a substantial collection of MIDI files comprising 10088 Small Tunes, a category of traditional Chinese Folk Songs. This dataset serves as the focus of our study. We generate Small Tunes songs utilizing the extracted skeleton notes as conditions, and experiment results indicate that our proposed model, Small Tunes Transformer, outperforms other state-of-the-art models. Besides, we design three novel objective evaluation metrics to evaluate music from both rhythm and melody dimensions.
Hull's Parameters of Projective Reed-Muller Code
Yufeng Song, Jinquan Luo
Oct 11 2024 cs.IT math.IT arXiv:2410.07217v1

@misc{2410.07217, author = {Yufeng Song and Jinquan Luo}, title = {{H}ull's {P}arameters of {P}rojective {R}eed-{M}uller {C}ode}, year = {2024}, eprint = {2410.07217}, note = {arXiv:2410.07217v1} }
PDF
Projective Reed-Muller codes(PRM codes) are constructed from the family of projective hypersurfaces of a fixed degree over a finite field $\F_q$. In this paper, we completely determine the minimal distance of the hull of any Projective Reed-Muller codes. Motivated by Nathan Kaplan and Jon-Lark Kim \citekaplankim,we extend their results and calculate the hulls' dimension of Projective Reed-Muller Codes in a larger range. We also analyse two special classes of PRM codes apart from self-dual,self-orthgonal and LCD cases, which Kaplan and Kim \cite[section 3]kaplankim didn't consider.
PointOBB-v2: Towards Simpler, Faster, and Stronger Single Point Supervised Oriented Object Detection
Botao Ren, Xue Yang, Yi Yu, Junwei Luo, Zhidong Deng
Oct 11 2024 cs.CV cs.AI arXiv:2410.08210v1

@misc{2410.08210, author = {Botao Ren and Xue Yang and Yi Yu and Junwei Luo and Zhidong Deng}, title = {{P}oint{OBB}-v2: {T}owards {S}impler, {F}aster, and {S}tronger {S}ingle {P}oint {S}upervised {O}riented {O}bject {D}etection}, year = {2024}, eprint = {2410.08210}, note = {arXiv:2410.08210v1} }
PDF
Single point supervised oriented object detection has gained attention and made initial progress within the community. Diverse from those approaches relying on one-shot samples or powerful pretrained models (e.g. SAM), PointOBB has shown promise due to its prior-free feature. In this paper, we propose PointOBB-v2, a simpler, faster, and stronger method to generate pseudo rotated boxes from points without relying on any other prior. Specifically, we first generate a Class Probability Map (CPM) by training the network with non-uniform positive and negative sampling. We show that the CPM is able to learn the approximate object regions and their contours. Then, Principal Component Analysis (PCA) is applied to accurately estimate the orientation and the boundary of objects. By further incorporating a separation mechanism, we resolve the confusion caused by the overlapping on the CPM, enabling its operation in high-density scenarios. Extensive comparisons demonstrate that our method achieves a training speed 15.58x faster and an accuracy improvement of 11.60%/25.15%/21.19% on the DOTA-v1.0/v1.5/v2.0 datasets compared to the previous state-of-the-art, PointOBB. This significantly advances the cutting edge of single point supervised oriented detection in the modular track.
DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, Kang Liu
Oct 11 2024 cs.CL cs.AI arXiv:2410.07331v2

@misc{2410.07331, author = {Yiming Huang and Jianwen Luo and Yan Yu and Yitong Zhang and Fangyu Lei and Yifan Wei and Shizhu He and Lifu Huang and Xiao Liu and Jun Zhao and Kang Liu}, title = {{DA}-{C}ode: {A}gent {D}ata {S}cience {C}ode {G}eneration {B}enchmark for {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2410.07331}, note = {arXiv:2410.07331v2} }
PDF
We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages, to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement. We release our benchmark at https://da-code-bench.github.io.
Retrieving, Rethinking and Revising: The Chain-of-Verification Can Improve Retrieval Augmented Generation
Bolei He, Nuo Chen, Xinran He, Lingyong Yan, Zhenkai Wei, Jinchang Luo, Zhen-Hua Ling
Oct 10 2024 cs.CL cs.AI arXiv:2410.05801v1

@misc{2410.05801, author = {Bolei He and Nuo Chen and Xinran He and Lingyong Yan and Zhenkai Wei and Jinchang Luo and Zhen-Hua Ling}, title = {{R}etrieving, {R}ethinking and {R}evising: {T}he {C}hain-of-{V}erification {C}an {I}mprove {R}etrieval {A}ugmented {G}eneration}, year = {2024}, eprint = {2410.05801}, note = {arXiv:2410.05801v1} }
PDF
Recent Retrieval Augmented Generation (RAG) aims to enhance Large Language Models (LLMs) by incorporating extensive knowledge retrieved from external sources. However, such approach encounters some challenges: Firstly, the original queries may not be suitable for precise retrieval, resulting in erroneous contextual knowledge; Secondly, the language model can easily generate inconsistent answer with external references due to their knowledge boundary limitation. To address these issues, we propose the chain-of-verification (CoV-RAG) to enhance the external retrieval correctness and internal generation consistency. Specifically, we integrate the verification module into the RAG, engaging in scoring, judgment, and rewriting. To correct external retrieval errors, CoV-RAG retrieves new knowledge using a revised query. To correct internal generation errors, we unify QA and verification tasks with a Chain-of-Thought (CoT) reasoning during training. Our comprehensive experiments across various LLMs demonstrate the effectiveness and adaptability compared with other strong baselines. Especially, our CoV-RAG can significantly surpass the state-of-the-art baselines using different LLM backbones.
SwiftQueue: Optimizing Low-Latency Applications with Swift Packet Queuing
Siddhant Ray, Xi Jiang, Jack Luo, Nick Feamster, Junchen Jiang
Oct 10 2024 cs.NI cs.LG arXiv:2410.06112v1

@misc{2410.06112, author = {Siddhant Ray and Xi Jiang and Jack Luo and Nick Feamster and Junchen Jiang}, title = {{S}wift{Q}ueue: {O}ptimizing {L}ow-{L}atency {A}pplications with {S}wift {P}acket {Q}ueuing}, year = {2024}, eprint = {2410.06112}, note = {arXiv:2410.06112v1} }
PDF
Low Latency, Low Loss, and Scalable Throughput (L4S), as an emerging router-queue management technique, has seen steady deployment in the industry. An L4S-enabled router assigns each packet to the queue based on the packet header marking. Currently, L4S employs per-flow queue selection, i.e. all packets of a flow are marked the same way and thus use the same queues, even though each packet is marked separately. However, this may hurt tail latency and latency-sensitive applications because transient congestion and queue buildups may only affect a fraction of packets in a flow. We present SwiftQueue, a new L4S queue-selection strategy in which a sender uses a novel per-packet latency predictor to pinpoint which packets likely have latency spikes or drops. The insight is that many packet-level latency variations result from complex interactions among recent packets at shared router queues. Yet, these intricate packet-level latency patterns are hard to learn efficiently by traditional models. Instead, SwiftQueue uses a custom Transformer, which is well-studied for its expressiveness on sequential patterns, to predict the next packet's latency based on the latencies of recently received ACKs. Based on the predicted latency of each outgoing packet, SwiftQueue's sender dynamically marks the L4S packet header to assign packets to potentially different queues, even within the same flow. Using real network traces, we show that SwiftQueue is 45-65% more accurate in predicting latency and its variations than state-of-art methods. Based on its latency prediction, SwiftQueue reduces the tail latency for L4S-enabled flows by 36-45%, compared with the existing L4S queue-selection method.
Unsupervised Model Diagnosis
Yinong Oliver Wang, Eileen Li, Jinqi Luo, Zhaoning Wang, Fernando De la Torre
Oct 10 2024 cs.CV cs.AI cs.CL cs.LG arXiv:2410.06243v1

@misc{2410.06243, author = {Yinong Oliver Wang and Eileen Li and Jinqi Luo and Zhaoning Wang and Fernando De la Torre}, title = {{U}nsupervised {M}odel {D}iagnosis}, year = {2024}, eprint = {2410.06243}, note = {arXiv:2410.06243v1} }
PDF
Ensuring model explainability and robustness is essential for reliable deployment of deep vision systems. Current methods for evaluating robustness rely on collecting and annotating extensive test sets. While this is common practice, the process is labor-intensive and expensive with no guarantee of sufficient coverage across attributes of interest. Recently, model diagnosis frameworks have emerged leveraging user inputs (e.g., text) to assess the vulnerability of the model. However, such dependence on human can introduce bias and limitation given the domain knowledge of particular users. This paper proposes Unsupervised Model Diagnosis (UMO), that leverages generative models to produce semantic counterfactual explanations without any user guidance. Given a differentiable computer vision model (i.e., the target model), UMO optimizes for the most counterfactual directions in a generative latent space. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources, such as dictionaries or language models. We validate the framework on multiple vision tasks (e.g., classification, segmentation, keypoint detection). Extensive experiments show that our unsupervised discovery of semantic directions can correctly highlight spurious correlations and visualize the failure mode of target models without any human intervention.
SePPO: Semi-Policy Preference Optimization for Diffusion Alignment
Daoan Zhang, Guangchen Lan, Dong-Jun Han, Wenlin Yao, Xiaoman Pan, Hongming Zhang, Mingxiao Li, Pengcheng Chen, Yu Dong, Christopher Brinton, Jiebo Luo
Oct 08 2024 cs.CV cs.LG arXiv:2410.05255v1

@misc{2410.05255, author = {Daoan Zhang and Guangchen Lan and Dong-Jun Han and Wenlin Yao and Xiaoman Pan and Hongming Zhang and Mingxiao Li and Pengcheng Chen and Yu Dong and Christopher Brinton and Jiebo Luo}, title = {{S}e{PPO}: {S}emi-{P}olicy {P}reference {O}ptimization for {D}iffusion {A}lignment}, year = {2024}, eprint = {2410.05255}, note = {arXiv:2410.05255v1} }
PDF
Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace "losing images" in preference pairs. This approach allows us to optimize using only off-policy "winning images." Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks. Code will be released in https://github.com/DwanZhang-AI/SePPO.
Residual Policy Learning for Perceptive Quadruped Control Using Differentiable Simulation
Jing Yuan Luo, Yunlong Song, Victor Klemm, Fan Shi, Davide Scaramuzza, Marco Hutter
Oct 07 2024 cs.RO arXiv:2410.03076v1

@misc{2410.03076, author = {Jing Yuan Luo and Yunlong Song and Victor Klemm and Fan Shi and Davide Scaramuzza and Marco Hutter}, title = {{R}esidual {P}olicy {L}earning for {P}erceptive {Q}uadruped {C}ontrol {U}sing {D}ifferentiable {S}imulation}, year = {2024}, eprint = {2410.03076}, note = {arXiv:2410.03076v1} }
PDF
First-order Policy Gradient (FoPG) algorithms such as Backpropagation through Time and Analytical Policy Gradients leverage local simulation physics to accelerate policy search, significantly improving sample efficiency in robot control compared to standard model-free reinforcement learning. However, FoPG algorithms can exhibit poor learning dynamics in contact-rich tasks like locomotion. Previous approaches address this issue by alleviating contact dynamics via algorithmic or simulation innovations. In contrast, we propose guiding the policy search by learning a residual over a simple baseline policy. For quadruped locomotion, we find that the role of residual policy learning in FoPG-based training (FoPG RPL) is primarily to improve asymptotic rewards, compared to improving sample efficiency for model-free RL. Additionally, we provide insights on applying FoPG's to pixel-based local navigation, training a point-mass robot to convergence within seconds. Finally, we showcase the versatility of FoPG RPL by using it to train locomotion and perceptive navigation end-to-end on a quadruped in minutes.
On the Cost of Consecutive Estimation Error: Significance-Aware Non-linear Aging
Jiping Luo, Nikolaos Pappas
Oct 07 2024 cs.IT cs.SY eess.SY math.IT arXiv:2410.03637v1

@misc{2410.03637, author = {Jiping Luo and Nikolaos Pappas}, title = {{O}n the {C}ost of {C}onsecutive {E}stimation {E}rror: {S}ignificance-{A}ware {N}on-linear {A}ging}, year = {2024}, eprint = {2410.03637}, note = {arXiv:2410.03637v1} }
PDF
This paper considers the semantics-aware remote state estimation of an asymmetric Markov chain with prioritized states. Due to resource constraints, the sensor needs to trade between estimation quality and communication cost. The aim is to exploit the significance of information through the history of system realizations to determine the optimal timing of transmission, thereby reducing the amount of uninformative data transmitted in the network. To this end, we introduce a new metric, the significance-aware Age of Consecutive Error (AoCE), that captures two semantic attributes: the significance of estimation error and the cost of consecutive error. Different costs and non-linear age functions are assigned to different estimation errors to account for their relative importance to system performance. We identify the optimal transmission problem as a countably infinite state Markov decision process (MDP) with unbounded costs. We first give sufficient conditions on the age functions, source pattern, and channel reliability so that an optimal policy exists to have bounded average costs. We show that the optimal policy exhibits a switching structure. That is, the sensor triggers a transmission only when the system has been trapped in an error for a certain number of consecutive time slots. We also provide sufficient conditions under which the switching policy degenerates into a simple threshold policy, i.e., featuring identical thresholds for all estimation errors. Furthermore, we exploit the structural properties and develop a structured policy iteration (SPI) algorithm that considerably reduces computation overhead. Numerical results show that the optimal policy outperforms the classic rule-, distortion- and age-based policies. An important takeaway is that the more semantic attributes we utilize, the fewer transmissions are needed.
D(R, O) Grasp: A Unified Representation of Robot and Object Interaction for Cross-Embodiment Dexterous Grasping
Zhenyu Wei, Zhixuan Xu, Jingxiang Guo, Yiwen Hou, Chongkai Gao, Zhehao Cai, Jiayu Luo, Lin Shao
Oct 03 2024 cs.RO arXiv:2410.01702v3

@misc{2410.01702, author = {Zhenyu Wei and Zhixuan Xu and Jingxiang Guo and Yiwen Hou and Chongkai Gao and Zhehao Cai and Jiayu Luo and Lin Shao}, title = {{D}({R}, {O}) {G}rasp: {A} {U}nified {R}epresentation of {R}obot and {O}bject {I}nteraction for {C}ross-{E}mbodiment {D}exterous {G}rasping}, year = {2024}, eprint = {2410.01702}, note = {arXiv:2410.01702v3} }
PDF
Dexterous grasping is a fundamental yet challenging skill in robotic manipulation, requiring precise interaction between robotic hands and objects. In this paper, we present D(R,O) Grasp, a novel framework that models the interaction between the robotic hand in its grasping pose and the object, enabling broad generalization across various robot hands and object geometries. Our model takes the robot hand's description and object point cloud as inputs and efficiently predicts kinematically valid and stable grasps, demonstrating strong adaptability to diverse robot embodiments and object geometries. Extensive experiments conducted in both simulated and real-world environments validate the effectiveness of our approach, with significant improvements in success rate, grasp diversity, and inference speed across multiple robotic hands. Our method achieves an average success rate of 87.53% in simulation in less than one second, tested across three different dexterous robotic hands. In real-world experiments using the LeapHand, the method also demonstrates an average success rate of 89%. D(R,O) Grasp provides a robust solution for dexterous grasping in complex and varied environments. The code, appendix, and videos are available on our project website at https://nus-lins-lab.github.io/drograspweb/.
PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation
Jing Luo, Run Luo, Longze Chen, Liang Zhu, Chang Ao, Jiaming Li, Yukun Chen, Xin Cheng, Wen Yang, Jiayuan Su, Chengming Li, Min Yang
Oct 03 2024 cs.CL arXiv:2410.01504v1

@misc{2410.01504, author = {Jing Luo and Run Luo and Longze Chen and Liang Zhu and Chang Ao and Jiaming Li and Yukun Chen and Xin Cheng and Wen Yang and Jiayuan Su and Chengming Li and Min Yang}, title = {{P}ersona{M}ath: {E}nhancing {M}ath {R}easoning through {P}ersona-{D}riven {D}ata {A}ugmentation}, year = {2024}, eprint = {2410.01504}, note = {arXiv:2410.01504v1} }
PDF
While closed-source Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities, open-source models continue to struggle with such tasks. To bridge this gap, we propose a data augmentation approach and introduce PersonaMathQA, a dataset derived from MATH and GSM8K, on which we train the PersonaMath models. Our approach consists of two stages: the first stage is learning from Persona Diversification, and the second stage is learning from Reflection. In the first stage, we regenerate detailed chain-of-thought (CoT) solutions as instructions using a closed-source LLM and introduce a novel persona-driven data augmentation technique to enhance the dataset's quantity and diversity. In the second stage, we incorporate reflection to fully leverage more challenging and valuable questions. Evaluation of our PersonaMath models on MATH and GSM8K reveals that the PersonaMath-7B model (based on LLaMA-2-7B) achieves an accuracy of 24.2% on MATH and 68.7% on GSM8K, surpassing all baseline methods and achieving state-of-the-art performance. Notably, our dataset contains only 70.3K data points-merely 17.8% of MetaMathQA and 27% of MathInstruct-yet our model outperforms these baselines, demonstrating the high quality and diversity of our dataset, which enables more efficient model training. We open-source the PersonaMathQA dataset, PersonaMath models, and our code for public usage.
Investigating the Impact of Model Complexity in Large Language Models
Jing Luo, Huiyuan Wang, Weiran Huang
Oct 02 2024 cs.LG stat.ML arXiv:2410.00699v1

@misc{2410.00699, author = {Jing Luo and Huiyuan Wang and Weiran Huang}, title = {{I}nvestigating the {I}mpact of {M}odel {C}omplexity in {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2410.00699}, note = {arXiv:2410.00699v1} }
PDF
Large Language Models (LLMs) based on the pre-trained fine-tuning paradigm have become pivotal in solving natural language processing tasks, consistently achieving state-of-the-art performance. Nevertheless, the theoretical understanding of how model complexity influences fine-tuning performance remains challenging and has not been well explored yet. In this paper, we focus on autoregressive LLMs and propose to employ Hidden Markov Models (HMMs) to model them. Based on the HMM modeling, we investigate the relationship between model complexity and the generalization capability in downstream tasks. Specifically, we consider a popular tuning paradigm for downstream tasks, head tuning, where all pre-trained parameters are frozen and only individual heads are trained atop pre-trained LLMs. Our theoretical analysis reveals that the risk initially increases and then decreases with rising model complexity, showcasing a "double descent" phenomenon. In this case, the initial "descent" is degenerate, signifying that the "sweet spot" where bias and variance are balanced occurs when the model size is zero. Obtaining the presented in this study conclusion confronts several challenges, primarily revolving around effectively modeling autoregressive LLMs and downstream tasks, as well as conducting a comprehensive risk analysis for multivariate regression. Our research is substantiated by experiments conducted on data generated from HMMs, which provided empirical support and alignment with our theoretical insights.
Towards Energy- and Cost-Efficient 6G Networks
Tommy Azzino, Aria HasanzadeZonuzy, Jianghong Luo, Navid Abedini, Tao Luo
Oct 01 2024 cs.NI cs.SY eess.SY arXiv:2409.19121v1

@misc{2409.19121, author = {Tommy Azzino and Aria HasanzadeZonuzy and Jianghong Luo and Navid Abedini and Tao Luo}, title = {{T}owards {E}nergy- and {C}ost-{E}fficient 6{G} {N}etworks}, year = {2024}, eprint = {2409.19121}, note = {arXiv:2409.19121v1} }
PDF
As the world enters the journey toward the 6th generation (6G) of wireless technology, the promises of ultra-high data rates, unprecedented low latency, and a massive surge in connected devices require crucial exploration of network energy saving (NES) solutions to minimize the carbon footprint and overall energy usage of future cellular networks. On the other hand, network-controlled repeaters (NCRs) have been introduced by 3rd generation partnership project (3GPP) as a cost-effective solution to improve network coverage. However, their impact on network power consumption and energy efficiency has not been thoroughly investigated. This paper studies NES schemes for next-generation 6G networks aided by NCRs and proposes optimal NES strategies aiming at maximizing the overall energy efficiency of the network. Repeaters are shown to allow for power savings at next-generation nodeB (gNB), and offer higher overall energy efficiency (EE) and spectral efficiency (SE), thus providing an energy-efficient and cost-efficient alternative to increase the performance of future 6G networks
DanceCamAnimator: Keyframe-Based Controllable 3D Dance Camera Synthesis
Zixuan Wang, Jiayi Li, Xiaoyu Qin, Shikun Sun, Songtao Zhou, Jia Jia, Jiebo Luo
Sep 24 2024 cs.CV cs.MM arXiv:2409.14925v1

@misc{2409.14925, author = {Zixuan Wang and Jiayi Li and Xiaoyu Qin and Shikun Sun and Songtao Zhou and Jia Jia and Jiebo Luo}, title = {{D}ance{C}am{A}nimator: {K}eyframe-{B}ased {C}ontrollable 3{D} {D}ance {C}amera {S}ynthesis}, year = {2024}, eprint = {2409.14925}, doi = {10.1145/3664647.3680980}, note = {arXiv:2409.14925v1} }
PDF
Synthesizing camera movements from music and dance is highly challenging due to the contradicting requirements and complexities of dance cinematography. Unlike human movements, which are always continuous, dance camera movements involve both continuous sequences of variable lengths and sudden drastic changes to simulate the switching of multiple cameras. However, in previous works, every camera frame is equally treated and this causes jittering and unavoidable smoothing in post-processing. To solve these problems, we propose to integrate animator dance cinematography knowledge by formulating this task as a three-stage process: keyframe detection, keyframe synthesis, and tween function prediction. Following this formulation, we design a novel end-to-end dance camera synthesis framework \textbfDanceCamAnimator, which imitates human animation procedures and shows powerful keyframe-based controllability with variable lengths. Extensive experiments on the DCM dataset demonstrate that our method surpasses previous baselines quantitatively and qualitatively. Code will be available at \urlhttps://github.com/Carmenw1203/DanceCamAnimator-Official.
End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting
Yongqi Wang, Shuo Yang, Xinxiao Wu, Jiebo Luo
Sep 20 2024 cs.CV arXiv:2409.12499v1

@misc{2409.12499, author = {Yongqi Wang and Shuo Yang and Xinxiao Wu and Jiebo Luo}, title = {{E}nd-to-end {O}pen-vocabulary {V}ideo {V}isual {R}elationship {D}etection using {M}ulti-modal {P}rompting}, year = {2024}, eprint = {2409.12499}, note = {arXiv:2409.12499v1} }
PDF
Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.
Depth from Coupled Optical Differentiation
Junjie Luo, Yuxuan Liu, Emma Alexander, Qi Guo
Sep 18 2024 cs.CV eess.IV arXiv:2409.10725v1

@misc{2409.10725, author = {Junjie Luo and Yuxuan Liu and Emma Alexander and Qi Guo}, title = {{D}epth from {C}oupled {O}ptical {D}ifferentiation}, year = {2024}, eprint = {2409.10725}, note = {arXiv:2409.10725v1} }
PDF
We propose depth from coupled optical differentiation, a low-computation passive-lighting 3D sensing mechanism. It is based on our discovery that per-pixel object distance can be rigorously determined by a coupled pair of optical derivatives of a defocused image using a simple, closed-form relationship. Unlike previous depth-from-defocus (DfD) methods that leverage spatial derivatives of the image to estimate scene depths, the proposed mechanism's use of only optical derivatives makes it significantly more robust to noise. Furthermore, unlike many previous DfD algorithms with requirements on aperture code, this relationship is proved to be universal to a broad range of aperture codes. We build the first 3D sensor based on depth from coupled optical differentiation. Its optical assembly includes a deformable lens and a motorized iris, which enables dynamic adjustments to the optical power and aperture radius. The sensor captures two pairs of images: one pair with a differential change of optical power and the other with a differential change of aperture scale. From the four images, a depth and confidence map can be generated with only 36 floating point operations per output pixel (FLOPOP), more than ten times lower than the previous lowest passive-lighting depth sensing solution to our knowledge. Additionally, the depth map generated by the proposed sensor demonstrates more than twice the working range of previous DfD methods while using significantly lower computation.
Semantics Preserving Emoji Recommendation with Large Language Models
Zhongyi Qiu, Kangyi Qiu, Hanjia Lyu, Wei Xiong, Jiebo Luo
Sep 18 2024 cs.CL cs.SI arXiv:2409.10760v1

@misc{2409.10760, author = {Zhongyi Qiu and Kangyi Qiu and Hanjia Lyu and Wei Xiong and Jiebo Luo}, title = {{S}emantics {P}reserving {E}moji {R}ecommendation with {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2409.10760}, note = {arXiv:2409.10760v1} }
PDF
Emojis have become an integral part of digital communication, enriching text by conveying emotions, tone, and intent. Existing emoji recommendation methods are primarily evaluated based on their ability to match the exact emoji a user chooses in the original text. However, they ignore the essence of users' behavior on social media in that each text can correspond to multiple reasonable emojis. To better assess a model's ability to align with such real-world emoji usage, we propose a new semantics preserving evaluation framework for emoji recommendation, which measures a model's ability to recommend emojis that maintain the semantic consistency with the user's text. To evaluate how well a model preserves semantics, we assess whether the predicted affective state, demographic profile, and attitudinal stance of the user remain unchanged. If these attributes are preserved, we consider the recommended emojis to have maintained the original semantics. The advanced abilities of Large Language Models (LLMs) in understanding and generating nuanced, contextually relevant output make them well-suited for handling the complexities of semantics preserving emoji recommendation. To this end, we construct a comprehensive benchmark to systematically assess the performance of six proprietary and open-source LLMs using different prompting techniques on our task. Our experiments demonstrate that GPT-4o outperforms other LLMs, achieving a semantics preservation score of 79.23%. Additionally, we conduct case studies to analyze model biases in downstream classification tasks and evaluate the diversity of the recommended emojis.
Shaking the Fake: Detecting Deepfake Videos in Real Time via Active Probes
Zhixin Xie, Jun Luo
Sep 18 2024 cs.CV cs.AI cs.CR arXiv:2409.10889v1

@misc{2409.10889, author = {Zhixin Xie and Jun Luo}, title = {{S}haking the {F}ake: {D}etecting {D}eepfake {V}ideos in {R}eal {T}ime via {A}ctive {P}robes}, year = {2024}, eprint = {2409.10889}, note = {arXiv:2409.10889v1} }
PDF
Real-time deepfake, a type of generative AI, is capable of "creating" non-existing contents (e.g., swapping one's face with another) in a video. It has been, very unfortunately, misused to produce deepfake videos (during web conferences, video calls, and identity authentication) for malicious purposes, including financial scams and political misinformation. Deepfake detection, as the countermeasure against deepfake, has attracted considerable attention from the academic community, yet existing works typically rely on learning passive features that may perform poorly beyond seen datasets. In this paper, we propose SFake, a new real-time deepfake detection method that innovatively exploits deepfake models' inability to adapt to physical interference. Specifically, SFake actively sends probes to trigger mechanical vibrations on the smartphone, resulting in the controllable feature on the footage. Consequently, SFake determines whether the face is swapped by deepfake based on the consistency of the facial area with the probe pattern. We implement SFake, evaluate its effectiveness on a self-built dataset, and compare it with six other detection methods. The results show that SFake outperforms other detection methods with higher detection accuracy, faster process speed, and lower memory consumption.
DELTA: Dual Consistency Delving with Topological Uncertainty for Active Graph Domain Adaptation
Pengyun Wang, Yadi Cao, Chris Russell, Siyu Heng, Junyu Luo, Yanxin Shen, Xiao Luo
Sep 16 2024 cs.LG cs.SI arXiv:2409.08946v1

@misc{2409.08946, author = {Pengyun Wang and Yadi Cao and Chris Russell and Siyu Heng and Junyu Luo and Yanxin Shen and Xiao Luo}, title = {{DELTA}: {D}ual {C}onsistency {D}elving with {T}opological {U}ncertainty for {A}ctive {G}raph {D}omain {A}daptation}, year = {2024}, eprint = {2409.08946}, note = {arXiv:2409.08946v1} }
PDF
Graph domain adaptation has recently enabled knowledge transfer across different graphs. However, without the semantic information on target graphs, the performance on target graphs is still far from satisfactory. To address the issue, we study the problem of active graph domain adaptation, which selects a small quantitative of informative nodes on the target graph for extra annotation. This problem is highly challenging due to the complicated topological relationships and the distribution discrepancy across graphs. In this paper, we propose a novel approach named Dual Consistency Delving with Topological Uncertainty (DELTA) for active graph domain adaptation. Our DELTA consists of an edge-oriented graph subnetwork and a path-oriented graph subnetwork, which can explore topological semantics from complementary perspectives. In particular, our edge-oriented graph subnetwork utilizes the message passing mechanism to learn neighborhood information, while our path-oriented graph subnetwork explores high-order relationships from substructures. To jointly learn from two subnetworks, we roughly select informative candidate nodes with the consideration of consistency across two subnetworks. Then, we aggregate local semantics from its K-hop subgraph based on node degrees for topological uncertainty estimation. To overcome potential distribution shifts, we compare target nodes and their corresponding source nodes for discrepancy scores as an additional component for fine selection. Extensive experiments on benchmark datasets demonstrate that DELTA outperforms various state-of-the-art approaches.
Learning Brain Tumor Representation in 3D High-Resolution MR Images via Interpretable State Space Models
Qingqiao Hu, Daoan Zhang, Jiebo Luo, Zhenyu Gong, Benedikt Wiestler, Jianguo Zhang, Hongwei Bran Li
Sep 13 2024 cs.CV arXiv:2409.07746v1

@misc{2409.07746, author = {Qingqiao Hu and Daoan Zhang and Jiebo Luo and Zhenyu Gong and Benedikt Wiestler and Jianguo Zhang and Hongwei Bran Li}, title = {{L}earning {B}rain {T}umor {R}epresentation in 3{D} {H}igh-{R}esolution {MR} {I}mages via {I}nterpretable {S}tate {S}pace {M}odels}, year = {2024}, eprint = {2409.07746}, note = {arXiv:2409.07746v1} }
PDF
Learning meaningful and interpretable representations from high-dimensional volumetric magnetic resonance (MR) images is essential for advancing personalized medicine. While Vision Transformers (ViTs) have shown promise in handling image data, their application to 3D multi-contrast MR images faces challenges due to computational complexity and interpretability. To address this, we propose a novel state-space-model (SSM)-based masked autoencoder which scales ViT-like models to handle high-resolution data effectively while also enhancing the interpretability of learned representations. We propose a latent-to-spatial mapping technique that enables direct visualization of how latent features correspond to specific regions in the input volumes in the context of SSM. We validate our method on two key neuro-oncology tasks: identification of isocitrate dehydrogenase mutation status and 1p/19q co-deletion classification, achieving state-of-the-art accuracy. Our results highlight the potential of SSM-based self-supervised learning to transform radiomics analysis by combining efficiency and interpretability.
Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts
Eleftheria Briakou, Jiaming Luo, Colin Cherry, Markus Freitag
Sep 12 2024 cs.CL arXiv:2409.06790v1

@misc{2409.06790, author = {Eleftheria Briakou and Jiaming Luo and Colin Cherry and Markus Freitag}, title = {{T}ranslating {S}tep-by-{S}tep: {D}ecomposing the {T}ranslation {P}rocess for {I}mproved {T}ranslation {Q}uality of {L}ong-{F}orm {T}exts}, year = {2024}, eprint = {2409.06790}, note = {arXiv:2409.06790v1} }
PDF
In this paper we present a step-by-step approach to long-form text translation, drawing on established processes in translation studies. Instead of viewing machine translation as a single, monolithic task, we propose a framework that engages language models in a multi-turn interaction, encompassing pre-translation research, drafting, refining, and proofreading, resulting in progressively improved translations. Extensive automatic evaluations using Gemini 1.5 Pro across ten language pairs show that translating step-by-step yields large translation quality improvements over conventional zero-shot prompting approaches and earlier human-like baseline strategies, resulting in state-of-the-art results on WMT2024.
3D Priors-Guided Diffusion for Blind Face Restoration
Xiaobin Lu, Xiaobin Hu, Jun Luo, Ben Zhu, Yaping Ruan, Wenqi Ren
Sep 04 2024 cs.CV cs.AI arXiv:2409.00991v2

@misc{2409.00991, author = {Xiaobin Lu and Xiaobin Hu and Jun Luo and Ben Zhu and Yaping Ruan and Wenqi Ren}, title = {3{D} {P}riors-{G}uided {D}iffusion for {B}lind {F}ace {R}estoration}, year = {2024}, eprint = {2409.00991}, note = {arXiv:2409.00991v2} }
PDF
Blind face restoration endeavors to restore a clear face image from a degraded counterpart. Recent approaches employing Generative Adversarial Networks (GANs) as priors have demonstrated remarkable success in this field. However, these methods encounter challenges in achieving a balance between realism and fidelity, particularly in complex degradation scenarios. To inherit the exceptional realism generative ability of the diffusion model and also constrained by the identity-aware fidelity, we propose a novel diffusion-based framework by embedding the 3D facial priors as structure and identity constraints into a denoising diffusion process. Specifically, in order to obtain more accurate 3D prior representations, the 3D facial image is reconstructed by a 3D Morphable Model (3DMM) using an initial restored face image that has been processed by a pretrained restoration network. A customized multi-level feature extraction method is employed to exploit both structural and identity information of 3D facial images, which are then mapped into the noise estimation process. In order to enhance the fusion of identity information into the noise estimation, we propose a Time-Aware Fusion Block (TAFB). This module offers a more efficient and adaptive fusion of weights for denoising, considering the dynamic nature of the denoising process in the diffusion model, which involves initial structure refinement followed by texture detail enhancement. Extensive experiments demonstrate that our network performs favorably against state-of-the-art algorithms on synthetic and real-world datasets for blind face restoration. The Code is released on our project page at https://github.com/838143396/3Diffusion.
Mirror contrastive loss based sliding window transformer for subject-independent motor imagery based EEG signal recognition
Jing Luo, Qi Mao, Weiwei Shi, Zhenghao Shi, Xiaofan Wang, Xiaofeng Lu, Xinhong Hei
Sep 04 2024 eess.SP cs.AI cs.LG arXiv:2409.00130v1

@misc{2409.00130, author = {Jing Luo and Qi Mao and Weiwei Shi and Zhenghao Shi and Xiaofan Wang and Xiaofeng Lu and Xinhong Hei}, title = {{M}irror contrastive loss based sliding window transformer for subject-independent motor imagery based {EEG} signal recognition}, year = {2024}, eprint = {2409.00130}, note = {arXiv:2409.00130v1} }
PDF
While deep learning models have been extensively utilized in motor imagery based EEG signal recognition, they often operate as black boxes. Motivated by neurological findings indicating that the mental imagery of left or right-hand movement induces event-related desynchronization (ERD) in the contralateral sensorimotor area of the brain, we propose a Mirror Contrastive Loss based Sliding Window Transformer (MCL-SWT) to enhance subject-independent motor imagery-based EEG signal recognition. Specifically, our proposed mirror contrastive loss enhances sensitivity to the spatial location of ERD by contrasting the original EEG signals with their mirror counterparts-mirror EEG signals generated by interchanging the channels of the left and right hemispheres of the EEG signals. Moreover, we introduce a temporal sliding window transformer that computes self-attention scores from high temporal resolution features, thereby improving model performance with manageable computational complexity. We evaluate the performance of MCL-SWT on subject-independent motor imagery EEG signal recognition tasks, and our experimental results demonstrate that MCL-SWT achieved accuracies of 66.48% and 75.62%, surpassing the state-of-the-art (SOTA) model by 2.82% and 2.17%, respectively. Furthermore, ablation experiments confirm the effectiveness of the proposed mirror contrastive loss. A code demo of MCL-SWT is available at https://github.com/roniusLuo/MCL_SWT.
X-Reflect: Cross-Reflection Prompting for Multimodal Recommendation
Hanjia Lyu, Ryan Rossi, Xiang Chen, Md Mehrab Tanjim, Stefano Petrangeli, Somdeb Sarkhel, Jiebo Luo
Aug 28 2024 cs.IR cs.CL cs.CV arXiv:2408.15172v1

@misc{2408.15172, author = {Hanjia Lyu and Ryan Rossi and Xiang Chen and Md Mehrab Tanjim and Stefano Petrangeli and Somdeb Sarkhel and Jiebo Luo}, title = {{X}-{R}eflect: {C}ross-{R}eflection {P}rompting for {M}ultimodal {R}ecommendation}, year = {2024}, eprint = {2408.15172}, note = {arXiv:2408.15172v1} }
PDF
Large Language Models (LLMs) and Large Multimodal Models (LMMs) have been shown to enhance the effectiveness of enriching item descriptions, thereby improving the accuracy of recommendation systems. However, most existing approaches either rely on text-only prompting or employ basic multimodal strategies that do not fully exploit the complementary information available from both textual and visual modalities. This paper introduces a novel framework, Cross-Reflection Prompting, termed X-Reflect, designed to address these limitations by prompting LMMs to explicitly identify and reconcile supportive and conflicting information between text and images. By capturing nuanced insights from both modalities, this approach generates more comprehensive and contextually richer item representations. Extensive experiments conducted on two widely used benchmarks demonstrate that our method outperforms existing prompting baselines in downstream recommendation accuracy. Additionally, we evaluate the generalizability of our framework across different LMM backbones and the robustness of the prompting strategies, offering insights for optimization. This work underscores the importance of integrating multimodal information and presents a novel solution for improving item understanding in multimodal recommendation systems.
Self-Parameterization Based Multi-Resolution Mesh Convolution Networks
Shi Hezi, Jiang Luo, Zheng Jianmin, Zeng Jun
Aug 27 2024 cs.CV arXiv:2408.13762v1

@misc{2408.13762, author = {Shi Hezi and Jiang Luo and Zheng Jianmin and Zeng Jun}, title = {{S}elf-{P}arameterization {B}ased {M}ulti-{R}esolution {M}esh {C}onvolution {N}etworks}, year = {2024}, eprint = {2408.13762}, howpublished = {Computer Aided Design 2023}, doi = {10.1016/j.cad.2023.103550}, note = {arXiv:2408.13762v1} }
PDF
This paper addresses the challenges of designing mesh convolution neural networks for 3D mesh dense prediction. While deep learning has achieved remarkable success in image dense prediction tasks, directly applying or extending these methods to irregular graph data, such as 3D surface meshes, is nontrivial due to the non-uniform element distribution and irregular connectivity in surface meshes which make it difficult to adapt downsampling, upsampling, and convolution operations. In addition, commonly used multiresolution networks require repeated high-to-low and then low-to-high processes to boost the performance of recovering rich, high-resolution representations. To address these challenges, this paper proposes a self-parameterization-based multi-resolution convolution network that extends existing image dense prediction architectures to 3D meshes. The novelty of our approach lies in two key aspects. First, we construct a multi-resolution mesh pyramid directly from the high-resolution input data and propose area-aware mesh downsampling/upsampling operations that use sequential bijective inter-surface mappings between different mesh resolutions. The inter-surface mapping redefines the mesh, rather than reshaping it, which thus avoids introducing unnecessary errors. Second, we maintain the high-resolution representation in the multi-resolution convolution network, enabling multi-scale fusions to exchange information across parallel multi-resolution subnetworks, rather than through connections of high-to-low resolution subnetworks in series. These features differentiate our approach from most existing mesh convolution networks and enable more accurate mesh dense predictions, which is confirmed in experiments.
Mini-Slot-Assisted Short Packet URLLC:Differential or Coherent Detection?
Canjian Zheng, Fu-Chun Zheng, Jingjing Luo, Pengcheng Zhu, Xiaohu You, Daquan Feng
Aug 27 2024 cs.IT eess.SP math.IT arXiv:2408.14089v1

@misc{2408.14089, author = {Canjian Zheng and Fu-Chun Zheng and Jingjing Luo and Pengcheng Zhu and Xiaohu You and Daquan Feng}, title = {{M}ini-{S}lot-{A}ssisted {S}hort {P}acket {URLLC}:{D}ifferential or {C}oherent {D}etection?}, year = {2024}, eprint = {2408.14089}, note = {arXiv:2408.14089v1} }
PDF
One of the primary challenges in short packet ultra-reliable and low-latency communications (URLLC) is to achieve reliable channel estimation and data detection while minimizing the impact on latency performance. Given the small packet size in mini-slot-assisted URLLC, relying solely on pilot-based coherent detection is almost impossible to meet the seemingly contradictory requirements of high channel estimation accuracy, high reliability, low training overhead, and low latency. In this paper, we explore differential modulation both in the frequency domain and in the time domain, and propose adopting an adaptive approach that integrates both differential and coherent detection to achieve mini-slot-assisted short packet URLLC, striking a balance among training overhead, system performance, and computational complexity. Specifically, differential (especially in the frequency domain) and coherent detection schemes can be dynamically activated based on application scenarios, channel statistics, information payloads, mini-slot deployment options, and service requirements. Furthermore, we derive the block error rate (BLER) for pilot-based, frequency domain, and time domain differential OFDM using non-asymptotic information-theoretic bounds. Simulation results validate the feasibility and effectiveness of adaptive differential and coherent detection.
Decoding SEC Actions: Enforcement Trends through Analyzing Blockchain litigation using LLM-based Thematic Factor Mapping
Junliang Luo, Xihan Xiong, William Knottenbelt, Xue Liu
Aug 23 2024 cs.CL arXiv:2408.11961v1

@misc{2408.11961, author = {Junliang Luo and Xihan Xiong and William Knottenbelt and Xue Liu}, title = {{D}ecoding {SEC} {A}ctions: {E}nforcement {T}rends through {A}nalyzing {B}lockchain litigation using {LLM}-based {T}hematic {F}actor {M}apping}, year = {2024}, eprint = {2408.11961}, note = {arXiv:2408.11961v1} }
PDF
The proliferation of blockchain entities (persons or enterprises) exposes them to potential regulatory actions (e.g., being litigated) by regulatory authorities. Regulatory frameworks for crypto assets are actively being developed and refined, increasing the likelihood of such actions. The lack of systematic analysis of the factors driving litigation against blockchain entities leaves companies in need of clarity to navigate compliance risks. This absence of insight also deprives investors of the information for informed decision-making. This study focuses on U.S. litigation against blockchain entities, particularly by the U.S. Securities and Exchange Commission (SEC) given its influence on global crypto regulation. Utilizing frontier pretrained language models and large language models, we systematically map all SEC complaints against blockchain companies from 2012 to 2024 to thematic factors conceptualized by our study to delineate the factors driving SEC actions. We quantify the thematic factors and assess their influence on specific legal Acts cited within the complaints on an annual basis, allowing us to discern the regulatory emphasis, patterns and conduct trend analysis.
Rank and Align: Towards Effective Source-free Graph Domain Adaptation
Junyu Luo, Zhiping Xiao, Yifan Wang, Xiao Luo, Jingyang Yuan, Wei Ju, Langechuan Liu, Ming Zhang
Aug 23 2024 cs.LG cs.AI cs.IR arXiv:2408.12185v1

@misc{2408.12185, author = {Junyu Luo and Zhiping Xiao and Yifan Wang and Xiao Luo and Jingyang Yuan and Wei Ju and Langechuan Liu and Ming Zhang}, title = {{R}ank and {A}lign: {T}owards {E}ffective {S}ource-free {G}raph {D}omain {A}daptation}, year = {2024}, eprint = {2408.12185}, doi = {10.24963/ijcai.2024/520}, note = {arXiv:2408.12185v1} }
PDF
Graph neural networks (GNNs) have achieved impressive performance in graph domain adaptation. However, extensive source graphs could be unavailable in real-world scenarios due to privacy and storage concerns. To this end, we investigate an underexplored yet practical problem of source-free graph domain adaptation, which transfers knowledge from source models instead of source graphs to a target domain. To solve this problem, we introduce a novel GNN-based approach called Rank and Align (RNA), which ranks graph similarities with spectral seriation for robust semantics learning, and aligns inharmonic graphs with harmonic graphs which close to the source domain for subgraph extraction. In particular, to overcome label scarcity, we employ the spectral seriation algorithm to infer the robust pairwise rankings, which can guide semantic learning using a similarity learning objective. To depict distribution shifts, we utilize spectral clustering and the silhouette coefficient to detect harmonic graphs, which the source model can easily classify. To reduce potential domain discrepancy, we extract domain-invariant subgraphs from inharmonic graphs by an adversarial edge sampling process, which guides the invariant learning of GNNs. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed RNA.
ISAC-Fi: Enabling Full-fledged Monostatic Sensing over Wi-Fi Communication
Zhe Chen, Chao Hu, Tianyue Zheng, Hangcheng Cao, Yanbing Yang, Yen Chu, Hongbo Jiang, Jun Luo
Aug 20 2024 cs.NI cs.SY eess.SY arXiv:2408.09851v1

@misc{2408.09851, author = {Zhe Chen and Chao Hu and Tianyue Zheng and Hangcheng Cao and Yanbing Yang and Yen Chu and Hongbo Jiang and Jun Luo}, title = {{ISAC}-{F}i: {E}nabling {F}ull-fledged {M}onostatic {S}ensing over {W}i-{F}i {C}ommunication}, year = {2024}, eprint = {2408.09851}, note = {arXiv:2408.09851v1} }
PDF
Whereas Wi-Fi communications have been exploited for sensing purpose for over a decade, the bistatic or multistatic nature of Wi-Fi still poses multiple challenges, hampering real-life deployment of integrated sensing and communication (ISAC) within Wi-Fi framework. In this paper, we aim to re-design WiFi so that monostatic sensing (mimicking radar) can be achieved over the multistatic communication infrastructure. Specifically, we propose, design, and implement ISAC-Fi as an ISAC-ready Wi-Fi prototype. We first present a novel self-interference cancellation scheme, in order to extract reflected (radio frequency) signals for sensing purpose in the face of transmissions. We then subtly revise existing Wi-Fi framework so as to seamlessly operate monostatic sensing under Wi-Fi communication standard. Finally, we offer two ISAC-Fi designs: while a USRP-based one emulates a totally re-designed ISAC-Fi device, another plug-andplay design allows for backward compatibility by attaching an extra module to an arbitrary Wi-Fi device. We perform extensive experiments to validate the efficacy of ISAC-Fi and also to demonstrate its superiority over existing Wi-Fi sensing proposals.
A Versatile Framework for Attributed Network Clustering via K-Nearest Neighbor Augmentation
Yiran Li, Gongyao Guo, Jieming Shi, Renchi Yang, Shiqi Shen, Qing Li, Jun Luo
Aug 13 2024 cs.SI cs.LG arXiv:2408.05459v2

@misc{2408.05459, author = {Yiran Li and Gongyao Guo and Jieming Shi and Renchi Yang and Shiqi Shen and Qing Li and Jun Luo}, title = {{A} {V}ersatile {F}ramework for {A}ttributed {N}etwork {C}lustering via {K}-{N}earest {N}eighbor {A}ugmentation}, year = {2024}, eprint = {2408.05459}, howpublished = {The VLDB Journal (2024) 1-31}, doi = {10.1007/s00778-024-00875-8}, note = {arXiv:2408.05459v2} }
PDF
Attributed networks containing entity-specific information in node attributes are ubiquitous in modeling social networks, e-commerce, bioinformatics, etc. Their inherent network topology ranges from simple graphs to hypergraphs with high-order interactions and multiplex graphs with separate layers. An important graph mining task is node clustering, aiming to partition the nodes of an attributed network into k disjoint clusters such that intra-cluster nodes are closely connected and share similar attributes, while inter-cluster nodes are far apart and dissimilar. It is highly challenging to capture multi-hop connections via nodes or attributes for effective clustering on multiple types of attributed networks. In this paper, we first present AHCKA as an efficient approach to attributed hypergraph clustering (AHC). AHCKA includes a carefully-crafted K-nearest neighbor augmentation strategy for the optimized exploitation of attribute information on hypergraphs, a joint hypergraph random walk model to devise an effective AHC objective, and an efficient solver with speedup techniques for the objective optimization. The proposed techniques are extensible to various types of attributed networks, and thus, we develop ANCKA as a versatile attributed network clustering framework, capable of attributed graph clustering (AGC), attributed multiplex graph clustering (AMGC), and AHC. Moreover, we devise ANCKA with algorithmic designs tailored for GPU acceleration to boost efficiency. We have conducted extensive experiments to compare our methods with 19 competitors on 8 attributed hypergraphs, 16 competitors on 6 attributed graphs, and 16 competitors on 3 attributed multiplex graphs, all demonstrating the superb clustering quality and efficiency of our methods.
Retrieval Augmentation via User Interest Clustering
Hanjia Lyu, Hanqing Zeng, Yinglong Xia, Ren Chen, Jiebo Luo
Aug 08 2024 cs.IR arXiv:2408.03886v1

@misc{2408.03886, author = {Hanjia Lyu and Hanqing Zeng and Yinglong Xia and Ren Chen and Jiebo Luo}, title = {{R}etrieval {A}ugmentation via {U}ser {I}nterest {C}lustering}, year = {2024}, eprint = {2408.03886}, note = {arXiv:2408.03886v1} }
PDF
Many existing industrial recommender systems are sensitive to the patterns of user-item engagement. Light users, who interact less frequently, correspond to a data sparsity problem, making it difficult for the system to accurately learn and represent their preferences. On the other hand, heavy users with rich interaction history often demonstrate a variety of niche interests that are hard to be precisely captured under the standard "user-item" similarity measurement. Moreover, implementing these systems in an industrial environment necessitates that they are resource-efficient and scalable to process web-scale data under strict latency constraints. In this paper, we address these challenges by introducing an intermediate "interest" layer between users and items. We propose a novel approach that efficiently constructs user interest and facilitates low computational cost inference by clustering engagement graphs and incorporating user-interest attention. This method enhances the understanding of light users' preferences by linking them with heavy users. By integrating user-interest attention, our approach allows a more personalized similarity metric, adept at capturing the complex dynamics of user-item interactions. The use of interest as an intermediary layer fosters a balance between scalability and expressiveness in the model. Evaluations on two public datasets reveal that our method not only achieves improved recommendation performance but also demonstrates enhanced computational efficiency compared to item-level attention models. Our approach has also been deployed in multiple products at Meta, facilitating short-form video related recommendation.
Evolver: Chain-of-Evolution Prompting to Boost Large Multimodal Models for Hateful Meme Detection
Jinfa Huang, Jinsheng Pan, Zhongwei Wan, Hanjia Lyu, Jiebo Luo
Jul 31 2024 cs.CL cs.CV arXiv:2407.21004v1

@misc{2407.21004, author = {Jinfa Huang and Jinsheng Pan and Zhongwei Wan and Hanjia Lyu and Jiebo Luo}, title = {{E}volver: {C}hain-of-{E}volution {P}rompting to {B}oost {L}arge {M}ultimodal {M}odels for {H}ateful {M}eme {D}etection}, year = {2024}, eprint = {2407.21004}, note = {arXiv:2407.21004v1} }
PDF
Recent advances show that two-stream approaches have achieved outstanding performance in hateful meme detection. However, hateful memes constantly evolve as new memes emerge by fusing progressive cultural ideas, making existing methods obsolete or ineffective. In this work, we explore the potential of Large Multimodal Models (LMMs) for hateful meme detection. To this end, we propose Evolver, which incorporates LMMs via Chain-of-Evolution (CoE) Prompting, by integrating the evolution attribute and in-context information of memes. Specifically, Evolver simulates the evolving and expressing process of memes and reasons through LMMs in a step-by-step manner. First, an evolutionary pair mining module retrieves the top-k most similar memes in the external curated meme set with the input meme. Second, an evolutionary information extractor is designed to summarize the semantic regularities between the paired memes for prompting. Finally, a contextual relevance amplifier enhances the in-context hatefulness information to boost the search for evolutionary processes. Extensive experiments on public FHM, MAMI, and HarM datasets show that CoE prompting can be incorporated into existing LMMs to improve their performance. More encouragingly, it can serve as an interpretive tool to promote the understanding of the evolution of social memes.
Performance Study of Various Relay Nodes in 5G Wireless Network
Jianghong Luo, Ashwin Sampath, Navid Abedini, Tao Luo
Jul 30 2024 cs.NI cs.SY eess.SY arXiv:2407.20089v1

@misc{2407.20089, author = {Jianghong Luo and Ashwin Sampath and Navid Abedini and Tao Luo}, title = {{P}erformance {S}tudy of {V}arious {R}elay {N}odes in 5{G} {W}ireless {N}etwork}, year = {2024}, eprint = {2407.20089}, note = {arXiv:2407.20089v1} }
PDF
This paper studies performance of various types of relay nodes in a 5G wireless network: conventional amplify-forward repeaters, (semi-)smart/smart amplify-forward repeaters with different levels of side information, and half-duplex/full-duplex decode-forward relay nodes with and without spatial reuse. End-to-end effective signal to interference and noise ratios (SINRs) and achievable rates are derived for these different types of relay nodes. Performance and complexity tradeoffs are discussed with a simulation over a Manhattan topology setting. Over-the-air (OTA) test results corroborates the findings in this paper.
Sample Enrichment via Temporary Operations on Subsequences for Sequential Recommendation
Shu Chen, Jinwei Luo, Weike Pan, Jiangxing Yu, Xin Huang, Zhong Ming
Jul 26 2024 cs.IR arXiv:2407.17802v1

@misc{2407.17802, author = {Shu Chen and Jinwei Luo and Weike Pan and Jiangxing Yu and Xin Huang and Zhong Ming}, title = {{S}ample {E}nrichment via {T}emporary {O}perations on {S}ubsequences for {S}equential {R}ecommendation}, year = {2024}, eprint = {2407.17802}, note = {arXiv:2407.17802v1} }
PDF
Sequential recommendation leverages interaction sequences to predict forthcoming user behaviors, crucial for crafting personalized recommendations. However, the true preferences of a user are inherently complex and high-dimensional, while the observed data is merely a simplified and low-dimensional projection of the rich preferences, which often leads to prevalent issues like data sparsity and inaccurate model training. To learn true preferences from the sparse data, most existing works endeavor to introduce some extra information or design some ingenious models. Although they have shown to be effective, extra information usually increases the cost of data collection, and complex models may result in difficulty in deployment. Innovatively, we avoid the use of extra information or alterations to the model; instead, we fill the transformation space between the observed data and the underlying preferences with randomness. Specifically, we propose a novel model-agnostic and highly generic framework for sequential recommendation called sample enrichment via temporary operations on subsequences (SETO), which temporarily and separately enriches the transformation space via sequence enhancement operations with rationality constraints in training. The transformation space not only exists in the process from input samples to preferences but also in preferences to target samples. We highlight our SETO's effectiveness and versatility over multiple representative and state-of-the-art sequential recommendation models (including six single-domain sequential models and two cross-domain sequential models) across multiple real-world datasets (including three single-domain datasets, three cross-domain datasets and a large-scale industry dataset).
3D Gaussian Splatting: Survey, Technologies, Challenges, and Opportunities
Yanqi Bao, Tianyu Ding, Jing Huo, Yaoli Liu, Yuxin Li, Wenbin Li, Yang Gao, Jiebo Luo
Jul 25 2024 cs.CV arXiv:2407.17418v1

@misc{2407.17418, author = {Yanqi Bao and Tianyu Ding and Jing Huo and Yaoli Liu and Yuxin Li and Wenbin Li and Yang Gao and Jiebo Luo}, title = {3{D} {G}aussian {S}platting: {S}urvey, {T}echnologies, {C}hallenges, and {O}pportunities}, year = {2024}, eprint = {2407.17418}, note = {arXiv:2407.17418v1} }
PDF
3D Gaussian Splatting (3DGS) has emerged as a prominent technique with the potential to become a mainstream method for 3D representations. It can effectively transform multi-view images into explicit 3D Gaussian representations through efficient training, and achieve real-time rendering of novel views. This survey aims to analyze existing 3DGS-related works from multiple intersecting perspectives, including related tasks, technologies, challenges, and opportunities. The primary objective is to provide newcomers with a rapid understanding of the field and to assist researchers in methodically organizing existing technologies and challenges. Specifically, we delve into the optimization, application, and extension of 3DGS, categorizing them based on their focuses or motivations. Additionally, we summarize and classify nine types of technical modules and corresponding improvements identified in existing works. Based on these analyses, we further examine the common challenges and technologies across various tasks, proposing potential research opportunities.
Diff-Shadow: Global-guided Diffusion Model for Shadow Removal
Jinting Luo, Ru Li, Chengzhi Jiang, Mingyan Han, Xiaoming Zhang, Ting Jiang, Haoqiang Fan, Shuaicheng Liu
Jul 24 2024 cs.CV arXiv:2407.16214v1

@misc{2407.16214, author = {Jinting Luo and Ru Li and Chengzhi Jiang and Mingyan Han and Xiaoming Zhang and Ting Jiang and Haoqiang Fan and Shuaicheng Liu}, title = {{D}iff-{S}hadow: {G}lobal-guided {D}iffusion {M}odel for {S}hadow {R}emoval}, year = {2024}, eprint = {2407.16214}, note = {arXiv:2407.16214v1} }
PDF
We propose Diff-Shadow, a global-guided diffusion model for high-quality shadow removal. Previous transformer-based approaches can utilize global information to relate shadow and non-shadow regions but are limited in their synthesis ability and recover images with obvious boundaries. In contrast, diffusion-based methods can generate better content but ignore global information, resulting in inconsistent illumination. In this work, we combine the advantages of diffusion models and global guidance to realize shadow-free restoration. Specifically, we propose a parallel UNets architecture: 1) the local branch performs the patch-based noise estimation in the diffusion process, and 2) the global branch recovers the low-resolution shadow-free images. A Reweight Cross Attention (RCA) module is designed to integrate global contextural information of non-shadow regions into the local branch. We further design a Global-guided Sampling Strategy (GSS) that mitigates patch boundary issues and ensures consistent illumination across shaded and unshaded regions in the recovered image. Comprehensive experiments on three publicly standard datasets ISTD, ISTD+, and SRD have demonstrated the effectiveness of Diff-Shadow. Compared to state-of-the-art methods, our method achieves a significant improvement in terms of PSNR, increasing from 32.33dB to 33.69dB on the SRD dataset. Codes will be released.
Downstream-Pretext Domain Knowledge Traceback for Active Learning
Beichen Zhang, Liang Li, Zheng-Jun Zha, Jiebo Luo, Qingming Huang
Jul 23 2024 cs.LG arXiv:2407.14720v1

@misc{2407.14720, author = {Beichen Zhang and Liang Li and Zheng-Jun Zha and Jiebo Luo and Qingming Huang}, title = {{D}ownstream-{P}retext {D}omain {K}nowledge {T}raceback for {A}ctive {L}earning}, year = {2024}, eprint = {2407.14720}, doi = {10.1109/TMM.2024.3391897}, note = {arXiv:2407.14720v1} }
PDF
Active learning (AL) is designed to construct a high-quality labeled dataset by iteratively selecting the most informative samples. Such sampling heavily relies on data representation, while recently pre-training is popular for robust feature learning. However, as pre-training utilizes low-level pretext tasks that lack annotation, directly using pre-trained representation in AL is inadequate for determining the sampling score. To address this problem, we propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance for selecting diverse and instructive samples near the decision boundary. DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator. The diversity indicator constructs two feature spaces based on the pre-training pretext model and the downstream knowledge from annotation, by which it locates the neighbors of unlabeled data from the downstream space in the pretext space to explore the interaction of samples. With this mechanism, DOKT unifies the data relations of low-level and high-level representations to estimate traceback diversity. Next, in the uncertainty estimator, domain mixing is designed to enforce perceptual perturbing to unlabeled samples with similar visual patches in the pretext space. Then the divergence of perturbed samples is measured to estimate the domain uncertainty. As a result, DOKT selects the most diverse and important samples based on these two modules. The experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods and generalizes well to various application scenarios such as semantic segmentation and image captioning.
The VEP Booster: A Closed-Loop AI System for Visual EEG Biomarker Auto-generation
Junwen Luo, Chengyong Jiang, Qingyuan Chen, Dongqi Han, Yansen Wang, Biao Yan, Dongsheng Li, Jiayi Zhang
Jul 23 2024 cs.CV arXiv:2407.15167v1

@misc{2407.15167, author = {Junwen Luo and Chengyong Jiang and Qingyuan Chen and Dongqi Han and Yansen Wang and Biao Yan and Dongsheng Li and Jiayi Zhang}, title = {{T}he {VEP} {B}ooster: {A} {C}losed-{L}oop {AI} {S}ystem for {V}isual {EEG} {B}iomarker {A}uto-generation}, year = {2024}, eprint = {2407.15167}, note = {arXiv:2407.15167v1} }
PDF
Effective visual brain-machine interfaces (BMI) is based on reliable and stable EEG biomarkers. However, traditional adaptive filter-based approaches may suffer from individual variations in EEG signals, while deep neural network-based approaches may be hindered by the non-stationarity of EEG signals caused by biomarker attenuation and background oscillations. To address these challenges, we propose the Visual Evoked Potential Booster (VEP Booster), a novel closed-loop AI framework that generates reliable and stable EEG biomarkers under visual stimulation protocols. Our system leverages an image generator to refine stimulus images based on real-time feedback from human EEG signals, generating visual stimuli tailored to the preferences of primary visual cortex (V1) neurons and enabling effective targeting of neurons most responsive to stimuli. We validated our approach by implementing a system and employing steady-state visual evoked potential (SSVEP) visual protocols in five human subjects. Our results show significant enhancements in the reliability and utility of EEG biomarkers for all individuals, with the largest improvement in SSVEP response being 105%, the smallest being 28%, and the average increase being 76.5%. These promising results have implications for both clinical and technological applications
Improving GBDT Performance on Imbalanced Datasets: An Empirical Study of Class-Balanced Loss Functions
Jiaqi Luo, Yuan Yuan, Shixin Xu
Jul 22 2024 cs.LG arXiv:2407.14381v1

@misc{2407.14381, author = {Jiaqi Luo and Yuan Yuan and Shixin Xu}, title = {{I}mproving {GBDT} {P}erformance on {I}mbalanced {D}atasets: {A}n {E}mpirical {S}tudy of {C}lass-{B}alanced {L}oss {F}unctions}, year = {2024}, eprint = {2407.14381}, note = {arXiv:2407.14381v1} }
PDF
Class imbalance remains a significant challenge in machine learning, particularly for tabular data classification tasks. While Gradient Boosting Decision Trees (GBDT) models have proven highly effective for such tasks, their performance can be compromised when dealing with imbalanced datasets. This paper presents the first comprehensive study on adapting class-balanced loss functions to three GBDT algorithms across various tabular classification tasks, including binary, multi-class, and multi-label classification. We conduct extensive experiments on multiple datasets to evaluate the impact of class-balanced losses on different GBDT models, establishing a valuable benchmark. Our results demonstrate the potential of class-balanced loss functions to enhance GBDT performance on imbalanced datasets, offering a robust approach for practitioners facing class imbalance challenges in real-world applications. Additionally, we introduce a Python package that facilitates the integration of class-balanced loss functions into GBDT workflows, making these advanced techniques accessible to a wider audience.
Representation Bias in Political Sample Simulations with Large Language Models
Weihong Qi, Hanjia Lyu, Jiebo Luo
Jul 17 2024 cs.CL arXiv:2407.11409v1

@misc{2407.11409, author = {Weihong Qi and Hanjia Lyu and Jiebo Luo}, title = {{R}epresentation {B}ias in {P}olitical {S}ample {S}imulations with {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2407.11409}, note = {arXiv:2407.11409v1} }
PDF
This study seeks to identify and quantify biases in simulating political samples with Large Language Models, specifically focusing on vote choice and public opinion. Using the GPT-3.5-Turbo model, we leverage data from the American National Election Studies, German Longitudinal Election Study, Zuobiao Dataset, and China Family Panel Studies to simulate voting behaviors and public opinions. This methodology enables us to examine three types of representation bias: disparities based on the the country's language, demographic groups, and political regime types. The findings reveal that simulation performance is generally better for vote choice than for public opinions, more accurate in English-speaking countries, more effective in bipartisan systems than in multi-partisan systems, and stronger in democratic settings than in authoritarian regimes. These results contribute to enhancing our understanding and developing strategies to mitigate biases in AI applications within the field of computational social science.
Rethinking Learned Image Compression: Context is All You Need
Jixiang Luo
Jul 17 2024 eess.IV cs.CV arXiv:2407.11590v3

@misc{2407.11590, author = {Jixiang Luo}, title = {{R}ethinking {L}earned {I}mage {C}ompression: {C}ontext is {A}ll {Y}ou {N}eed}, year = {2024}, eprint = {2407.11590}, note = {arXiv:2407.11590v3} }
PDF
Since LIC has made rapid progress recently compared to traditional methods, this paper attempts to discuss the question about 'Where is the boundary of Learned Image Compression(LIC)?'. Thus this paper splits the above problem into two sub-problems:1)Where is the boundary of rate-distortion performance of PSNR? 2)How to further improve the compression gain and achieve the boundary? Therefore this paper analyzes the effectiveness of scaling parameters for encoder, decoder and context model, which are the three components of LIC. Then we conclude that scaling for LIC is to scale for context model and decoder within LIC. Extensive experiments demonstrate that overfitting can actually serve as an effective context. By optimizing the context, this paper further improves PSNR and achieves state-of-the-art performance, showing a performance gain of 14.39% with BD-RATE over VVC.
A Framework for QoS of Integration Testing in Satellite Edge Clouds
Guogen Zeng, Juan Luo, Yufeng Zhang, Ying Qiao, Shuyang Teng
Jul 16 2024 cs.SE arXiv:2407.10402v2

@misc{2407.10402, author = {Guogen Zeng and Juan Luo and Yufeng Zhang and Ying Qiao and Shuyang Teng}, title = {{A} {F}ramework for {Q}o{S} of {I}ntegration {T}esting in {S}atellite {E}dge {C}louds}, year = {2024}, eprint = {2407.10402}, note = {arXiv:2407.10402v2} }
PDF
The diversification of satellite communication services imposes varied requirements on network service quality, making quality of service (QoS) testing for microservices running on satellites more complex. Existing testing tools have limitations, potentially offering only single-functionality testing, thus failing to meet the requirements of QoS testing for edge cloud services in mobile satellite scenarios. In this paper, we propose a framework for integrating quality of service testing in satellite edge clouds. More precisely, the framework can integrate changes in satellite network topology, create and manage satellite edge cloud cluster testing environments on heterogeneous edge devices, customize experiments for users, support deployment and scaling of various integrated testing tools, and publish and visualize test results. Our experimental results validate the framework's ability to test key service quality metrics in a satellite edge cloud cluster.
CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis
Yihang Xiao, Jinyi Liu, Yan Zheng, Xiaohan Xie, Jianye Hao, Mingzhi Li, Ruitao Wang, Fei Ni, Yuxiao Li, Jintian Luo, Shaoqing Jiao, Jiajie Peng
Jul 16 2024 cs.AI cs.HC q-bio.GN arXiv:2407.09811v1

@misc{2407.09811, author = {Yihang Xiao and Jinyi Liu and Yan Zheng and Xiaohan Xie and Jianye Hao and Mingzhi Li and Ruitao Wang and Fei Ni and Yuxiao Li and Jintian Luo and Shaoqing Jiao and Jiajie Peng}, title = {{C}ell{A}gent: {A}n {LLM}-driven {M}ulti-{A}gent {F}ramework for {A}utomated {S}ingle-cell {D}ata {A}nalysis}, year = {2024}, eprint = {2407.09811}, note = {arXiv:2407.09811v1} }
PDF
Single-cell RNA sequencing (scRNA-seq) data analysis is crucial for biological research, as it enables the precise characterization of cellular heterogeneity. However, manual manipulation of various tools to achieve desired outcomes can be labor-intensive for researchers. To address this, we introduce CellAgent (http://cell.agent4science.cn/), an LLM-driven multi-agent framework, specifically designed for the automatic processing and execution of scRNA-seq data analysis tasks, providing high-quality results with no human intervention. Firstly, to adapt general LLMs to the biological field, CellAgent constructs LLM-driven biological expert roles - planner, executor, and evaluator - each with specific responsibilities. Then, CellAgent introduces a hierarchical decision-making mechanism to coordinate these biological experts, effectively driving the planning and step-by-step execution of complex data analysis tasks. Furthermore, we propose a self-iterative optimization mechanism, enabling CellAgent to autonomously evaluate and optimize solutions, thereby guaranteeing output quality. We evaluate CellAgent on a comprehensive benchmark dataset encompassing dozens of tissues and hundreds of distinct cell types. Evaluation results consistently show that CellAgent effectively identifies the most suitable tools and hyperparameters for single-cell analysis tasks, achieving optimal performance. This automated framework dramatically reduces the workload for science data analyses, bringing us into the "Agent for Science" era.
BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features
Jing Luo, Xinyu Yang, Dorien Herremans
Jul 16 2024 cs.SD cs.AI cs.MM arXiv:2407.10462v1

@misc{2407.10462, author = {Jing Luo and Xinyu Yang and Dorien Herremans}, title = {{B}and{C}ontrol{N}et: {P}arallel {T}ransformers-based {S}teerable {P}opular {M}usic {G}eneration with {F}ine-{G}rained {S}patiotemporal {F}eatures}, year = {2024}, eprint = {2407.10462}, note = {arXiv:2407.10462v1} }
PDF
Controllable music generation promotes the interaction between humans and composition systems by projecting the users' intent on their desired music. The challenge of introducing controllability is an increasingly important issue in the symbolic music generation field. When building controllable generative popular multi-instrument music systems, two main challenges typically present themselves, namely weak controllability and poor music quality. To address these issues, we first propose spatiotemporal features as powerful and fine-grained controls to enhance the controllability of the generative model. In addition, an efficient music representation called REMI_Track is designed to convert multitrack music into multiple parallel music sequences and shorten the sequence length of each track with Byte Pair Encoding (BPE) techniques. Subsequently, we release BandControlNet, a conditional model based on parallel Transformers, to tackle the multiple music sequences and generate high-quality music samples that are conditioned to the given spatiotemporal control features. More concretely, the two specially designed modules of BandControlNet, namely structure-enhanced self-attention (SE-SA) and Cross-Track Transformer (CTT), are utilized to strengthen the resulting musical structure and inter-track harmony modeling respectively. Experimental results tested on two popular music datasets of different lengths demonstrate that the proposed BandControlNet outperforms other conditional music generation models on most objective metrics in terms of fidelity and inference speed and shows great robustness in generating long music samples. The subjective evaluations show BandControlNet trained on short datasets can generate music with comparable quality to state-of-the-art models, while outperforming them significantly using longer datasets.
A Joint Approach to Local Updating and Gradient Compression for Efficient Asynchronous Federated Learning
Jiajun Song, Jiajun Luo, Rongwei Lu, Shuzhao Xie, Bin Chen, Zhi Wang
Jul 09 2024 cs.DC cs.LG arXiv:2407.05125v1

@misc{2407.05125, author = {Jiajun Song and Jiajun Luo and Rongwei Lu and Shuzhao Xie and Bin Chen and Zhi Wang}, title = {{A} {J}oint {A}pproach to {L}ocal {U}pdating and {G}radient {C}ompression for {E}fficient {A}synchronous {F}ederated {L}earning}, year = {2024}, eprint = {2407.05125}, note = {arXiv:2407.05125v1} }
PDF
Asynchronous Federated Learning (AFL) confronts inherent challenges arising from the heterogeneity of devices (e.g., their computation capacities) and low-bandwidth environments, both potentially causing stale model updates (e.g., local gradients) for global aggregation. Traditional approaches mitigating the staleness of updates typically focus on either adjusting the local updating or gradient compression, but not both. Recognizing this gap, we introduce a novel approach that synergizes local updating with gradient compression. Our research begins by examining the interplay between local updating frequency and gradient compression rate, and their collective impact on convergence speed. The theoretical upper bound shows that the local updating frequency and gradient compression rate of each device are jointly determined by its computing power, communication capabilities and other factors. Building on this foundation, we propose an AFL framework called FedLuck that adaptively optimizes both local update frequency and gradient compression rates. Experiments on image classification and speech recognization show that FedLuck reduces communication consumption by 56% and training time by 55% on average, achieving competitive performance in heterogeneous and low-bandwidth scenarios compared to the baselines.
Gemini: Integrating Full-fledged Sensing upon Millimeter Wave Communications
Yilong Li, Zhe Chen, Jun Luo, Suman Banerjee
Jul 08 2024 cs.NI eess.SP arXiv:2407.04174v5

@misc{2407.04174, author = {Yilong Li and Zhe Chen and Jun Luo and Suman Banerjee}, title = {{G}emini: {I}ntegrating {F}ull-fledged {S}ensing upon {M}illimeter {W}ave {C}ommunications}, year = {2024}, eprint = {2407.04174}, note = {arXiv:2407.04174v5} }
PDF
Integrating millimeter wave (mmWave)technology in both communication and sensing is promising as it enables the reuse of existing spectrum and infrastructure without draining resources. Most existing systems piggyback sensing onto conventional communication modes without fully exploiting the potential of integrated sensing and communication (ISAC) in mmWave radios (not full-fledged). In this paper, we design and implement a full-fledged mmWave ISAC system Gemini; it delivers raw channel states to serve a broad category of sensing applications. We first propose the mmWave self-interference cancellation approach to extract the weak reflected signals for near-field sensing purposes. Then, we develop a joint optimization scheduling framework that can be utilized in accurate radar sensing while maximizing the communication throughput. Finally, we design a united fusion sensing algorithm to offer a better sensing performance via combining monostatic and bistatic modes. We evaluate our system in extensive experiments to demonstrate Gemini's capability of simultaneously operating sensing and communication, enabling mmWave ISAC to perform better than the commercial off-the-shelf mmWave radar for 5G cellular networks.
MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, Yixin Wang
Jul 03 2024 cs.CL cs.AI arXiv:2407.02483v2

@misc{2407.02483, author = {Binxu Li and Tiankai Yan and Yuanting Pan and Jie Luo and Ruiyang Ji and Jiayuan Ding and Zhe Xu and Shilong Liu and Haoyu Dong and Zihao Lin and Yixin Wang}, title = {{MM}ed{A}gent: {L}earning to {U}se {M}edical {T}ools with {M}ulti-modal {A}gent}, year = {2024}, eprint = {2407.02483}, note = {arXiv:2407.02483v2} }
PDF
Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To bridge this gap, this paper introduces the first agent explicitly designed for the medical field, named \textbfMulti-modal \textbfMedical \textbfAgent (MMedAgent). We curate an instruction-tuning dataset comprising six medical tools solving seven tasks across five modalities, enabling the agent to choose the most suitable tools for a given task. Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o. Furthermore, MMedAgent exhibits efficiency in updating and integrating new medical tools. Codes and models are all available.
High Spectral-Efficiency, Ultra-low MIMO SDM Transmission over a Field-Deployed Multi-Core OAM Fiber
Junyi Liu, Zengquan Xu, Shuqi Mo, Yuming Huang, Yining Huang, Zhenhua Li, Yuying Guo, Lei Shen, Shuo Xu, Ran Gao, Cheng Du, Qian Feng, Jie Luo, Jie Liu, Siyuan Yu
Jul 03 2024 cs.NI physics.optics arXiv:2407.01552v1

@misc{2407.01552, author = {Junyi Liu and Zengquan Xu and Shuqi Mo and Yuming Huang and Yining Huang and Zhenhua Li and Yuying Guo and Lei Shen and Shuo Xu and Ran Gao and Cheng Du and Qian Feng and Jie Luo and Jie Liu and Siyuan Yu}, title = {{H}igh {S}pectral-{E}fficiency, {U}ltra-low {MIMO} {SDM} {T}ransmission over a {F}ield-{D}eployed {M}ulti-{C}ore {OAM} {F}iber}, year = {2024}, eprint = {2407.01552}, note = {arXiv:2407.01552v1} }
PDF
Few-mode multi-core fiber (FM-MCF) based Space-Division Multiplexing (SDM) systems possess the potential to maximize the number of multiplexed spatial channels per fiber by harnessing both the space (fiber cores) and mode (optical mode per core) dimensions. However, to date, no SDM transmissions over field-deployed FM-MCFs in realistic outdoor settings have been reported, which contrasts with SDM schemes demonstrated using single-mode multi-core fibers (SM-MCFs) installed in practical fiber cable ducts. In this paper, we present the successful demonstration of bidirectional SDM transmission over a 5-km field-deployed seven ring-core fiber (7-RCF) with a cladding diameter of 178 ${\mu}$m, achieving a Spectral Efficiency (SE) of 2$\times$201.6 bit/s/Hz. This work establishes a new record for the highest SE attained in SDM demonstrations utilizing field-deployed fiber cables, achieving an approximate 10x increase compared to the SE of reported field-deployed optical fiber cable transmission systems. Notably, these results are realized through the utilization of small-scale modular 4$\times$4 multiple-input multiple-output (MIMO) processing with a time-domain equalization (TDE) tap number not exceeding 15, maintaining a complexity per unit capacity comparable to that of MIMO equalization in SDM demonstrations employing weakly coupled SM-MCF cables. These results underscore the significant potential for achieving heightened SE and expanding capacity per individual fiber using SDM techniques in practical applications.
Research on Reliable and Safe Occupancy Grid Prediction in Underground Parking Lots
JiaQi Luo
Jul 03 2024 cs.AI cs.CV cs.RO arXiv:2407.02197v1

@misc{2407.02197, author = {JiaQi Luo}, title = {{R}esearch on {R}eliable and {S}afe {O}ccupancy {G}rid {P}rediction in {U}nderground {P}arking {L}ots}, year = {2024}, eprint = {2407.02197}, note = {arXiv:2407.02197v1} }
PDF
Against the backdrop of advancing science and technology, autonomous vehicle technology has emerged as a focal point of intense scrutiny within the academic community. Nevertheless, the challenge persists in guaranteeing the safety and reliability of this technology when navigating intricate scenarios. While a substantial portion of autonomous driving research is dedicated to testing in open-air environments, such as urban roads and highways, where the myriad variables at play are meticulously examined, enclosed indoor spaces like underground parking lots have, to a significant extent, been overlooked in the scholarly discourse. This discrepancy highlights a gap in derstanding the unique challenges these confined settings pose for autonomous navigation systems. This study tackles indoor autonomous driving, particularly in overlooked spaces like underground parking lots. Using CARLA's simulation platform, a realistic parking model is created for data gathering. An occupancy grid network then processes this data to predict vehicle paths and obstacles, enhancing the system's perception in complex indoor environments. Ultimately, this strategy improves safety in autonomous parking operations. The paper meticulously evaluates the model's predictive capabilities, validating its efficacy in the context of underground parking. Our findings confirm that the proposed strategy successfully enhances autonomous vehicle performance in these complex indoor settings. It equips autonomous systems with improved adaptation to underground lots, reinforcing safety measures and dependability. This work paves the way for future advancements and applications by addressing the research shortfall concerning indoor parking environments, serving as a pivotal reference point.
Dataflow-Based Optimization for Quantum Intermediate Representation Programs
Junjie Luo, Haoyu Zhang, Jianjun Zhao
Jul 01 2024 cs.PL arXiv:2406.19592v1

@misc{2406.19592, author = {Junjie Luo and Haoyu Zhang and Jianjun Zhao}, title = {{D}ataflow-{B}ased {O}ptimization for {Q}uantum {I}ntermediate {R}epresentation {P}rograms}, year = {2024}, eprint = {2406.19592}, note = {arXiv:2406.19592v1} }
PDF
This paper proposes QDFO, a dataflow-based optimization approach to Microsoft QIR. QDFO consists of two main functions: one is to preprocess the QIR code so that the LLVM optimizer can capture more optimization opportunities, and the other is to optimize the QIR code so that duplicate loading and constructing of qubits and qubit arrays can be avoided. We evaluated our work on the IBM Challenge Dataset, the results show that our method effectively reduces redundant operations in the QIR code. We also completed a preliminary implementation of QDFO and conducted a case study on the real-world code. Our observational study indicates that the LLVM optimizer can further optimize the QIR code preprocessed by our algorithm. Both the experiments and the case study demonstrate the effectiveness of our approach.
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan
Jun 27 2024 cs.CV cs.CL arXiv:2406.18522v2

@misc{2406.18522, author = {Shenghai Yuan and Jinfa Huang and Yongqi Xu and Yaoyang Liu and Shaofeng Zhang and Yujun Shi and Ruijie Zhu and Xinhua Cheng and Jiebo Luo and Li Yuan}, title = {{C}hrono{M}agic-{B}ench: {A} {B}enchmark for {M}etamorphic {E}valuation of {T}ext-to-{T}ime-lapse {V}ideo {G}eneration}, year = {2024}, eprint = {2406.18522}, note = {arXiv:2406.18522v2} }
PDF
We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude. [Homepage](https://pku-yuangroup.github.io/ChronoMagic-Bench/).
Exploiting Data Significance in Remote Estimation of Discrete-State Markov Sources
Jiping Luo, Nikolaos Pappas
Jun 27 2024 cs.IT cs.NI cs.SY eess.SY math.IT arXiv:2406.18270v1

@misc{2406.18270, author = {Jiping Luo and Nikolaos Pappas}, title = {{E}xploiting {D}ata {S}ignificance in {R}emote {E}stimation of {D}iscrete-{S}tate {M}arkov {S}ources}, year = {2024}, eprint = {2406.18270}, note = {arXiv:2406.18270v1} }
PDF
We consider the semantics-aware remote estimation of a discrete-state Markov source with normal (low-priority) and alarm (high-priority) states. Erroneously announcing a normal state at the destination when the source is actually in an alarm state (i.e., missed alarm error) incurs a significantly higher cost than falsely announcing an alarm state when the source is in a normal state (i.e., false alarm error). Moreover, successive reception of an estimation error may cause significant lasting impact, e.g., maintenance cost and misoperations. Motivated by this, we assign different costs to different estimation errors and introduce two new age metrics, namely the Age of Missed Alarm (AoMA) and the Age of False Alarm (AoFA), to account for the lasting impact incurred by different estimation errors. Notably, the two age processes evolve dependently and can distinguish between different types of estimation errors and different synced states. The aim is to achieve an optimal trade-off between the cost of estimation error, lasting impact, and communication utilization. The problem is formulated as an average-cost, countably infinite state-space Markov decision process (MDP). We show that the optimal policy exhibits a switching-type structure, making it amenable to policy storage and algorithm design. Notably, when the source is symmetric and states are equally important, the optimal policy has identical thresholds, i.e., threshold-type. Theoretical and numerical results underscore that our approach extends the current understanding of the Age of Incorrect Information (AoII) and the cost of actuation error (CAE), showing that they are specific instances within our broader framework.
PVUW 2024 Challenge on Complex Video Understanding: Methods and Results
Henghui Ding, Chang Liu, Yunchao Wei, Nikhila Ravi, Shuting He, Song Bai, Philip Torr, Deshui Miao, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang, Zhensong Xu, Jiangtao Yao, Chengjing Wu, Ting Liu, Luoqi Liu, Xinyu Liu, Jing Zhang, Kexin Zhang, et al (17)
Jun 26 2024 cs.CV arXiv:2406.17005v1

@misc{2406.17005, author = {Henghui Ding and Chang Liu and Yunchao Wei and Nikhila Ravi and Shuting He and Song Bai and Philip Torr and Deshui Miao and Xin Li and Zhenyu He and Yaowei Wang and Ming-Hsuan Yang and Zhensong Xu and Jiangtao Yao and Chengjing Wu and Ting Liu and Luoqi Liu and Xinyu Liu and Jing Zhang and Kexin Zhang and Yuting Yang and Licheng Jiao and Shuyuan Yang and Mingqi Gao and Jingnan Luo and Jinyu Yang and Jungong Han and Feng Zheng and Bin Cao and Yisi Zhang and Xuanxu Lin and Xingjian He and Bo Zhao and Jing Liu and Feiyu Pan and Hao Fang and Xiankai Lu}, title = {{PVUW} 2024 {C}hallenge on {C}omplex {V}ideo {U}nderstanding: {M}ethods and {R}esults}, year = {2024}, eprint = {2406.17005}, note = {arXiv:2406.17005v1} }
PDF
Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as the disappearance and reappearance of objects, inconspicuous small objects, heavy occlusions, and crowded environments in MOSE. Moreover, we provide a new motion expression guided video segmentation dataset MeViS to study the natural language-guided video understanding in complex environments. These new videos, sentences, and annotations enable us to foster the development of a more comprehensive and robust pixel-level understanding of video scenes in complex environments and realistic scenarios. The MOSE challenge had 140 registered teams in total, 65 teams participated the validation phase and 12 teams made valid submissions in the final challenge phase. The MeViS challenge had 225 registered teams in total, 50 teams participated the validation phase and 5 teams made valid submissions in the final challenge phase.
Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing
Xinbo Zhao, Yingxue Zhang, Xin Zhang, Yu Yang, Yiqun Xie, Yanhua Li, Jun Luo
Jun 21 2024 cs.LG arXiv:2406.14054v1

@misc{2406.14054, author = {Xinbo Zhao and Yingxue Zhang and Xin Zhang and Yu Yang and Yiqun Xie and Yanhua Li and Jun Luo}, title = {{U}rban-{F}ocused {M}ulti-{T}ask {O}ffline {R}einforcement {L}earning with {C}ontrastive {D}ata {S}haring}, year = {2024}, eprint = {2406.14054}, note = {arXiv:2406.14054v1} }
PDF
Enhancing diverse human decision-making processes in an urban environment is a critical issue across various applications, including ride-sharing vehicle dispatching, public transportation management, and autonomous driving. Offline reinforcement learning (RL) is a promising approach to learn and optimize human urban strategies (or policies) from pre-collected human-generated spatial-temporal urban data. However, standard offline RL faces two significant challenges: (1) data scarcity and data heterogeneity, and (2) distributional shift. In this paper, we introduce MODA -- a Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing approach. MODA addresses the challenges of data scarcity and heterogeneity in a multi-task urban setting through Contrastive Data Sharing among tasks. This technique involves extracting latent representations of human behaviors by contrasting positive and negative data pairs. It then shares data presenting similar representations with the target task, facilitating data augmentation for each task. Moreover, MODA develops a novel model-based multi-task offline RL algorithm. This algorithm constructs a robust Markov Decision Process (MDP) by integrating a dynamics model with a Generative Adversarial Network (GAN). Once the robust MDP is established, any online RL or planning algorithm can be applied. Extensive experiments conducted in a real-world multi-task urban setting validate the effectiveness of MODA. The results demonstrate that MODA exhibits significant improvements compared to state-of-the-art baselines, showcasing its capability in advancing urban decision-making processes. We also made our code available to the research community.
SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding
Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, Yansheng Li
Jun 17 2024 cs.CV cs.AI arXiv:2406.10100v2

@misc{2406.10100, author = {Junwei Luo and Zhen Pang and Yongjun Zhang and Tingzhu Wang and Linlin Wang and Bo Dang and Jiangwei Lao and Jian Wang and Jingdong Chen and Yihua Tan and Yansheng Li}, title = {{S}ky{S}ense{GPT}: {A} {F}ine-{G}rained {I}nstruction {T}uning {D}ataset and {M}odel for {R}emote {S}ensing {V}ision-{L}anguage {U}nderstanding}, year = {2024}, eprint = {2406.10100}, note = {arXiv:2406.10100v2} }
PDF
Remote Sensing Large Multi-Modal Models (RSLMMs) are developing rapidly and showcase significant capabilities in remote sensing imagery (RSI) comprehension. However, due to the limitations of existing datasets, RSLMMs have shortcomings in understanding the rich semantic relations among objects in complex remote sensing scenes. To unlock RSLMMs' complex comprehension ability, we propose a large-scale instruction tuning dataset FIT-RS, containing 1,800,851 instruction samples. FIT-RS covers common interpretation tasks and innovatively introduces several complex comprehension tasks of escalating difficulty, ranging from relation reasoning to image-level scene graph generation. Based on FIT-RS, we build the FIT-RSFG benchmark. Furthermore, we establish a new benchmark to evaluate the fine-grained relation comprehension capabilities of LMMs, named FIT-RSRC. Based on combined instruction data, we propose SkySenseGPT, which achieves outstanding performance on both public datasets and FIT-RSFG, surpassing existing RSLMMs. We hope the FIT-RS dataset can enhance the relation comprehension capability of RSLMMs and provide a large-scale fine-grained data source for the remote sensing community. The dataset will be available at https://github.com/Luo-Z13/SkySenseGPT
A Comprehensive Graph Pooling Benchmark: Effectiveness, Robustness and Generalizability
Pengyun Wang, Junyu Luo, Yanxin Shen, Ming Zhang, Siyu Heng, Xiao Luo
Jun 14 2024 cs.LG cs.AI arXiv:2406.09031v3

@misc{2406.09031, author = {Pengyun Wang and Junyu Luo and Yanxin Shen and Ming Zhang and Siyu Heng and Xiao Luo}, title = {{A} {C}omprehensive {G}raph {P}ooling {B}enchmark: {E}ffectiveness, {R}obustness and {G}eneralizability}, year = {2024}, eprint = {2406.09031}, note = {arXiv:2406.09031v3} }
PDF
Graph pooling has gained attention for its ability to obtain effective node and graph representations for various downstream tasks. Despite the recent surge in graph pooling approaches, there is a lack of standardized experimental settings and fair benchmarks to evaluate their performance. To address this issue, we have constructed a comprehensive benchmark that includes 17 graph pooling methods and 28 different graph datasets. This benchmark systematically assesses the performance of graph pooling methods in three dimensions, i.e., effectiveness, robustness, and generalizability. We first evaluate the performance of these graph pooling approaches across different tasks including graph classification, graph regression and node classification. Then, we investigate their performance under potential noise attacks and out-of-distribution shifts in real-world scenarios. We also involve detailed efficiency analysis, backbone analysis, parameter analysis and visualization to provide more evidence. Extensive experiments validate the strong capability and applicability of graph pooling approaches in various scenarios, which can provide valuable insights and guidance for deep geometric learning research. The source code of our benchmark is available at https://github.com/goose315/Graph_Pooling_Benchmark.
STAR: A First-Ever Dataset and A Large-Scale Benchmark for Scene Graph Generation in Large-Size Satellite Imagery
Yansheng Li, Linlin Wang, Tingzhu Wang, Xue Yang, Junwei Luo, Qi Wang, Youming Deng, Wenbin Wang, Xian Sun, Haifeng Li, Bo Dang, Yongjun Zhang, Yi Yu, Junchi Yan
Jun 14 2024 cs.CV cs.AI arXiv:2406.09410v3

@misc{2406.09410, author = {Yansheng Li and Linlin Wang and Tingzhu Wang and Xue Yang and Junwei Luo and Qi Wang and Youming Deng and Wenbin Wang and Xian Sun and Haifeng Li and Bo Dang and Yongjun Zhang and Yi Yu and Junchi Yan}, title = {{STAR}: {A} {F}irst-{E}ver {D}ataset and {A} {L}arge-{S}cale {B}enchmark for {S}cene {G}raph {G}eneration in {L}arge-{S}ize {S}atellite {I}magery}, year = {2024}, eprint = {2406.09410}, note = {arXiv:2406.09410v3} }
PDF
Scene graph generation (SGG) in satellite imagery (SAI) benefits promoting understanding of geospatial scenarios from perception to cognition. In SAI, objects exhibit great variations in scales and aspect ratios, and there exist rich relationships between objects (even between spatially disjoint objects), which makes it attractive to holistically conduct SGG in large-size very-high-resolution (VHR) SAI. However, there lack such SGG datasets. Due to the complexity of large-size SAI, mining triplets <subject, relationship, object> heavily relies on long-range contextual reasoning. Consequently, SGG models designed for small-size natural imagery are not directly applicable to large-size SAI. This paper constructs a large-scale dataset for SGG in large-size VHR SAI with image sizes ranging from 512 x 768 to 27,860 x 31,096 pixels, named STAR (Scene graph generaTion in lArge-size satellite imageRy), encompassing over 210K objects and over 400K triplets. To realize SGG in large-size SAI, we propose a context-aware cascade cognition (CAC) framework to understand SAI regarding object detection (OBD), pair pruning and relationship prediction for SGG. We also release a SAI-oriented SGG toolkit with about 30 OBD and 10 SGG methods which need further adaptation by our devised modules on our challenging STAR dataset. The dataset and toolkit are available at: https://linlin-dev.github.io/project/STAR.
Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding
Yue Xu, Kaizhi Yang, Jiebo Luo, Xuejin Chen
Jun 14 2024 cs.CV cs.MM arXiv:2406.08907v1

@misc{2406.08907, author = {Yue Xu and Kaizhi Yang and Jiebo Luo and Xuejin Chen}, title = {{D}ual {A}ttribute-{S}patial {R}elation {A}lignment for 3{D} {V}isual {G}rounding}, year = {2024}, eprint = {2406.08907}, note = {arXiv:2406.08907v1} }
PDF
3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual Attribute-Spatial relation Alignment Network that separately models and aligns object attributes and spatial relation features between language and 3D vision modalities. We decompose both the language and 3D point cloud input into two separate parts and design a dual-branch attention module to separately model the decomposed inputs while preserving global context in attribute-spatial feature fusion by cross attentions. Our DASANet achieves the highest grounding accuracy 65.1% on the Nr3D dataset, 1.3% higher than the best competitor. Besides, the visualization of the two branches proves that our method is efficient and highly interpretable.
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo
Jun 14 2024 cs.CV cs.AI cs.CL cs.LG arXiv:2406.09105v1

@misc{2406.09105, author = {Chenwei Lin and Hanjia Lyu and Xian Xu and Jiebo Luo}, title = {{INS}-{MMB}ench: {A} {C}omprehensive {B}enchmark for {E}valuating {LVLM}s' {P}erformance in {I}nsurance}, year = {2024}, eprint = {2406.09105}, note = {arXiv:2406.09105v1} }
PDF
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications such as image recognition and visual reasoning, and have also shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain-characterized by rich application scenarios and abundant multimodal data-has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for four representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like BLIP-2. This evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain and inspire interdisciplinary development. Our dataset and evaluation code are available at https://github.com/FDU-INS/INS-MMBench.
1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation
Mingqi Gao, Jingnan Luo, Jinyu Yang, Jungong Han, Feng Zheng
Jun 12 2024 cs.CV arXiv:2406.07043v1

@misc{2406.07043, author = {Mingqi Gao and Jingnan Luo and Jinyu Yang and Jungong Han and Feng Zheng}, title = {1st {P}lace {S}olution for {M}e{V}i{S} {T}rack in {CVPR} 2024 {PVUW} {W}orkshop: {M}otion {E}xpression guided {V}ideo {S}egmentation}, year = {2024}, eprint = {2406.07043}, note = {arXiv:2406.07043v1} }
PDF
Motion Expression guided Video Segmentation (MeViS), as an emerging task, poses many new challenges to the field of referring video object segmentation (RVOS). In this technical report, we investigated and validated the effectiveness of static-dominant data and frame sampling on this challenging setting. Our solution achieves a J&F score of 0.5447 in the competition phase and ranks 1st in the MeViS track of the PVUW Challenge. The code is available at: https://github.com/Tapall-AI/MeViS_Track_Solution_2024.