Search SciRate

178 results for au:Xiao_S in:cs

Show all abstracts

Spiking Transformer with Spatial-Temporal Attention
Donghyun Lee, Yuhang Li, Youngeun Kim, Shiting Xiao, Priyadarshini Panda
Oct 01 2024 cs.NE arXiv:2409.19764v1

@misc{2409.19764, author = {Donghyun Lee and Yuhang Li and Youngeun Kim and Shiting Xiao and Priyadarshini Panda}, title = {{S}piking {T}ransformer with {S}patial-{T}emporal {A}ttention}, year = {2024}, eprint = {2409.19764}, note = {arXiv:2409.19764v1} }
PDF
Spiking Neural Networks (SNNs) present a compelling and energy-efficient alternative to traditional Artificial Neural Networks (ANNs) due to their sparse binary activation. Leveraging the success of the transformer architecture, the spiking transformer architecture is explored to scale up dataset size and performance. However, existing works only consider the spatial self-attention in spiking transformer, neglecting the inherent temporal context across the timesteps. In this work, we introduce Spiking Transformer with Spatial-Temporal Attention (STAtten), a simple and straightforward architecture designed to integrate spatial and temporal information in self-attention with negligible additional computational load. The STAtten divides the temporal or token index and calculates the self-attention in a cross-manner to effectively incorporate spatial-temporal information. We first verify our spatial-temporal attention mechanism's ability to capture long-term temporal dependencies using sequential datasets. Moreover, we validate our approach through extensive experiments on varied datasets, including CIFAR10/100, ImageNet, CIFAR10-DVS, and N-Caltech101. Notably, our cross-attention mechanism achieves an accuracy of 78.39 % on the ImageNet dataset.
FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search
Bing Tian, Haikun Liu, Yuhang Tang, Shihai Xiao, Zhuohui Duan, Xiaofei Liao, Xuecang Zhang, Junhua Zhu, Yu Zhang
Sep 26 2024 cs.IR cs.DB cs.OS arXiv:2409.16576v1

@misc{2409.16576, author = {Bing Tian and Haikun Liu and Yuhang Tang and Shihai Xiao and Zhuohui Duan and Xiaofei Liao and Xuecang Zhang and Junhua Zhu and Yu Zhang}, title = {{F}usion{ANNS}: {A}n {E}fficient {CPU}/{GPU} {C}ooperative {P}rocessing {A}rchitecture for {B}illion-scale {A}pproximate {N}earest {N}eighbor {S}earch}, year = {2024}, eprint = {2409.16576}, note = {arXiv:2409.16576v1} }
PDF
Approximate nearest neighbor search (ANNS) has emerged as a crucial component of database and AI infrastructure. Ever-increasing vector datasets pose significant challenges in terms of performance, cost, and accuracy for ANNS services. None of modern ANNS systems can address these issues simultaneously. We present FusionANNS, a high-throughput, low-latency, cost-efficient, and high-accuracy ANNS system for billion-scale datasets using SSDs and only one entry-level GPU. The key idea of FusionANNS lies in CPU/GPU collaborative filtering and re-ranking mechanisms, which significantly reduce I/O operations across CPUs, GPU, and SSDs to break through the I/O performance bottleneck. Specifically, we propose three novel designs: (1) multi-tiered indexing to avoid data swapping between CPUs and GPU, (2) heuristic re-ranking to eliminate unnecessary I/Os and computations while guaranteeing high accuracy, and (3) redundant-aware I/O deduplication to further improve I/O efficiency. We implement FusionANNS and compare it with the state-of-the-art SSD-based ANNS system--SPANN and GPU-accelerated in-memory ANNS system--RUMMY. Experimental results show that FusionANNS achieves 1) 9.4-13.1X higher query per second (QPS) and 5.7-8.8X higher cost efficiency compared with SPANN; 2) and 2-4.9X higher QPS and 2.3-6.8X higher cost efficiency compared with RUMMY, while guaranteeing low latency and high accuracy.
Lighter And Better: Towards Flexible Context Adaptation For Retrieval Augmented Generation
Zheng Liu, Chenyuan Wu, Ninglu Shao, Shitao Xiao, Chaozhuo Li, Defu Lian
Sep 25 2024 cs.CL arXiv:2409.15699v1

@misc{2409.15699, author = {Zheng Liu and Chenyuan Wu and Ninglu Shao and Shitao Xiao and Chaozhuo Li and Defu Lian}, title = {{L}ighter {A}nd {B}etter: {T}owards {F}lexible {C}ontext {A}daptation {F}or {R}etrieval {A}ugmented {G}eneration}, year = {2024}, eprint = {2409.15699}, note = {arXiv:2409.15699v1} }
PDF
The existing Retrieval-Augmented Generation (RAG) systems face significant challenges in terms of cost and effectiveness. On one hand, they need to encode the lengthy retrieved contexts before responding to the input tasks, which imposes substantial computational overhead. On the other hand, directly using generic Large Language Models (LLMs) often leads to sub-optimal answers, while task-specific fine-tuning may compromise the LLMs' general capabilities. To address these challenges, we introduce a novel approach called FlexRAG (Flexible Context Adaptation for RAG). In this approach, the retrieved contexts are compressed into compact embeddings before being encoded by the LLMs. Simultaneously, these compressed embeddings are optimized to enhance downstream RAG performance. A key feature of FlexRAG is its flexibility, which enables effective support for diverse compression ratios and selective preservation of important contexts. Thanks to these technical designs, FlexRAG achieves superior generation quality while significantly reducing running costs. Comprehensive experiments on various question-answering datasets validate our approach as a cost-effective and flexible solution for RAG systems.
Making Text Embedders Few-Shot Learners
Chaofan Li, MingHao Qin, Shitao Xiao, Jianlyu Chen, Kun Luo, Yingxia Shao, Defu Lian, Zheng Liu
Sep 25 2024 cs.IR cs.CL arXiv:2409.15700v1

@misc{2409.15700, author = {Chaofan Li and MingHao Qin and Shitao Xiao and Jianlyu Chen and Kun Luo and Yingxia Shao and Defu Lian and Zheng Liu}, title = {{M}aking {T}ext {E}mbedders {F}ew-{S}hot {L}earners}, year = {2024}, eprint = {2409.15700}, note = {arXiv:2409.15700v1} }
PDF
Large language models (LLMs) with decoder-only architectures demonstrate remarkable in-context learning (ICL) capabilities. This feature enables them to effectively handle both familiar and novel tasks by utilizing examples provided within their input context. Recognizing the potential of this capability, we propose leveraging the ICL feature in LLMs to enhance the process of text embedding generation. To this end, we introduce a novel model bge-en-icl, which employs few-shot examples to produce high-quality text embeddings. Our approach integrates task-related examples directly into the query side, resulting in significant improvements across various tasks. Additionally, we have investigated how to effectively utilize LLMs as embedding models, including various attention mechanisms, pooling methods, etc. Our findings suggest that retaining the original framework often yields the best results, underscoring that simplicity is best. Experimental results on the MTEB and AIR-Bench benchmarks demonstrate that our approach sets new state-of-the-art (SOTA) performance. Our model, code and dataset are freely available at https://github.com/FlagOpen/FlagEmbedding .
OmniGen: Unified Image Generation
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu
Sep 18 2024 cs.CV cs.AI arXiv:2409.11340v1

@misc{2409.11340, author = {Shitao Xiao and Yueze Wang and Junjie Zhou and Huaying Yuan and Xingrun Xing and Ruiran Yan and Shuting Wang and Tiejun Huang and Zheng Liu}, title = {{O}mni{G}en: {U}nified {I}mage {G}eneration}, year = {2024}, eprint = {2409.11340}, note = {arXiv:2409.11340v1} }
PDF
In this work, we introduce OmniGen, a new diffusion model for unified image generation. Unlike popular diffusion models (e.g., Stable Diffusion), OmniGen no longer requires additional modules such as ControlNet or IP-Adapter to process diverse control conditions. OmniGenis characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports other downstream tasks, such as image editing, subject-driven generation, and visual-conditional generation. Additionally, OmniGen can handle classical computer vision tasks by transforming them into image generation tasks, such as edge detection and human pose recognition. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional text encoders. Moreover, it is more user-friendly compared to existing diffusion models, enabling complex tasks to be accomplished through instructions without the need for extra preprocessing steps (e.g., human pose estimation), thereby significantly simplifying the workflow of image generation. 3) Knowledge Transfer: Through learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model's reasoning capabilities and potential applications of chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and there remain several unresolved issues. We will open-source the related resources at https://github.com/VectorSpaceLab/OmniGen to foster advancements in this field.
From FDG to PSMA: A Hitchhiker's Guide to Multitracer, Multicenter Lesion Segmentation in PET/CT Imaging
Maximilian Rokuss, Balint Kovacs, Yannick Kirchhoff, Shuhan Xiao, Constantin Ulrich, Klaus H. Maier-Hein, Fabian Isensee
Sep 17 2024 eess.IV cs.AI cs.CV arXiv:2409.09478v2

@misc{2409.09478, author = {Maximilian Rokuss and Balint Kovacs and Yannick Kirchhoff and Shuhan Xiao and Constantin Ulrich and Klaus H.~Maier-Hein and Fabian Isensee}, title = {{F}rom {FDG} to {PSMA}: {A} {H}itchhiker's {G}uide to {M}ultitracer, {M}ulticenter {L}esion {S}egmentation in {PET}/{CT} {I}maging}, year = {2024}, eprint = {2409.09478}, note = {arXiv:2409.09478v2} }
PDF
Automated lesion segmentation in PET/CT scans is crucial for improving clinical workflows and advancing cancer diagnostics. However, the task is challenging due to physiological variability, different tracers used in PET imaging, and diverse imaging protocols across medical centers. To address this, the autoPET series was created to challenge researchers to develop algorithms that generalize across diverse PET/CT environments. This paper presents our solution for the autoPET III challenge, targeting multitracer, multicenter generalization using the nnU-Net framework with the ResEncL architecture. Key techniques include misalignment data augmentation and multi-modal pretraining across CT, MR, and PET datasets to provide an initial anatomical understanding. We incorporate organ supervision as a multitask approach, enabling the model to distinguish between physiological uptake and tracer-specific patterns, which is particularly beneficial in cases where no lesions are present. Compared to the default nnU-Net, which achieved a Dice score of 57.61, or the larger ResEncL (65.31) our model significantly improved performance with a Dice score of 68.40, alongside a reduction in false positive (FPvol: 7.82) and false negative (FNvol: 10.35) volumes. These results underscore the effectiveness of combining advanced network design, augmentation, pretraining, and multitask learning for PET/CT lesion segmentation. After evaluation on the test set, our approach was awarded the first place in the model-centric category (Team LesionTracer). Code is publicly available at https://github.com/MIC-DKFZ/autopet-3-submission.
Data-Centric Strategies for Overcoming PET/CT Heterogeneity: Insights from the AutoPET III Lesion Segmentation Challenge
Balint Kovacs, Shuhan Xiao, Maximilian Rokuss, Constantin Ulrich, Fabian Isensee, Klaus H. Maier-Hein
Sep 17 2024 eess.IV cs.CV arXiv:2409.10120v1

@misc{2409.10120, author = {Balint Kovacs and Shuhan Xiao and Maximilian Rokuss and Constantin Ulrich and Fabian Isensee and Klaus H.~Maier-Hein}, title = {{D}ata-{C}entric {S}trategies for {O}vercoming {PET}/{CT} {H}eterogeneity: {I}nsights from the {A}uto{PET} {III} {L}esion {S}egmentation {C}hallenge}, year = {2024}, eprint = {2409.10120}, note = {arXiv:2409.10120v1} }
PDF
The third autoPET challenge introduced a new data-centric task this year, shifting the focus from model development to improving metastatic lesion segmentation on PET/CT images through data quality and handling strategies. In response, we developed targeted methods to enhance segmentation performance tailored to the characteristics of PET/CT imaging. Our approach encompasses two key elements. First, to address potential alignment errors between CT and PET modalities as well as the prevalence of punctate lesions, we modified the baseline data augmentation scheme and extended it with misalignment augmentation. This adaptation aims to improve segmentation accuracy, particularly for tiny metastatic lesions. Second, to tackle the variability in image dimensions significantly affecting the prediction time, we implemented a dynamic ensembling and test-time augmentation (TTA) strategy. This method optimizes the use of ensembling and TTA within a 5-minute prediction time limit, effectively leveraging the generalization potential for both small and large images. Both of our solutions are designed to be robust across different tracers and institutional settings, offering a general, yet imaging-specific approach to the multi-tracer and multi-institutional challenges of the competition. We made the challenge repository with our modifications publicly available at \urlhttps://github.com/MIC-DKFZ/miccai2024_autopet3_datacentric.
ReSpike: Residual Frames-based Hybrid Spiking Neural Networks for Efficient Action Recognition
Shiting Xiao, Yuhang Li, Youngeun Kim, Donghyun Lee, Priyadarshini Panda
Sep 04 2024 cs.CV cs.LG arXiv:2409.01564v1

@misc{2409.01564, author = {Shiting Xiao and Yuhang Li and Youngeun Kim and Donghyun Lee and Priyadarshini Panda}, title = {{R}e{S}pike: {R}esidual {F}rames-based {H}ybrid {S}piking {N}eural {N}etworks for {E}fficient {A}ction {R}ecognition}, year = {2024}, eprint = {2409.01564}, note = {arXiv:2409.01564v1} }
PDF
Spiking Neural Networks (SNNs) have emerged as a compelling, energy-efficient alternative to traditional Artificial Neural Networks (ANNs) for static image tasks such as image classification and segmentation. However, in the more complex video classification domain, SNN-based methods fall considerably short of ANN-based benchmarks due to the challenges in processing dense frame sequences. To bridge this gap, we propose ReSpike, a hybrid framework that synergizes the strengths of ANNs and SNNs to tackle action recognition tasks with high accuracy and low energy cost. By decomposing film clips into spatial and temporal components, i.e., RGB image Key Frames and event-like Residual Frames, ReSpike leverages ANN for learning spatial information and SNN for learning temporal information. In addition, we propose a multi-scale cross-attention mechanism for effective feature fusion. Compared to state-of-the-art SNN baselines, our ReSpike hybrid architecture demonstrates significant performance improvements (e.g., >30% absolute accuracy improvement on HMDB-51, UCF-101, and Kinetics-400). Furthermore, ReSpike achieves comparable performance with prior ANN approaches while bringing better accuracy-energy tradeoff.
CinePreGen: Camera Controllable Video Previsualization via Engine-powered Diffusion
Yiran Chen, Anyi Rao, Xuekun Jiang, Shishi Xiao, Ruiqing Ma, Zeyu Wang, Hui Xiong, Bo Dai
Sep 02 2024 cs.CV cs.HC arXiv:2408.17424v1

@misc{2408.17424, author = {Yiran Chen and Anyi Rao and Xuekun Jiang and Shishi Xiao and Ruiqing Ma and Zeyu Wang and Hui Xiong and Bo Dai}, title = {{C}ine{P}re{G}en: {C}amera {C}ontrollable {V}ideo {P}revisualization via {E}ngine-powered {D}iffusion}, year = {2024}, eprint = {2408.17424}, note = {arXiv:2408.17424v1} }
PDF
With advancements in video generative AI models (e.g., SORA), creators are increasingly using these techniques to enhance video previsualization. However, they face challenges with incomplete and mismatched AI workflows. Existing methods mainly rely on text descriptions and struggle with camera placement, a key component of previsualization. To address these issues, we introduce CinePreGen, a visual previsualization system enhanced with engine-powered diffusion. It features a novel camera and storyboard interface that offers dynamic control, from global to local camera adjustments. This is combined with a user-friendly AI rendering workflow, which aims to achieve consistent results through multi-masked IP-Adapter and engine simulation guidelines. In our comprehensive evaluation study, we demonstrate that our system reduces development viscosity (i.e., the complexity and challenges in the development process), meets users' needs for extensive control and iteration in the design process, and outperforms other AI video production workflows in cinematic camera movement, as shown by our experiments and a within-subjects user study. With its intuitive camera controls and realistic rendering of camera motion, CinePreGen shows great potential for improving video production for both individual creators and industry professionals.
Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment
Kun Luo, Minghao Qin, Zheng Liu, Shitao Xiao, Jun Zhao, Kang Liu
Aug 23 2024 cs.CL arXiv:2408.12194v2

@misc{2408.12194, author = {Kun Luo and Minghao Qin and Zheng Liu and Shitao Xiao and Jun Zhao and Kang Liu}, title = {{L}arge {L}anguage {M}odels as {F}oundations for {N}ext-{G}en {D}ense {R}etrieval: {A} {C}omprehensive {E}mpirical {A}ssessment}, year = {2024}, eprint = {2408.12194}, note = {arXiv:2408.12194v2} }
PDF
Pretrained language models like BERT and T5 serve as crucial backbone encoders for dense retrieval. However, these models often exhibit limited generalization capabilities and face challenges in improving in domain accuracy. Recent research has explored using large language models (LLMs) as retrievers, achieving SOTA performance across various tasks. Despite these advancements, the specific benefits of LLMs over traditional retrievers and the impact of different LLM configurations, such as parameter sizes, pretraining duration, and alignment processes on retrieval tasks remain unclear. In this work, we conduct a comprehensive empirical study on a wide range of retrieval tasks, including in domain accuracy, data efficiency, zero shot generalization, lengthy retrieval, instruction based retrieval, and multi task learning. We evaluate over 15 different backbone LLMs and non LLMs. Our findings reveal that larger models and extensive pretraining consistently enhance in domain accuracy and data efficiency. Additionally, larger models demonstrate significant potential in zero shot generalization, lengthy retrieval, instruction based retrieval, and multi task learning. These results underscore the advantages of LLMs as versatile and effective backbone encoders in dense retrieval, providing valuable insights for future research and development in this field.
On-the-fly Synthesis for LTL over Finite Traces: An Efficient Approach that Counts
Shengping Xiao, Yongkang Li, Shufang Zhu, Jun Sun, Jianwen Li, Geguang Pu, Moshe Y. Vardi
Aug 15 2024 cs.AI cs.LO arXiv:2408.07324v1

@misc{2408.07324, author = {Shengping Xiao and Yongkang Li and Shufang Zhu and Jun Sun and Jianwen Li and Geguang Pu and Moshe Y.~Vardi}, title = {{O}n-the-fly {S}ynthesis for {LTL} over {F}inite {T}races: {A}n {E}fficient {A}pproach that {C}ounts}, year = {2024}, eprint = {2408.07324}, note = {arXiv:2408.07324v1} }
PDF
We present an on-the-fly synthesis framework for Linear Temporal Logic over finite traces (LTLf) based on top-down deterministic automata construction. Existing approaches rely on constructing a complete Deterministic Finite Automaton (DFA) corresponding to the LTLf specification, a process with doubly exponential complexity relative to the formula size in the worst case. In this case, the synthesis procedure cannot be conducted until the entire DFA is constructed. This inefficiency is the main bottleneck of existing approaches. To address this challenge, we first present a method for converting LTLf into Transition-based DFA (TDFA) by directly leveraging LTLf semantics, incorporating intermediate results as direct components of the final automaton to enable parallelized synthesis and automata construction. We then explore the relationship between LTLf synthesis and TDFA games and subsequently develop an algorithm for performing LTLf synthesis using on-the-fly TDFA game solving. This algorithm traverses the state space in a global forward manner combined with a local backward method, along with the detection of strongly connected components. Moreover, we introduce two optimization techniques -- model-guided synthesis and state entailment -- to enhance the practical efficiency of our approach. Experimental results demonstrate that our on-the-fly approach achieves the best performance on the tested benchmarks and effectively complements existing tools and approaches.
ModalChorus: Visual Probing and Alignment of Multi-modal Embeddings via Modal Fusion Map
Yilin Ye, Shishi Xiao, Xingchen Zeng, Wei Zeng
Jul 18 2024 cs.CV cs.AI cs.HC cs.IR arXiv:2407.12315v1

@misc{2407.12315, author = {Yilin Ye and Shishi Xiao and Xingchen Zeng and Wei Zeng}, title = {{M}odal{C}horus: {V}isual {P}robing and {A}lignment of {M}ulti-modal {E}mbeddings via {M}odal {F}usion {M}ap}, year = {2024}, eprint = {2407.12315}, note = {arXiv:2407.12315v1} }
PDF
Multi-modal embeddings form the foundation for vision-language models, such as CLIP embeddings, the most widely used text-image embeddings. However, these embeddings are vulnerable to subtle misalignment of cross-modal features, resulting in decreased model performance and diminished generalization. To address this problem, we design ModalChorus, an interactive system for visual probing and alignment of multi-modal embeddings. ModalChorus primarily offers a two-stage process: 1) embedding probing with Modal Fusion Map (MFM), a novel parametric dimensionality reduction method that integrates both metric and nonmetric objectives to enhance modality fusion; and 2) embedding alignment that allows users to interactively articulate intentions for both point-set and set-set alignments. Quantitative and qualitative comparisons for CLIP embeddings with existing dimensionality reduction (e.g., t-SNE and MDS) and data fusion (e.g., data context map) methods demonstrate the advantages of MFM in showcasing cross-modal features over common vision-language datasets. Case studies reveal that ModalChorus can facilitate intuitive discovery of misalignment and efficient re-alignment in scenarios ranging from zero-shot classification to cross-modal retrieval and generation.
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Chen Ju, Haicheng Wang, Haozhe Cheng, Xu Chen, Zhonghua Zhai, Weilin Huang, Jinsong Lan, Shuai Xiao, Bo Zheng
Jul 17 2024 cs.CV arXiv:2407.11717v1

@misc{2407.11717, author = {Chen Ju and Haicheng Wang and Haozhe Cheng and Xu Chen and Zhonghua Zhai and Weilin Huang and Jinsong Lan and Shuai Xiao and Bo Zheng}, title = {{T}urbo: {I}nformativity-{D}riven {A}cceleration {P}lug-{I}n for {V}ision-{L}anguage {L}arge {M}odels}, year = {2024}, eprint = {2407.11717}, note = {arXiv:2407.11717v1} }
PDF
Vision-Language Large Models (VLMs) recently become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in the real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantization, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two crucial factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without re-training and trivial engineering efforts. On multiple VLMs benchmarks, we fully experiment to demonstrate the good acceleration of Turbo, under negligible performance drop.
Enhanced Self-supervised Learning for Multi-modality MRI Segmentation and Classification: A Novel Approach Avoiding Model Collapse
Linxuan Han, Sa Xiao, Zimeng Li, Haidong Li, Xiuchao Zhao, Fumin Guo, Yeqing Han, Xin Zhou
Jul 16 2024 eess.IV cs.AI cs.CV arXiv:2407.10377v2

@misc{2407.10377, author = {Linxuan Han and Sa Xiao and Zimeng Li and Haidong Li and Xiuchao Zhao and Fumin Guo and Yeqing Han and Xin Zhou}, title = {{E}nhanced {S}elf-supervised {L}earning for {M}ulti-modality {MRI} {S}egmentation and {C}lassification: {A} {N}ovel {A}pproach {A}voiding {M}odel {C}ollapse}, year = {2024}, eprint = {2407.10377}, note = {arXiv:2407.10377v2} }
PDF
Multi-modality magnetic resonance imaging (MRI) can provide complementary information for computer-aided diagnosis. Traditional deep learning algorithms are suitable for identifying specific anatomical structures segmenting lesions and classifying diseases with magnetic resonance images. However, manual labels are limited due to high expense, which hinders further improvement of model accuracy. Self-supervised learning (SSL) can effectively learn feature representations from unlabeled data by pre-training and is demonstrated to be effective in natural image analysis. Most SSL methods ignore the similarity of multi-modality MRI, leading to model collapse. This limits the efficiency of pre-training, causing low accuracy in downstream segmentation and classification tasks. To solve this challenge, we establish and validate a multi-modality MRI masked autoencoder consisting of hybrid mask pattern (HMP) and pyramid barlow twin (PBT) module for SSL on multi-modality MRI analysis. The HMP concatenates three masking steps forcing the SSL to learn the semantic connections of multi-modality images by reconstructing the masking patches. We have proved that the proposed HMP can avoid model collapse. The PBT module exploits the pyramidal hierarchy of the network to construct barlow twin loss between masked and original views, aligning the semantic representations of image patches at different vision scales in latent space. Experiments on BraTS2023, PI-CAI, and lung gas MRI datasets further demonstrate the superiority of our framework over the state-of-the-art. The performance of the segmentation and classification is substantially enhanced, supporting the accurate detection of small lesion areas. The code is available at https://github.com/LinxuanHan/M2-MAE.
SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
Xingrun Xing, Boyan Gao, Zheng Zhang, David A. Clifton, Shitao Xiao, Li Du, Guoqi Li, Jiajun Zhang
Jul 09 2024 cs.LG cs.NE arXiv:2407.04752v1

@misc{2407.04752, author = {Xingrun Xing and Boyan Gao and Zheng Zhang and David A.~Clifton and Shitao Xiao and Li Du and Guoqi Li and Jiajun Zhang}, title = {{S}pike{LLM}: {S}caling up {S}piking {N}eural {N}etwork to {L}arge {L}anguage {M}odels via {S}aliency-based {S}piking}, year = {2024}, eprint = {2407.04752}, note = {arXiv:2407.04752v1} }
PDF
The recent advancements in large language models (LLMs) with billions of parameters have significantly boosted their performance across various real-world applications. However, the inference processes for these models require substantial energy and computational resources, presenting considerable deployment challenges. In contrast, human brains, which contain approximately 86 billion biological neurons, exhibit significantly greater energy efficiency compared to LLMs with a similar number of parameters. Inspired by this, we redesign 7 to 70 billion parameter LLMs using bio-plausible spiking mechanisms, emulating the efficient behavior of the human brain. We propose the first spiking large language model as recent LLMs termed SpikeLLM. Coupled with the proposed model, a novel spike-driven quantization framework named Optimal Brain Spiking is introduced to reduce the energy cost and accelerate inference speed via two essential approaches: first (second)-order differentiation-based salient channel detection, and per-channel salient outlier expansion with Generalized Integrate-and-Fire neurons. Our proposed spike-driven quantization can plug in main streams of quantization training methods. In the OmniQuant pipeline, SpikeLLM significantly reduces 25.51% WikiText2 perplexity and improves 3.08% average accuracy of 6 zero-shot datasets on a LLAMA2-7B 4A4W model. In the GPTQ pipeline, SpikeLLM realizes a sparse ternary quantization, which achieves additive in all linear layers. Compared with PB-LLM with similar operations, SpikeLLM also exceeds significantly. We will release our code on GitHub.
Methodology of Adapting Large English Language Models for Specific Cultural Contexts
Wenjing Zhang, Siqi Xiao, Xuejiao Lei, Ning Wang, Huazheng Zhang, Meijuan An, Bikun Yang, Zhaoxiang Liu, Kai Wang, Shiguo Lian
Jun 27 2024 cs.CL cs.AI arXiv:2406.18192v2

@misc{2406.18192, author = {Wenjing Zhang and Siqi Xiao and Xuejiao Lei and Ning Wang and Huazheng Zhang and Meijuan An and Bikun Yang and Zhaoxiang Liu and Kai Wang and Shiguo Lian}, title = {{M}ethodology of {A}dapting {L}arge {E}nglish {L}anguage {M}odels for {S}pecific {C}ultural {C}ontexts}, year = {2024}, eprint = {2406.18192}, note = {arXiv:2406.18192v2} }
PDF
The rapid growth of large language models(LLMs) has emerged as a prominent trend in the field of artificial intelligence. However, current state-of-the-art LLMs are predominantly based on English. They encounter limitations when directly applied to tasks in specific cultural domains, due to deficiencies in domain-specific knowledge and misunderstandings caused by differences in cultural values. To address this challenge, our paper proposes a rapid adaptation method for large models in specific cultural contexts, which leverages instruction-tuning based on specific cultural knowledge and safety values data. Taking Chinese as the specific cultural context and utilizing the LLaMA3-8B as the experimental English LLM, the evaluation results demonstrate that the adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values, while maintaining its original expertise advantages.
Generative Artificial Intelligence-Guided User Studies: An Application for Air Taxi Services
Shengdi Xiao, Jingjing Li, Tatsuki Fushimi, Yoichi Ochiai
Jun 19 2024 cs.HC cs.AI arXiv:2406.12296v1

@misc{2406.12296, author = {Shengdi Xiao and Jingjing Li and Tatsuki Fushimi and Yoichi Ochiai}, title = {{G}enerative {A}rtificial {I}ntelligence-{G}uided {U}ser {S}tudies: {A}n {A}pplication for {A}ir {T}axi {S}ervices}, year = {2024}, eprint = {2406.12296}, note = {arXiv:2406.12296v1} }
PDF
User studies are crucial for meeting user needs. In user studies, real experimental scenarios and participants are constructed and recruited. However, emerging and unfamiliar studies face limitations, including safety concerns and iterative efficiency. To address these challenges, this study utilizes a large language model (LLM) to create generative AI virtual scenarios for user experience. By recruiting real users to evaluate this experience, we can collect feedback that enables rapid iteration in the early design phase. The air taxi is particularly representative of these challenges and has been chosen as the case study for this research. The key contribution was designing a virtual ATJ using OpenAI's GPT-4 model and AI image and video generators. Based on the LLM-generated scripts, key visuals were created for the air taxi, and the ATJ was evaluated by 72 participants. Furthermore, the LLM demonstrated the ability to identify and suggest environments that significantly improve participants' attitudes toward air taxis. Education level and gender significantly influenced participants' attitudes and their satisfaction with the ATJ. Our study confirms the capability of generative AI to support user studies, providing a feasible approach and valuable insights for designing air taxi user experiences in the early design phase.
Learned Image Compression for HE-stained Histopathological Images via Stain Deconvolution
Maximilian Fischer, Peter Neher, Tassilo Wald, Silvia Dias Almeida, Shuhan Xiao, Peter Schüffler, Rickmer Braren, Michael Götz, Alexander Muckenhuber, Jens Kleesiek, Marco Nolden, Klaus Maier-Hein
Jun 19 2024 eess.IV cs.CV arXiv:2406.12623v1

@misc{2406.12623, author = {Maximilian Fischer and Peter Neher and Tassilo Wald and Silvia Dias Almeida and Shuhan Xiao and Peter Schüffler and Rickmer Braren and Michael Götz and Alexander Muckenhuber and Jens Kleesiek and Marco Nolden and Klaus Maier-Hein}, title = {{L}earned {I}mage {C}ompression for {HE}-stained {H}istopathological {I}mages via {S}tain {D}econvolution}, year = {2024}, eprint = {2406.12623}, note = {arXiv:2406.12623v1} }
PDF
Processing histopathological Whole Slide Images (WSI) leads to massive storage requirements for clinics worldwide. Even after lossy image compression during image acquisition, additional lossy compression is frequently possible without substantially affecting the performance of deep learning-based (DL) downstream tasks. In this paper, we show that the commonly used JPEG algorithm is not best suited for further compression and we propose Stain Quantized Latent Compression (SQLC ), a novel DL based histopathology data compression approach. SQLC compresses staining and RGB channels before passing it through a compression autoencoder (CAE ) in order to obtain quantized latent representations for maximizing the compression. We show that our approach yields superior performance in a classification downstream task, compared to traditional approaches like JPEG, while image quality metrics like the Multi-Scale Structural Similarity Index (MS-SSIM) is largely preserved. Our method is online available.
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu
Jun 07 2024 cs.CV cs.AI cs.CL arXiv:2406.04264v2

@misc{2406.04264, author = {Junjie Zhou and Yan Shu and Bo Zhao and Boya Wu and Shitao Xiao and Xi Yang and Yongping Xiong and Bo Zhang and Tiejun Huang and Zheng Liu}, title = {{MLVU}: {A} {C}omprehensive {B}enchmark for {M}ulti-{T}ask {L}ong {V}ideo {U}nderstanding}, year = {2024}, eprint = {2406.04264}, note = {arXiv:2406.04264v2} }
PDF
The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 20 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding quality, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong
Jun 07 2024 cs.IR cs.CL cs.CV arXiv:2406.04292v1

@misc{2406.04292, author = {Junjie Zhou and Zheng Liu and Shitao Xiao and Bo Zhao and Yongping Xiong}, title = {{VISTA}: {V}isualized {T}ext {E}mbedding {F}or {U}niversal {M}ulti-{M}odal {R}etrieval}, year = {2024}, eprint = {2406.04292}, note = {arXiv:2406.04292v1} }
PDF
Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding.
SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms
Xingrun Xing, Zheng Zhang, Ziyi Ni, Shitao Xiao, Yiming Ju, Siqi Fan, Yequan Wang, Jiajun Zhang, Guoqi Li
Jun 06 2024 cs.NE cs.CL cs.LG arXiv:2406.03287v1

@misc{2406.03287, author = {Xingrun Xing and Zheng Zhang and Ziyi Ni and Shitao Xiao and Yiming Ju and Siqi Fan and Yequan Wang and Jiajun Zhang and Guoqi Li}, title = {{S}pike{LM}: {T}owards {G}eneral {S}pike-{D}riven {L}anguage {M}odeling via {E}lastic {B}i-{S}piking {M}echanisms}, year = {2024}, eprint = {2406.03287}, note = {arXiv:2406.03287v1} }
PDF
Towards energy-efficient artificial intelligence similar to the human brain, the bio-inspired spiking neural networks (SNNs) have advantages of biological plausibility, event-driven sparsity, and binary activation. Recently, large-scale language models exhibit promising generalization capability, making it a valuable issue to explore more general spike-driven models. However, the binary spikes in existing SNNs fail to encode adequate semantic information, placing technological challenges for generalization. This work proposes the first fully spiking mechanism for general language tasks, including both discriminative and generative ones. Different from previous spikes with 0,1 levels, we propose a more general spike formulation with bi-directional, elastic amplitude, and elastic frequency encoding, while still maintaining the addition nature of SNNs. In a single time step, the spike is enhanced by direction and amplitude information; in spike frequency, a strategy to control spike firing rate is well designed. We plug this elastic bi-spiking mechanism in language modeling, named SpikeLM. It is the first time to handle general language tasks with fully spike-driven models, which achieve much higher accuracy than previously possible. SpikeLM also greatly bridges the performance gap between SNNs and ANNs in language modeling. Our code is available at https://github.com/Xingrun-Xing/SpikeLM.
Enhancing predictive imaging biomarker discovery through treatment effect analysis
Shuhan Xiao, Lukas Klein, Jens Petersen, Philipp Vollmuth, Paul F. Jaeger, Klaus H. Maier-Hein
Jun 05 2024 eess.IV cs.AI cs.CV cs.LG arXiv:2406.02534v1

@misc{2406.02534, author = {Shuhan Xiao and Lukas Klein and Jens Petersen and Philipp Vollmuth and Paul F.~Jaeger and Klaus H.~Maier-Hein}, title = {{E}nhancing predictive imaging biomarker discovery through treatment effect analysis}, year = {2024}, eprint = {2406.02534}, note = {arXiv:2406.02534v1} }
PDF
Identifying predictive biomarkers, which forecast individual treatment effectiveness, is crucial for personalized medicine and informs decision-making across diverse disciplines. These biomarkers are extracted from pre-treatment data, often within randomized controlled trials, and have to be distinguished from prognostic biomarkers, which are independent of treatment assignment. Our study focuses on the discovery of predictive imaging biomarkers, aiming to leverage pre-treatment images to unveil new causal relationships. Previous approaches relied on labor-intensive handcrafted or manually derived features, which may introduce biases. In response, we present a new task of discovering predictive imaging biomarkers directly from the pre-treatment images to learn relevant image features. We propose an evaluation protocol for this task to assess a model's ability to identify predictive imaging biomarkers and differentiate them from prognostic ones. It employs statistical testing and a comprehensive analysis of image feature attribution. We explore the suitability of deep learning models originally designed for estimating the conditional average treatment effect (CATE) for this task, which previously have been primarily assessed for the precision of CATE estimation, overlooking the evaluation of imaging biomarker discovery. Our proof-of-concept analysis demonstrates promising results in discovering and validating predictive imaging biomarkers from synthetic outcomes and real-world image datasets.
Compressing Lengthy Context With UltraGist
Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou
May 28 2024 cs.CL arXiv:2405.16635v2

@misc{2405.16635, author = {Peitian Zhang and Zheng Liu and Shitao Xiao and Ninglu Shao and Qiwei Ye and Zhicheng Dou}, title = {{C}ompressing {L}engthy {C}ontext {W}ith {U}ltra{G}ist}, year = {2024}, eprint = {2405.16635}, note = {arXiv:2405.16635v2} }
PDF
Compressing lengthy context is a critical but technically challenging problem. In this paper, we propose a new method called UltraGist, which is distinguished for its high-quality compression of lengthy context due to the innovative design of the compression and learning algorithm. UltraGist brings forth the following important benefits. Firstly, it notably contributes to the flexibility of compression, as it can be effectively learned to support a broad range of context lengths and compression ratios. Secondly, it helps to produce fine-grained compression for the lengthy context, where each small segment of the context is progressively processed on top of a tailored cross-attention mechanism. Thirdly, it makes the training process sample-efficient and thus maximizes the use of training data. Finally, it facilitates the efficient running of compression for dynamic context, as the compression result can be progressively generated and hence incrementally updated. UltraGist is evaluated on a wide variety of tasks associated with lengthy context, such as document QA and summarization, few-shot learning, multi-session conversation, et al. Whilst the existing methods fail to handle these challenging scenarios, our approach is able to preserve a near-lossless compression performance throughout all the evaluations. Our data, model, and code have been released at \urlhttps://github.com/namespace-Pt/UltraGist.
Prototype2Code: End-to-end Front-end Code Generation from UI Design Prototypes
Shuhong Xiao, Yunnong Chen, Jiazhi Li, Liuqing Chen, Lingyun Sun, Tingting Zhou
May 09 2024 cs.SE arXiv:2405.04975v1

@misc{2405.04975, author = {Shuhong Xiao and Yunnong Chen and Jiazhi Li and Liuqing Chen and Lingyun Sun and Tingting Zhou}, title = {{P}rototype2{C}ode: {E}nd-to-end {F}ront-end {C}ode {G}eneration from {UI} {D}esign {P}rototypes}, year = {2024}, eprint = {2405.04975}, note = {arXiv:2405.04975v1} }
PDF
UI-to-code technology has streamlined the front-end development process, reducing repetitive tasks for engineers. prior research mainly use design prototypes as inputs, with the effectiveness of the generated code heavily dependent on these prototypes' quality, leading to compromised robustness. Moreover, these approaches also exhibit shortcomings in code quality, including issues such as disorganized UI structures and the inability to support responsive layouts. To address these challenges, we introduce Prototype2Code, which achieves end-to-end front-end code generation with business demands. For Prototype2Code, we incorporate design linting into the workflow, addressing the detection of fragmented elements and perceptual groups, enhancing the robustness of the generated outcomes. By optimizing the hierarchical structure and intelligently recognizing UI element types, Prototype2Code generates code that is more readable and structurally clearer. To meet responsive design requirements, Prototype2Code primarily supports flexbox layout model, ensuring code compatibility across various device sizes. To validate the efficacy, we compare Prototype2Code with the commercial code generation platform CodeFun and Screenshot-to-code based on GPT-4 with vision. Employing structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and mean squared error (MSE) for visual similarity assessment, Prototype2Code's rendered UI effects align most closely with the design prototypes, exhibiting the minimal errors. We also conduct a user study with five experienced front-end engineers, inviting them to review and revise code generated by the three methods. As a result, Prototype2Code surpasses other methods in readability, usability, and maintainability, better meeting the business needs of industrial development.
An Artificial Intelligence Approach for Interpreting Creative Combinational Designs
Liuqing Chen, Shuhong Xiao, Yunnong Chen, Linyun Sun, Peter R.N. Childs, Ji Han
May 09 2024 cs.CE arXiv:2405.04985v1

@misc{2405.04985, author = {Liuqing Chen and Shuhong Xiao and Yunnong Chen and Linyun Sun and Peter R.N.~Childs and Ji Han}, title = {{A}n {A}rtificial {I}ntelligence {A}pproach for {I}nterpreting {C}reative {C}ombinational {D}esigns}, year = {2024}, eprint = {2405.04985}, note = {arXiv:2405.04985v1} }
PDF
Combinational creativity, a form of creativity involving the blending of familiar ideas, is pivotal in design innovation. While most research focuses on how combinational creativity in design is achieved through blending elements, this study focuses on the computational interpretation, specifically identifying the 'base' and 'additive' components that constitute a creative design. To achieve this goal, the authors propose a heuristic algorithm integrating computer vision and natural language processing technologies, and implement multiple approaches based on both discriminative and generative artificial intelligence architectures. A comprehensive evaluation was conducted on a dataset created for studying combinational creativity. Among the implementations of the proposed algorithm, the most effective approach demonstrated a high accuracy in interpretation, achieving 87.5% for identifying 'base' and 80% for 'additive'. We conduct a modular analysis and an ablation experiment to assess the performance of each part in our implementations. Additionally, the study includes an analysis of error cases and bottleneck issues, providing critical insights into the limitations and challenges inherent in the computational interpretation of creative designs.
Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection
Zhanwei Zhang, Minghao Chen, Shuai Xiao, Liang Peng, Hengjia Li, Binbin Lin, Ping Li, Wenxiao Wang, Boxi Wu, Deng Cai
May 01 2024 cs.CV cs.AI arXiv:2404.19384v1

@misc{2404.19384, author = {Zhanwei Zhang and Minghao Chen and Shuai Xiao and Liang Peng and Hengjia Li and Binbin Lin and Ping Li and Wenxiao Wang and Boxi Wu and Deng Cai}, title = {{P}seudo {L}abel {R}efinery for {U}nsupervised {D}omain {A}daptation on {C}ross-dataset 3{D} {O}bject {D}etection}, year = {2024}, eprint = {2404.19384}, note = {arXiv:2404.19384v1} }
PDF
Recent self-training techniques have shown notable improvements in unsupervised domain adaptation for 3D object detection (3D UDA). These techniques typically select pseudo labels, i.e., 3D boxes, to supervise models for the target domain. However, this selection process inevitably introduces unreliable 3D boxes, in which 3D points cannot be definitively assigned as foreground or background. Previous techniques mitigate this by reweighting these boxes as pseudo labels, but these boxes can still poison the training process. To resolve this problem, in this paper, we propose a novel pseudo label refinery framework. Specifically, in the selection process, to improve the reliability of pseudo boxes, we propose a complementary augmentation strategy. This strategy involves either removing all points within an unreliable box or replacing it with a high-confidence box. Moreover, the point numbers of instances in high-beam datasets are considerably higher than those in low-beam datasets, also degrading the quality of pseudo labels during the training process. We alleviate this issue by generating additional proposals and aligning RoI features across different domains. Experimental results demonstrate that our method effectively enhances the quality of pseudo labels and consistently surpasses the state-of-the-art methods on six autonomous driving benchmarks. Code will be available at https://github.com/Zhanwei-Z/PERE.
Extending Llama-3's Context Ten-Fold Overnight
Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao, Hongjin Qian, Qiwei Ye, Zhicheng Dou
May 01 2024 cs.CL arXiv:2404.19553v1

@misc{2404.19553, author = {Peitian Zhang and Ninglu Shao and Zheng Liu and Shitao Xiao and Hongjin Qian and Qiwei Ye and Zhicheng Dou}, title = {{E}xtending {L}lama-3's {C}ontext {T}en-{F}old {O}vernight}, year = {2024}, eprint = {2404.19553}, note = {arXiv:2404.19553v1} }
PDF
We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning. The entire training cycle is super efficient, which takes 8 hours on one 8xA800 (80G) GPU machine. The resulted model exhibits superior performances across a broad range of evaluation tasks, such as NIHS, topic retrieval, and long-context language understanding; meanwhile, it also well preserves the original capability over short contexts. The dramatic context extension is mainly attributed to merely 3.5K synthetic training samples generated by GPT-4 , which indicates the LLMs' inherent (yet largely underestimated) potential to extend its original context length. In fact, the context length could be extended far beyond 80K with more computation resources. Therefore, the team will publicly release the entire resources (including data, model, data generation pipeline, training code) so as to facilitate the future research from the community: \urlhttps://github.com/FlagOpen/FlagEmbedding.
Generative AI for Visualization: State of the Art and Future Directions
Yilin Ye, Jianing Hao, Yihan Hou, Zhan Wang, Shishi Xiao, Yuyu Luo, Wei Zeng
Apr 30 2024 cs.LG cs.AI cs.HC arXiv:2404.18144v1

@misc{2404.18144, author = {Yilin Ye and Jianing Hao and Yihan Hou and Zhan Wang and Shishi Xiao and Yuyu Luo and Wei Zeng}, title = {{G}enerative {AI} for {V}isualization: {S}tate of the {A}rt and {F}uture {D}irections}, year = {2024}, eprint = {2404.18144}, note = {arXiv:2404.18144v1} }
PDF
Generative AI (GenAI) has witnessed remarkable progress in recent years and demonstrated impressive performance in various generation tasks in different domains such as computer vision and computational design. Many researchers have attempted to integrate GenAI into visualization framework, leveraging the superior generative capacity for different operations. Concurrently, recent major breakthroughs in GenAI like diffusion model and large language model have also drastically increase the potential of GenAI4VIS. From a technical perspective, this paper looks back on previous visualization studies leveraging GenAI and discusses the challenges and opportunities for future research. Specifically, we cover the applications of different types of GenAI methods including sequence, tabular, spatial and graph generation techniques for different tasks of visualization which we summarize into four major stages: data enhancement, visual mapping generation, stylization and interaction. For each specific visualization sub-task, we illustrate the typical data and concrete GenAI algorithms, aiming to provide in-depth understanding of the state-of-the-art GenAI4VIS techniques and their limitations. Furthermore, based on the survey, we discuss three major aspects of challenges and research opportunities including evaluation, dataset, and the gap between end-to-end GenAI and generative algorithms. By summarizing different generation algorithms, their current applications and limitations, this paper endeavors to provide useful insights for future GenAI4VIS research.
Tunnel Try-on: Excavating Spatial-temporal Tunnels for High-quality Virtual Try-on in Videos
Zhengze Xu, Mengting Chen, Zhao Wang, Linyu Xing, Zhonghua Zhai, Nong Sang, Jinsong Lan, Shuai Xiao, Changxin Gao
Apr 29 2024 cs.CV arXiv:2404.17571v1

@misc{2404.17571, author = {Zhengze Xu and Mengting Chen and Zhao Wang and Linyu Xing and Zhonghua Zhai and Nong Sang and Jinsong Lan and Shuai Xiao and Changxin Gao}, title = {{T}unnel {T}ry-on: {E}xcavating {S}patial-temporal {T}unnels for {H}igh-quality {V}irtual {T}ry-on in {V}ideos}, year = {2024}, eprint = {2404.17571}, note = {arXiv:2404.17571v1} }
PDF
Video try-on is a challenging task and has not been well tackled in previous works. The main obstacle lies in preserving the details of the clothing and modeling the coherent motions simultaneously. Faced with those difficulties, we address video try-on by proposing a diffusion-based framework named "Tunnel Try-on." The core idea is excavating a "focus tunnel" in the input video that gives close-up shots around the clothing regions. We zoom in on the region in the tunnel to better preserve the fine details of the clothing. To generate coherent motions, we first leverage the Kalman filter to construct smooth crops in the focus tunnel and inject the position embedding of the tunnel into attention layers to improve the continuity of the generated videos. In addition, we develop an environment encoder to extract the context information outside the tunnels as supplementary cues. Equipped with these techniques, Tunnel Try-on keeps the fine details of the clothing and synthesizes stable and smooth videos. Demonstrating significant advancements, Tunnel Try-on could be regarded as the first attempt toward the commercial-level application of virtual try-on in videos.
LightOctree: Lightweight 3D Spatially-Coherent Indoor Lighting Estimation
Xuecan Wang, Shibang Xiao, Xiaohui Liang
Apr 08 2024 cs.CV arXiv:2404.03925v1

@misc{2404.03925, author = {Xuecan Wang and Shibang Xiao and Xiaohui Liang}, title = {{L}ight{O}ctree: {L}ightweight 3{D} {S}patially-{C}oherent {I}ndoor {L}ighting {E}stimation}, year = {2024}, eprint = {2404.03925}, note = {arXiv:2404.03925v1} }
PDF
We present a lightweight solution for estimating spatially-coherent indoor lighting from a single RGB image. Previous methods for estimating illumination using volumetric representations have overlooked the sparse distribution of light sources in space, necessitating substantial memory and computational resources for achieving high-quality results. We introduce a unified, voxel octree-based illumination estimation framework to produce 3D spatially-coherent lighting. Additionally, a differentiable voxel octree cone tracing rendering layer is proposed to eliminate regular volumetric representation throughout the entire process and ensure the retention of features across different frequency domains. This reduction significantly decreases spatial usage and required floating-point operations without substantially compromising precision. Experimental results demonstrate that our approach achieves high-quality coherent estimation with minimal cost compared to previous methods.
Movable Antenna-Aided Hybrid Beamforming for Multi-User Communications
Yichi Zhang, Yuchen Zhang, Lipeng Zhu, Sa Xiao, Wanbin Tang, Yonina C. Eldar, Rui Zhang
Apr 02 2024 cs.IT eess.SP math.IT arXiv:2404.00953v1

@misc{2404.00953, author = {Yichi Zhang and Yuchen Zhang and Lipeng Zhu and Sa Xiao and Wanbin Tang and Yonina C.~Eldar and Rui Zhang}, title = {{M}ovable {A}ntenna-{A}ided {H}ybrid {B}eamforming for {M}ulti-{U}ser {C}ommunications}, year = {2024}, eprint = {2404.00953}, note = {arXiv:2404.00953v1} }
PDF
In this correspondence, we propose a movable antenna (MA)-aided multi-user hybrid beamforming scheme with a sub-connected structure, where multiple movable sub-arrays can independently change their positions within different local regions. To maximize the system sum rate, we jointly optimize the digital beamformer, analog beamformer, and positions of subarrays, under the constraints of unit modulus, finite movable regions, and power budget. Due to the non-concave/non-convex objective function/constraints, as well as the highly coupled variables, the formulated problem is challenging to solve. By employing fractional programming, we develop an alternating optimization framework to solve the problem via a combination of Lagrange multipliers, penalty method, and gradient descent. Numerical results reveal that the proposed MA-aided hybrid beamforming scheme significantly improves the sum rate compared to its fixed-position antenna (FPA) counterpart. Moreover, with sufficiently large movable regions, the proposed scheme with sub-connected MA arrays even outperforms the fully-connected FPA array.
A Moving Mesh Method for Porous Medium Equation by the Onsager Variational Principle
Si Xiao, Xianmin Xu
Apr 01 2024 math.NA cs.NA math-ph math.MP arXiv:2403.20030v1

@misc{2403.20030, author = {Si Xiao and Xianmin Xu}, title = {{A} {M}oving {M}esh {M}ethod for {P}orous {M}edium {E}quation by the {O}nsager {V}ariational {P}rinciple}, year = {2024}, eprint = {2403.20030}, note = {arXiv:2403.20030v1} }
PDF
In this paper, we introduce a new approach to solving the porous medium equation using a moving mesh finite element method that leverages the Onsager variational principle as an approximation tool. Both the continuous and discrete problems are formulated based on the Onsager principle. The energy dissipation structure is maintained in the semi-discrete and fully implicit discrete schemes. We also develop a fully decoupled explicit scheme by which only a few linear equations are solved sequentially in each time step. The numerical schemes exhibit an optimal convergence rate when the initial mesh is appropriately selected to ensure accurate approximation of the initial data. Furthermore, the method naturally captures the waiting time phenomena without requiring any manual intervention.
Cell Variational Information Bottleneck Network
Zhonghua Zhai, Chen Ju, Jinsong Lan, Shuai Xiao
Mar 25 2024 cs.CV arXiv:2403.15082v3

@misc{2403.15082, author = {Zhonghua Zhai and Chen Ju and Jinsong Lan and Shuai Xiao}, title = {{C}ell {V}ariational {I}nformation {B}ottleneck {N}etwork}, year = {2024}, eprint = {2403.15082}, note = {arXiv:2403.15082v3} }
PDF
In this work, we propose Cell Variational Information Bottleneck Network (cellVIB), a convolutional neural network using information bottleneck mechanism, which can be combined with the latest feedforward network architecture in an end-to-end training method. Our Cell Variational Information Bottleneck Network is constructed by stacking VIB cells, which generate feature maps with uncertainty. As layers going deeper, the regularization effect will gradually increase, instead of directly adding excessive regular constraints to the output layer of the model as in Deep VIB. Under each VIB cell, the feedforward process learns an independent mean term and an standard deviation term, and predicts the Gaussian distribution based on them. The feedback process is based on reparameterization trick for effective training. This work performs an extensive analysis on MNIST dataset to verify the effectiveness of each VIB cells, and provides an insightful analysis on how the VIB cells affect mutual information. Experiments conducted on CIFAR-10 also prove that our cellVIB is robust against noisy labels during training and against corrupted images during testing. Then, we validate our method on PACS dataset, whose results show that the VIB cells can significantly improve the generalization performance of the basic model. Finally, in a more complex representation learning task, face recognition, our network structure has also achieved very competitive results.
Embarrassingly Simple Scribble Supervision for 3D Medical Segmentation
Karol Gotkowski, Carsten Lüth, Paul F. Jäger, Sebastian Ziegler, Lars Krämer, Stefan Denner, Shuhan Xiao, Nico Disch, Klaus H. Maier-Hein, Fabian Isensee
Mar 20 2024 cs.CV arXiv:2403.12834v1

@misc{2403.12834, author = {Karol Gotkowski and Carsten Lüth and Paul F.~Jäger and Sebastian Ziegler and Lars Krämer and Stefan Denner and Shuhan Xiao and Nico Disch and Klaus H.~Maier-Hein and Fabian Isensee}, title = {{E}mbarrassingly {S}imple {S}cribble {S}upervision for 3{D} {M}edical {S}egmentation}, year = {2024}, eprint = {2403.12834}, note = {arXiv:2403.12834v1} }
PDF
Traditionally, segmentation algorithms require dense annotations for training, demanding significant annotation efforts, particularly within the 3D medical imaging field. Scribble-supervised learning emerges as a possible solution to this challenge, promising a reduction in annotation efforts when creating large-scale datasets. Recently, a plethora of methods for optimized learning from scribbles have been proposed, but have so far failed to position scribble annotation as a beneficial alternative. We relate this shortcoming to two major issues: 1) the complex nature of many methods which deeply ties them to the underlying segmentation model, thus preventing a migration to more powerful state-of-the-art models as the field progresses and 2) the lack of a systematic evaluation to validate consistent performance across the broader medical domain, resulting in a lack of trust when applying these methods to new segmentation problems. To address these issues, we propose a comprehensive scribble supervision benchmark consisting of seven datasets covering a diverse set of anatomies and pathologies imaged with varying modalities. We furthermore propose the systematic use of partial losses, i.e. losses that are only computed on annotated voxels. Contrary to most existing methods, these losses can be seamlessly integrated into state-of-the-art segmentation methods, enabling them to learn from scribble annotations while preserving their original loss formulations. Our evaluation using nnU-Net reveals that while most existing methods suffer from a lack of generalization, the proposed approach consistently delivers state-of-the-art performance. Thanks to its simplicity, our approach presents an embarrassingly simple yet effective solution to the challenges of scribble supervision. Source code as well as our extensive scribble benchmarking suite will be made publicly available upon publication.
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment
Mengting Chen, Xi Chen, Zhonghua Zhai, Chen Ju, Xuewen Hong, Jinsong Lan, Shuai Xiao
Mar 20 2024 cs.CV arXiv:2403.12965v1

@misc{2403.12965, author = {Mengting Chen and Xi Chen and Zhonghua Zhai and Chen Ju and Xuewen Hong and Jinsong Lan and Shuai Xiao}, title = {{W}ear-{A}ny-{W}ay: {M}anipulable {V}irtual {T}ry-on via {S}parse {C}orrespondence {A}lignment}, year = {2024}, eprint = {2403.12965}, note = {arXiv:2403.12965v1} }
PDF
This paper introduces a novel framework for virtual try-on, termed Wear-Any-Way. Different from previous methods, Wear-Any-Way is a customizable solution. Besides generating high-fidelity results, our method supports users to precisely manipulate the wearing style. To achieve this goal, we first construct a strong pipeline for standard virtual try-on, supporting single/multiple garment try-on and model-to-model settings in complicated scenarios. To make it manipulable, we propose sparse correspondence alignment which involves point-based control to guide the generation for specific locations. With this design, Wear-Any-Way gets state-of-the-art performance for the standard setting and provides a novel interaction form for customizing the wearing style. For instance, it supports users to drag the sleeve to make it rolled up, drag the coat to make it open, and utilize clicks to control the style of tuck, etc. Wear-Any-Way enables more liberated and flexible expressions of the attires, holding profound implications in the fashion industry.
Leveraging Foundation Models for Content-Based Medical Image Retrieval in Radiology
Stefan Denner, David Zimmerer, Dimitrios Bounias, Markus Bujotzek, Shuhan Xiao, Lisa Kausch, Philipp Schader, Tobias Penzkofer, Paul F. Jäger, Klaus Maier-Hein
Mar 12 2024 cs.CV cs.IR arXiv:2403.06567v3

@misc{2403.06567, author = {Stefan Denner and David Zimmerer and Dimitrios Bounias and Markus Bujotzek and Shuhan Xiao and Lisa Kausch and Philipp Schader and Tobias Penzkofer and Paul F.~Jäger and Klaus Maier-Hein}, title = {{L}everaging {F}oundation {M}odels for {C}ontent-{B}ased {M}edical {I}mage {R}etrieval in {R}adiology}, year = {2024}, eprint = {2403.06567}, note = {arXiv:2403.06567v3} }
PDF
Content-based image retrieval (CBIR) has the potential to significantly improve diagnostic aid and medical research in radiology. Current CBIR systems face limitations due to their specialization to certain pathologies, limiting their utility. In response, we propose using vision foundation models as powerful and versatile off-the-shelf feature extractors for content-based medical image retrieval. By benchmarking these models on a comprehensive dataset of 1.6 million 2D radiological images spanning four modalities and 161 pathologies, we identify weakly-supervised models as superior, achieving a P@1 of up to 0.594. This performance not only competes with a specialized model but does so without the need for fine-tuning. Our analysis further explores the challenges in retrieving pathological versus anatomical structures, indicating that accurate retrieval of pathological features presents greater difficulty. Despite these challenges, our research underscores the vast potential of foundation models for CBIR in radiology, proposing a shift towards versatile, general-purpose medical image retrieval systems that do not require specific tuning.
UI Semantic Group Detection: Grouping UI Elements with Similar Semantics in Mobile Graphical User Interface
Shuhong Xiao, Yunnong Chen, Yaxuan Song, Liuqing Chen, Lingyun Sun, Yankun Zhen, Yanfang Chang
Mar 11 2024 cs.SE arXiv:2403.04984v1

@misc{2403.04984, author = {Shuhong Xiao and Yunnong Chen and Yaxuan Song and Liuqing Chen and Lingyun Sun and Yankun Zhen and Yanfang Chang}, title = {{UI} {S}emantic {G}roup {D}etection: {G}rouping {UI} {E}lements with {S}imilar {S}emantics in {M}obile {G}raphical {U}ser {I}nterface}, year = {2024}, eprint = {2403.04984}, note = {arXiv:2403.04984v1} }
PDF
Texts, widgets, and images on a UI page do not work separately. Instead, they are partitioned into groups to achieve certain interaction functions or visual information. Existing studies on UI elements grouping mainly focus on a specific single UI-related software engineering task, and their groups vary in appearance and function. In this case, we propose our semantic component groups that pack adjacent text and non-text elements with similar semantics. In contrast to those task-oriented grouping methods, our semantic component group can be adopted for multiple UI-related software tasks, such as retrieving UI perceptual groups, improving code structure for automatic UI-to-code generation, and generating accessibility data for screen readers. To recognize semantic component groups on a UI page, we propose a robust, deep learning-based vision detector, UISCGD, which extends the SOTA deformable-DETR by incorporating UI element color representation and a learned prior on group distribution. The model is trained on our UI screenshots dataset of 1988 mobile GUIs from more than 200 apps in both iOS and Android platforms. The evaluation shows that our UISCGD achieves 6.1\% better than the best baseline algorithm and 5.4 \% better than deformable-DETR in which it is based.
Learning solution operators of PDEs defined on varying domains via MIONet
Shanshan Xiao, Pengzhan Jin, Yifa Tang
Feb 26 2024 cs.LG cs.NA math.NA arXiv:2402.15097v2

@misc{2402.15097, author = {Shanshan Xiao and Pengzhan Jin and Yifa Tang}, title = {{L}earning solution operators of {PDE}s defined on varying domains via {MION}et}, year = {2024}, eprint = {2402.15097}, note = {arXiv:2402.15097v2} }
PDF
In this work, we propose a method to learn the solution operators of PDEs defined on varying domains via MIONet, and theoretically justify this method. We first extend the approximation theory of MIONet to further deal with metric spaces, establishing that MIONet can approximate mappings with multiple inputs in metric spaces. Subsequently, we construct a set consisting of some appropriate regions and provide a metric on this set thus make it a metric space, which satisfies the approximation condition of MIONet. Building upon the theoretical foundation, we are able to learn the solution mapping of a PDE with all the parameters varying, including the parameters of the differential operator, the right-hand side term, the boundary condition, as well as the domain. Without loss of generality, we for example perform the experiments for 2-d Poisson equations, where the domains and the right-hand side terms are varying. The results provide insights into the performance of this method across convex polygons, polar regions with smooth boundary, and predictions for different levels of discretization on one task. We also show the additional result of the fully-parameterized case in the appendix for interested readers. Reasonably, we point out that this is a meshless method, hence can be flexibly used as a general solver for a type of PDE.
BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models
Kun Luo, Zheng Liu, Shitao Xiao, Kang Liu
Feb 20 2024 cs.CL arXiv:2402.11573v1

@misc{2402.11573, author = {Kun Luo and Zheng Liu and Shitao Xiao and Kang Liu}, title = {{BGE} {L}andmark {E}mbedding: {A} {C}hunking-{F}ree {E}mbedding {M}ethod {F}or {R}etrieval {A}ugmented {L}ong-{C}ontext {L}arge {L}anguage {M}odels}, year = {2024}, eprint = {2402.11573}, note = {arXiv:2402.11573v1} }
PDF
Large language models (LLMs) call for extension of context to handle many critical applications. However, the existing approaches are prone to expensive costs and inferior quality of context extension. In this work, we proposeExtensible Embedding, which realizes high-quality extension of LLM's context with strong flexibility and cost-effectiveness. Extensible embedding stand as an enhancement of typical token embedding, which represents the information for an extensible scope of context instead of a single token. By leveraging such compact input units of higher information density, the LLM can access to a vast scope of context even with a small context window. Extensible embedding is systematically optimized in architecture and training method, which leads to multiple advantages. 1) High flexibility of context extension, which flexibly supports ad-hoc extension of diverse context lengths. 2) Strong sample efficiency of training, which enables the embedding model to be learned in a cost-effective way. 3) Superior compatibility with the existing LLMs, where the extensible embedding can be seamlessly introduced as a plug-in component. Comprehensive evaluations on long-context language modeling and understanding tasks verify extensible embedding as an effective, efficient, flexible, and compatible method to extend the LLM's context.
Extensible Embedding: A Flexible Multipler For LLM's Context Length
Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang
Feb 20 2024 cs.CL arXiv:2402.11577v1

@misc{2402.11577, author = {Ninglu Shao and Shitao Xiao and Zheng Liu and Peitian Zhang}, title = {{E}xtensible {E}mbedding: {A} {F}lexible {M}ultipler {F}or {LLM}'s {C}ontext {L}ength}, year = {2024}, eprint = {2402.11577}, note = {arXiv:2402.11577v1} }
PDF
Large language models (LLMs) call for extension of context to handle many critical applications. However, the existing approaches are prone to expensive costs and inferior quality of context extension. In this work, we propose Extensible Embedding, which realizes high-quality extension of LLM's context with strong flexibility and cost-effectiveness. Extensible embedding stand as an enhancement of typical token embedding, which represents the information for an extensible scope of context instead of a single token. By leveraging such compact input units of higher information density, the LLM can access to a vast scope of context even with a small context window. Extensible embedding is systematically optimized in architecture and training method, which leads to multiple advantages. 1) High flexibility of context extension, which flexibly supports ad-hoc extension of diverse context lengths. 2) Strong sample efficiency of training, which enables the embedding model to be learned in a cost-effective way. 3) Superior compatibility with the existing LLMs, where the extensible embedding can be seamlessly introduced as a plug-in component. Comprehensive evaluations on long-context language modeling and understanding tasks verify extensible embedding as an effective, efficient, flexible, and compatible method to extend the LLM's context.
A two-stage solution to quantum process tomography: error analysis and optimal design
Shuixin Xiao, Yuanlong Wang, Jun Zhang, Daoyi Dong, Gary J. Mooney, Ian R. Petersen, Hidehiro Yonezawa
Feb 15 2024 quant-ph cs.SY eess.SY arXiv:2402.08952v1

@misc{2402.08952, author = {Shuixin Xiao and Yuanlong Wang and Jun Zhang and Daoyi Dong and Gary J.~Mooney and Ian R.~Petersen and Hidehiro Yonezawa}, title = {{A} two-stage solution to quantum process tomography: error analysis and optimal design}, year = {2024}, eprint = {2402.08952}, note = {arXiv:2402.08952v1} }
PDF
Quantum process tomography is a critical task for characterizing the dynamics of quantum systems and achieving precise quantum control. In this paper, we propose a two-stage solution for both trace-preserving and non-trace-preserving quantum process tomography. Utilizing a tensor structure, our algorithm exhibits a computational complexity of $O(MLd^2)$ where $d$ is the dimension of the quantum system and $ M $, $ L $ represent the numbers of different input states and measurement operators, respectively. We establish an analytical error upper bound and then design the optimal input states and the optimal measurement operators, which are both based on minimizing the error upper bound and maximizing the robustness characterized by the condition number. Numerical examples and testing on IBM quantum devices are presented to demonstrate the performance and efficiency of our algorithm.
Intelligent Agricultural Management Considering N$_2$O Emission and Climate Variability with Uncertainties
Zhaoan Wang, Shaoping Xiao, Jun Wang, Ashwin Parab, Shivam Patel
Feb 15 2024 cs.LG cs.AI cs.CY arXiv:2402.08832v1

@misc{2402.08832, author = {Zhaoan Wang and Shaoping Xiao and Jun Wang and Ashwin Parab and Shivam Patel}, title = {{I}ntelligent {A}gricultural {M}anagement {C}onsidering {N}$_2${O} {E}mission and {C}limate {V}ariability with {U}ncertainties}, year = {2024}, eprint = {2402.08832}, note = {arXiv:2402.08832v1} }
PDF
This study examines how artificial intelligence (AI), especially Reinforcement Learning (RL), can be used in farming to boost crop yields, fine-tune nitrogen use and watering, and reduce nitrate runoff and greenhouse gases, focusing on Nitrous Oxide (N$_2$O) emissions from soil. Facing climate change and limited agricultural knowledge, we use Partially Observable Markov Decision Processes (POMDPs) with a crop simulator to model AI agents' interactions with farming environments. We apply deep Q-learning with Recurrent Neural Network (RNN)-based Q networks for training agents on optimal actions. Also, we develop Machine Learning (ML) models to predict N$_2$O emissions, integrating these predictions into the simulator. Our research tackles uncertainties in N$_2$O emission estimates with a probabilistic ML approach and climate variability through a stochastic weather model, offering a range of emission outcomes to improve forecast reliability and decision-making. By incorporating climate change effects, we enhance agents' climate adaptability, aiming for resilient agricultural practices. Results show these agents can align crop productivity with environmental concerns by penalizing N$_2$O emissions, adapting effectively to climate shifts like warmer temperatures and less rain. This strategy improves farm management under climate change, highlighting AI's role in sustainable agriculture.
ChatScratch: An AI-Augmented System Toward Autonomous Visual Programming Learning for Children Aged 6-12
Liuqing Chen, Shuhong Xiao, Yunnong Chen, Ruoyu Wu, Yaxuan Song, Lingyun Sun
Feb 08 2024 cs.HC cs.AI cs.PL arXiv:2402.04975v1

@misc{2402.04975, author = {Liuqing Chen and Shuhong Xiao and Yunnong Chen and Ruoyu Wu and Yaxuan Song and Lingyun Sun}, title = {{C}hat{S}cratch: {A}n {AI}-{A}ugmented {S}ystem {T}oward {A}utonomous {V}isual {P}rogramming {L}earning for {C}hildren {A}ged 6-12}, year = {2024}, eprint = {2402.04975}, doi = {10.1145/3613904.3642229}, note = {arXiv:2402.04975v1} }
PDF
As Computational Thinking (CT) continues to permeate younger age groups in K-12 education, established CT platforms such as Scratch face challenges in catering to these younger learners, particularly those in the elementary school (ages 6-12). Through formative investigation with Scratch experts, we uncover three key obstacles to children's autonomous Scratch learning: artist's block in project planning, bounded creativity in asset creation, and inadequate coding guidance during implementation. To address these barriers, we introduce ChatScratch, an AI-augmented system to facilitate autonomous programming learning for young children. ChatScratch employs structured interactive storyboards and visual cues to overcome artist's block, integrates digital drawing and advanced image generation technologies to elevate creativity, and leverages Scratch-specialized Large Language Models (LLMs) for professional coding guidance. Our study shows that, compared to Scratch, ChatScratch efficiently fosters autonomous programming learning, and contributes to the creation of high-quality, personally meaningful Scratch projects for children.
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu
Feb 06 2024 cs.CL cs.AI cs.LG arXiv:2402.03216v4

@misc{2402.03216, author = {Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu}, title = {{BGE} {M}3-{E}mbedding: {M}ulti-{L}ingual, {M}ulti-{F}unctionality, {M}ulti-{G}ranularity {T}ext {E}mbeddings {T}hrough {S}elf-{K}nowledge {D}istillation}, year = {2024}, eprint = {2402.03216}, note = {arXiv:2402.03216v4} }
PDF
In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.
An objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion
Sharib Ali, Yamid Espinel, Yueming Jin, Peng Liu, Bianca Güttner, Xukun Zhang, Lihua Zhang, Tom Dowrick, Matthew J. Clarkson, Shiting Xiao, Yifan Wu, Yijun Yang, Lei Zhu, Dai Sun, Lan Li, Micha Pfeiffer, Shahid Farid, Lena Maier-Hein, Emmanuel Buc, Adrien Bartoli
Jan 30 2024 cs.CV cs.AI cs.GR cs.LG arXiv:2401.15753v2

@misc{2401.15753, author = {Sharib Ali and Yamid Espinel and Yueming Jin and Peng Liu and Bianca Güttner and Xukun Zhang and Lihua Zhang and Tom Dowrick and Matthew J.~Clarkson and Shiting Xiao and Yifan Wu and Yijun Yang and Lei Zhu and Dai Sun and Lan Li and Micha Pfeiffer and Shahid Farid and Lena Maier-Hein and Emmanuel Buc and Adrien Bartoli}, title = {{A}n objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion}, year = {2024}, eprint = {2401.15753}, note = {arXiv:2401.15753v2} }
PDF
Augmented reality for laparoscopic liver resection is a visualisation mode that allows a surgeon to localise tumours and vessels embedded within the liver by projecting them on top of a laparoscopic image. Preoperative 3D models extracted from CT or MRI data are registered to the intraoperative laparoscopic images during this process. In terms of 3D-2D fusion, most of the algorithms make use of anatomical landmarks to guide registration. These landmarks include the liver's inferior ridge, the falciform ligament, and the occluding contours. They are usually marked by hand in both the laparoscopic image and the 3D model, which is time-consuming and may contain errors if done by a non-experienced user. Therefore, there is a need to automate this process so that augmented reality can be used effectively in the operating room. We present the Preoperative-to-Intraoperative Laparoscopic Fusion Challenge (P2ILF), held during the Medical Imaging and Computer Assisted Interventions (MICCAI 2022) conference, which investigates the possibilities of detecting these landmarks automatically and using them in registration. The challenge was divided into two tasks: 1) A 2D and 3D landmark detection task and 2) a 3D-2D registration task. The teams were provided with training data consisting of 167 laparoscopic images and 9 preoperative 3D models from 9 patients, with the corresponding 2D and 3D landmark annotations. A total of 6 teams from 4 countries participated, whose proposed methods were evaluated on 16 images and two preoperative 3D models from two patients. All the teams proposed deep learning-based methods for the 2D and 3D landmark segmentation tasks and differentiable rendering-based methods for the registration task. Based on the experimental outcomes, we propose three key hypotheses that determine current limitations and future directions for research in this domain.
TypeDance: Creating Semantic Typographic Logos from Image through Personalized Generation
Shishi Xiao, Liangwei Wang, Xiaojuan Ma, Wei Zeng
Jan 24 2024 cs.AI arXiv:2401.11094v1

@misc{2401.11094, author = {Shishi Xiao and Liangwei Wang and Xiaojuan Ma and Wei Zeng}, title = {{T}ype{D}ance: {C}reating {S}emantic {T}ypographic {L}ogos from {I}mage through {P}ersonalized {G}eneration}, year = {2024}, eprint = {2401.11094}, doi = {10.1145/3613904.3642185}, note = {arXiv:2401.11094v1} }
PDF
Semantic typographic logos harmoniously blend typeface and imagery to represent semantic concepts while maintaining legibility. Conventional methods using spatial composition and shape substitution are hindered by the conflicting requirement for achieving seamless spatial fusion between geometrically dissimilar typefaces and semantics. While recent advances made AI generation of semantic typography possible, the end-to-end approaches exclude designer involvement and disregard personalized design. This paper presents TypeDance, an AI-assisted tool incorporating design rationales with the generative model for personalized semantic typographic logo design. It leverages combinable design priors extracted from uploaded image exemplars and supports type-imagery mapping at various structural granularity, achieving diverse aesthetic designs with flexible control. Additionally, we instantiate a comprehensive design workflow in TypeDance, including ideation, selection, generation, evaluation, and iteration. A two-task user evaluation, including imitation and creation, confirmed the usability of TypeDance in design across different usage scenarios
Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization
Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang
Jan 17 2024 cs.CL arXiv:2401.07793v1

@misc{2401.07793, author = {Ninglu Shao and Shitao Xiao and Zheng Liu and Peitian Zhang}, title = {{F}lexibly {S}caling {L}arge {L}anguage {M}odels {C}ontexts {T}hrough {E}xtensible {T}okenization}, year = {2024}, eprint = {2401.07793}, note = {arXiv:2401.07793v1} }
PDF
Large language models (LLMs) are in need of sufficient contexts to handle many critical applications, such as retrieval augmented generation and few-shot learning. However, due to the constrained window size, the LLMs can only access to the information within a limited context. Although the size of context window can be extended by fine-tuning, it will result in a substantial cost in both training and inference stage. In this paper, we present Extensible Tokenization as an alternative method which realizes the flexible scaling of LLMs' context. Extensible Tokenization stands as a midware in between of the tokenized context and the LLM, which transforms the raw token embeddings into the extensible embeddings. Such embeddings provide a more compact representation for the long context, on top of which the LLM is able to perceive more information with the same context window. Extensible Tokenization is also featured by its flexibility: the scaling factor can be flexibly determined within a feasible scope, leading to the extension of an arbitrary context length at the inference time. Besides, Extensible Tokenization is introduced as a drop-in component, which can be seamlessly plugged into not only the LLM itself and but also its fine-tuned derivatives, bringing in the extended contextual information while fully preserving the LLM's existing capabilities. We perform comprehensive experiments on long-context language modeling and understanding tasks, which verify Extensible Tokenization as an effective, efficient, flexible, and compatible method to extend LLM's context. Our model and source code will be made publicly available.
Long Context Compression with Activation Beacon
Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou
Jan 09 2024 cs.CL cs.AI arXiv:2401.03462v3

@misc{2401.03462, author = {Peitian Zhang and Zheng Liu and Shitao Xiao and Ninglu Shao and Qiwei Ye and Zhicheng Dou}, title = {{L}ong {C}ontext {C}ompression with {A}ctivation {B}eacon}, year = {2024}, eprint = {2401.03462}, note = {arXiv:2401.03462v3} }
PDF
Long context compression is a critical research problem due to its significance in reducing the high computational and memory costs associated with LLMs. In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts. To achieve this, our method introduces the following technical designs. 1) We directly compress the activations (i.e. keys and values at every layer), rather than leveraging soft prompts to relay information (which constitute a major bottleneck to encapsulate the complex information within long contexts). 2) We tailor the compression workflow, where each fine-grained input unit is progressively compressed, enabling high-quality compression and efficient computation during both training and inference. 3) We train the model through compression-based auto-regression, making full use of plain texts and instructional data to optimize the model's compression performance. 4) During training, we randomly sample a compression ratio at each step, teaching the model to support a wide range of compression configurations. Extensive evaluations are conducted on various long-context tasks whose lengths (e.g., 128K) may far exceed the maximum training length (20K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various scenarios, achieving a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache. Our data, model, and code have been released at \urlhttps://github.com/FlagOpen/FlagEmbedding/.
Generalized Lagrangian Neural Networks
Shanshan Xiao, Jiawei Zhang, Yifa Tang
Jan 09 2024 math.DS cs.LG cs.NA math.NA arXiv:2401.03728v2

@misc{2401.03728, author = {Shanshan Xiao and Jiawei Zhang and Yifa Tang}, title = {{G}eneralized {L}agrangian {N}eural {N}etworks}, year = {2024}, eprint = {2401.03728}, note = {arXiv:2401.03728v2} }
PDF
Incorporating neural networks for the solution of Ordinary Differential Equations (ODEs) represents a pivotal research direction within computational mathematics. Within neural network architectures, the integration of the intrinsic structure of ODEs offers advantages such as enhanced predictive capabilities and reduced data utilization. Among these structural ODE forms, the Lagrangian representation stands out due to its significant physical underpinnings. Building upon this framework, Bhattoo introduced the concept of Lagrangian Neural Networks (LNNs). Then in this article, we introduce a groundbreaking extension (Genralized Lagrangian Neural Networks) to Lagrangian Neural Networks (LNNs), innovatively tailoring them for non-conservative systems. By leveraging the foundational importance of the Lagrangian within Lagrange's equations, we formulate the model based on the generalized Lagrange's equation. This modification not only enhances prediction accuracy but also guarantees Lagrangian representation in non-conservative systems. Furthermore, we perform various experiments, encompassing 1-dimensional and 2-dimensional examples, along with an examination of the impact of network parameters, which proved the superiority of Generalized Lagrangian Neural Networks(GLNNs).
Learning-based agricultural management in partially observable environments subject to climate variability
Zhaoan Wang, Shaoping Xiao, Junchao Li, Jun Wang
Jan 03 2024 cs.LG arXiv:2401.01273v1

@misc{2401.01273, author = {Zhaoan Wang and Shaoping Xiao and Junchao Li and Jun Wang}, title = {{L}earning-based agricultural management in partially observable environments subject to climate variability}, year = {2024}, eprint = {2401.01273}, note = {arXiv:2401.01273v1} }
PDF
Agricultural management, with a particular focus on fertilization strategies, holds a central role in shaping crop yield, economic profitability, and environmental sustainability. While conventional guidelines offer valuable insights, their efficacy diminishes when confronted with extreme weather conditions, such as heatwaves and droughts. In this study, we introduce an innovative framework that integrates Deep Reinforcement Learning (DRL) with Recurrent Neural Networks (RNNs). Leveraging the Gym-DSSAT simulator, we train an intelligent agent to master optimal nitrogen fertilization management. Through a series of simulation experiments conducted on corn crops in Iowa, we compare Partially Observable Markov Decision Process (POMDP) models with Markov Decision Process (MDP) models. Our research underscores the advantages of utilizing sequential observations in developing more efficient nitrogen input policies. Additionally, we explore the impact of climate variability, particularly during extreme weather events, on agricultural outcomes and management. Our findings demonstrate the adaptability of fertilization policies to varying climate conditions. Notably, a fixed policy exhibits resilience in the face of minor climate fluctuations, leading to commendable corn yields, cost-effectiveness, and environmental conservation. However, our study illuminates the need for agent retraining to acquire new optimal policies under extreme weather events. This research charts a promising course toward adaptable fertilization strategies that can seamlessly align with dynamic climate scenarios, ultimately contributing to the optimization of crop management practices.
Making Large Language Models A Better Foundation For Dense Retrieval
Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao
Dec 27 2023 cs.CL arXiv:2312.15503v1

@misc{2312.15503, author = {Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao}, title = {{M}aking {L}arge {L}anguage {M}odels {A} {B}etter {F}oundation {F}or {D}ense {R}etrieval}, year = {2023}, eprint = {2312.15503}, note = {arXiv:2312.15503v1} }
PDF
Dense retrieval needs to learn discriminative text embeddings to represent the semantic relationship between query and document. It may benefit from the using of large language models (LLMs), given LLMs' strong capability on semantic understanding. However, the LLMs are pre-trained by text generation tasks, whose working pattern is completely different from representing texts as embeddings. As a result, it is imperative to study how to adapt LLMs properly so that they can be effectively initialized as the backbone encoder for dense retrieval. In this paper, we propose a novel approach, called LLaRA (LLM adapted for dense RetrievAl), which works as a post-hoc adaptation of LLM for the dense retrieval application. LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively. LLaRA turns out to be simple, lightweight, and highly effective. It is applied to adapt LLaMA-2-7B (base) on the Wikipedia corpus, where it substantially improves the model's fine-tuned performances on a variety of dense retrieval benchmarks, like MSMARCO and BEIR. Our model and code will be made publicly available at BGE repository.
Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models
Chen Ju, Haicheng Wang, Zeqian Li, Xu Chen, Zhonghua Zhai, Weilin Huang, Shuai Xiao
Dec 13 2023 cs.CV arXiv:2312.07408v1

@misc{2312.07408, author = {Chen Ju and Haicheng Wang and Zeqian Li and Xu Chen and Zhonghua Zhai and Weilin Huang and Shuai Xiao}, title = {{T}urbo: {I}nformativity-{D}riven {A}cceleration {P}lug-{I}n for {V}ision-{L}anguage {M}odels}, year = {2023}, eprint = {2312.07408}, note = {arXiv:2312.07408v1} }
PDF
Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantification, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two key factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates the data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without retraining and trivial engineering efforts. On multiple public VLMs benchmarks, we conduct extensive experiments to reveal the gratifying acceleration of Turbo, under negligible performance drop.
The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model
Yilin Ye, Qian Zhu, Shishi Xiao, Kang Zhang, Wei Zeng
Dec 05 2023 cs.IR cs.AI cs.CV cs.HC arXiv:2312.01656v2

@misc{2312.01656, author = {Yilin Ye and Qian Zhu and Shishi Xiao and Kang Zhang and Wei Zeng}, title = {{T}he {C}ontemporary {A}rt of {I}mage {S}earch: {I}terative {U}ser {I}ntent {E}xpansion via {V}ision-{L}anguage {M}odel}, year = {2023}, eprint = {2312.01656}, note = {arXiv:2312.01656v2} }
PDF
Image search is an essential and user-friendly method to explore vast galleries of digital images. However, existing image search methods heavily rely on proximity measurements like tag matching or image similarity, requiring precise user inputs for satisfactory results. To meet the growing demand for a contemporary image search engine that enables accurate comprehension of users' search intentions, we introduce an innovative user intent expansion framework. Our framework leverages visual-language models to parse and compose multi-modal user inputs to provide more accurate and satisfying results. It comprises two-stage processes: 1) a parsing stage that incorporates a language parsing module with large language models to enhance the comprehension of textual inputs, along with a visual parsing module that integrates an interactive segmentation module to swiftly identify detailed visual elements within images; and 2) a logic composition stage that combines multiple user search intents into a unified logic expression for more sophisticated operations in complex searching scenarios. Moreover, the intent expansion framework enables users to perform flexible contextualized interactions with the search results to further specify or adjust their detailed search intents iteratively. We implemented the framework into an image search system for NFT (non-fungible token) search and conducted a user study to evaluate its usability and novel properties. The results indicate that the proposed framework significantly improves users' image search experience. Particularly the parsing and contextualized interactions prove useful in allowing users to express their search intents more accurately and engage in a more enjoyable iterative search experience.
Enhancing Cross-domain Click-Through Rate Prediction via Explicit Feature Augmentation
Xu Chen, Zida Cheng, Jiangchao Yao, Chen Ju, Weilin Huang, Jinsong Lan, Xiaoyi Zeng, Shuai Xiao
Dec 04 2023 cs.IR arXiv:2312.00078v2

@misc{2312.00078, author = {Xu Chen and Zida Cheng and Jiangchao Yao and Chen Ju and Weilin Huang and Jinsong Lan and Xiaoyi Zeng and Shuai Xiao}, title = {{E}nhancing {C}ross-domain {C}lick-{T}hrough {R}ate {P}rediction via {E}xplicit {F}eature {A}ugmentation}, year = {2023}, eprint = {2312.00078}, note = {arXiv:2312.00078v2} }
PDF
Cross-domain CTR (CDCTR) prediction is an important research topic that studies how to leverage meaningful data from a related domain to help CTR prediction in target domain. Most existing CDCTR works design implicit ways to transfer knowledge across domains such as parameter-sharing that regularizes the model training in target domain. More effectively, recent researchers propose explicit techniques to extract user interest knowledge and transfer this knowledge to target domain. However, the proposed method mainly faces two issues: 1) it usually requires a super domain, i.e. an extremely large source domain, to cover most users or items of target domain, and 2) the extracted user interest knowledge is static no matter what the context is in target domain. These limitations motivate us to develop a more flexible and efficient technique to explicitly transfer knowledge. In this work, we propose a cross-domain augmentation network (CDAnet) being able to perform explicit knowledge transfer between two domains. Specifically, CDAnet contains a designed translation network and an augmentation network which are trained sequentially. The translation network computes latent features from two domains and learns meaningful cross-domain knowledge of each input in target domain by using a designed cross-supervised feature translator. Later the augmentation network employs the explicit cross-domain knowledge as augmented information to boost the target domain CTR prediction. Through extensive experiments on two public benchmarks and one industrial production dataset, we show CDAnet can learn meaningful translated features and largely improve the performance of CTR prediction. CDAnet has been conducted online A/B test in image2product retrieval at Taobao app, bringing an absolute 0.11 point CTR improvement, a relative 0.64% deal growth and a relative 1.26% GMV increase.
LM-Cocktail: Resilient Tuning of Language Models via Model Merging
Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing
Nov 23 2023 cs.CL cs.AI cs.IR arXiv:2311.13534v4

@misc{2311.13534, author = {Shitao Xiao and Zheng Liu and Peitian Zhang and Xingrun Xing}, title = {{LM}-{C}ocktail: {R}esilient {T}uning of {L}anguage {M}odels via {M}odel {M}erging}, year = {2023}, eprint = {2311.13534}, note = {arXiv:2311.13534v4} }
PDF
The pre-trained language models are continually fine-tuned to better support downstream applications. However, this operation may result in significant performance degeneration on general tasks beyond the targeted domain. To overcome this problem, we propose LM-Cocktail which enables the fine-tuned model to stay resilient in general perspectives. Our method is conducted in the form of model merging, where the fine-tuned language model is merged with the pre-trained base model or the peer models from other domains through weighted average. Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain. We conduct comprehensive experiments with LLama and BGE model on popular benchmarks, including FLAN, MMLU, MTEB, whose results validate the efficacy of our proposed method. The code and checkpoints are available at https://github.com/FlagOpen/FlagEmbedding/tree/master/LM_Cocktail.
CMFDFormer: Transformer-based Copy-Move Forgery Detection with Continual Learning
Yaqi Liu, Chao Xia, Song Xiao, Qingxiao Guan, Wenqian Dong, Yifan Zhang, Nenghai Yu
Nov 23 2023 cs.CV arXiv:2311.13263v2

@misc{2311.13263, author = {Yaqi Liu and Chao Xia and Song Xiao and Qingxiao Guan and Wenqian Dong and Yifan Zhang and Nenghai Yu}, title = {{CMFDF}ormer: {T}ransformer-based {C}opy-{M}ove {F}orgery {D}etection with {C}ontinual {L}earning}, year = {2023}, eprint = {2311.13263}, note = {arXiv:2311.13263v2} }
PDF
Copy-move forgery detection aims at detecting duplicated regions in a suspected forged image, and deep learning based copy-move forgery detection methods are in the ascendant. These deep learning based methods heavily rely on synthetic training data, and the performance will degrade when facing new tasks. In this paper, we propose a Transformer-style copy-move forgery detection network named as CMFDFormer, and provide a novel PCSD (Pooled Cube and Strip Distillation) continual learning framework to help CMFDFormer handle new tasks. CMFDFormer consists of a MiT (Mix Transformer) backbone network and a PHD (Pluggable Hybrid Decoder) mask prediction network. The MiT backbone network is a Transformer-style network which is adopted on the basis of comprehensive analyses with CNN-style and MLP-style backbones. The PHD network is constructed based on self-correlation computation, hierarchical feature integration, a multi-scale cycle fully-connected block and a mask reconstruction block. The PHD network is applicable to feature extractors of different styles for hierarchical multi-scale information extraction, achieving comparable performance. Last but not least, we propose a PCSD continual learning framework to improve the forgery detectability and avoid catastrophic forgetting when handling new tasks. Our continual learning framework restricts intermediate features from the PHD network, and takes advantage of both cube pooling and strip pooling. Extensive experiments on publicly available datasets demonstrate the good performance of CMFDFormer and the effectiveness of the PCSD continual learning framework.
SparseSpikformer: A Co-Design Framework for Token and Weight Pruning in Spiking Transformer
Yue Liu, Shanlin Xiao, Bo Li, Zhiyi Yu
Nov 16 2023 cs.CV cs.AI arXiv:2311.08806v1

@misc{2311.08806, author = {Yue Liu and Shanlin Xiao and Bo Li and Zhiyi Yu}, title = {{S}parse{S}pikformer: {A} {C}o-{D}esign {F}ramework for {T}oken and {W}eight {P}runing in {S}piking {T}ransformer}, year = {2023}, eprint = {2311.08806}, note = {arXiv:2311.08806v1} }
PDF
As the third-generation neural network, the Spiking Neural Network (SNN) has the advantages of low power consumption and high energy efficiency, making it suitable for implementation on edge devices. More recently, the most advanced SNN, Spikformer, combines the self-attention module from Transformer with SNN to achieve remarkable performance. However, it adopts larger channel dimensions in MLP layers, leading to an increased number of redundant model parameters. To effectively decrease the computational complexity and weight parameters of the model, we explore the Lottery Ticket Hypothesis (LTH) and discover a very sparse ($\ge$90%) subnetwork that achieves comparable performance to the original network. Furthermore, we also design a lightweight token selector module, which can remove unimportant background information from images based on the average spike firing rate of neurons, selecting only essential foreground image tokens to participate in attention calculation. Based on that, we present SparseSpikformer, a co-design framework aimed at achieving sparsity in Spikformer through token and weight pruning techniques. Experimental results demonstrate that our framework can significantly reduce 90% model parameters and cut down Giga Floating-Point Operations (GFLOPs) by 20% while maintaining the accuracy of the original model.
Throughput Maximization in Multi-Band Optical Networks with Column Generation
Cao Chen, Shilin Xiao, Fen Zhou, Massimo Tornatore
Nov 14 2023 cs.NI arXiv:2311.07335v2

@misc{2311.07335, author = {Cao Chen and Shilin Xiao and Fen Zhou and Massimo Tornatore}, title = {{T}hroughput {M}aximization in {M}ulti-{B}and {O}ptical {N}etworks with {C}olumn {G}eneration}, year = {2023}, eprint = {2311.07335}, note = {arXiv:2311.07335v2} }
PDF
Multi-band transmission is a promising technical direction for spectrum and capacity expansion of existing optical networks. Due to the increase in the number of usable wavelengths in multi-band optical networks, the complexity of resource allocation problems becomes a major concern. Moreover, the transmission performance, spectrum width, and cost constraint across optical bands may be heterogeneous. Assuming a worst-case transmission margin in U, L, and C-bands, this paper investigates the problem of throughput maximization in multi-band optical networks, including the optimization of route, wavelength, and band assignment. We propose a low-complexity decomposition approach based on Column Generation (CG) to address the scalability issue faced by traditional methodologies. We numerically compare the results obtained by our CG-based approach to an integer linear programming model, confirming the near-optimal network throughput. Our results also demonstrate the scalability of the CG-based approach when the number of wavelengths increases, with the computation time in the magnitude order of 10 s for cases varying from 75 to 1200 wavelength channels per link in a 14-node network. Code of this publication is available at github.com/cchen000/CG-Multi-Band.
Atom: Neural Traffic Compression with Spatio-Temporal Graph Neural Networks
Paul Almasan, Krzysztof Rusek, Shihan Xiao, Xiang Shi, Xiangle Cheng, Albert Cabellos-Aparicio, Pere Barlet-Ros
Nov 10 2023 cs.NI arXiv:2311.05337v1

@misc{2311.05337, author = {Paul Almasan and Krzysztof Rusek and Shihan Xiao and Xiang Shi and Xiangle Cheng and Albert Cabellos-Aparicio and Pere Barlet-Ros}, title = {{A}tom: {N}eural {T}raffic {C}ompression with {S}patio-{T}emporal {G}raph {N}eural {N}etworks}, year = {2023}, eprint = {2311.05337}, doi = {10.1145/3630049.3630170}, note = {arXiv:2311.05337v1} }
PDF
Storing network traffic data is key to efficient network management; however, it is becoming more challenging and costly due to the ever-increasing data transmission rates, traffic volumes, and connected devices. In this paper, we explore the use of neural architectures for network traffic compression. Specifically, we consider a network scenario with multiple measurement points in a network topology. Such measurements can be interpreted as multiple time series that exhibit spatial and temporal correlations induced by network topology, routing, or user behavior. We present \textitAtom, a neural traffic compression method that leverages spatial and temporal correlations present in network traffic. \textitAtom implements a customized spatio-temporal graph neural network design that effectively exploits both types of correlations simultaneously. The experimental results show that \textitAtom can outperform GZIP's compression ratios by 50\%-65\% on three real-world networks.
Two-stage solution for ancilla-assisted quantum process tomography: error analysis and optimal design
Shuixin Xiao, Yuanlong Wang, Daoyi Dong, Jun Zhang
Nov 01 2023 quant-ph cs.SY eess.SY arXiv:2310.20421v1

@misc{2310.20421, author = {Shuixin Xiao and Yuanlong Wang and Daoyi Dong and Jun Zhang}, title = {{T}wo-stage solution for ancilla-assisted quantum process tomography: error analysis and optimal design}, year = {2023}, eprint = {2310.20421}, note = {arXiv:2310.20421v1} }
PDF
Quantum process tomography (QPT) is a fundamental task to characterize the dynamics of quantum systems. In contrast to standard QPT, ancilla-assisted process tomography (AAPT) framework introduces an extra ancilla system such that a single input state is needed. In this paper, we extend the two-stage solution, a method originally designed for standard QPT, to perform AAPT. Our algorithm has $O(Md_A^2d_B^2)$ computational complexity where $ M $ is the type number of the measurement operators, $ d_A $ is the dimension of the quantum system of interest, and $d_B$ is the dimension of the ancilla system. Then we establish an error upper bound and further discuss the optimal design on the input state in AAPT. A numerical example on a phase damping process demonstrates the effectiveness of the optimal design and illustrates the theoretical error analysis.
MCRAGE: Synthetic Healthcare Data for Fairness
Keira Behal, Jiayi Chen, Caleb Fikes, Sophia Xiao
Oct 31 2023 stat.ML cs.LG arXiv:2310.18430v3

@misc{2310.18430, author = {Keira Behal and Jiayi Chen and Caleb Fikes and Sophia Xiao}, title = {{MCRAGE}: {S}ynthetic {H}ealthcare {D}ata for {F}airness}, year = {2023}, eprint = {2310.18430}, note = {arXiv:2310.18430v3} }
PDF
In the field of healthcare, electronic health records (EHR) serve as crucial training data for developing machine learning models for diagnosis, treatment, and the management of healthcare resources. However, medical datasets are often imbalanced in terms of sensitive attributes such as race/ethnicity, gender, and age. Machine learning models trained on class-imbalanced EHR datasets perform significantly worse in deployment for individuals of the minority classes compared to those from majority classes, which may lead to inequitable healthcare outcomes for minority groups. To address this challenge, we propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE), a novel approach to augment imbalanced datasets using samples generated by a deep generative model. The MCRAGE process involves training a Conditional Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes. We use this synthetic data to augment the existing imbalanced dataset, resulting in a more balanced distribution across all classes, which can be used to train less biased downstream models. We measure the performance of MCRAGE versus alternative approaches using Accuracy, F1 score and AUROC of these downstream models. We provide theoretical justification for our method in terms of recent convergence results for DDPMs.
Enabling energy-Efficient object detection with surrogate gradient descent in spiking neural networks
Jilong Luo, Shanlin Xiao, Yinsheng Chen, Zhiyi Yu
Oct 23 2023 cs.CV cs.AI arXiv:2310.12985v1

@misc{2310.12985, author = {Jilong Luo and Shanlin Xiao and Yinsheng Chen and Zhiyi Yu}, title = {{E}nabling energy-{E}fficient object detection with surrogate gradient descent in spiking neural networks}, year = {2023}, eprint = {2310.12985}, note = {arXiv:2310.12985v1} }
PDF
Spiking Neural Networks (SNNs) are a biologically plausible neural network model with significant advantages in both event-driven processing and spatio-temporal information processing, rendering SNNs an appealing choice for energyefficient object detection. However, the non-differentiability of the biological neuronal dynamics model presents a challenge during the training of SNNs. Furthermore, a suitable decoding strategy for object detection in SNNs is currently lacking. In this study, we introduce the Current Mean Decoding (CMD) method, which solves the regression problem to facilitate the training of deep SNNs for object detection tasks. Based on the gradient surrogate and CMD, we propose the SNN-YOLOv3 model for object detection. Our experiments demonstrate that SNN-YOLOv3 achieves a remarkable performance with an mAP of 61.87% on the PASCAL VOC dataset, requiring only 6 time steps. Compared to SpikingYOLO, we have managed to increase mAP by nearly 10% while reducing energy consumption by two orders of magnitude.
Enhancing Deep Neural Network Training Efficiency and Performance through Linear Prediction
Hejie Ying, Mengmeng Song, Yaohong Tang, Shungen Xiao, Zimin Xiao
Oct 18 2023 cs.LG cs.CV arXiv:2310.10958v2

@misc{2310.10958, author = {Hejie Ying and Mengmeng Song and Yaohong Tang and Shungen Xiao and Zimin Xiao}, title = {{E}nhancing {D}eep {N}eural {N}etwork {T}raining {E}fficiency and {P}erformance through {L}inear {P}rediction}, year = {2023}, eprint = {2310.10958}, note = {arXiv:2310.10958v2} }
PDF
Deep neural networks (DNN) have achieved remarkable success in various fields, including computer vision and natural language processing. However, training an effective DNN model still poses challenges. This paper aims to propose a method to optimize the training effectiveness of DNN, with the goal of improving model performance. Firstly, based on the observation that the DNN parameters change in certain laws during training process, the potential of parameter prediction for improving model training efficiency and performance is discovered. Secondly, considering the magnitude of DNN model parameters, hardware limitations and characteristics of Stochastic Gradient Descent (SGD) for noise tolerance, a Parameter Linear Prediction (PLP) method is exploit to perform DNN parameter prediction. Finally, validations are carried out on some representative backbones. Experiment results show that compare to the normal training ways, under the same training conditions and epochs, by employing proposed PLP method, the optimal model is able to obtain average about 1% accuracy improvement and 0.01 top-1/top-5 error reduction for Vgg16, Resnet18 and GoogLeNet based on CIFAR-100 dataset, which shown the effectiveness of the proposed method on different DNN structures, and validated its capacity in enhancing DNN training efficiency and performance.
Retrieve Anything To Augment Large Language Models
Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, Jian-Yun Nie
Oct 12 2023 cs.IR arXiv:2310.07554v2

@misc{2310.07554, author = {Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie}, title = {{R}etrieve {A}nything {T}o {A}ugment {L}arge {L}anguage {M}odels}, year = {2023}, eprint = {2310.07554}, note = {arXiv:2310.07554v2} }
PDF
Large language models (LLMs) face significant challenges stemming from their inherent limitations in knowledge, memory, alignment, and action. These challenges cannot be addressed by LLMs alone, but should rely on assistance from the external world, such as knowledge base, memory store, demonstration examples, and tools. Retrieval augmentation stands as a vital mechanism for bridging the gap between LLMs and the external assistance. However, conventional methods encounter two pressing issues. On the one hand, the general-purpose retrievers are not properly optimized for the retrieval augmentation of LLMs. On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios. In this work, we present a novel approach, the LLM-Embedder, which comprehensively supports the diverse retrieval augmentation needs of LLMs with one unified embedding model. Training such a unified model is non-trivial, as various retrieval tasks aim to capture distinct semantic relationships, often subject to mutual interference. To address this challenge, we systematically optimize our training methodology. This includes reward formulation based on LLMs' feedback, the stabilization of knowledge distillation, multi-task fine-tuning with explicit instructions, and homogeneous in-batch negative sampling. These optimization strategies contribute to the outstanding empirical performance of the LLM-Embedder. Notably, it yields remarkable enhancements in retrieval augmentation for LLMs, surpassing both general-purpose and task-specific retrievers in various evaluation scenarios. Our checkpoint and source code are publicly available at https://github.com/FlagOpen/FlagEmbedding.
DiskANN++: Efficient Page-based Search over Isomorphic Mapped Graph Index using Query-sensitivity Entry Vertex
Jiongkang Ni, Xiaoliang Xu, Yuxiang Wang, Can Li, Jiajie Yao, Shihai Xiao, Xuecang Zhang
Oct 03 2023 cs.IR cs.DB arXiv:2310.00402v5

@misc{2310.00402, author = {Jiongkang Ni and Xiaoliang Xu and Yuxiang Wang and Can Li and Jiajie Yao and Shihai Xiao and Xuecang Zhang}, title = {{D}isk{ANN}++: {E}fficient {P}age-based {S}earch over {I}somorphic {M}apped {G}raph {I}ndex using {Q}uery-sensitivity {E}ntry {V}ertex}, year = {2023}, eprint = {2310.00402}, note = {arXiv:2310.00402v5} }
PDF
Given a vector dataset $\mathcal{X}$ and a query vector $\vec{x}_q$, graph-based Approximate Nearest Neighbor Search (ANNS) aims to build a graph index $G$ and approximately return vectors with minimum distances to $\vec{x}_q$ by searching over $G$. The main drawback of graph-based ANNS is that a graph index would be too large to fit into the memory especially for a large-scale $\mathcal{X}$. To solve this, a Product Quantization (PQ)-based hybrid method called DiskANN is proposed to store a low-dimensional PQ index in memory and retain a graph index in SSD, thus reducing memory overhead while ensuring a high search accuracy. However, it suffers from two I/O issues that significantly affect the overall efficiency: (1) long routing path from an entry vertex to the query's neighborhood that results in large number of I/O requests and (2) redundant I/O requests during the routing process. We propose an optimized DiskANN++ to overcome above issues. Specifically, for the first issue, we present a query-sensitive entry vertex selection strategy to replace DiskANN's static graph-central entry vertex by a dynamically determined entry vertex that is close to the query. For the second I/O issue, we present an isomorphic mapping on DiskANN's graph index to optimize the SSD layout and propose an asynchronously optimized Pagesearch based on the optimized SSD layout as an alternative to DiskANN's beamsearch. Comprehensive experimental studies on eight real-world datasets demonstrate our DiskANN++'s superiority on efficiency. We achieve a notable 1.5 X to 2.2 X improvement on QPS compared to DiskANN, given the same accuracy constraint.
Forgedit: Text Guided Image Editing via Learning and Forgetting
Shiwen Zhang, Shuai Xiao, Weilin Huang
Sep 20 2023 cs.CV arXiv:2309.10556v2

@misc{2309.10556, author = {Shiwen Zhang and Shuai Xiao and Weilin Huang}, title = {{F}orgedit: {T}ext {G}uided {I}mage {E}diting via {L}earning and {F}orgetting}, year = {2023}, eprint = {2309.10556}, note = {arXiv:2309.10556v2} }
PDF
Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task. It requires an editing model to estimate by itself which part of the image should be edited, and then perform either rigid or non-rigid editing while preserving the characteristics of original image. In this paper, we design a novel text-guided image editing method, named as Forgedit. First, we propose a vision-language joint optimization framework capable of reconstructing the original image in 30 seconds, much faster than previous SOTA and much less overfitting. Then we propose a novel vector projection mechanism in text embedding space of Diffusion Models, which is capable to control the identity similarity and editing strength seperately. Finally, we discovered a general property of UNet in Diffusion Models, i.e., Unet encoder learns space and structure, Unet decoder learns appearance and identity. With such a property, we design forgetting mechanisms to successfully tackle the fatal and inevitable overfitting issues when fine-tuning Diffusion Models on one image, thus significantly boosting the editing capability of Diffusion Models. Our method, Forgedit, built on Stable Diffusion, achieves new state-of-the-art results on the challenging text-guided image editing benchmark: TEdBench, surpassing the previous SOTA methods such as Imagic with Imagen, in terms of both CLIP score and LPIPS score. Codes are available at https://github.com/witcherofresearch/Forgedit
EGFE: End-to-end Grouping of Fragmented Elements in UI Designs with Multimodal Learning
Liuqing Chen, Yunnong Chen, Shuhong Xiao, Yaxuan Song, Lingyun Sun, Yankun Zhen, Tingting Zhou, Yanfang Chang
Sep 19 2023 cs.SE cs.AI arXiv:2309.09867v1

@misc{2309.09867, author = {Liuqing Chen and Yunnong Chen and Shuhong Xiao and Yaxuan Song and Lingyun Sun and Yankun Zhen and Tingting Zhou and Yanfang Chang}, title = {{EGFE}: {E}nd-to-end {G}rouping of {F}ragmented {E}lements in {UI} {D}esigns with {M}ultimodal {L}earning}, year = {2023}, eprint = {2309.09867}, doi = {10.1145/3597503.3623313}, note = {arXiv:2309.09867v1} }
PDF
When translating UI design prototypes to code in industry, automatically generating code from design prototypes can expedite the development of applications and GUI iterations. However, in design prototypes without strict design specifications, UI components may be composed of fragmented elements. Grouping these fragmented elements can greatly improve the readability and maintainability of the generated code. Current methods employ a two-stage strategy that introduces hand-crafted rules to group fragmented elements. Unfortunately, the performance of these methods is not satisfying due to visually overlapped and tiny UI elements. In this study, we propose EGFE, a novel method for automatically End-to-end Grouping Fragmented Elements via UI sequence prediction. To facilitate the UI understanding, we innovatively construct a Transformer encoder to model the relationship between the UI elements with multi-modal representation learning. The evaluation on a dataset of 4606 UI prototypes collected from professional UI designers shows that our method outperforms the state-of-the-art baselines in the precision (by 29.75\%), recall (by 31.07\%), and F1-score (by 30.39\%) at edit distance threshold of 4. In addition, we conduct an empirical study to assess the improvement of the generated front-end code. The results demonstrate the effectiveness of our method on a real software engineering application. Our end-to-end fragmented elements grouping method creates opportunities for improving UI-related software engineering tasks.
C-Pack: Packed Resources For General Chinese Embeddings
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, Jian-Yun Nie
Sep 15 2023 cs.CL cs.AI cs.IR arXiv:2309.07597v5

@misc{2309.07597, author = {Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff and Defu Lian and Jian-Yun Nie}, title = {{C}-{P}ack: {P}acked {R}esources {F}or {G}eneral {C}hinese {E}mbeddings}, year = {2023}, eprint = {2309.07597}, note = {arXiv:2309.07597v5} }
PDF
We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.
DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation
Zhijie Bao, Wei Chen, Shengze Xiao, Kuang Ren, Jiaao Wu, Cheng Zhong, Jiajie Peng, Xuanjing Huang, Zhongyu Wei
Aug 29 2023 cs.CL cs.AI arXiv:2308.14346v1

@misc{2308.14346, author = {Zhijie Bao and Wei Chen and Shengze Xiao and Kuang Ren and Jiaao Wu and Cheng Zhong and Jiajie Peng and Xuanjing Huang and Zhongyu Wei}, title = {{DISC}-{M}ed{LLM}: {B}ridging {G}eneral {L}arge {L}anguage {M}odels and {R}eal-{W}orld {M}edical {C}onsultation}, year = {2023}, eprint = {2308.14346}, note = {arXiv:2308.14346v1} }
PDF
We propose DISC-MedLLM, a comprehensive solution that leverages Large Language Models (LLMs) to provide accurate and truthful medical response in end-to-end conversational healthcare services. To construct high-quality Supervised Fine-Tuning (SFT) datasets, we employ three strategies: utilizing medical knowledge-graphs, reconstructing real-world dialogues, and incorporating human-guided preference rephrasing. These datasets are instrumental in training DISC-MedLLM, surpassing existing medical LLMs in both single-turn and multi-turn consultation scenarios. Extensive experimental results demonstrate the effectiveness of the proposed model in bridging the gap between general language models and real-world medical consultation. Additionally, we release the constructed dataset and model weights to further contribute to research and development. Further details and resources can be found at https://github.com/FudanDISC/DISC-MedLLM
MixBCT: Towards Self-Adapting Backward-Compatible Training
Yu Liang, Yufeng Zhang, Shiliang Zhang, Yaowei Wang, Sheng Xiao, Rong Xiao, Xiaoyu Wang
Aug 15 2023 cs.CV arXiv:2308.06948v2

@misc{2308.06948, author = {Yu Liang and Yufeng Zhang and Shiliang Zhang and Yaowei Wang and Sheng Xiao and Rong Xiao and Xiaoyu Wang}, title = {{M}ix{BCT}: {T}owards {S}elf-{A}dapting {B}ackward-{C}ompatible {T}raining}, year = {2023}, eprint = {2308.06948}, note = {arXiv:2308.06948v2} }
PDF
Backward-compatible training circumvents the need for expensive updates to the old gallery database when deploying an advanced new model in the retrieval system. Previous methods achieved backward compatibility by aligning prototypes of the new model with the old one, yet they often overlooked the distribution of old features, limiting their effectiveness when the low quality of the old model results in a weakly feature discriminability. Instance-based methods like L2 regression take into account the distribution of old features but impose strong constraints on the performance of the new model itself. In this paper, we propose MixBCT, a simple yet highly effective backward-compatible training method that serves as a unified framework for old models of varying qualities. We construct a single loss function applied to mixed old and new features to facilitate backward-compatible training, which adaptively adjusts the constraint domain for new features based on the distribution of old features. We conducted extensive experiments on the large-scale face recognition datasets MS1Mv3 and IJB-C to verify the effectiveness of our method. The experimental results clearly demonstrate its superiority over previous methods. Code is available at https://github.com/yuleung/MixBCT .