subscribe to arXiv mailings

GaussianBlock: Building Part-Aware Compositional and Editable 3D Scene by Primitives and Gaussians

Authors: Shuyi Jiang, Qihao Zhao, Hossein Rahmani, De Wen Soh, Jun Liu, Na Zhao

Abstract: Recently, with the development of Neural Radiance Fields and Gaussian Splatting, 3D reconstruction techniques have achieved remarkably high fidelity. However, the latent representations learnt by these methods are highly entangled and lack interpretability. In this paper, we propose a novel part-aware compositional reconstruction method, called GaussianBlock, that enables semantically coherent and… ▽ More Recently, with the development of Neural Radiance Fields and Gaussian Splatting, 3D reconstruction techniques have achieved remarkably high fidelity. However, the latent representations learnt by these methods are highly entangled and lack interpretability. In this paper, we propose a novel part-aware compositional reconstruction method, called GaussianBlock, that enables semantically coherent and disentangled representations, allowing for precise and physical editing akin to building blocks, while simultaneously maintaining high fidelity. Our GaussianBlock introduces a hybrid representation that leverages the advantages of both primitives, known for their flexible actionability and editability, and 3D Gaussians, which excel in reconstruction quality. Specifically, we achieve semantically coherent primitives through a novel attention-guided centering loss derived from 2D semantic priors, complemented by a dynamic splitting and fusion strategy. Furthermore, we utilize 3D Gaussians that hybridize with primitives to refine structural details and enhance fidelity. Additionally, a binding inheritance strategy is employed to strengthen and maintain the connection between the two. Our reconstructed scenes are evidenced to be disentangled, compositional, and compact across diverse benchmarks, enabling seamless, direct and precise editing while maintaining high quality. △ Less

Submitted 6 October, 2024; v1 submitted 2 October, 2024; originally announced October 2024.

arXiv:2410.00376 [pdf, other]

Frequency Diverse Array-enabled RIS-aided Integrated Sensing and Communication

Authors: Hanyu Yang, Shiqi Gong, Heng Liu, Chengwen Xing, Nan Zhao, Dusit Niyato

Abstract: Integrated sensing and communication (ISAC) has been envisioned as a prospective technology to enable ubiquitous sensing and communications in next-generation wireless networks. In contrast to existing works on reconfigurable intelligent surface (RIS) aided ISAC systems using conventional phased arrays (PAs), this paper investigates a frequency diverse array (FDA)-enabled RIS-aided ISAC system, wh… ▽ More Integrated sensing and communication (ISAC) has been envisioned as a prospective technology to enable ubiquitous sensing and communications in next-generation wireless networks. In contrast to existing works on reconfigurable intelligent surface (RIS) aided ISAC systems using conventional phased arrays (PAs), this paper investigates a frequency diverse array (FDA)-enabled RIS-aided ISAC system, where the FDA aims to provide a distance-angle-dependent beampattern to effectively suppress the clutter, and RIS is employed to establish high-quality links between the BS and users/target. We aim to maximize sum rate by jointly optimizing the BS transmit beamforming vectors, the covariance matrix of the dedicated radar signal, the RIS phase shift matrix, the FDA frequency offsets and the radar receive equalizer, while guaranteeing the required signal-to-clutter-plus-noise ratio (SCNR) of the radar echo signal. To tackle this challenging problem, we first theoretically prove that the dedicated radar signal is unnecessary for enhancing target sensing performance, based on which the original problem is much simplified. Then, we turn our attention to the single-user single-target (SUST) scenario to demonstrate that the FDA-RIS-aided ISAC system always achieves a higher SCNR than its PA-RIS-aided counterpart. Moreover, it is revealed that the SCNR increment exhibits linear growth with the BS transmit power and the number of BS receive antennas. In order to effectively solve this simplified problem, we leverage the fractional programming (FP) theory and subsequently develop an efficient alternating optimization (AO) algorithm based on symmetric alternating direction method of multipliers (SADMM) and successive convex approximation (SCA) techniques. Numerical results demonstrate the superior performance of our proposed algorithm in terms of sum rate and radar SCNR. △ Less

Submitted 30 September, 2024; originally announced October 2024.

Comments: 36 pages, 9 figures

arXiv:2409.14379 [pdf, other]

GroupDiff: Diffusion-based Group Portrait Editing

Authors: Yuming Jiang, Nanxuan Zhao, Qing Liu, Krishna Kumar Singh, Shuai Yang, Chen Change Loy, Ziwei Liu

Abstract: Group portrait editing is highly desirable since users constantly want to add a person, delete a person, or manipulate existing persons. It is also challenging due to the intricate dynamics of human interactions and the diverse gestures. In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there is no labele… ▽ More Group portrait editing is highly desirable since users constantly want to add a person, delete a person, or manipulate existing persons. It is also challenging due to the intricate dynamics of human interactions and the diverse gestures. In this work, we present GroupDiff, a pioneering effort to tackle group photo editing with three dedicated contributions: 1) Data Engine: Since there is no labeled data for group photo editing, we create a data engine to generate paired data for training. The training data engine covers the diverse needs of group portrait editing. 2) Appearance Preservation: To keep the appearance consistent after editing, we inject the images of persons from the group photo into the attention modules and employ skeletons to provide intra-person guidance. 3) Control Flexibility: Bounding boxes indicating the locations of each person are used to reweight the attention matrix so that the features of each person can be injected into the correct places. This inter-person guidance provides flexible manners for manipulation. Extensive experiments demonstrate that GroupDiff exhibits state-of-the-art performance compared to existing methods. GroupDiff offers controllability for editing and maintains the fidelity of the original photos. △ Less

Submitted 22 September, 2024; originally announced September 2024.

Comments: ECCV 2024

arXiv:2407.21335 [pdf, other]

doi 10.1145/3664647.3680700

On-the-fly Point Feature Representation for Point Clouds Analysis

Authors: Jiangyi Wang, Zhongyao Cheng, Na Zhao, Jun Cheng, Xulei Yang

Abstract: Point cloud analysis is challenging due to its unique characteristics of unorderness, sparsity and irregularity. Prior works attempt to capture local relationships by convolution operations or attention mechanisms, exploiting geometric information from coordinates implicitly. These methods, however, are insufficient to describe the explicit local geometry, e.g., curvature and orientation. In this… ▽ More Point cloud analysis is challenging due to its unique characteristics of unorderness, sparsity and irregularity. Prior works attempt to capture local relationships by convolution operations or attention mechanisms, exploiting geometric information from coordinates implicitly. These methods, however, are insufficient to describe the explicit local geometry, e.g., curvature and orientation. In this paper, we propose On-the-fly Point Feature Representation (OPFR), which captures abundant geometric information explicitly through Curve Feature Generator module. This is inspired by Point Feature Histogram (PFH) from computer vision community. However, the utilization of vanilla PFH encounters great difficulties when applied to large datasets and dense point clouds, as it demands considerable time for feature generation. In contrast, we introduce the Local Reference Constructor module, which approximates the local coordinate systems based on triangle sets. Owing to this, our OPFR only requires extra 1.56ms for inference (65x faster than vanilla PFH) and 0.012M more parameters, and it can serve as a versatile plug-and-play module for various backbones, particularly MLP-based and Transformer-based backbones examined in this study. Additionally, we introduce the novel Hierarchical Sampling module aimed at enhancing the quality of triangle sets, thereby ensuring robustness of the obtained geometric features. Our proposed method improves overall accuracy (OA) on ModelNet40 from 90.7% to 94.5% (+3.8%) for classification, and OA on S3DIS Area-5 from 86.4% to 90.0% (+3.6%) for semantic segmentation, respectively, building upon PointNet++ backbone. When integrated with Point Transformer backbone, we achieve state-of-the-art results on both tasks: 94.8% OA on ModelNet40 and 91.7% OA on S3DIS Area-5. △ Less

Submitted 12 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

Comments: Accepted by ACM MM 2024

arXiv:2407.05256 [pdf, other]

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Authors: Pengkun Jiao, Na Zhao, Jingjing Chen, Yu-Gang Jiang

Abstract: Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneer… ▽ More Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pre-trained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios. △ Less

Submitted 17 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

Comments: ECCV2024

arXiv:2407.04051 [pdf, other]

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Authors: Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, Shengpeng Ji, Yabin Li, Zerui Li, Heng Lu, Haoneng Luo, Xiang Lv, Bin Ma, Ziyang Ma, Chongjia Ni, Changhe Song, Jiaqi Shi, Xian Shi, Hao Wang, Wen Wang, Yuxuan Wang , et al. (8 additional authors not shown)

Abstract: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, sp… ▽ More This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM. △ Less

Submitted 10 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: Work in progress. Authors are listed in alphabetical order by family name

arXiv:2406.16560 [pdf]

GNNTAL:A Novel Model for Identifying Critical Nodes in Complex Networks

Authors: Hao Wang, Ting Luo, Shuang-ping Yang, Ming Jing, Jian Wang, Na Zhao

Abstract: Identification of critical nodes is a prominent topic in the study of complex networks. Numerous methods have been proposed, yet most exhibit inherent limitations. Traditional approaches primarily analyze specific structural features of the network; however, node influence is typically the result of a combination of multiple factors. Machine learning-based methods struggle to effectively represent… ▽ More Identification of critical nodes is a prominent topic in the study of complex networks. Numerous methods have been proposed, yet most exhibit inherent limitations. Traditional approaches primarily analyze specific structural features of the network; however, node influence is typically the result of a combination of multiple factors. Machine learning-based methods struggle to effectively represent the complex characteristics of network structures through suitable embedding techniques and require substantial data for training, rendering them prohibitively costly for large-scale networks. To address these challenges, this paper presents an active learning model based on GraphSAGE and Transformer, named GNNTAL. This model is initially pre-trained on random or synthetic networks and subsequently fine-tuned on real-world networks by selecting a few representative nodes using K-Means clustering and uncertainty sampling. This approach offers two main advantages: (1) it significantly reduces training costs; (2) it simultaneously incorporates both local and global features. A series of comparative experiments conducted on twelve real-world networks demonstrate that GNNTAL achieves superior performance. Additionally, this paper proposes an influence maximization method based on the predictions of the GNNTAL model, which achieves optimal performance without the need for complex computations. Finally, the paper analyses certain limitations of the GNNTAL model and suggests potential solutions. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.11311 [pdf, other]

Syn-to-Real Unsupervised Domain Adaptation for Indoor 3D Object Detection

Authors: Yunsong Wang, Na Zhao, Gim Hee Lee

Abstract: The use of synthetic data in indoor 3D object detection offers the potential of greatly reducing the manual labor involved in 3D annotations and training effective zero-shot detectors. However, the complicated domain shifts across syn-to-real indoor datasets remains underexplored. In this paper, we propose a novel Object-wise Hierarchical Domain Alignment (OHDA) framework for syn-to-real unsupervi… ▽ More The use of synthetic data in indoor 3D object detection offers the potential of greatly reducing the manual labor involved in 3D annotations and training effective zero-shot detectors. However, the complicated domain shifts across syn-to-real indoor datasets remains underexplored. In this paper, we propose a novel Object-wise Hierarchical Domain Alignment (OHDA) framework for syn-to-real unsupervised domain adaptation in indoor 3D object detection. Our approach includes an object-aware augmentation strategy to effectively diversify the source domain data, and we introduce a two-branch adaptation framework consisting of an adversarial training branch and a pseudo labeling branch, in order to simultaneously reach holistic-level and class-level domain alignment. The pseudo labeling is further refined through two proposed schemes specifically designed for indoor UDA. Our adaptation results from synthetic dataset 3D-FRONT to real-world datasets ScanNetV2 and SUN RGB-D demonstrate remarkable mAP25 improvements of 9.7% and 9.1% over Source-Only baselines, respectively, and consistently outperform the methods adapted from 2D and 3D outdoor scenarios. The code will be publicly available upon paper acceptance. △ Less

Submitted 26 August, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11283 [pdf, other]

Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding

Authors: Yunsong Wang, Na Zhao, Gim Hee Lee

Abstract: The field of self-supervised 3D representation learning has emerged as a promising solution to alleviate the challenge presented by the scarcity of extensive, well-annotated datasets. However, it continues to be hindered by the lack of diverse, large-scale, real-world 3D scene datasets for source data. To address this shortfall, we propose Generalizable Representation Learning (GRL), where we devi… ▽ More The field of self-supervised 3D representation learning has emerged as a promising solution to alleviate the challenge presented by the scarcity of extensive, well-annotated datasets. However, it continues to be hindered by the lack of diverse, large-scale, real-world 3D scene datasets for source data. To address this shortfall, we propose Generalizable Representation Learning (GRL), where we devise a generative Bayesian network to produce diverse synthetic scenes with real-world patterns, and conduct pre-training with a joint objective. By jointly learning a coarse-to-fine contrastive learning task and an occlusion-aware reconstruction task, the model is primed with transferable, geometry-informed representations. Post pre-training on synthetic data, the acquired knowledge of the model can be seamlessly transferred to two principal downstream tasks associated with 3D scene understanding, namely 3D object detection and 3D semantic segmentation, using real-world benchmark datasets. A thorough series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.09305 [pdf, other]

Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation

Authors: Yufan Zhou, Ruiyi Zhang, Kaizhi Zheng, Nanxuan Zhao, Jiuxiang Gu, Zichao Wang, Xin Eric Wang, Tong Sun

Abstract: In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing imag… ▽ More In subject-driven text-to-image generation, recent works have achieved superior performance by training the model on synthetic datasets containing numerous image pairs. Trained on these datasets, generative models can produce text-aligned images for specific subject from arbitrary testing image in a zero-shot manner. They even outperform methods which require additional fine-tuning on testing images. However, the cost of creating such datasets is prohibitive for most researchers. To generate a single training pair, current methods fine-tune a pre-trained text-to-image model on the subject image to capture fine-grained details, then use the fine-tuned model to create images for the same subject based on creative text prompts. Consequently, constructing a large-scale dataset with millions of subjects can require hundreds of thousands of GPU hours. To tackle this problem, we propose Toffee, an efficient method to construct datasets for subject-driven editing and generation. Specifically, our dataset construction does not need any subject-level fine-tuning. After pre-training two generative models, we are able to generate infinite number of high-quality samples. We construct the first large-scale dataset for subject-driven image editing and generation, which contains 5 million image pairs, text prompts, and masks. Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower. To test the proposed dataset, we also propose a model which is capable of both subject-driven image editing and generation. By simply training the model on our proposed dataset, it obtains competitive results, illustrating the effectiveness of the proposed dataset construction framework. △ Less

Submitted 7 August, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08152 [pdf, other]

CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer

Authors: Hualian Sheng, Sijia Cai, Na Zhao, Bing Deng, Qiao Liang, Min-Jian Zhao, Jieping Ye

Abstract: The field of 3D object detection from point clouds is rapidly advancing in computer vision, aiming to accurately and efficiently detect and localize objects in three-dimensional space. Current 3D detectors commonly fall short in terms of flexibility and scalability, with ample room for advancements in performance. In this paper, our objective is to address these limitations by introducing two fram… ▽ More The field of 3D object detection from point clouds is rapidly advancing in computer vision, aiming to accurately and efficiently detect and localize objects in three-dimensional space. Current 3D detectors commonly fall short in terms of flexibility and scalability, with ample room for advancements in performance. In this paper, our objective is to address these limitations by introducing two frameworks for 3D object detection with minimal hand-crafted design. Firstly, we propose CT3D, which sequentially performs raw-point-based embedding, a standard Transformer encoder, and a channel-wise decoder for point features within each proposal. Secondly, we present an enhanced network called CT3D++, which incorporates geometric and semantic fusion-based embedding to extract more valuable and comprehensive proposal-aware information. Additionally, CT3D ++ utilizes a point-to-key bidirectional encoder for more efficient feature encoding with reduced computational cost. By replacing the corresponding components of CT3D with these novel modules, CT3D++ achieves state-of-the-art performance on both the KITTI dataset and the large-scale Way\-mo Open Dataset. The source code for our frameworks will be made accessible at https://github.com/hlsheng1/CT3D-plusplus. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 19 pages, 8 figures

arXiv:2406.07670 [pdf]

Design and Control of a Compact Series Elastic Actuator Module for Robots in MRI Scanners

Authors: Binghan He, Naichen Zhao, David Y. Guo, Charles H. Paxson, Ronald S. Fearing

Abstract: In this study, we introduce a novel MRI-compatible rotary series elastic actuator module utilizing velocity-sourced ultrasonic motors for force-controlled robots operating within MRI scanners. Unlike previous MRI-compatible SEA designs, our module incorporates a transmission force sensing series elastic actuator structure, with four off-the-shelf compression springs strategically placed between th… ▽ More In this study, we introduce a novel MRI-compatible rotary series elastic actuator module utilizing velocity-sourced ultrasonic motors for force-controlled robots operating within MRI scanners. Unlike previous MRI-compatible SEA designs, our module incorporates a transmission force sensing series elastic actuator structure, with four off-the-shelf compression springs strategically placed between the gearbox housing and the motor housing. This design features a compact size, thus expanding possibilities for a wider range of MRI robotic applications. To achieve precise torque control, we develop a controller that incorporates a disturbance observer tailored for velocity-sourced motors. This controller enhances the robustness of torque control in our actuator module, even in the presence of varying external impedance, thereby augmenting its suitability for MRI-guided medical interventions. Experimental validation demonstrates the actuator's torque control performance in both 3 Tesla MRI and non-MRI environments, achieving a settling time of 0.1 seconds and a steady-state error within 2% of its maximum output torque. Notably, our force controller exhibits consistent performance across low and high external impedance scenarios, in contrast to conventional controllers for velocity-sourced series elastic actuators, which struggle with steady-state performance under low external impedance conditions. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2405.20195 [pdf, other]

Using Large Language Models for Humanitarian Frontline Negotiation: Opportunities and Considerations

Authors: Zilin Ma, Susannah, Su, Nathan Zhao, Linn Bieske, Blake Bullwinkel, Yanyi Zhang, Sophia, Yang, Ziqing Luo, Siyao Li, Gekai Liao, Boxiang Wang, Jinglun Gao, Zihan Wen, Claude Bruderlein, Weiwei Pan

Abstract: Humanitarian negotiations in conflict zones, called \emph{frontline negotiation}, are often highly adversarial, complex, and high-risk. Several best-practices have emerged over the years that help negotiators extract insights from large datasets to navigate nuanced and rapidly evolving scenarios. Recent advances in large language models (LLMs) have sparked interest in the potential for AI to aid d… ▽ More Humanitarian negotiations in conflict zones, called \emph{frontline negotiation}, are often highly adversarial, complex, and high-risk. Several best-practices have emerged over the years that help negotiators extract insights from large datasets to navigate nuanced and rapidly evolving scenarios. Recent advances in large language models (LLMs) have sparked interest in the potential for AI to aid decision making in frontline negotiation. Through in-depth interviews with 13 experienced frontline negotiators, we identified their needs for AI-assisted case analysis and creativity support, as well as concerns surrounding confidentiality and model bias. We further explored the potential for AI augmentation of three standard tools used in frontline negotiation planning. We evaluated the quality and stability of our ChatGPT-based negotiation tools in the context of two real cases. Our findings highlight the potential for LLMs to enhance humanitarian negotiations and underscore the need for careful ethical and practical considerations. △ Less

Submitted 30 May, 2024; v1 submitted 30 May, 2024; originally announced May 2024.

arXiv:2405.16099 [pdf, other]

Improving 3D Occupancy Prediction through Class-balancing Loss and Multi-scale Representation

Authors: Huizhou Chen, Jiangyi Wang, Yuxin Li, Na Zhao, Jun Cheng, Xulei Yang

Abstract: 3D environment recognition is essential for autonomous driving systems, as autonomous vehicles require a comprehensive understanding of surrounding scenes. Recently, the predominant approach to define this real-life problem is through 3D occupancy prediction. It attempts to predict the occupancy states and semantic labels for all voxels in 3D space, which enhances the perception capability. Birds-… ▽ More 3D environment recognition is essential for autonomous driving systems, as autonomous vehicles require a comprehensive understanding of surrounding scenes. Recently, the predominant approach to define this real-life problem is through 3D occupancy prediction. It attempts to predict the occupancy states and semantic labels for all voxels in 3D space, which enhances the perception capability. Birds-Eye-View(BEV)-based perception has achieved the SOTA performance for this task. Nonetheless, this architecture fails to represent various scales of BEV features. In this paper, inspired by the success of UNet in semantic segmentation tasks, we introduce a novel UNet-like Multi-scale Occupancy Head module to relieve this issue. Furthermore, we propose the class-balancing loss to compensate for rare classes in the dataset. The experimental results on nuScenes 3D occupancy challenge dataset show the superiority of our proposed approach over baseline and SOTA methods. △ Less

Submitted 25 May, 2024; originally announced May 2024.

Comments: 5 pages, 3 figures, accepted by IEEE CAI 2024

arXiv:2405.15217 [pdf, other]

NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation

Authors: Vikas Thamizharasan, Difan Liu, Matthew Fisher, Nanxuan Zhao, Evangelos Kalogerakis, Michal Lukac

Abstract: The success of denoising diffusion models in representing rich data distributions over 2D raster images has prompted research on extending them to other data representations, such as vector graphics. Unfortunately due to their variable structure and scarcity of vector training data, directly applying diffusion models on this domain remains a challenging problem. Using workarounds like optimization… ▽ More The success of denoising diffusion models in representing rich data distributions over 2D raster images has prompted research on extending them to other data representations, such as vector graphics. Unfortunately due to their variable structure and scarcity of vector training data, directly applying diffusion models on this domain remains a challenging problem. Using workarounds like optimization via Score Distillation Sampling (SDS) is also fraught with difficulty, as vector representations are non trivial to directly optimize and tend to result in implausible geometries such as redundant or self-intersecting shapes. NIVeL addresses these challenges by reinterpreting the problem on an alternative, intermediate domain which preserves the desirable properties of vector graphics -- mainly sparsity of representation and resolution-independence. This alternative domain is based on neural implicit fields expressed in a set of decomposable, editable layers. Based on our experiments, NIVeL produces text-to-vector graphics results of significantly better quality than the state-of-the-art. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.12469 [pdf, other]

doi 10.1145/3620665.3640403

Last-Level Cache Side-Channel Attacks Are Feasible in the Modern Public Cloud (Extended Version)

Authors: Zirui Neil Zhao, Adam Morrison, Christopher W. Fletcher, Josep Torrellas

Abstract: Last-level cache side-channel attacks have been mostly demonstrated in highly-controlled, quiescent local environments. Hence, it is unclear whether such attacks are feasible in a production cloud environment. In the cloud, side channels are flooded with noise from activities of other tenants and, in Function-as-a-Service (FaaS) workloads, the attacker has a very limited time window to mount the a… ▽ More Last-level cache side-channel attacks have been mostly demonstrated in highly-controlled, quiescent local environments. Hence, it is unclear whether such attacks are feasible in a production cloud environment. In the cloud, side channels are flooded with noise from activities of other tenants and, in Function-as-a-Service (FaaS) workloads, the attacker has a very limited time window to mount the attack. In this paper, we show that such attacks are feasible in practice, although they require new techniques. We present an end-to-end, cross-tenant attack on a vulnerable ECDSA implementation in the public FaaS Google Cloud Run environment. We introduce several new techniques to improve every step of the attack. First, to speed-up the generation of eviction sets, we introduce L2-driven candidate address filtering and a Binary Search-based algorithm for address pruning. Second, to monitor victim memory accesses with high time resolution, we introduce Parallel Probing. Finally, we leverage power spectral density from signal processing to easily identify the victim's target cache set in the frequency domain. Overall, using these mechanisms, we extract a median value of 81% of the secret ECDSA nonce bits from a victim container in 19 seconds on average. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Journal ref: 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2024), Volume 2, pages 582-600, La Jolla, CA, USA, May 2024

arXiv:2405.10317 [pdf, other]

Text-to-Vector Generation with Neural Path Representation

Authors: Peiying Zhang, Nanxuan Zhao, Jing Liao

Abstract: Vector graphics are widely used in digital art and highly favored by designers due to their scalability and layer-wise properties. However, the process of creating and editing vector graphics requires creativity and design expertise, making it a time-consuming task. Recent advancements in text-to-vector (T2V) generation have aimed to make this process more accessible. However, existing T2V methods… ▽ More Vector graphics are widely used in digital art and highly favored by designers due to their scalability and layer-wise properties. However, the process of creating and editing vector graphics requires creativity and design expertise, making it a time-consuming task. Recent advancements in text-to-vector (T2V) generation have aimed to make this process more accessible. However, existing T2V methods directly optimize control points of vector graphics paths, often resulting in intersecting or jagged paths due to the lack of geometry constraints. To overcome these limitations, we propose a novel neural path representation by designing a dual-branch Variational Autoencoder (VAE) that learns the path latent space from both sequence and image modalities. By optimizing the combination of neural paths, we can incorporate geometric constraints while preserving expressivity in generated SVGs. Furthermore, we introduce a two-stage path optimization method to improve the visual and topological quality of generated SVGs. In the first stage, a pre-trained text-to-image diffusion model guides the initial generation of complex vector graphics through the Variational Score Distillation (VSD) process. In the second stage, we refine the graphics using a layer-wise image vectorization strategy to achieve clearer elements and structure. We demonstrate the effectiveness of our method through extensive experiments and showcase various applications. The project page is https://intchous.github.io/T2V-NPR. △ Less

Submitted 20 May, 2024; v1 submitted 16 May, 2024; originally announced May 2024.

Comments: Accepted by SIGGRAPH 2024. Project page: https://intchous.github.io/T2V-NPR

arXiv:2404.19702 [pdf, other]

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Authors: Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, Zexiang Xu

Abstract: We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian para… ▽ More We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0.23 seconds on single A100 GPU. Our model features a very simple transformer-based architecture; we patchify input posed images, pass the concatenated multi-view image tokens through a sequence of transformer blocks, and decode final per-pixel Gaussian parameters directly from these tokens for differentiable rendering. In contrast to previous LRMs that can only reconstruct objects, by predicting per-pixel Gaussians, GS-LRM naturally handles scenes with large variations in scale and complexity. We show that our model can work on both object and scene captures by training it on Objaverse and RealEstate10K respectively. In both scenarios, the models outperform state-of-the-art baselines by a wide margin. We also demonstrate applications of our model in downstream 3D generation tasks. Our project webpage is available at: https://sai-bi.github.io/project/gs-lrm/ . △ Less

Submitted 30 April, 2024; originally announced April 2024.

Comments: Project webpage: https://sai-bi.github.io/project/gs-lrm/

arXiv:2404.13522 [pdf, other]

Error Analysis of Shapley Value-Based Model Explanations: An Informative Perspective

Authors: Ningsheng Zhao, Jia Yuan Yu, Krzysztof Dzieciolowski, Trang Bui

Abstract: Shapley value attribution (SVA) is an increasingly popular explainable AI (XAI) method, which quantifies the contribution of each feature to the model's output. However, recent work has shown that most existing methods to implement SVAs have some drawbacks, resulting in biased or unreliable explanations that fail to correctly capture the true intrinsic relationships between features and model outp… ▽ More Shapley value attribution (SVA) is an increasingly popular explainable AI (XAI) method, which quantifies the contribution of each feature to the model's output. However, recent work has shown that most existing methods to implement SVAs have some drawbacks, resulting in biased or unreliable explanations that fail to correctly capture the true intrinsic relationships between features and model outputs. Moreover, the mechanism and consequences of these drawbacks have not been discussed systematically. In this paper, we propose a novel error theoretical analysis framework, in which the explanation errors of SVAs are decomposed into two components: observation bias and structural bias. We further clarify the underlying causes of these two biases and demonstrate that there is a trade-off between them. Based on this error analysis framework, we develop two novel concepts: over-informative and underinformative explanations. We demonstrate how these concepts can be effectively used to understand potential errors of existing SVA methods. In particular, for the widely deployed assumption-based SVAs, we find that they can easily be under-informative due to the distribution drift caused by distributional assumptions. We propose a measurement tool to quantify such a distribution drift. Finally, our experiments illustrate how different existing SVA methods can be over- or under-informative. Our work sheds light on how errors incur in the estimation of SVAs and encourages new less error-prone methods. △ Less

Submitted 29 May, 2024; v1 submitted 21 April, 2024; originally announced April 2024.

arXiv:2404.05717 [pdf, other]

SwapAnything: Enabling Arbitrary Object Swapping in Personalized Visual Editing

Authors: Jing Gu, Nanxuan Zhao, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Yilin Wang, Xin Eric Wang

Abstract: Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, w… ▽ More Effective editing of personal content holds a pivotal role in enabling individuals to express their creativity, weaving captivating narratives within their visual stories, and elevate the overall quality and impact of their visual content. Therefore, in this work, we introduce SwapAnything, a novel framework that can swap any objects in an image with personalized concepts given by the reference, while keeping the context unchanged. Compared with existing methods for personalized subject swapping, SwapAnything has three unique advantages: (1) precise control of arbitrary objects and parts rather than the main subject, (2) more faithful preservation of context pixels, (3) better adaptation of the personalized concept to the image. First, we propose targeted variable swapping to apply region control over latent feature maps and swap masked variables for faithful context preservation and initial semantic concept swapping. Then, we introduce appearance adaptation, to seamlessly adapt the semantic concept into the original image in terms of target location, shape, style, and content during the image generation process. Extensive results on both human and automatic evaluation demonstrate significant improvements of our approach over baseline methods on personalized swapping. Furthermore, SwapAnything shows its precise and faithful swapping abilities across single object, multiple objects, partial object, and cross-domain swapping tasks. SwapAnything also achieves great performance on text-based swapping and tasks beyond swapping such as object insertion. △ Less

Submitted 3 October, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: ECCV 2024, 23 pages, 14 figures, 3 tables

arXiv:2403.11868 [pdf, other]

View-Consistent 3D Editing with Gaussian Splatting

Authors: Yuxuan Wang, Xuanyu Yi, Zike Wu, Na Zhao, Long Chen, Hanwang Zhang

Abstract: The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing, offering efficient, high-fidelity rendering and enabling precise local manipulations. Currently, diffusion-based 2D editing models are harnessed to modify multi-view rendered images, which then guide the editing of 3DGS models. However, this approach faces a critical issue of multi-view inconsistency, where the guidance imag… ▽ More The advent of 3D Gaussian Splatting (3DGS) has revolutionized 3D editing, offering efficient, high-fidelity rendering and enabling precise local manipulations. Currently, diffusion-based 2D editing models are harnessed to modify multi-view rendered images, which then guide the editing of 3DGS models. However, this approach faces a critical issue of multi-view inconsistency, where the guidance images exhibit significant discrepancies across views, leading to mode collapse and visual artifacts of 3DGS. To this end, we introduce View-consistent Editing (VcEdit), a novel framework that seamlessly incorporates 3DGS into image editing processes, ensuring multi-view consistency in edited guidance images and effectively mitigating mode collapse issues. VcEdit employs two innovative consistency modules: the Cross-attention Consistency Module and the Editing Consistency Module, both designed to reduce inconsistencies in edited images. By incorporating these consistency modules into an iterative pattern, VcEdit proficiently resolves the issue of multi-view inconsistency, facilitating high-quality 3DGS editing across a diverse range of scenes. Further video results are shown in http://vcedit.github.io. △ Less

Submitted 23 August, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: accepted to ECCV 2024

arXiv:2403.00644 [pdf, other]

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

Authors: Yuhao Liu, Zhanghan Ke, Fang Liu, Nanxuan Zhao, Rynson W. H. Lau

Abstract: Diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. However, due to the randomness in the diffusion process, they often struggle with handling diverse low-level tasks that require details preservation. To overcome this limitation, we present a new Diff-Plugin framework to enable a single pre-trained diffusion model to generate high-fidelity result… ▽ More Diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. However, due to the randomness in the diffusion process, they often struggle with handling diverse low-level tasks that require details preservation. To overcome this limitation, we present a new Diff-Plugin framework to enable a single pre-trained diffusion model to generate high-fidelity results across a variety of low-level tasks. Specifically, we first propose a lightweight Task-Plugin module with a dual branch design to provide task-specific priors, guiding the diffusion process in preserving image content. We then propose a Plugin-Selector that can automatically select different Task-Plugins based on the text instruction, allowing users to edit images by indicating multiple low-level tasks with natural language. We conduct extensive experiments on 8 low-level vision tasks. The results demonstrate the superiority of Diff-Plugin over existing methods, particularly in real-world scenarios. Our ablations further validate that Diff-Plugin is stable, schedulable, and supports robust training across different dataset sizes. △ Less

Submitted 28 May, 2024; v1 submitted 1 March, 2024; originally announced March 2024.

Comments: Accepted to CVPR2024. Replaced some celebrity images to avoid copyright disputes

arXiv:2403.00095 [pdf]

Solving Jigsaw Puzzles using Iterative Random Sampling: Parallels with Development of Skill Mastery

Authors: Neil Zhao, Diana Zheng

Abstract: Skill mastery is a priority for success in all fields. We present a parallel between the development of skill mastery and the process of solving jigsaw puzzles. We show that iterative random sampling solves jigsaw puzzles in two phases: a lag phase that is characterized by little change and occupies the majority of the time, and a growth phase that marks rapid and imminent puzzle completion. Chang… ▽ More Skill mastery is a priority for success in all fields. We present a parallel between the development of skill mastery and the process of solving jigsaw puzzles. We show that iterative random sampling solves jigsaw puzzles in two phases: a lag phase that is characterized by little change and occupies the majority of the time, and a growth phase that marks rapid and imminent puzzle completion. Changes in the proportions of the number of single pieces and larger pieces can be overlaid on the timeline and progression of skill mastery. An emphasis is placed on the development of connections between pieces, which serves as an indicator of increasing puzzle completion and increasing skill mastery. Our manuscript provides a straightforward visual of skill mastery in the context of a common recreational activity. △ Less

Submitted 29 February, 2024; originally announced March 2024.

Comments: 26 pages, 15 figures, 1 table

arXiv:2402.03549 [pdf]

AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising

Authors: Maham Tanveer, Yizhi Wang, Ruiqi Wang, Nanxuan Zhao, Ali Mahdavi-Amiri, Hao Zhang

Abstract: We present AnaMoDiff, a novel diffusion-based method for 2D motion analogies that is applied to raw, unannotated videos of articulated characters. Our goal is to accurately transfer motions from a 2D driving video onto a source character, with its identity, in terms of appearance and natural movement, well preserved, even when there may be significant discrepancies between the source and driving c… ▽ More We present AnaMoDiff, a novel diffusion-based method for 2D motion analogies that is applied to raw, unannotated videos of articulated characters. Our goal is to accurately transfer motions from a 2D driving video onto a source character, with its identity, in terms of appearance and natural movement, well preserved, even when there may be significant discrepancies between the source and driving characters in their part proportions and movement speed and styles. Our diffusion model transfers the input motion via a latent optical flow (LOF) network operating in a noised latent space, which is spatially aware, efficient to process compared to the original RGB videos, and artifact-resistant through the diffusion denoising process even amid dense movements. To accomplish both motion analogy and identity preservation, we train our denoising model in a feature-disentangled manner, operating at two noise levels. While identity-revealing features of the source are learned via conventional noise injection, motion features are learned from LOF-warped videos by only injecting noise with large values, with the stipulation that motion properties involving pose and limbs are encoded by higher-level features. Experiments demonstrate that our method achieves the best trade-off between motion analogy and identity preservation. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2401.15560 [pdf]

An Analysis of Letter Dynamics in the English Alphabet

Authors: Neil Zhao, Diana Zheng

Abstract: The frequency with which the letters of the English alphabet appear in writings has been applied to the field of cryptography, the development of keyboard mechanics, and the study of linguistics. We expanded on the statistical analysis of the English alphabet by examining the average frequency which each letter appears in different categories of writings. We evaluated news articles, novels, plays,… ▽ More The frequency with which the letters of the English alphabet appear in writings has been applied to the field of cryptography, the development of keyboard mechanics, and the study of linguistics. We expanded on the statistical analysis of the English alphabet by examining the average frequency which each letter appears in different categories of writings. We evaluated news articles, novels, plays, scientific publications and calculated the frequency of each letter of the alphabet, the information density of each letter, and the overall letter distribution. Furthermore, we developed a metric known as distance, d that can be used to algorithmically recognize different categories of writings. The results of our study can be applied to information transmission, large data curation, and linguistics. △ Less

Submitted 27 January, 2024; originally announced January 2024.

Comments: 22 pages, 6 figures, 5 tables

MSC Class: 94A15

arXiv:2401.12023 [pdf]

A Simulation of Optimal Dryness When Moving in the Rain or Snow Using MATLAB

Authors: Neil Zhao, Emilee Brockner, Asia Winslow, Megan Seraydarian

Abstract: The classic question of whether one should walk or run in the rain to remain the least wet has inspired a myriad of solutions ranging from physically performing test runs in raining conditions to mathematically modeling human movement through rain. This manuscript approaches the classical problem by simulating movement through rainfall using MATLAB. Our simulation was generalizable to include snow… ▽ More The classic question of whether one should walk or run in the rain to remain the least wet has inspired a myriad of solutions ranging from physically performing test runs in raining conditions to mathematically modeling human movement through rain. This manuscript approaches the classical problem by simulating movement through rainfall using MATLAB. Our simulation was generalizable to include snowfall as well. An increase in walking speed resulted in a corresponding decrease in raindrop and snowflake collisions. When raindrops or snowflakes were given a horizontal movement vector due to wind, a local minimum in collisions was achieved when moving in parallel with the same horizontal speed as the raindrop; no local minimum was detected with antiparallel movement. In general, our simulation revealed that the faster one moves, the drier one remains. △ Less

Submitted 22 January, 2024; originally announced January 2024.

Comments: 15 pages, 9 figures

MSC Class: 68U20

arXiv:2401.05011 [pdf, other]

Dual-Perspective Knowledge Enrichment for Semi-Supervised 3D Object Detection

Authors: Yucheng Han, Na Zhao, Weiling Chen, Keng Teck Ma, Hanwang Zhang

Abstract: Semi-supervised 3D object detection is a promising yet under-explored direction to reduce data annotation costs, especially for cluttered indoor scenes. A few prior works, such as SESS and 3DIoUMatch, attempt to solve this task by utilizing a teacher model to generate pseudo-labels for unlabeled samples. However, the availability of unlabeled samples in the 3D domain is relatively limited compared… ▽ More Semi-supervised 3D object detection is a promising yet under-explored direction to reduce data annotation costs, especially for cluttered indoor scenes. A few prior works, such as SESS and 3DIoUMatch, attempt to solve this task by utilizing a teacher model to generate pseudo-labels for unlabeled samples. However, the availability of unlabeled samples in the 3D domain is relatively limited compared to its 2D counterpart due to the greater effort required to collect 3D data. Moreover, the loose consistency regularization in SESS and restricted pseudo-label selection strategy in 3DIoUMatch lead to either low-quality supervision or a limited amount of pseudo labels. To address these issues, we present a novel Dual-Perspective Knowledge Enrichment approach named DPKE for semi-supervised 3D object detection. Our DPKE enriches the knowledge of limited training data, particularly unlabeled data, from two perspectives: data-perspective and feature-perspective. Specifically, from the data-perspective, we propose a class-probabilistic data augmentation method that augments the input data with additional instances based on the varying distribution of class probabilities. Our DPKE achieves feature-perspective knowledge enrichment by designing a geometry-aware feature matching method that regularizes feature-level similarity between object proposals from the student and teacher models. Extensive experiments on the two benchmark datasets demonstrate that our DPKE achieves superior performance over existing state-of-the-art approaches under various label ratio conditions. The source code will be made available to the public. △ Less

Submitted 10 January, 2024; originally announced January 2024.

Comments: Code is available at https://github.com/tingxueronghua/DPKE

arXiv:2312.14216 [pdf, other]

DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

Authors: Brian Nlong Zhao, Yuhang Xiao, Jiashu Xu, Xinyang Jiang, Yifan Yang, Dongsheng Li, Laurent Itti, Vibhav Vineet, Yunhao Ge

Abstract: The popularization of Text-to-Image (T2I) diffusion models enables the generation of high-quality images from text descriptions. However, generating diverse customized images with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting commonalities from a set of reference images while creating… ▽ More The popularization of Text-to-Image (T2I) diffusion models enables the generation of high-quality images from text descriptions. However, generating diverse customized images with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting commonalities from a set of reference images while creating new instances with sufficient variations. We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts, enabling the generation of novel images by sampling prompts from the learned distribution. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions. We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D. Finally we demonstrate effectiveness of our approach through quantitative analysis including automatic evaluation and human assessment. Project website: https://briannlongzhao.github.io/DreamDistribution △ Less

Submitted 21 December, 2023; originally announced December 2023.

arXiv:2312.11306 [pdf]

Human-machine cooperation: optimization of drug retrieval sequencing in automated drug dispensing systems

Authors: Mengge Yuan, Kan Wu, Ning Zhao

Abstract: Automated drug dispensing systems (ADDSs) are increasingly in demand in today's pharmacies, primarily driven by the growing ageing population. Recognizing the practical challenges faced by pharmacies implementing ADDSs, this study aims to optimize the layout design and sequencing issues within a human-machine cooperation environment to enhance the system throughput of ADDSs. Specifically, we devel… ▽ More Automated drug dispensing systems (ADDSs) are increasingly in demand in today's pharmacies, primarily driven by the growing ageing population. Recognizing the practical challenges faced by pharmacies implementing ADDSs, this study aims to optimize the layout design and sequencing issues within a human-machine cooperation environment to enhance the system throughput of ADDSs. Specifically, we develop models for drug retrieval sequencing under different system layout designs, taking into account the stochastic sorting time of pharmacists. The prescription order arrival pattern follows a successive arrival mode. To assess the efficiency of ADDSs with one input/output point and two input/output points, we propose dual command retrieval sequencing models that optimize the retrieval sequence of drugs in adjacent prescription orders. Notably, our models incorporate the stochastic sorting time of pharmacists to analyze its impact on ADDS performance. Through experimental comparisons of average picking times for prescription orders under various operational conditions, we demonstrate that a system layout design incorporating two input/output points significantly enhances the efficiency of prescription order fulfilment within a human-machine cooperation environment. Furthermore, our proposed retrieval sequencing method outperforms dynamic programming, greedy, and random strategies in terms of improving prescription order-picking efficiency. By addressing the layout design and sequencing challenges, our research contributes to the field of intelligent warehousing, particularly in smart pharmacies. The findings provide valuable insights for healthcare facilities and organizations seeking to optimize ADDS performance and enhance drug dispensing efficiency. △ Less

Submitted 16 January, 2024; v1 submitted 18 December, 2023; originally announced December 2023.

arXiv:2312.10078 [pdf, other]

Early ChatGPT User Portrait through the Lens of Data

Authors: Yuyang Deng, Ni Zhao, Xin Huang

Abstract: Since its launch, ChatGPT has achieved remarkable success as a versatile conversational AI platform, drawing millions of users worldwide and garnering widespread recognition across academic, industrial, and general communities. This paper aims to point a portrait of early GPT users and understand how they evolved. Specific questions include their topics of interest and their potential careers; and… ▽ More Since its launch, ChatGPT has achieved remarkable success as a versatile conversational AI platform, drawing millions of users worldwide and garnering widespread recognition across academic, industrial, and general communities. This paper aims to point a portrait of early GPT users and understand how they evolved. Specific questions include their topics of interest and their potential careers; and how this changes over time. We conduct a detailed analysis of real-world ChatGPT datasets with multi-turn conversations between users and ChatGPT. Through a multi-pronged approach, we quantify conversation dynamics by examining the number of turns, then gauge sentiment to understand user sentiment variations, and finally employ Latent Dirichlet Allocation (LDA) to discern overarching topics within the conversation. By understanding shifts in user demographics and interests, we aim to shed light on the changing nature of human-AI interaction and anticipate future trends in user engagement with language models. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: 6 pages, 5 figures, 2023 IEEE International Conference on Big Data (BigData), to be published

Report number: SP02207

arXiv:2312.09336 [pdf, other]

doi 10.1145/3576915.3623065

DECLASSIFLOW: A Static Analysis for Modeling Non-Speculative Knowledge to Relax Speculative Execution Security Measures (Full Version)

Authors: Rutvik Choudhary, Alan Wang, Zirui Neil Zhao, Adam Morrison, Christopher W. Fletcher

Abstract: Speculative execution attacks undermine the security of constant-time programming, the standard technique used to prevent microarchitectural side channels in security-sensitive software such as cryptographic code. Constant-time code must therefore also deploy a defense against speculative execution attacks to prevent leakage of secret data stored in memory or the processor registers. Unfortunately… ▽ More Speculative execution attacks undermine the security of constant-time programming, the standard technique used to prevent microarchitectural side channels in security-sensitive software such as cryptographic code. Constant-time code must therefore also deploy a defense against speculative execution attacks to prevent leakage of secret data stored in memory or the processor registers. Unfortunately, contemporary defenses, such as speculative load hardening (SLH), can only satisfy this strong security guarantee at a very high performance cost. This paper proposes DECLASSIFLOW, a static program analysis and protection framework to efficiently protect constant-time code from speculative leakage. DECLASSIFLOW models "attacker knowledge" -- data which is inherently transmitted (or, implicitly declassified) by the code's non-speculative execution -- and statically removes protection on such data from points in the program where it is already guaranteed to leak non-speculatively. Overall, DECLASSIFLOW ensures that data which never leaks during the non-speculative execution does not leak during speculative execution, but with lower overhead than conservative protections like SLH. △ Less

Submitted 14 December, 2023; originally announced December 2023.

Journal ref: In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS '23). Association for Computing Machinery, New York, NY, USA, 2053-2067

arXiv:2312.06488 [pdf, other]

Performance-lossless Black-box Model Watermarking

Authors: Na Zhao, Kejiang Chen, Weiming Zhang, Nenghai Yu

Abstract: With the development of deep learning, high-value and high-cost models have become valuable assets, and related intellectual property protection technologies have become a hot topic. However, existing model watermarking work in black-box scenarios mainly originates from training-based backdoor methods, which probably degrade primary task performance. To address this, we propose a branch backdoor-b… ▽ More With the development of deep learning, high-value and high-cost models have become valuable assets, and related intellectual property protection technologies have become a hot topic. However, existing model watermarking work in black-box scenarios mainly originates from training-based backdoor methods, which probably degrade primary task performance. To address this, we propose a branch backdoor-based model watermarking protocol to protect model intellectual property, where a construction based on a message authentication scheme is adopted as the branch indicator after a comparative analysis with secure cryptographic technologies primitives. We prove the lossless performance of the protocol by reduction. In addition, we analyze the potential threats to the protocol and provide a secure and feasible watermarking instance for language models. △ Less

Submitted 14 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

arXiv:2311.15302 [pdf]

A Quick Response Algorithm for Dynamic Autonomous Mobile Robot Routing Problem with Time Windows

Authors: Lulu Cheng, Ning Zhao, Mengge Yuan, Kan Wu

Abstract: This paper investigates the optimization problem of scheduling autonomous mobile robots (AMRs) in hospital settings, considering dynamic requests with different priorities. The primary objective is to minimize the daily service cost by dynamically planning routes for the limited number of available AMRs. The total cost consists of AMR's purchase cost, transportation cost, delay penalty cost, and l… ▽ More This paper investigates the optimization problem of scheduling autonomous mobile robots (AMRs) in hospital settings, considering dynamic requests with different priorities. The primary objective is to minimize the daily service cost by dynamically planning routes for the limited number of available AMRs. The total cost consists of AMR's purchase cost, transportation cost, delay penalty cost, and loss of denial of service. To address this problem, we have established a two-stage mathematical programming model. In the first stage, a tabu search algorithm is employed to plan prior routes for all known medical requests. The second stage involves planning for real-time received dynamic requests using the efficient insertion algorithm with decision rules, which enables quick response based on the time window and demand constraints of the dynamic requests. One of the main contributions of this study is to make resource allocation decisions based on the present number of service AMRs for dynamic requests with different priorities. Computational experiments using Lackner instances demonstrate the efficient insertion algorithm with decision rules is very fast and robust in solving the dynamic AMR routing problem with time windows and request priority. Additionally, we provide managerial insights concerning the AMR's safety stock settings, which can aid in decision-making processes. △ Less

Submitted 26 November, 2023; originally announced November 2023.

arXiv:2311.11574 [pdf, other]

A Framework on Complex Matrix Derivatives with Special Structure Constraints for Wireless Systems

Authors: Xin Ju, Shiqi Gong, Nan Zhao, Chengwen Xing, Arumugam Nallanathan, Dusit Niyato

Abstract: Matrix-variate optimization plays a central role in advanced wireless system designs. In this paper, we aim to explore optimal solutions of matrix variables under two special structure constraints using complex matrix derivatives, including diagonal structure constraints and constant modulus constraints, both of which are closely related to the state-of-the-art wireless applications. Specifically,… ▽ More Matrix-variate optimization plays a central role in advanced wireless system designs. In this paper, we aim to explore optimal solutions of matrix variables under two special structure constraints using complex matrix derivatives, including diagonal structure constraints and constant modulus constraints, both of which are closely related to the state-of-the-art wireless applications. Specifically, for diagonal structure constraints mostly considered in the uplink multi-user single-input multiple-output (MU-SIMO) system and the amplitude-adjustable intelligent reflecting surface (IRS)-aided multiple-input multiple-output (MIMO) system, the capacity maximization problem, the mean-squared error (MSE) minimization problem and their variants are rigorously investigated. By leveraging complex matrix derivatives, the optimal solutions of these problems are directly obtained in closed forms. Nevertheless, for constant modulus constraints with the intrinsic nature of element-wise decomposability, which are often seen in the hybrid analog-digital MIMO system and the fully-passive IRS-aided MIMO system, we firstly explore inherent structures of the element-wise phase derivatives associated with different optimization problems. Then, we propose a novel alternating optimization (AO) algorithm with the aid of several arbitrary feasible solutions, which avoids the complicated matrix inversion and matrix factorization involved in conventional element-wise iterative algorithms. Numerical simulations reveal that the proposed algorithm can dramatically reduce the computational complexity without loss of system performance. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2310.17327 [pdf, ps, other]

Near-Field Positioning and Attitude Sensing Based on Electromagnetic Propagation Modeling

Authors: Ang Chen, Li Chen, Yunfei Chen, Nan Zhao, Changsheng You

Abstract: Positioning and sensing over wireless networks are imperative for many emerging applications. However, since traditional wireless channel models over-simplify the user equipment (UE) as a point target, they cannot be used for sensing the attitude of the UE, which is typically described by the spatial orientation. In this paper, a comprehensive electromagnetic propagation modeling (EPM) based on el… ▽ More Positioning and sensing over wireless networks are imperative for many emerging applications. However, since traditional wireless channel models over-simplify the user equipment (UE) as a point target, they cannot be used for sensing the attitude of the UE, which is typically described by the spatial orientation. In this paper, a comprehensive electromagnetic propagation modeling (EPM) based on electromagnetic theory is developed to precisely model the near-field channel. For the noise-free case, the EPM model establishes the non-linear functional dependence of observed signals on both the position and attitude of the UE. To address the difficulty in the non-linear coupling, we first propose to divide the distance domain into three regions, separated by the defined Phase ambiguity distance and Spacing constraint distance. Then, for each region, we obtain the closed-form solutions for joint position and attitude estimation with low complexity. Next, to investigate the impact of random noise on the joint estimation performance, the Ziv-Zakai bound (ZZB) is derived to yield useful insights. The expected Cramér-Rao bound (ECRB) is further provided to obtain the simplified closed-form expressions for the performance lower bounds. Our numerical results demonstrate that the derived ZZB can provide accurate predictions of the performance of estimators in all signal-to-noise ratio (SNR) regimes. More importantly, we achieve the millimeter-level accuracy in position estimation and attain the 0.1-level accuracy in attitude estimation. △ Less

Submitted 13 May, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

Comments: 19 pages, 13 figures. Accepted by IEEE Journal on Selected Areas in Communications

arXiv:2310.13730 [pdf, other]

Localizing and Editing Knowledge in Text-to-Image Generative Models

Authors: Samyadeep Basu, Nanxuan Zhao, Vlad Morariu, Soheil Feizi, Varun Manjunatha

Abstract: Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image genera… ▽ More Text-to-Image Diffusion Models such as Stable-Diffusion and Imagen have achieved unprecedented quality of photorealism with state-of-the-art FID scores on MS-COCO and other generation benchmarks. Given a caption, image generation requires fine-grained knowledge about attributes such as object structure, style, and viewpoint amongst others. Where does this information reside in text-to-image generative models? In our paper, we tackle this question and understand how knowledge corresponding to distinct visual attributes is stored in large-scale text-to-image diffusion models. We adapt Causal Mediation Analysis for text-to-image models and trace knowledge about distinct visual attributes to various (causal) components in the (i) UNet and (ii) text-encoder of the diffusion model. In particular, we show that unlike generative large-language models, knowledge about different attributes is not localized in isolated components, but is instead distributed amongst a set of components in the conditional UNet. These sets of components are often distinct for different visual attributes. Remarkably, we find that the CLIP text-encoder in public text-to-image models such as Stable-Diffusion contains only one causal state across different visual attributes, and this is the first self-attention layer corresponding to the last subject token of the attribute in the caption. This is in stark contrast to the causal states in other language models which are often the mid-MLP layers. Based on this observation of only one causal state in the text-encoder, we introduce a fast, data-free model editing method Diff-QuickFix which can effectively edit concepts in text-to-image models. DiffQuickFix can edit (ablate) concepts in under a second with a closed-form update, providing a significant 1000x speedup and comparable editing performance to existing fine-tuning based editing methods. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 61 pages

arXiv:2309.12318 [pdf]

Stochastic scheduling of autonomous mobile robots at hospitals

Authors: Lulu Cheng, Ning Zhao, Mengge Yuan, Kan Wu

Abstract: This paper studies the scheduling of autonomous mobile robots (AMRs) at hospitals where the stochastic travel times and service times of AMRs are affected by the surrounding environment. The routes of AMRs are planned to minimize the daily cost of the hospital (including the AMR fixed cost, penalty cost of violating the time window, and transportation cost). To efficiently generate high-quality so… ▽ More This paper studies the scheduling of autonomous mobile robots (AMRs) at hospitals where the stochastic travel times and service times of AMRs are affected by the surrounding environment. The routes of AMRs are planned to minimize the daily cost of the hospital (including the AMR fixed cost, penalty cost of violating the time window, and transportation cost). To efficiently generate high-quality solutions, some properties are identified and incorporated into an improved tabu search (I-TS) algorithm for problem-solving. Experimental evaluations demonstrate that the I-TS algorithm outperforms existing methods by producing high-quality solutions. Based on the characteristics of healthcare requests and the AMR working environment, scheduling AMRs reasonably can effectively provide medical services, improve the utilization of medical resources, and reduce hospital costs. △ Less

Submitted 23 November, 2023; v1 submitted 30 July, 2023; originally announced September 2023.

arXiv:2309.12302 [pdf, other]

Text-Guided Vector Graphics Customization

Authors: Peiying Zhang, Nanxuan Zhao, Jing Liao

Abstract: Vector graphics are widely used in digital art and valued by designers for their scalability and layer-wise topological properties. However, the creation and editing of vector graphics necessitate creativity and design expertise, leading to a time-consuming process. In this paper, we propose a novel pipeline that generates high-quality customized vector graphics based on textual prompts while pres… ▽ More Vector graphics are widely used in digital art and valued by designers for their scalability and layer-wise topological properties. However, the creation and editing of vector graphics necessitate creativity and design expertise, leading to a time-consuming process. In this paper, we propose a novel pipeline that generates high-quality customized vector graphics based on textual prompts while preserving the properties and layer-wise information of a given exemplar SVG. Our method harnesses the capabilities of large pre-trained text-to-image models. By fine-tuning the cross-attention layers of the model, we generate customized raster images guided by textual prompts. To initialize the SVG, we introduce a semantic-based path alignment method that preserves and transforms crucial paths from the exemplar SVG. Additionally, we optimize path parameters using both image-level and vector-level losses, ensuring smooth shape deformation while aligning with the customized raster image. We extensively evaluate our method using multiple metrics from vector-level, image-level, and text-level perspectives. The evaluation results demonstrate the effectiveness of our pipeline in generating diverse customizations of vector graphics with exceptional quality. The project page is https://intchous.github.io/SVGCustomization. △ Less

Submitted 21 September, 2023; originally announced September 2023.

Comments: Accepted by SIGGRAPH Asia 2023. Project page: https://intchous.github.io/SVGCustomization

arXiv:2309.11228 [pdf, other]

Towards Robust Few-shot Point Cloud Semantic Segmentation

Authors: Yating Xu, Na Zhao, Gim Hee Lee

Abstract: Few-shot point cloud semantic segmentation aims to train a model to quickly adapt to new unseen classes with only a handful of support set samples. However, the noise-free assumption in the support set can be easily violated in many practical real-world settings. In this paper, we focus on improving the robustness of few-shot point cloud segmentation under the detrimental influence of noisy suppor… ▽ More Few-shot point cloud semantic segmentation aims to train a model to quickly adapt to new unseen classes with only a handful of support set samples. However, the noise-free assumption in the support set can be easily violated in many practical real-world settings. In this paper, we focus on improving the robustness of few-shot point cloud segmentation under the detrimental influence of noisy support sets during testing time. To this end, we first propose a Component-level Clean Noise Separation (CCNS) representation learning to learn discriminative feature representations that separates the clean samples of the target classes from the noisy samples. Leveraging the well separated clean and noisy support samples from our CCNS, we further propose a Multi-scale Degree-based Noise Suppression (MDNS) scheme to remove the noisy shots from the support set. We conduct extensive experiments on various noise settings on two benchmark datasets. Our results show that the combination of CCNS and MDNS significantly improves the performance. Our code is available at https://github.com/Pixie8888/R3DFSSeg. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: BMVC 2023

arXiv:2309.11222 [pdf, other]

Generalized Few-Shot Point Cloud Segmentation Via Geometric Words

Authors: Yating Xu, Conghui Hu, Na Zhao, Gim Hee Lee

Abstract: Existing fully-supervised point cloud segmentation methods suffer in the dynamic testing environment with emerging new classes. Few-shot point cloud segmentation algorithms address this problem by learning to adapt to new classes at the sacrifice of segmentation accuracy for the base classes, which severely impedes its practicality. This largely motivates us to present the first attempt at a more… ▽ More Existing fully-supervised point cloud segmentation methods suffer in the dynamic testing environment with emerging new classes. Few-shot point cloud segmentation algorithms address this problem by learning to adapt to new classes at the sacrifice of segmentation accuracy for the base classes, which severely impedes its practicality. This largely motivates us to present the first attempt at a more practical paradigm of generalized few-shot point cloud segmentation, which requires the model to generalize to new categories with only a few support point clouds and simultaneously retain the capability to segment base classes. We propose the geometric words to represent geometric components shared between the base and novel classes, and incorporate them into a novel geometric-aware semantic representation to facilitate better generalization to the new classes without forgetting the old ones. Moreover, we introduce geometric prototypes to guide the segmentation with geometric prior knowledge. Extensive experiments on S3DIS and ScanNet consistently illustrate the superior performance of our method over baseline methods. Our code is available at: https://github.com/Pixie8888/GFS-3DSeg_GWs. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: Accepted by ICCV 2023

arXiv:2309.05956 [pdf, other]

Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation

Authors: Yunhao Ge, Jiashu Xu, Brian Nlong Zhao, Neel Joshi, Laurent Itti, Vibhav Vineet

Abstract: We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-to-image synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach1 decouples training data generation into foreground object generation, and contextually coherent background generation. To generate foreground objects, we employ a straightforward textual template,… ▽ More We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-to-image synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach1 decouples training data generation into foreground object generation, and contextually coherent background generation. To generate foreground objects, we employ a straightforward textual template, incorporating the object class name as input prompts. This is fed into a text-to-image synthesis framework, producing various foreground images set against isolated backgrounds. A foreground-background segmentation algorithm is then used to generate foreground object masks. To generate context images, we begin by creating language descriptions of the context. This is achieved by applying an image captioning method to a small set of images representing the desired context. These textual descriptions are then transformed into a diverse array of context images via a text-to-image synthesis framework. Subsequently, we composite these with the foreground object masks produced in the initial step, utilizing a cut-and-paste method, to formulate the training data. We demonstrate the advantages of our approach on five object detection and segmentation datasets, including Pascal VOC and COCO. We found that detectors trained solely on synthetic data produced by our method achieve performance comparable to those trained on real data (Fig. 1). Moreover, a combination of real and synthetic data yields even much better results. Further analysis indicates that the synthetic data distribution complements the real data distribution effectively. Additionally, we emphasize the compositional nature of our data generation approach in out-of-distribution and zero-shot data generation scenarios. We open-source our code at https://github.com/gyhandy/Text2Image-for-Detection △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: Code in https://github.com/gyhandy/Text2Image-for-Detection

arXiv:2308.12163 [pdf, other]

NPF-200: A Multi-Modal Eye Fixation Dataset and Method for Non-Photorealistic Videos

Authors: Ziyu Yang, Sucheng Ren, Zongwei Wu, Nanxuan Zhao, Junle Wang, Jing Qin, Shengfeng He

Abstract: Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies. This work aims to take a step forward to understand how humans perceive non-photorealistic videos with eye fixation (\ie, saliency detection), which is critical for enhancing media production, artistic design, and game user experience. To fill in the gap of missing a suitable dataset fo… ▽ More Non-photorealistic videos are in demand with the wave of the metaverse, but lack of sufficient research studies. This work aims to take a step forward to understand how humans perceive non-photorealistic videos with eye fixation (\ie, saliency detection), which is critical for enhancing media production, artistic design, and game user experience. To fill in the gap of missing a suitable dataset for this research line, we present NPF-200, the first large-scale multi-modal dataset of purely non-photorealistic videos with eye fixations. Our dataset has three characteristics: 1) it contains soundtracks that are essential according to vision and psychological studies; 2) it includes diverse semantic content and videos are of high-quality; 3) it has rich motions across and within videos. We conduct a series of analyses to gain deeper insights into this task and compare several state-of-the-art methods to explore the gap between natural images and non-photorealistic data. Additionally, as the human attention system tends to extract visual and audio features with different frequencies, we propose a universal frequency-aware multi-modal non-photorealistic saliency detection model called NPSNet, demonstrating the state-of-the-art performance of our task. The results uncover strengths and weaknesses of multi-modal network design and multi-domain training, opening up promising directions for future works. {Our dataset and code can be found at \url{https://github.com/Yangziyu/NPF200}}. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Comments: Accepted by ACM MM 2023

arXiv:2308.06928 [pdf, other]

Refining 6-DoF Grasps with Context-Specific Classifiers

Authors: Tasbolat Taunyazov, Heng Zhang, John Patrick Eala, Na Zhao, Harold Soh

Abstract: In this work, we present GraspFlow, a refinement approach for generating context-specific grasps. We formulate the problem of grasp synthesis as a sampling problem: we seek to sample from a context-conditioned probability distribution of successful grasps. However, this target distribution is unknown. As a solution, we devise a discriminator gradient-flow method to evolve grasps obtained from a si… ▽ More In this work, we present GraspFlow, a refinement approach for generating context-specific grasps. We formulate the problem of grasp synthesis as a sampling problem: we seek to sample from a context-conditioned probability distribution of successful grasps. However, this target distribution is unknown. As a solution, we devise a discriminator gradient-flow method to evolve grasps obtained from a simpler distribution in a manner that mimics sampling from the desired target distribution. Unlike existing approaches, GraspFlow is modular, allowing grasps that satisfy multiple criteria to be obtained simply by incorporating the relevant discriminators. It is also simple to implement, requiring minimal code given existing auto-differentiation libraries and suitable discriminators. Experiments show that GraspFlow generates stable and executable grasps on a real-world Panda robot for a diverse range of objects. In particular, in 60 trials on 20 different household objects, the first attempted grasp was successful 94% of the time, and 100% grasp success was achieved by the second grasp. Moreover, incorporating a functional discriminator for robot-human handover improved the functional aspect of the grasp by up to 33%. △ Less

Submitted 14 August, 2023; originally announced August 2023.

Comments: IROS 2023, Code and Datasets are available at https://github.com/tasbolat1/graspflow

arXiv:2308.03059 [pdf, other]

doi 10.1145/3592111

Language-based Photo Color Adjustment for Graphic Designs

Authors: Zhenwei Wang, Nanxuan Zhao, Gerhard Hancke, Rynson W. H. Lau

Abstract: Adjusting the photo color to associate with some design elements is an essential way for a graphic design to effectively deliver its message and make it aesthetically pleasing. However, existing tools and previous works face a dilemma between the ease of use and level of expressiveness. To this end, we introduce an interactive language-based approach for photo recoloring, which provides an intuiti… ▽ More Adjusting the photo color to associate with some design elements is an essential way for a graphic design to effectively deliver its message and make it aesthetically pleasing. However, existing tools and previous works face a dilemma between the ease of use and level of expressiveness. To this end, we introduce an interactive language-based approach for photo recoloring, which provides an intuitive system that can assist both experts and novices on graphic design. Given a graphic design containing a photo that needs to be recolored, our model can predict the source colors and the target regions, and then recolor the target regions with the source colors based on the given language-based instruction. The multi-granularity of the instruction allows diverse user intentions. The proposed novel task faces several unique challenges, including: 1) color accuracy for recoloring with exactly the same color from the target design element as specified by the user; 2) multi-granularity instructions for parsing instructions correctly to generate a specific result or multiple plausible ones; and 3) locality for recoloring in semantically meaningful local regions to preserve original image semantics. To address these challenges, we propose a model called LangRecol with two main components: the language-based source color prediction module and the semantic-palette-based photo recoloring module. We also introduce an approach for generating a synthetic graphic design dataset with instructions to enable model training. We evaluate our model via extensive experiments and user studies. We also discuss several practical applications, showing the effectiveness and practicality of our approach. Code and data for this paper are at: https://zhenwwang.github.io/langrecol. △ Less

Submitted 6 August, 2023; originally announced August 2023.

Comments: 15 pages, 19 figures. Accepted by SIGGRAPH 2023. Project page: https://zhenwwang.github.io/langrecol

arXiv:2308.00543 [pdf, other]

On the Performance Tradeoff of an ISAC System with Finite Blocklength

Authors: Xiao Shen, Na Zhao, Yuan Shen

Abstract: Integrated sensing and communication (ISAC) has been proposed as a promising paradigm in the future wireless networks, where the spectral and hardware resources are shared to provide a considerable performance gain. It is essential to understand how sensing and communication (S\&C) influences each other to guide the practical algorithm and system design in ISAC. In this paper, we investigate the p… ▽ More Integrated sensing and communication (ISAC) has been proposed as a promising paradigm in the future wireless networks, where the spectral and hardware resources are shared to provide a considerable performance gain. It is essential to understand how sensing and communication (S\&C) influences each other to guide the practical algorithm and system design in ISAC. In this paper, we investigate the performance tradeoff between S\&C in a single-input single-output (SISO) ISAC system with finite blocklength. In particular, we present the system model and the ISAC scheme, after which the rate-error tradeoff is introduced as the performance metric. Then we derive the achievability and converse bounds for the rate-error tradeoff, determining the boundary of the joint S\&C performance. Furthermore, we develop the asymptotic analysis at large blocklength regime, where the performance tradeoff between S\&C is proved to vanish as the blocklength tends to infinity. Finally, our theoretical analysis is consolidated by simulation results. △ Less

Submitted 1 August, 2023; originally announced August 2023.

Comments: Accepted by ICC 2023

arXiv:2307.16215 [pdf]

The Multi-Trip Autonomous Mobile Robot Scheduling Problem with Time Windows in a Stochastic Environment at Smart Hospitals

Authors: Lulu Cheng, Ning Zhao, Kan Wu, Zhibin Chen

Abstract: Autonomous mobile robots (AMRs) play a crucial role in transportation and service tasks at hospitals, contributing to enhanced efficiency and meeting medical demands. This paper investigates the optimization problem of scheduling strategies for AMRs at smart hospitals, where the service and travel times of AMRs are stochastic. A stochastic mixed-integer programming model is formulated to minimize… ▽ More Autonomous mobile robots (AMRs) play a crucial role in transportation and service tasks at hospitals, contributing to enhanced efficiency and meeting medical demands. This paper investigates the optimization problem of scheduling strategies for AMRs at smart hospitals, where the service and travel times of AMRs are stochastic. A stochastic mixed-integer programming model is formulated to minimize the total cost of the hospital by reducing the number of AMRs and travel distance while satisfying constraints such as AMR battery state of charge, AMR capacity, and time windows for medical requests. To address this objective, some properties of the solutions with time window constraints are identified. The variable neighborhood search (VNS) algorithm is adjusted by incorporating the properties of the AMR scheduling problem to solve the model. Experimental results demonstrate that VNS generates high-quality solutions. Both enhanced efficiency and the meeting of medical demands are achieved through intelligently arranging the driving routes of AMRs for both charging and service requests, resulting in substantial cost reductions for hospitals and enhanced utilization of medical resources. △ Less

Submitted 23 November, 2023; v1 submitted 30 July, 2023; originally announced July 2023.

arXiv:2307.08631 [pdf, ps, other]

Dual-Functional MIMO Beamforming Optimization for RIS-Aided Integrated Sensing and Communication

Authors: Xin Zhao, Heng Liu, Shiqi Gong, Xin Ju, Chengwen Xing, Nan Zhao

Abstract: Aiming at providing wireless communication systems with environment-perceptive capacity, emerging integrated sensing and communication (ISAC) technologies face multiple difficulties, especially in balancing the performance trade-off between the communication and radar functions. In this paper, we introduce a reconfigurable intelligent surface (RIS) to assist both data transmission and target detec… ▽ More Aiming at providing wireless communication systems with environment-perceptive capacity, emerging integrated sensing and communication (ISAC) technologies face multiple difficulties, especially in balancing the performance trade-off between the communication and radar functions. In this paper, we introduce a reconfigurable intelligent surface (RIS) to assist both data transmission and target detection in a dual-functional ISAC system. To formulate a general optimization framework, diverse communication performance metrics have been taken into account including famous capacity maximization and mean-squared error (MSE) minimization. Whereas the target detection process is modeled as a general likelihood ratio test (GLRT) due to the practical limitations, and the monotonicity of the corresponding detection probability is proved. For the single-user and single-target (SUST) scenario, the minimum transmit power of the ISAC transceiver has been revealed. By exploiting the optimal conditions of the BS design, we validate that the BS is able to realize the maximum power allocation scheme and derive the optimal BS precoder in a semi-closed form. Moreover, an alternating direction method of multipliers (ADMM) based RIS design is proposed to address the optimization of unit-modulus RIS phase shifts. For the sake of further enhancing computational efficiency, we also develop a low-complexity RIS design based on Riemannian gradient descent. Furthermore, the ISAC transceiver design for the multiple-users and multiple-targets (MUMT) scenario is also investigated, where a zero-forcing (ZF) radar receiver is adopted to cancel the interferences. Then optimal BS precoder is derived under the maximum power allocation scheme, and the RIS phase shifts can be optimized by extending the proposed ADMM-based RIS design. Numerical simulation results verify the performance of our proposed transceiver designs. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: 30 pages, 8 figures, manuscript submitted to IEEE TCOM

arXiv:2305.18286 [pdf, other]

Photoswap: Personalized Subject Swapping in Images

Authors: Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang

Abstract: In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present Photoswap, a novel approach that enables th… ▽ More In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present Photoswap, a novel approach that enables this immersive image editing experience through personalized subject swapping in existing images. Photoswap first learns the visual concept of the subject from reference images and then swaps it into the target image using pre-trained diffusion models in a training-free manner. We establish that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image. Comprehensive experiments underscore the efficacy and controllability of Photoswap in personalized subject swapping. Furthermore, Photoswap significantly outperforms baseline methods in human ratings across subject swapping, background preservation, and overall quality, revealing its vast application potential, from entertainment to professional editing. △ Less

Submitted 29 May, 2023; originally announced May 2023.

Comments: 14 pages

arXiv:2305.04451 [pdf, other]

doi 10.1145/3588432.3591568

FashionTex: Controllable Virtual Try-on with Text and Texture

Authors: Anran Lin, Nanxuan Zhao, Shuliang Ning, Yuda Qiu, Baoyuan Wang, Xiaoguang Han

Abstract: Virtual try-on attracts increasing research attention as a promising way for enhancing the user experience for online cloth shopping. Though existing methods can generate impressive results, users need to provide a well-designed reference image containing the target fashion clothes that often do not exist. To support user-friendly fashion customization in full-body portraits, we propose a multi-mo… ▽ More Virtual try-on attracts increasing research attention as a promising way for enhancing the user experience for online cloth shopping. Though existing methods can generate impressive results, users need to provide a well-designed reference image containing the target fashion clothes that often do not exist. To support user-friendly fashion customization in full-body portraits, we propose a multi-modal interactive setting by combining the advantages of both text and texture for multi-level fashion manipulation. With the carefully designed fashion editing module and loss functions, FashionTex framework can semantically control cloth types and local texture patterns without annotated pairwise training data. We further introduce an ID recovery module to maintain the identity of input portrait. Extensive experiments have demonstrated the effectiveness of our proposed pipeline. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: Accepted to SIGGRAPH 2023 (Conference Proceedings)

arXiv:2303.14001 [pdf, other]

Grid-guided Neural Radiance Fields for Large Urban Scenes

Authors: Linning Xu, Yuanbo Xiangli, Sida Peng, Xingang Pan, Nanxuan Zhao, Christian Theobalt, Bo Dai, Dahua Lin

Abstract: Purely MLP-based neural radiance fields (NeRF-based methods) often suffer from underfitting with blurred renderings on large-scale scenes due to limited model capacity. Recent approaches propose to geographically divide the scene and adopt multiple sub-NeRFs to model each region individually, leading to linear scale-up in training costs and the number of sub-NeRFs as the scene expands. An alternat… ▽ More Purely MLP-based neural radiance fields (NeRF-based methods) often suffer from underfitting with blurred renderings on large-scale scenes due to limited model capacity. Recent approaches propose to geographically divide the scene and adopt multiple sub-NeRFs to model each region individually, leading to linear scale-up in training costs and the number of sub-NeRFs as the scene expands. An alternative solution is to use a feature grid representation, which is computationally efficient and can naturally scale to a large scene with increased grid resolutions. However, the feature grid tends to be less constrained and often reaches suboptimal solutions, producing noisy artifacts in renderings, especially in regions with complex geometry and texture. In this work, we present a new framework that realizes high-fidelity rendering on large urban scenes while being computationally efficient. We propose to use a compact multiresolution ground feature plane representation to coarsely capture the scene, and complement it with positional encoding inputs through another NeRF branch for rendering in a joint learning fashion. We show that such an integration can utilize the advantages of two alternative solutions: a light-weighted NeRF is sufficient, under the guidance of the feature grid representation, to render photorealistic novel views with fine details; and the jointly optimized ground feature planes, can meanwhile gain further refinements, forming a more accurate and compact feature space and output much more natural rendering results. △ Less

Submitted 24 March, 2023; originally announced March 2023.

Comments: CVPR2023, Project page at https://city-super.github.io/gridnerf/

Showing 1–50 of 117 results for author: Zhao, N