Skip to main content

Showing 1–50 of 185 results for author: Bai, S

  1. arXiv:2410.15805  [pdf, other

    cs.AI

    RAG4ITOps: A Supervised Fine-Tunable and Comprehensive RAG Framework for IT Operations and Maintenance

    Authors: Tianyang Zhang, Zhuoxuan Jiang, Shengguang Bai, Tianrui Zhang, Lin Lin, Yang Liu, Jiawei Ren

    Abstract: With the ever-increasing demands on Question Answering (QA) systems for IT operations and maintenance, an efficient and supervised fine-tunable framework is necessary to ensure the data security, private deployment and continuous upgrading. Although Large Language Models (LLMs) have notably improved the open-domain QA's performance, how to efficiently handle enterprise-exclusive corpora and build… ▽ More

    Submitted 21 October, 2024; originally announced October 2024.

    Comments: Accepted by EMNLP 2024 Industry Track

  2. arXiv:2410.09703  [pdf, other

    quant-ph cs.AI cs.IT cs.LG

    Universal scaling laws in quantum-probabilistic machine learning by tensor network towards interpreting representation and generalization powers

    Authors: Sheng-Chen Bai, Shi-Ju Ran

    Abstract: Interpreting the representation and generalization powers has been a long-standing issue in the field of machine learning (ML) and artificial intelligence. This work contributes to uncovering the emergence of universal scaling laws in quantum-probabilistic ML. We take the generative tensor network (GTN) in the form of a matrix product state as an example and show that with an untrained GTN (such a… ▽ More

    Submitted 12 October, 2024; originally announced October 2024.

    Comments: 5 pages (main text) + 3 pages (appendices), 5 figures (main text) + 4 figures (appendices)

  3. arXiv:2410.01928  [pdf

    cs.CV

    Deep learning assisted high resolution microscopy image processing for phase segmentation in functional composite materials

    Authors: Ganesh Raghavendran, Bing Han, Fortune Adekogbe, Shuang Bai, Bingyu Lu, William Wu, Minghao Zhang, Ying Shirley Meng

    Abstract: In the domain of battery research, the processing of high-resolution microscopy images is a challenging task, as it involves dealing with complex images and requires a prior understanding of the components involved. The utilization of deep learning methodologies for image analysis has attracted considerable interest in recent years, with multiple investigations employing such techniques for image… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

  4. arXiv:2409.19200  [pdf, other

    math.OC cs.LG stat.ML

    Faster Acceleration for Steepest Descent

    Authors: Site Bai, Brian Bullins

    Abstract: We propose a new accelerated first-order method for convex optimization under non-Euclidean smoothness assumptions. In contrast to standard acceleration techniques, our approach uses primal-dual iterate sequences taken with respect to differing norms, which are then coupled using an implicitly determined interpolation parameter. For $\ell_p$ norm smooth problems in $d$ dimensions, our method provi… ▽ More

    Submitted 27 September, 2024; originally announced September 2024.

  5. arXiv:2409.16818  [pdf, other

    eess.IV cs.CV

    Towards General Text-guided Image Synthesis for Customized Multimodal Brain MRI Generation

    Authors: Yulin Wang, Honglin Xiong, Kaicong Sun, Shuwei Bai, Ling Dai, Zhongxiang Ding, Jiameng Liu, Qian Wang, Qian Liu, Dinggang Shen

    Abstract: Multimodal brain magnetic resonance (MR) imaging is indispensable in neuroscience and neurology. However, due to the accessibility of MRI scanners and their lengthy acquisition time, multimodal MR images are not commonly available. Current MR image synthesis approaches are typically trained on independent datasets for specific tasks, leading to suboptimal performance when applied to novel datasets… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: 23 pages, 9 figures

  6. arXiv:2409.14163  [pdf, other

    cs.CV cs.CL cs.LG

    PromptTA: Prompt-driven Text Adapter for Source-free Domain Generalization

    Authors: Haoran Zhang, Shuanghao Bai, Wanqi Zhou, Jingwen Fu, Badong Chen

    Abstract: Source-free domain generalization (SFDG) tackles the challenge of adapting models to unseen target domains without access to source domain data. To deal with this challenging task, recent advances in SFDG have primarily focused on leveraging the text modality of vision-language models such as CLIP. These methods involve developing a transferable linear classifier based on diverse style features ex… ▽ More

    Submitted 21 September, 2024; originally announced September 2024.

  7. arXiv:2409.13149  [pdf

    cs.DS cs.RO

    Obstacle-Free Path Planning for Autonomous Drones Using Floyd Algorithm

    Authors: Edward Yao, Philip Yao, Shuju Bai

    Abstract: This research investigates the efficiency of Floyd algorithm for obstacle-free path planning for autonomous aerial vehicles (UAVs) or drones. Floyd algorithm is used to generate the shortest paths for UAVs to fly from any place to the destination in a large-scale field with obstacles which UAVs cannot fly over. The simulation results demonstrated that Floyd algorithm effectively plans the shortest… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

    Comments: 9 pages, 8 figures

    ACM Class: I.6.3

  8. arXiv:2409.12191  [pdf, other

    cs.CV cs.AI cs.CL

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Authors: Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin

    Abstract: We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more eff… ▽ More

    Submitted 3 October, 2024; v1 submitted 18 September, 2024; originally announced September 2024.

    Comments: Code is available at https://github.com/QwenLM/Qwen2-VL. arXiv admin note: text overlap with arXiv:2408.15262 by other authors

  9. arXiv:2408.12067   

    eess.SP cs.AI cs.NI

    Distributed Noncoherent Joint Transmission Based on Multi-Agent Reinforcement Learning for Dense Small Cell MISO Systems

    Authors: Shaozhuang Bai, Zhenzhen Gao, Xuewen Liao

    Abstract: We consider a dense small cell (DSC) network where multi-antenna small cell base stations (SBSs) transmit data to single-antenna users over a shared frequency band. To enhance capacity, a state-of-the-art technique known as noncoherent joint transmission (JT) is applied, enabling users to receive data from multiple coordinated SBSs. However, the sum rate maximization problem with noncoherent JT is… ▽ More

    Submitted 11 September, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: After thorough discussions with my co-authors, we have identified certain issues with the paper that cannot be resolved through revisions. As a result, we have collectively decided to complete withdraw the paper from arXiv

  10. arXiv:2408.11464  [pdf, other

    cs.CV

    MambaOcc: Visual State Space Model for BEV-based Occupancy Prediction with Local Adaptive Reordering

    Authors: Yonglin Tian, Songlin Bai, Zhiyao Luo, Yutong Wang, Yisheng Lv, Fei-Yue Wang

    Abstract: Occupancy prediction has attracted intensive attention and shown great superiority in the development of autonomous driving systems. The fine-grained environmental representation brought by occupancy prediction in terms of both geometry and semantic information has facilitated the general perception and safe planning under open scenarios. However, it also brings high computation costs and heavy pa… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  11. arXiv:2408.10519  [pdf, other

    cs.DC cs.DS

    Almost Optimal Algorithms for Token Collision in Anonymous Networks

    Authors: Sirui Bai, Xinyu Fu, Xudong Wu, Penghui Yao, Chaodong Zheng

    Abstract: In distributed systems, situations often arise where some nodes each holds a collection of tokens, and all nodes collectively need to determine whether all tokens are distinct. For example, if each token represents a logged-in user, the problem corresponds to checking whether there are duplicate logins. Similarly, if each token represents a data object or a timestamp, the problem corresponds to ch… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  12. arXiv:2407.16696  [pdf, other

    cs.CV

    PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects

    Authors: Junyi Li, Junfeng Wu, Weizhi Zhao, Song Bai, Xiang Bai

    Abstract: We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images. Through a unified framework, PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario. Specifically, we propose a Q-Former to construct the hierarchical relationship between objects and parts, parsing every object into corr… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024, homepage: https://provencestar.github.io/PartGLEE-Vision/

  13. arXiv:2407.13038  [pdf, other

    cs.CV cs.LG

    Universal Facial Encoding of Codec Avatars from VR Headsets

    Authors: Shaojie Bai, Te-Li Wang, Chenghui Li, Akshay Venkatesh, Tomas Simon, Chen Cao, Gabriel Schwartz, Ryan Wrench, Jason Saragih, Yaser Sheikh, Shih-En Wei

    Abstract: Faithful real-time facial animation is essential for avatar-mediated telepresence in Virtual Reality (VR). To emulate authentic communication, avatar animation needs to be efficient and accurate: able to capture both extreme and subtle expressions within a few milliseconds to sustain the rhythm of natural conversations. The oblique and incomplete views of the face, variability in the donning of he… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: SIGGRAPH 2024 (ACM Transactions on Graphics (TOG))

    Journal ref: ACM Trans. Graph. 43, 4, Article 93 (July 2024), 22 pages.

  14. arXiv:2407.10671  [pdf, other

    cs.CL cs.AI

    Qwen2 Technical Report

    Authors: An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin , et al. (37 additional authors not shown)

    Abstract: This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, a… ▽ More

    Submitted 10 September, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: 26 pages, 1 figure

  15. arXiv:2406.17005  [pdf, other

    cs.CV

    PVUW 2024 Challenge on Complex Video Understanding: Methods and Results

    Authors: Henghui Ding, Chang Liu, Yunchao Wei, Nikhila Ravi, Shuting He, Song Bai, Philip Torr, Deshui Miao, Xin Li, Zhenyu He, Yaowei Wang, Ming-Hsuan Yang, Zhensong Xu, Jiangtao Yao, Chengjing Wu, Ting Liu, Luoqi Liu, Xinyu Liu, Jing Zhang, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang, Mingqi Gao, Jingnan Luo , et al. (12 additional authors not shown)

    Abstract: Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: MOSE Challenge: https://henghuiding.github.io/MOSE/ChallengeCVPR2024, MeViS Challenge: https://henghuiding.github.io/MeViS/ChallengeCVPR2024

  16. arXiv:2406.04322  [pdf, other

    cs.CV

    DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

    Authors: Qihao Liu, Yi Zhang, Song Bai, Adam Kortylewski, Alan Yuille

    Abstract: We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge… ▽ More

    Submitted 6 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted to CVPR 2024. Code: https://github.com/qihao067/direct3d Project page: https://direct-3d.github.io/

  17. arXiv:2406.00532  [pdf, other

    cs.AI cs.LG

    Breast Cancer Diagnosis: A Comprehensive Exploration of Explainable Artificial Intelligence (XAI) Techniques

    Authors: Samita Bai, Sidra Nasir, Rizwan Ahmed Khan, Sheeraz Arif, Alexandre Meyer, Hubert Konik

    Abstract: Breast cancer (BC) stands as one of the most common malignancies affecting women worldwide, necessitating advancements in diagnostic methodologies for better clinical outcomes. This article provides a comprehensive exploration of the application of Explainable Artificial Intelligence (XAI) techniques in the detection and diagnosis of breast cancer. As Artificial Intelligence (AI) technologies cont… ▽ More

    Submitted 1 June, 2024; originally announced June 2024.

  18. arXiv:2405.08779  [pdf, other

    cs.LG

    Jacobian Regularizer-based Neural Granger Causality

    Authors: Wanqi Zhou, Shuanghao Bai, Shujian Yu, Qibin Zhao, Badong Chen

    Abstract: With the advancement of neural networks, diverse methods for neural Granger causality have emerged, which demonstrate proficiency in handling complex data, and nonlinear relationships. However, the existing framework of neural Granger causality has several limitations. It requires the construction of separate predictive models for each target variable, and the relationship depends on the sparsity… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 20 pages, 7 figures, ICML 2024

  19. arXiv:2405.08484  [pdf, other

    quant-ph cs.LG nlin.CD stat.ML

    Universal replication of chaotic characteristics by classical and quantum machine learning

    Authors: Sheng-Chen Bai, Shi-Ju Ran

    Abstract: Replicating chaotic characteristics of non-linear dynamics by machine learning (ML) has recently drawn wide attentions. In this work, we propose that a ML model, trained to predict the state one-step-ahead from several latest historic states, can accurately replicate the bifurcation diagram and the Lyapunov exponents of discrete dynamic systems. The characteristics for different values of the hype… ▽ More

    Submitted 14 May, 2024; originally announced May 2024.

    Comments: 8 pages, 4 figures

  20. arXiv:2404.19287  [pdf, other

    cs.CV

    Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

    Authors: Wanqi Zhou, Shuanghao Bai, Qibin Zhao, Badong Chen

    Abstract: Pretrained vision-language models (VLMs) like CLIP have shown impressive generalization performance across various downstream tasks, yet they remain vulnerable to adversarial attacks. While prior research has primarily concentrated on improving the adversarial robustness of image encoders to guard against attacks on images, the exploration of text-based and multimodal attacks has largely been over… ▽ More

    Submitted 17 July, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

    Comments: 16 pages, 14 figures

  21. arXiv:2404.19286  [pdf, other

    cs.CV

    Soft Prompt Generation for Domain Generalization

    Authors: Shuanghao Bai, Yuedi Zhang, Wanqi Zhou, Zhirong Luan, Badong Chen

    Abstract: Large pre-trained vision language models (VLMs) have shown impressive zero-shot ability on downstream tasks with manually designed prompt. To further adapt VLMs to downstream tasks, soft prompt is proposed to replace manually designed prompt, which undergoes fine-tuning based on specific domain data. Prior prompt learning methods primarily learn a fixed prompt or residuled prompt from training sam… ▽ More

    Submitted 12 July, 2024; v1 submitted 30 April, 2024; originally announced April 2024.

    Comments: 25 pages, 4 figures, accepted by ECCV 2024

  22. arXiv:2404.14724  [pdf

    cs.RO

    Tightly Joined Positioning and Control Model for Unmanned Aerial Vehicles Based on Factor Graph Optimization

    Authors: Peiwen Yang, Weisong Wen, Shiyu Bai, Li-Ta Hsu

    Abstract: The execution of flight missions by unmanned aerial vehicles (UAV) primarily relies on navigation. In particular, the navigation pipeline has traditionally been divided into positioning and control, operating in a sequential loop. However, the existing navigation pipeline, where the positioning and control are decoupled, struggles to adapt to ubiquitous uncertainties arising from measurement noise… ▽ More

    Submitted 22 April, 2024; originally announced April 2024.

  23. arXiv:2404.14471  [pdf, other

    cs.CV

    Narrative Action Evaluation with Prompt-Guided Multimodal Interaction

    Authors: Shiyi Zhang, Sule Bai, Guangyi Chen, Lei Chen, Jiwen Lu, Junle Wang, Yansong Tang

    Abstract: In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate d… ▽ More

    Submitted 26 April, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

    Comments: Accepted by CVPR 2024

  24. arXiv:2404.10499  [pdf, other

    cs.CV cs.AI

    Robust Noisy Label Learning via Two-Stream Sample Distillation

    Authors: Sihan Bai, Sanping Zhou, Zheng Qin, Le Wang, Nanning Zheng

    Abstract: Noisy label learning aims to learn robust networks under the supervision of noisy labels, which plays a critical role in deep learning. Existing work either conducts sample selection or label correction to deal with noisy labels during the model training process. In this paper, we design a simple yet effective sample selection framework, termed Two-Stream Sample Distillation (TSSD), for noisy labe… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  25. arXiv:2404.03067  [pdf, other

    cs.RO cs.CV

    Self-supervised 6-DoF Robot Grasping by Demonstration via Augmented Reality Teleoperation System

    Authors: Xiwen Dengxiong, Xueting Wang, Shi Bai, Yunbo Zhang

    Abstract: Most existing 6-DoF robot grasping solutions depend on strong supervision on grasp pose to ensure satisfactory performance, which could be laborious and impractical when the robot works in some restricted area. To this end, we propose a self-supervised 6-DoF grasp pose detection framework via an Augmented Reality (AR) teleoperation system that can efficiently learn human demonstrations and provide… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  26. arXiv:2404.01853  [pdf, other

    cs.LG cs.CV

    Pairwise Similarity Distribution Clustering for Noisy Label Learning

    Authors: Sihan Bai

    Abstract: Noisy label learning aims to train deep neural networks using a large amount of samples with noisy labels, whose main challenge comes from how to deal with the inaccurate supervision caused by wrong labels. Existing works either take the label correction or sample selection paradigm to involve more samples with accurate labels into the training process. In this paper, we propose a simple yet effec… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  27. arXiv:2403.08506  [pdf, other

    cs.LG cs.AI cs.CV

    DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning

    Authors: Sikai Bai, Jie Zhang, Shuaicheng Li, Song Guo, Jingcai Guo, Jun Hou, Tao Han, Xiaocheng Lu

    Abstract: Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data, and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However, most existing FL methods assume that domain labels are provided during training, and their evaluation imposes explicit constraints on the numb… ▽ More

    Submitted 11 March, 2024; originally announced March 2024.

    Journal ref: The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

  28. arXiv:2403.08192  [pdf, other

    cs.CL q-bio.BM

    MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension

    Authors: Xingyu Lu, He Cao, Zijing Liu, Shengyuan Bai, Leqing Chen, Yuan Yao, Hai-Tao Zheng, Yu Li

    Abstract: Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous information, posing challenges to accurate molecular comprehension. Traditional evaluation metrics for generated content fail to assess a model's accuracy in molecular understanding. To rectify the absence of factual evaluation, we present MoleculeQA, a novel quest… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

    Comments: 19 pages, 8 figures

  29. arXiv:2403.06764  [pdf, other

    cs.CV cs.AI cs.CL

    An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

    Authors: Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, Baobao Chang

    Abstract: In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we i… ▽ More

    Submitted 2 September, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: Accepted to ECCV 2024 (Oral), code is released at https://github.com/pkunlp-icler/FastV,

  30. arXiv:2402.14577  [pdf, other

    cs.CV

    Debiasing Text-to-Image Diffusion Models

    Authors: Ruifei He, Chuhui Xue, Haoru Tan, Wenqing Zhang, Yingchen Yu, Song Bai, Xiaojuan Qi

    Abstract: Learning-based Text-to-Image (TTI) models like Stable Diffusion have revolutionized the way visual content is generated in various domains. However, recent research has shown that nonnegligible social bias exists in current state-of-the-art TTI systems, which raises important concerns. In this work, we target resolving the social bias in TTI diffusion models. We begin by formalizing the problem se… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  31. arXiv:2401.15865  [pdf, other

    cs.CV

    LiDAR-PTQ: Post-Training Quantization for Point Cloud 3D Object Detection

    Authors: Sifan Zhou, Liang Li, Xinyu Zhang, Bo Zhang, Shipeng Bai, Miao Sun, Ziyu Zhao, Xiaobo Lu, Xiangxiang Chu

    Abstract: Due to highly constrained computing power and memory, deploying 3D lidar-based detectors on edge devices equipped in autonomous vehicles and robots poses a crucial challenge. Being a convenient and straightforward model compression approach, Post-Training Quantization (PTQ) has been widely adopted in 2D vision tasks. However, applying it directly to 3D lidar-based tasks inevitably leads to perform… ▽ More

    Submitted 28 January, 2024; originally announced January 2024.

    Comments: Accepted in ICLR 2024

  32. arXiv:2401.11002  [pdf, other

    cs.CV cs.AI

    Fast Registration of Photorealistic Avatars for VR Facial Animation

    Authors: Chaitanya Patel, Shaojie Bai, Te-Li Wang, Jason Saragih, Shih-En Wei

    Abstract: Virtual Reality (VR) bares promise of social interactions that can feel more immersive than other media. Key to this is the ability to accurately animate a personalized photorealistic avatar, and hence the acquisition of the labels for headset-mounted camera (HMC) images need to be efficient and accurate, while wearing a VR headset. This is challenging due to oblique camera views and differences i… ▽ More

    Submitted 18 July, 2024; v1 submitted 19 January, 2024; originally announced January 2024.

    Comments: ECCV 2024. Project page: https://chaitanya100100.github.io/FastRegistration/

  33. arXiv:2401.02620  [pdf, other

    cs.AI cs.GR

    Progress and Prospects in 3D Generative AI: A Technical Overview including 3D human

    Authors: Song Bai, Jie Li

    Abstract: While AI-generated text and 2D images continue to expand its territory, 3D generation has gradually emerged as a trend that cannot be ignored. Since the year 2023 an abundant amount of research papers has emerged in the domain of 3D generation. This growth encompasses not just the creation of 3D objects, but also the rapid development of 3D character and motion generation. Several key factors cont… ▽ More

    Submitted 4 January, 2024; originally announced January 2024.

  34. arXiv:2401.01885  [pdf, other

    cs.CV

    From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

    Authors: Evonne Ng, Javier Romero, Timur Bagautdinov, Shaojie Bai, Trevor Darrell, Angjoo Kanazawa, Alexander Richard

    Abstract: We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

  35. arXiv:2401.00616  [pdf, other

    cs.CV

    GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields

    Authors: Xiao Pan, Zongxin Yang, Shuai Bai, Yi Yang

    Abstract: In this paper, we focus on the One-shot Novel View Synthesis (O-NVS) task which targets synthesizing photo-realistic novel views given only one reference image per scene. Previous One-shot Generalizable Neural Radiance Fields (OG-NeRF) methods solve this task in an inference-time finetuning-free manner, yet suffer the blurry issue due to the encoder-only architecture that highly relies on the limi… ▽ More

    Submitted 29 March, 2024; v1 submitted 31 December, 2023; originally announced January 2024.

    Comments: Submitted to Journal

  36. arXiv:2312.09589  [pdf, other

    cs.CV

    Improving Cross-domain Few-shot Classification with Multilayer Perceptron

    Authors: Shuanghao Bai, Wanqi Zhou, Zhirong Luan, Donglin Wang, Badong Chen

    Abstract: Cross-domain few-shot classification (CDFSC) is a challenging and tough task due to the significant distribution discrepancies across different domains. To address this challenge, many approaches aim to learn transferable representations. Multilayer perceptron (MLP) has shown its capability to learn transferable representations in various downstream tasks, such as unsupervised image classification… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: 5pages, 4 figures

  37. arXiv:2312.09553  [pdf, other

    cs.CV

    Prompt-based Distribution Alignment for Unsupervised Domain Adaptation

    Authors: Shuanghao Bai, Min Zhang, Wanqi Zhou, Siteng Huang, Zhirong Luan, Donglin Wang, Badong Chen

    Abstract: Recently, despite the unprecedented success of large pre-trained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target dom… ▽ More

    Submitted 26 January, 2024; v1 submitted 15 December, 2023; originally announced December 2023.

    Comments: 13pages,6figures

  38. arXiv:2312.09158  [pdf, other

    cs.CV

    General Object Foundation Model for Images and Videos at Scale

    Authors: Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai

    Abstract: We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework, GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy, GLEE acquires knowledge from diverse data… ▽ More

    Submitted 14 December, 2023; originally announced December 2023.

    Comments: Project homepage: https://glee-vision.github.io

  39. arXiv:2312.04089  [pdf, other

    cs.CV

    Open-Vocabulary Segmentation with Semantic-Assisted Calibration

    Authors: Yong Liu, Sule Bai, Guanbin Li, Yitong Wang, Yansong Tang

    Abstract: This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding, alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge, recent works propose to utilize CLIP as an additional cl… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  40. arXiv:2312.02481  [pdf, other

    cs.CV cs.AI

    Learning to Holistically Detect Bridges from Large-Size VHR Remote Sensing Imagery

    Authors: Yansheng Li, Junwei Luo, Yongjun Zhang, Yihua Tan, Jin-Gang Yu, Song Bai

    Abstract: Bridge detection in remote sensing images (RSIs) plays a crucial role in various applications, but it poses unique challenges compared to the detection of other objects. In RSIs, bridges exhibit considerable variations in terms of their spatial scales and aspect ratios. Therefore, to ensure the visibility and integrity of bridges, it is essential to perform holistic bridge detection in large-size… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: 16 pages, 11 figures, 6 tables; due to the limitation "The abstract field cannot be longer than 1,920 characters", the abstract appearing here is slightly shorter than that in the PDF file

  41. arXiv:2310.06218  [pdf, other

    cs.LG cs.AI

    SUBP: Soft Uniform Block Pruning for 1xN Sparse CNNs Multithreading Acceleration

    Authors: Jingyang Xiang, Siqi Li, Jun Chen, Shipeng Bai, Yukai Ma, Guang Dai, Yong Liu

    Abstract: The study of sparsity in Convolutional Neural Networks (CNNs) has become widespread to compress and accelerate models in environments with limited resources. By constraining N consecutive weights along the output channel to be group-wise non-zero, the recent network with 1$\times$N sparsity has received tremendous popularity for its three outstanding advantages: 1) A large amount of storage space… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

    Comments: 14 pages, 4 figures, Accepted by 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  42. arXiv:2309.16609  [pdf, other

    cs.CL

    Qwen Technical Report

    Authors: Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan , et al. (23 additional authors not shown)

    Abstract: Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Q… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: 59 pages, 5 figures

  43. arXiv:2309.07698  [pdf, other

    cs.CV

    Dataset Condensation via Generative Model

    Authors: David Junhao Zhang, Heng Wang, Chuhui Xue, Rui Yan, Wenqing Zhang, Song Bai, Mike Zheng Shou

    Abstract: Dataset condensation aims to condense a large dataset with a lot of training samples into a small set. Previous methods usually condense the dataset into the pixels format. However, it suffers from slow optimization speed and large number of parameters to be optimized. When increasing image resolutions and classes, the number of learnable parameters grows accordingly, prohibiting condensation meth… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: old work,done in 2022

  44. Ethical Framework for Harnessing the Power of AI in Healthcare and Beyond

    Authors: Sidra Nasir, Rizwan Ahmed Khan, Samita Bai

    Abstract: In the past decade, the deployment of deep learning (Artificial Intelligence (AI)) methods has become pervasive across a spectrum of real-world applications, often in safety-critical contexts. This comprehensive research article rigorously investigates the ethical dimensions intricately linked to the rapid evolution of AI technologies, with a particular focus on the healthcare domain. Delving deep… ▽ More

    Submitted 31 August, 2023; originally announced September 2023.

    Journal ref: IEEE Access 2024

  45. arXiv:2308.16890  [pdf, other

    cs.CV cs.CL

    TouchStone: Evaluating Vision-Language Models by Language Models

    Authors: Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, Jingren Zhou

    Abstract: Large vision-language models (LVLMs) have recently witnessed rapid advancements, exhibiting a remarkable capacity for perceiving, understanding, and processing visual information by connecting visual receptor with large language models (LLMs). However, current assessments mainly focus on recognizing and reasoning abilities, lacking direct evaluation of conversational skills and neglecting visual s… ▽ More

    Submitted 4 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: https://github.com/OFA-Sys/TouchStone

  46. arXiv:2308.12966  [pdf, other

    cs.CV cs.CL

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Authors: Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou

    Abstract: In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyon… ▽ More

    Submitted 12 October, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

    Comments: Code, demo and models are available at https://github.com/QwenLM/Qwen-VL

  47. arXiv:2308.07209  [pdf, other

    cs.LG cs.CV eess.IV

    Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning

    Authors: Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, Yong Liu

    Abstract: Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore,… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: ICCV2023

  48. arXiv:2308.06739  [pdf, other

    cs.CV

    Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

    Authors: David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenqing Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou

    Abstract: Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been… ▽ More

    Submitted 13 August, 2023; originally announced August 2023.

  49. arXiv:2308.04269  [pdf, other

    cs.CV cs.AI

    Lossy and Lossless (L$^2$) Post-training Model Size Compression

    Authors: Yumeng Shi, Shihao Bai, Xiuying Wei, Ruihao Gong, Jianlei Yang

    Abstract: Deep neural networks have delivered remarkable performance and have been widely used in various visual tasks. However, their huge size causes significant inconvenience for transmission and storage. Many previous studies have explored model size compression. However, these studies often approach various lossy and lossless compression methods in isolation, leading to challenges in achieving high com… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  50. arXiv:2308.00353  [pdf, other

    cs.CV

    Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

    Authors: Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan Qi

    Abstract: Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Interne… ▽ More

    Submitted 1 August, 2023; originally announced August 2023.

    Comments: submit to TPAMI