subscribe to arXiv mailings

arXiv:2409.19499 [pdf, other]

Fast-UMI: A Scalable and Hardware-Independent Universal Manipulation Interface

Authors: Ziniu Wu, Tianyu Wang, Zhaxizhuoma, Chuyue Guan, Zhongjie Jia, Shuai Liang, Haoming Song, Delin Qu, Dong Wang, Zhigang Wang, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li

Abstract: Collecting real-world manipulation trajectory data involving robotic arms is essential for developing general-purpose action policies in robotic manipulation, yet such data remains scarce. Existing methods face limitations such as high costs, labor intensity, hardware dependencies, and complex setup requirements involving SLAM algorithms. In this work, we introduce Fast-UMI, an interface-mediated… ▽ More Collecting real-world manipulation trajectory data involving robotic arms is essential for developing general-purpose action policies in robotic manipulation, yet such data remains scarce. Existing methods face limitations such as high costs, labor intensity, hardware dependencies, and complex setup requirements involving SLAM algorithms. In this work, we introduce Fast-UMI, an interface-mediated manipulation system comprising two key components: a handheld device operated by humans for data collection and a robot-mounted device used during policy inference. Our approach employs a decoupled design compatible with a wide range of grippers while maintaining consistent observation perspectives, allowing models trained on handheld-collected data to be directly applied to real robots. By directly obtaining the end-effector pose using existing commercial hardware products, we eliminate the need for complex SLAM deployment and calibration, streamlining data processing. Fast-UMI provides supporting software tools for efficient robot learning data collection and conversion, facilitating rapid, plug-and-play functionality. This system offers an efficient and user-friendly tool for robotic learning data acquisition. △ Less

Submitted 28 September, 2024; originally announced September 2024.

arXiv:2408.15428 [pdf, other]

HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles

Authors: Deyuan Qu, Qi Chen, Yongqi Zhu, Yihao Zhu, Sergei S. Avedisov, Song Fu, Qing Yang

Abstract: In cooperative perception studies, there is often a trade-off between communication bandwidth and perception performance. While current feature fusion solutions are known for their excellent object detection performance, transmitting the entire sets of intermediate feature maps requires substantial bandwidth. Furthermore, these fusion approaches are typically limited to vehicles that use identical… ▽ More In cooperative perception studies, there is often a trade-off between communication bandwidth and perception performance. While current feature fusion solutions are known for their excellent object detection performance, transmitting the entire sets of intermediate feature maps requires substantial bandwidth. Furthermore, these fusion approaches are typically limited to vehicles that use identical detection models. Our goal is to develop a solution that supports cooperative perception across vehicles equipped with different modalities of sensors. This method aims to deliver improved perception performance compared to late fusion techniques, while achieving precision similar to the state-of-art intermediate fusion, but requires an order of magnitude less bandwidth. We propose HEAD, a method that fuses features from the classification and regression heads in 3D object detection networks. Our method is compatible with heterogeneous detection networks such as LiDAR PointPillars, SECOND, VoxelNet, and camera Bird's-eye View (BEV) Encoder. Given the naturally smaller feature size in the detection heads, we design a self-attention mechanism to fuse the classification head and a complementary feature fusion layer to fuse the regression head. Our experiments, comprehensively evaluated on the V2V4Real and OPV2V datasets, demonstrate that HEAD is a fusion method that effectively balances communication bandwidth and perception performance. △ Less

Submitted 27 August, 2024; originally announced August 2024.

Comments: Accepted by ECCV 2024 Workshop

arXiv:2408.13024 [pdf, other]

Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

Authors: Xianqiang Gao, Pingrui Zhang, Delin Qu, Dong Wang, Zhigang Wang, Yan Ding, Bin Zhao, Xuelong Li

Abstract: 3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always… ▽ More 3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the \textbf{M}ulti-\textbf{I}mage Guided Invariant-\textbf{F}eature-Aware 3D \textbf{A}ffordance \textbf{G}rounding (\textbf{MIFAG}) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (\textbf{IAM}) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (\textbf{ADM}) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (\textbf{MIPA}) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons. Project page: \url{https://goxq.github.io/mifag} △ Less

Submitted 23 August, 2024; originally announced August 2024.

arXiv:2406.16038 [pdf, other]

LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control

Authors: Delin Qu, Qizhi Chen, Pingrui Zhang, Xianqiang Gao, Bin Zhao, Dong Wang, Xuelong Li

Abstract: This paper aims to advance the progress of physical world interactive scene reconstruction by extending the interactive object reconstruction from single object level to complex scene level. To this end, we first construct one simulated and one real scene-level physical interaction dataset containing 28 scenes with multiple interactive objects per scene. Furthermore, to accurately model the intera… ▽ More This paper aims to advance the progress of physical world interactive scene reconstruction by extending the interactive object reconstruction from single object level to complex scene level. To this end, we first construct one simulated and one real scene-level physical interaction dataset containing 28 scenes with multiple interactive objects per scene. Furthermore, to accurately model the interactive motions of multiple objects in complex scenes, we propose LiveScene, the first scene-level language-embedded interactive neural radiance field that efficiently reconstructs and controls multiple interactive objects in complex scenes. LiveScene introduces an efficient factorization that decomposes the interactive scene into multiple local deformable fields to separately reconstruct individual interactive objects, achieving the first accurate and independent control on multiple interactive objects in a complex scene. Moreover, we introduce an interaction-aware language embedding method that generates varying language embeddings to localize individual interactive objects under different interactive states, enabling arbitrary control of interactive objects using natural language. Finally, we evaluate LiveScene on the constructed datasets OminiSim and InterReal with various simulated and real-world complex scenes. Extensive experiment results demonstrate that the proposed approach achieves SOTA novel view synthesis and language grounding performance, surpassing existing methods by +9.89, +1.30, and +1.99 in PSNR on CoNeRF Synthetic, OminiSim #chanllenging, and InterReal #chanllenging datasets, and +65.12 of mIOU on OminiSim, respectively. Project page: \href{https://livescenes.github.io}{https://livescenes.github.io}. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.03882 [pdf, other]

doi 10.21437/Interspeech.2024-1895

Spontaneous Speech-Based Suicide Risk Detection Using Whisper and Large Language Models

Authors: Ziyun Cui, Chang Lei, Wen Wu, Yinan Duan, Diyang Qu, Ji Wu, Runsen Chen, Chao Zhang

Abstract: The early detection of suicide risk is important since it enables the intervention to prevent potential suicide attempts. This paper studies the automatic detection of suicide risk based on spontaneous speech from adolescents, and collects a Mandarin dataset with 15 hours of suicide speech from more than a thousand adolescents aged from ten to eighteen for our experiments. To leverage the diverse… ▽ More The early detection of suicide risk is important since it enables the intervention to prevent potential suicide attempts. This paper studies the automatic detection of suicide risk based on spontaneous speech from adolescents, and collects a Mandarin dataset with 15 hours of suicide speech from more than a thousand adolescents aged from ten to eighteen for our experiments. To leverage the diverse acoustic and linguistic features embedded in spontaneous speech, both the Whisper speech model and textual large language models (LLMs) are used for suicide risk detection. Both all-parameter finetuning and parameter-efficient finetuning approaches are used to adapt the pre-trained models for suicide risk detection, and multiple audio-text fusion approaches are evaluated to combine the representations of Whisper and the LLM. The proposed system achieves a detection accuracy of 0.807 and an F1-score of 0.846 on the test set with 119 subjects, indicating promising potential for real suicide risk detection applications. △ Less

Submitted 9 July, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

Comments: Accepted by Interspeech 2024

arXiv:2406.02916 [pdf, other]

Real-time Motion Planning for autonomous vehicles in dynamic environments

Authors: Mohammad Dehghani Tezerjani, Dominic Carrillo, Deyuan Qu, Sudip Dhakal, Amir Mirzaeinia, Qing Yang

Abstract: Recent advancements in self-driving car technologies have enabled them to navigate autonomously through various environments. However, one of the critical challenges in autonomous vehicle operation is trajectory planning, especially in dynamic environments with moving obstacles. This research aims to tackle this challenge by proposing a robust algorithm tailored for autonomous cars operating in dy… ▽ More Recent advancements in self-driving car technologies have enabled them to navigate autonomously through various environments. However, one of the critical challenges in autonomous vehicle operation is trajectory planning, especially in dynamic environments with moving obstacles. This research aims to tackle this challenge by proposing a robust algorithm tailored for autonomous cars operating in dynamic environments with moving obstacles. The algorithm introduces two main innovations. Firstly, it defines path density by adjusting the number of waypoints along the trajectory, optimizing their distribution for accuracy in curved areas and reducing computational complexity in straight sections. Secondly, it integrates hierarchical motion planning algorithms, combining global planning with an enhanced $A^*$ graph-based method and local planning using the time elastic band algorithm with moving obstacle detection considering different motion models. The proposed algorithm is adaptable for different vehicle types and mobile robots, making it versatile for real-world applications. Simulation results demonstrate its effectiveness across various conditions, promising safer and more efficient navigation for autonomous vehicles in dynamic environments. These modifications significantly improve trajectory planning capabilities, addressing a crucial aspect of autonomous vehicle technology. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 8 pages

arXiv:2312.16141 [pdf, other]

VirtualPainting: Addressing Sparsity with Virtual Points and Distance-Aware Data Augmentation for 3D Object Detection

Authors: Sudip Dhakal, Dominic Carrillo, Deyuan Qu, Michael Nutt, Qing Yang, Song Fu

Abstract: In recent times, there has been a notable surge in multimodal approaches that decorates raw LiDAR point clouds with camera-derived features to improve object detection performance. However, we found that these methods still grapple with the inherent sparsity of LiDAR point cloud data, primarily because fewer points are enriched with camera-derived features for sparsely distributed objects. We pres… ▽ More In recent times, there has been a notable surge in multimodal approaches that decorates raw LiDAR point clouds with camera-derived features to improve object detection performance. However, we found that these methods still grapple with the inherent sparsity of LiDAR point cloud data, primarily because fewer points are enriched with camera-derived features for sparsely distributed objects. We present an innovative approach that involves the generation of virtual LiDAR points using camera images and enhancing these virtual points with semantic labels obtained from image-based segmentation networks to tackle this issue and facilitate the detection of sparsely distributed objects, particularly those that are occluded or distant. Furthermore, we integrate a distance aware data augmentation (DADA) technique to enhance the models capability to recognize these sparsely distributed objects by generating specialized training samples. Our approach offers a versatile solution that can be seamlessly integrated into various 3D frameworks and 2D semantic segmentation methods, resulting in significantly improved overall detection accuracy. Evaluation on the KITTI and nuScenes datasets demonstrates substantial enhancements in both 3D and birds eye view (BEV) detection benchmarks △ Less

Submitted 26 December, 2023; originally announced December 2023.

arXiv:2312.10818 [pdf, other]

Facial Emotion Recognition using CNN in PyTorch

Authors: Deyuan Qu, Sudip Dhakal, Dominic Carrillo

Abstract: In this project, we have implemented a model to recognize real-time facial emotions given the camera images. Current approaches would read all data and input it into their model, which has high space complexity. Our model is based on the Convolutional Neural Network utilizing the PyTorch library. We believe our implementation will significantly improve the space complexity and provide a useful con… ▽ More In this project, we have implemented a model to recognize real-time facial emotions given the camera images. Current approaches would read all data and input it into their model, which has high space complexity. Our model is based on the Convolutional Neural Network utilizing the PyTorch library. We believe our implementation will significantly improve the space complexity and provide a useful contribution to facial emotion recognition. Our motivation is to understanding clearly about deep learning, particularly in CNNs, and analysis real-life scenarios. Therefore, we tunned the hyper parameter of model such as learning rate, batch size, and number of epochs to meet our needs. In addition, we also used techniques to optimize the networks, such as activation function, dropout and max pooling. Finally, we analyzed the result from two optimizer to observe the relationship between number of epochs and accuracy. △ Less

Submitted 17 December, 2023; originally announced December 2023.

arXiv:2312.04822 [pdf, other]

SiCP: Simultaneous Individual and Cooperative Perception for 3D Object Detection in Connected and Automated Vehicles

Authors: Deyuan Qu, Qi Chen, Tianyu Bai, Hongsheng Lu, Heng Fan, Hao Zhang, Song Fu, Qing Yang

Abstract: Cooperative perception for connected and automated vehicles is traditionally achieved through the fusion of feature maps from two or more vehicles. However, the absence of feature maps shared from other vehicles can lead to a significant decline in 3D object detection performance for cooperative perception models compared to standalone 3D detection models. This drawback impedes the adoption of coo… ▽ More Cooperative perception for connected and automated vehicles is traditionally achieved through the fusion of feature maps from two or more vehicles. However, the absence of feature maps shared from other vehicles can lead to a significant decline in 3D object detection performance for cooperative perception models compared to standalone 3D detection models. This drawback impedes the adoption of cooperative perception as vehicle resources are often insufficient to concurrently employ two perception models. To tackle this issue, we present Simultaneous Individual and Cooperative Perception (SiCP), a generic framework that supports a wide range of the state-of-the-art standalone perception backbones and enhances them with a novel Dual-Perception Network (DP-Net) designed to facilitate both individual and cooperative perception. In addition to its lightweight nature with only 0.13M parameters, DP-Net is robust and retains crucial gradient information during feature map fusion. As demonstrated in a comprehensive evaluation on the V2V4Real and OPV2V datasets, thanks to DP-Net, SiCP surpasses state-of-the-art cooperative perception solutions while preserving the performance of standalone perception solutions. △ Less

Submitted 26 August, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: Accepted by IROS 2024

arXiv:2311.15766 [pdf, other]

Knowledge Unlearning for LLMs: Tasks, Methods, and Challenges

Authors: Nianwen Si, Hao Zhang, Heyu Chang, Wenlin Zhang, Dan Qu, Weiqiang Zhang

Abstract: In recent years, large language models (LLMs) have spurred a new research paradigm in natural language processing. Despite their excellent capability in knowledge-based question answering and reasoning, their potential to retain faulty or even harmful knowledge poses risks of malicious application. The challenge of mitigating this issue and transforming these models into purer assistants is crucia… ▽ More In recent years, large language models (LLMs) have spurred a new research paradigm in natural language processing. Despite their excellent capability in knowledge-based question answering and reasoning, their potential to retain faulty or even harmful knowledge poses risks of malicious application. The challenge of mitigating this issue and transforming these models into purer assistants is crucial for their widespread applicability. Unfortunately, Retraining LLMs repeatedly to eliminate undesirable knowledge is impractical due to their immense parameters. Knowledge unlearning, derived from analogous studies on machine unlearning, presents a promising avenue to address this concern and is notably advantageous in the context of LLMs. It allows for the removal of harmful knowledge in an efficient manner, without affecting unrelated knowledge in the model. To this end, we provide a survey of knowledge unlearning in the era of LLMs. Firstly, we formally define the knowledge unlearning problem and distinguish it from related works. Subsequently, we categorize existing knowledge unlearning methods into three classes: those based on parameter optimization, parameter merging, and in-context learning, and introduce details of these unlearning methods. We further present evaluation datasets used in existing methods, and finally conclude this survey by presenting the ongoing challenges and future directions. △ Less

Submitted 7 December, 2023; v1 submitted 27 November, 2023; originally announced November 2023.

Comments: Work in progress

arXiv:2311.11700 [pdf, other]

GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting

Authors: Chi Yan, Delin Qu, Dan Xu, Bin Zhao, Zhigang Wang, Dong Wang, Xuelong Li

Abstract: In this paper, we introduce \textbf{GS-SLAM} that first utilizes 3D Gaussian representation in the Simultaneous Localization and Mapping (SLAM) system. It facilitates a better balance between efficiency and accuracy. Compared to recent SLAM methods employing neural implicit representations, our method utilizes a real-time differentiable splatting rendering pipeline that offers significant speedup… ▽ More In this paper, we introduce \textbf{GS-SLAM} that first utilizes 3D Gaussian representation in the Simultaneous Localization and Mapping (SLAM) system. It facilitates a better balance between efficiency and accuracy. Compared to recent SLAM methods employing neural implicit representations, our method utilizes a real-time differentiable splatting rendering pipeline that offers significant speedup to map optimization and RGB-D rendering. Specifically, we propose an adaptive expansion strategy that adds new or deletes noisy 3D Gaussians in order to efficiently reconstruct new observed scene geometry and improve the mapping of previously observed areas. This strategy is essential to extend 3D Gaussian representation to reconstruct the whole scene rather than synthesize a static object in existing methods. Moreover, in the pose tracking process, an effective coarse-to-fine technique is designed to select reliable 3D Gaussian representations to optimize camera pose, resulting in runtime reduction and robust estimation. Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the Replica, TUM-RGBD datasets. Project page: https://gs-slam.github.io/. △ Less

Submitted 7 April, 2024; v1 submitted 20 November, 2023; originally announced November 2023.

Comments: Accepted to CVPR 2024(highlight). Project Page: https://gs-slam.github.io/

arXiv:2311.11013 [pdf, other]

Implicit Event-RGBD Neural SLAM

Authors: Delin Qu, Chi Yan, Dong Wang, Jie Yin, Dan Xu, Bin Zhao, Xuelong Li

Abstract: Implicit neural SLAM has achieved remarkable progress recently. Nevertheless, existing methods face significant challenges in non-ideal scenarios, such as motion blur or lighting variation, which often leads to issues like convergence failures, localization drifts, and distorted mapping. To address these challenges, we propose EN-SLAM, the first event-RGBD implicit neural SLAM framework, which eff… ▽ More Implicit neural SLAM has achieved remarkable progress recently. Nevertheless, existing methods face significant challenges in non-ideal scenarios, such as motion blur or lighting variation, which often leads to issues like convergence failures, localization drifts, and distorted mapping. To address these challenges, we propose EN-SLAM, the first event-RGBD implicit neural SLAM framework, which effectively leverages the high rate and high dynamic range advantages of event data for tracking and mapping. Specifically, EN-SLAM proposes a differentiable CRF (Camera Response Function) rendering technique to generate distinct RGB and event camera data via a shared radiance field, which is optimized by learning a unified implicit representation with the captured event and RGBD supervision. Moreover, based on the temporal difference property of events, we propose a temporal aggregating optimization strategy for the event joint tracking and global bundle adjustment, capitalizing on the consecutive difference constraints of events, significantly enhancing tracking accuracy and robustness. Finally, we construct the simulated dataset DEV-Indoors and real captured dataset DEV-Reals containing 6 scenes, 17 sequences with practical motion blur and lighting changes for evaluations. Experimental results show that our method outperforms the SOTA methods in both tracking ATE and mapping ACC with a real-time 17 FPS in various challenging environments. Project page: https://delinqu.github.io/EN-SLAM. △ Less

Submitted 17 March, 2024; v1 submitted 18 November, 2023; originally announced November 2023.

Comments: Accept at CVPR 2024

arXiv:2310.02050 [pdf, other]

Tuning Large language model for End-to-end Speech Translation

Authors: Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu, Xiaolin Jiao

Abstract: With the emergence of large language models (LLMs), multimodal models based on LLMs have demonstrated significant potential. Models such as LLaSM, X-LLM, and SpeechGPT exhibit an impressive ability to comprehend and generate human instructions. However, their performance often falters when faced with complex tasks like end-to-end speech translation (E2E-ST), a cross-language and cross-modal transl… ▽ More With the emergence of large language models (LLMs), multimodal models based on LLMs have demonstrated significant potential. Models such as LLaSM, X-LLM, and SpeechGPT exhibit an impressive ability to comprehend and generate human instructions. However, their performance often falters when faced with complex tasks like end-to-end speech translation (E2E-ST), a cross-language and cross-modal translation task. In comparison to single-modal models, multimodal models lag behind in these scenarios. This paper introduces LST, a Large multimodal model designed to excel at the E2E-ST task. LST consists of a speech frontend, an adapter, and a LLM backend. The training of LST consists of two stages: (1) Modality adjustment, where the adapter is tuned to align speech representation with text embedding space, and (2) Downstream task fine-tuning, where both the adapter and LLM model are trained to optimize performance on the E2EST task. Experimental results on the MuST-C speech translation benchmark demonstrate that LST-13B achieves BLEU scores of 30.39/41.55/35.33 on En-De/En-Fr/En-Es language pairs, surpassing previous models and establishing a new state-of-the-art. Additionally, we conduct an in-depth analysis of single-modal model selection and the impact of training strategies, which lays the foundation for future research. We will open up our code and models after review. △ Less

Submitted 3 October, 2023; originally announced October 2023.

arXiv:2304.10321 [pdf, other]

doi 10.1109/LSP.2022.3140693

DropDim: A Regularization Method for Transformer Networks

Authors: Hao Zhang, Dan Qu, Keji Shao, Xukui Yang

Abstract: We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dim… ▽ More We introduceDropDim, a structured dropout method designed for regularizing the self-attention mechanism, which is a key component of the transformer. In contrast to the general dropout method, which randomly drops neurons, DropDim drops part of the embedding dimensions. In this way, the semantic information can be completely discarded. Thus, the excessive coadapting between different embedding dimensions can be broken, and the self-attention is forced to encode meaningful featureswith a certain number of embedding dimensions erased. Experiments on a wide range of tasks executed on the MUST-C English-Germany dataset show that DropDim can effectively improve model performance, reduce over-fitting, and show complementary effects with other regularization methods. When combined with label smoothing, the WER can be reduced from 19.1% to 15.1% on the ASR task, and the BLEU value can be increased from26.90 to 28.38 on the MT task. On the ST task, the model can reach a BLEU score of 22.99, an increase by 1.86 BLEU points compared to the strong baseline. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Journal ref: IEEE SIGNAL PROCESSING LETTERS, VOL. 29, 2022

arXiv:2304.10309 [pdf, other]

doi 10.1109/TASLP.2023.3244521

Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning

Authors: Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu, Wei-Qiang Zhang

Abstract: The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based o… ▽ More The end-to-end speech translation (E2E-ST) model has gradually become a mainstream paradigm due to its low latency and less error propagation. However, it is non-trivial to train such a model well due to the task complexity and data scarcity. The speech-and-text modality differences result in the E2E-ST model performance usually inferior to the corresponding machine translation (MT) model. Based on the above observation, existing methods often use sharingmechanisms to carry out implicit knowledge transfer by imposing various constraints. However, the final model often performs worse on the MT task than the MT model trained alone, which means that the knowledge transfer ability of this method is also limited. To deal with these problems, we propose the FCCL (Fine- and Coarse- Granularity Contrastive Learning) approach for E2E-ST, which makes explicit knowledge transfer through cross-modal multi-grained contrastive learning. A key ingredient of our approach is applying contrastive learning at both sentence- and frame-level to give the comprehensive guide for extracting speech representations containing rich semantic information.In addition, we adopt a simple whitening method to alleviate the representation degeneration in the MT model, which adversely affects contrast learning. Experiments on the MuST-C benchmark show that our proposed approach significantly outperforms the state-of-the-art E2E-ST baselines on all eight language pairs. Further analysis indicates that FCCL can free up its capacity from learning grammatical structure information and force more layers to learn semantic information. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Journal ref: IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 31, 2023

arXiv:2304.10295 [pdf, other]

Decouple Non-parametric Knowledge Distillation For End-to-end Speech Translation

Authors: Hao Zhang, Nianwen Si, Yaqi Chen, Wenlin Zhang, Xukui Yang, Dan Qu, Zhen Li

Abstract: Existing techniques often attempt to make knowledge transfer from a powerful machine translation (MT) to speech translation (ST) model with some elaborate techniques, which often requires transcription as extra input during training. However, transcriptions are not always available, and how to improve the ST model performance without transcription, i.e., data efficiency, has rarely been studied in… ▽ More Existing techniques often attempt to make knowledge transfer from a powerful machine translation (MT) to speech translation (ST) model with some elaborate techniques, which often requires transcription as extra input during training. However, transcriptions are not always available, and how to improve the ST model performance without transcription, i.e., data efficiency, has rarely been studied in the literature. In this paper, we propose Decoupled Non-parametric Knowledge Distillation (DNKD) from data perspective to improve the data efficiency. Our method follows the knowledge distillation paradigm. However, instead of obtaining the teacher distribution from a sophisticated MT model, we construct it from a non-parametric datastore via k-Nearest-Neighbor (kNN) retrieval, which removes the dependence on transcription and MT model. Then we decouple the classic knowledge distillation loss into target and non-target distillation to enhance the effect of the knowledge among non-target logits, which is the prominent "dark knowledge". Experiments on MuST-C corpus show that, the proposed method can achieve consistent improvement over the strong baseline without requiring any transcription. △ Less

Submitted 20 April, 2023; originally announced April 2023.

Comments: Accepted by ICASSP 2023

arXiv:2303.18125 [pdf, other]

Towards Nonlinear-Motion-Aware and Occlusion-Robust Rolling Shutter Correction

Authors: Delin Qu, Yizhen Lao, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li

Abstract: This paper addresses the problem of rolling shutter correction in complex nonlinear and dynamic scenes with extreme occlusion. Existing methods suffer from two main drawbacks. Firstly, they face challenges in estimating the accurate correction field due to the uniform velocity assumption, leading to significant image correction errors under complex motion. Secondly, the drastic occlusion in dynami… ▽ More This paper addresses the problem of rolling shutter correction in complex nonlinear and dynamic scenes with extreme occlusion. Existing methods suffer from two main drawbacks. Firstly, they face challenges in estimating the accurate correction field due to the uniform velocity assumption, leading to significant image correction errors under complex motion. Secondly, the drastic occlusion in dynamic scenes prevents current solutions from achieving better image quality because of the inherent difficulties in aligning and aggregating multiple frames. To tackle these challenges, we model the curvilinear trajectory of pixels analytically and propose a geometry-based Quadratic Rolling Shutter (QRS) motion solver, which precisely estimates the high-order correction field of individual pixels. Besides, to reconstruct high-quality occlusion frames in dynamic scenes, we present a 3D video architecture that effectively Aligns and Aggregates multi-frame context, namely, RSA2-Net. We evaluate our method across a broad range of cameras and video sequences, demonstrating its significant superiority. Specifically, our method surpasses the state-of-the-art by +4.98, +0.77, and +4.33 of PSNR on Carla-RS, Fastec-RS, and BS-RSC datasets, respectively. Code is available at https://github.com/DelinQu/qrsc. △ Less

Submitted 15 August, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

Comments: accepted at ICCV 2023

arXiv:2209.08503 [pdf, other]

Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution

Authors: Bangyan Liao, Delin Qu, Yifei Xue, Huiqing Zhang, Yizhen Lao

Abstract: We propose a robust and fast bundle adjustment solution that estimates the 6-DoF pose of the camera and the geometry of the environment based on measurements from a rolling shutter (RS) camera. This tackles the challenges in the existing works, namely relying on additional sensors, high frame rate video as input, restrictive assumptions on camera motion, readout direction, and poor efficiency. To… ▽ More We propose a robust and fast bundle adjustment solution that estimates the 6-DoF pose of the camera and the geometry of the environment based on measurements from a rolling shutter (RS) camera. This tackles the challenges in the existing works, namely relying on additional sensors, high frame rate video as input, restrictive assumptions on camera motion, readout direction, and poor efficiency. To this end, we first investigate the influence of normalization to the image point on RSBA performance and show its better approximation in modelling the real 6-DoF camera motion. Then we present a novel analytical model for the visual residual covariance, which can be used to standardize the reprojection error during the optimization, consequently improving the overall accuracy. More importantly, the combination of normalization and covariance standardization weighting in RSBA (NW-RSBA) can avoid common planar degeneracy without needing to constrain the filming manner. Besides, we propose an acceleration strategy for NW-RSBA based on the sparsity of its Jacobian matrix and Schur complement. The extensive synthetic and real data experiments verify the effectiveness and efficiency of the proposed solution over the state-of-the-art works. We also demonstrate the proposed method can be easily implemented and plug-in famous GSSfM and GSSLAM systems as completed RSSfM and RSSLAM solutions. △ Less

Submitted 18 April, 2023; v1 submitted 18 September, 2022; originally announced September 2022.

Comments: Accepted to CVPR 2023

arXiv:2101.07116 [pdf, other]

LNSMM: Eye Gaze Estimation With Local Network Share Multiview Multitask

Authors: Yong Huang, Ben Chen, Daiming Qu

Abstract: Eye gaze estimation has become increasingly significant in computer vision.In this paper,we systematically study the mainstream of eye gaze estimation methods,propose a novel methodology to estimate eye gaze points and eye gaze directions simultaneously.First,we construct a local sharing network for feature extraction of gaze points and gaze directions estimation,which can reduce network computati… ▽ More Eye gaze estimation has become increasingly significant in computer vision.In this paper,we systematically study the mainstream of eye gaze estimation methods,propose a novel methodology to estimate eye gaze points and eye gaze directions simultaneously.First,we construct a local sharing network for feature extraction of gaze points and gaze directions estimation,which can reduce network computational parameters and converge quickly;Second,we propose a Multiview Multitask Learning (MTL) framework,for gaze directions,a coplanar constraint is proposed for the left and right eyes,for gaze points,three views data input indirectly introduces eye position information,a cross-view pooling module is designed, propose joint loss which handle both gaze points and gaze directions estimation.Eventually,we collect a dataset to use of gaze points,which have three views to exist public dataset.The experiment show our method is state-of-the-art the current mainstream methods on two indicators of gaze points and gaze directions. △ Less

Submitted 18 January, 2021; originally announced January 2021.

arXiv:2009.13631 [pdf, other]

Tempura: A General Cost Based Optimizer Framework for Incremental Data Processing (Extended Version)

Authors: Zuozhi Wang, Kai Zeng, Botong Huang, Wei Chen, Xiaozong Cui, Bo Wang, Ji Liu, Liya Fan, Dachuan Qu, Zhenyu Hou, Tao Guan, Chen Li, Jingren Zhou

Abstract: Incremental processing is widely-adopted in many applications, ranging from incremental view maintenance, stream computing, to recently emerging progressive data warehouse and intermittent query processing. Despite many algorithms developed on this topic, none of them can produce an incremental plan that always achieves the best performance, since the optimal plan is data dependent. In this paper,… ▽ More Incremental processing is widely-adopted in many applications, ranging from incremental view maintenance, stream computing, to recently emerging progressive data warehouse and intermittent query processing. Despite many algorithms developed on this topic, none of them can produce an incremental plan that always achieves the best performance, since the optimal plan is data dependent. In this paper, we develop a novel cost-based optimizer framework, called Tempura, for optimizing incremental data processing. We propose an incremental query planning model called TIP based on the concept of time-varying relations, which can formally model incremental processing in its most general form. We give a full specification of Tempura, which can not only unify various existing techniques to generate an optimal incremental plan, but also allow the developer to add their rewrite rules. We study how to explore the plan space and search for an optimal incremental plan. We conduct a thorough experimental evaluation of Tempura in various incremental processing scenarios to show its effectiveness and efficiency. △ Less

Submitted 28 September, 2020; originally announced September 2020.

Comments: 19 pages, 8 figures. The short version of this paper is accepeted at VLDB 2021 (PVLDB Volume 14, Issue 1)

ACM Class: H.2.4

arXiv:1905.00005 [pdf, ps, other]

Optimal Preamble Length for Spectral Efficiency in Grant-Free RA with Massive MIMO

Authors: Jie Ding, Daiming Qu, Hao Jiang

Abstract: Grant-free random access (RA) with massive MIMO is a promising RA technique for massive access with low signaling overhead. In the grant-free RA with massive MIMO, preamble length has a critical impact on the performance of the system. In this paper, the optimal preamble length is investigated to maximize spectral efficiency (SE) of the grant-free RA with massive MIMO, where effects of the preambl… ▽ More Grant-free random access (RA) with massive MIMO is a promising RA technique for massive access with low signaling overhead. In the grant-free RA with massive MIMO, preamble length has a critical impact on the performance of the system. In this paper, the optimal preamble length is investigated to maximize spectral efficiency (SE) of the grant-free RA with massive MIMO, where effects of the preamble length on the preamble collision and preamble overhead as well as channel estimation accuracy are taken into account. Simulation results agree well with our analyses and confirm the existence of optimal preamble length for SE maximization in the grant-free RA with massive MIMO. Moreover, properties of the optimal preamble length with respect to system parameters are revealed. Compared to the granted access, it is shown that longer preamble length is required for SE maximization in the grant-free RA. △ Less

Submitted 29 April, 2019; originally announced May 2019.

Comments: Accepted By IEEE ICEIC 2019. arXiv admin note: text overlap with arXiv:1805.08345

arXiv:1810.04458

Cluster Pairwise Error Probability and Construction of Parity-Check-Concatenated Polar Codes

Authors: Tao Wang, Daiming Qu, Tao Jiang

Abstract: A successive cancellation list (SCL) decoder with limited list size for polar codes can not be analyzed as a successive cancellation (SC) decoder, nor as a maximum likelihood (ML) decoder, due to the complicated decoding errors caused by path elimination. To address this issue, an analytical tool, named as cluster pairwise error probability (CPEP), is proposed in this paper to measure the competit… ▽ More A successive cancellation list (SCL) decoder with limited list size for polar codes can not be analyzed as a successive cancellation (SC) decoder, nor as a maximum likelihood (ML) decoder, due to the complicated decoding errors caused by path elimination. To address this issue, an analytical tool, named as cluster pairwise error probability (CPEP), is proposed in this paper to measure the competitiveness of the correct path against the error paths in an SCL decoder. It is shown that the sum of CPEPs over error paths could be used as an indicator of the probability of correct path being eliminated from the decoder list. Then, we use CPEP to explain the error performance gain of parity-check-concatenated (PCC) polar code, and apply CPEP as the optimization criterion in the construction of PCC polar codes, aiming to reduce the elimination probability of the correct path in an SCL decoder with limited list size. Simulation results show that the constructed CRC-PCC polar codes outperform their counterparts of CRC-concatenated polar codes over various codeword lengths, code rates and puncturing patterns. △ Less

Submitted 21 March, 2019; v1 submitted 10 October, 2018; originally announced October 2018.

Comments: There are some errors in Algorithm 1 of Page 6

arXiv:1809.07535 [pdf, ps, other]

Multiple Preambles for High Success Rate of Grant-Free Random Access with Massive MIMO

Authors: Hao Jiang, Daiming Qu, Jie Ding, Tao Jiang

Abstract: Grant-free random access (RA) with massive MIMO is a promising RA technique with low signaling overhead that provides significant benefits in increasing the channel reuse efficiency. Since user equipment (UE) detection and channel estimation in grant-free RA rely solely on the received preambles, preamble designs that enable high success rate of UE detection and channel estimation are very much in… ▽ More Grant-free random access (RA) with massive MIMO is a promising RA technique with low signaling overhead that provides significant benefits in increasing the channel reuse efficiency. Since user equipment (UE) detection and channel estimation in grant-free RA rely solely on the received preambles, preamble designs that enable high success rate of UE detection and channel estimation are very much in need to ensure the performance gain of grant-free RA with massive MIMO. In this paper, a super preamble consisting of multiple consecutive preambles is proposed for high success rate of grant-free RA with massive MIMO. With the proposed approach, the success of UE detection and channel estimation for a RA UE depends on two conditions: 1) it is a solvable UE; 2) its super preamble is detected. Accordingly, we theoretically analyze the solvable rate of RA UEs with multiple preambles and propose a reliable UE detection algorithm to obtain the super preambles of RA UEs by exploiting the quasi-orthogonality characteristic of massive MIMO. Theoretical analysis and simulation results show that turning a preamble into a super preamble consisting of two or three shorter preambles, the success rate of UE detection and channel estimation could be significantly increased using the proposed approach. △ Less

Submitted 20 September, 2018; originally announced September 2018.

arXiv:1806.09836 [pdf, ps, other]

Virtual Carrier Sensing Based Random Access in Massive MIMO Systems

Authors: Jie Ding, Daiming Qu, Hao Jiang, Tao Jiang

Abstract: The 5th generation mobile communication systems aim to support massive access for future wireless applications. Unfortunately, wireless resource scarcity in random access (RA) is a fundamental bottleneck for enabling massive access. To address this problem, we propose a virtual carrier sensing (VCS) based RA scheme in massive MIMO systems. The essence of the proposed scheme lies in exploiting wire… ▽ More The 5th generation mobile communication systems aim to support massive access for future wireless applications. Unfortunately, wireless resource scarcity in random access (RA) is a fundamental bottleneck for enabling massive access. To address this problem, we propose a virtual carrier sensing (VCS) based RA scheme in massive MIMO systems. The essence of the proposed scheme lies in exploiting wireless spatial resources of uplink channels occupied by assigned user equipments (UEs) to increase channel resources for RA. With the proposed scheme, RA UEs are able to exploit the spatial resources that are approximately orthogonal to those of assigned UEs, thus sharing the uplink channel resource with assigned UEs without causing significant interference to them. Specifically, to ensure RA UEs avoid serious interference with assigned UEs, base station (BS) sends tailored virtual carriers to RA UEs on behalf of assigned UEs. RA UEs then conduct VCS to determine whether or not the uplink channel resource is available for RA. Closed-form approximations for probability of channel availability and uplink achievable rate with the proposed scheme are derived. Theoretical analysis and simulation results show that the proposed scheme is able to significantly increase channel resources of RA for massive access. △ Less

Submitted 26 June, 2018; originally announced June 2018.

arXiv:1805.08345 [pdf, ps, other]

Success Probability of Grant-Free Random Access with Massive MIMO

Authors: Jie Ding, Daiming Qu, Hao Jiang, Tao Jiang

Abstract: Massive MIMO opens up new avenues for enabling highly efficient random access (RA) by offering abundance of spatial degrees of freedom. In this paper, we investigate the grant-free RA with massive MIMO and derive the analytic expressions of success probability of the grant-free RA for conjugate beamforming and zero-forcing beamforming techniques.With the derived analytic expressions, we further sh… ▽ More Massive MIMO opens up new avenues for enabling highly efficient random access (RA) by offering abundance of spatial degrees of freedom. In this paper, we investigate the grant-free RA with massive MIMO and derive the analytic expressions of success probability of the grant-free RA for conjugate beamforming and zero-forcing beamforming techniques.With the derived analytic expressions, we further shed light on the impact of system parameters on the success probability. Simulation results verify the accuracy of the analyses. It is confirmed that the grant-free RA with massive MIMO is an attractive RA technique with low signaling overhead that could simultaneously accommodate a number of RA users, which is multiple times the number of RA channels, with close-to-one success probability. In addition, when the number of antennas in massive MIMO is sufficiently large, we show that the number of orthogonal preambles would dominate the success probability. △ Less

Submitted 21 May, 2018; originally announced May 2018.

arXiv:1802.03706 [pdf, other]

FDM-Structured Preamble Optimization for Channel Estimation in MIMO-OQAM/FBMC Systems

Authors: Wenfeng Liu, Da Chen, Kai Luo, Tao Jiang, Daiming Qu

Abstract: In this paper, we consider the problem of preamble design in multiple-input multiple-output (MIMO) systems employing offset quadrature amplitude modulation based filter bank multicarrier (OQAM/FBMC) and propose a preamble optimization method for the frequency division multiplexing (FDM)-structured preamble. Specifically, we formulate an optimization problem to determine the frequency division mult… ▽ More In this paper, we consider the problem of preamble design in multiple-input multiple-output (MIMO) systems employing offset quadrature amplitude modulation based filter bank multicarrier (OQAM/FBMC) and propose a preamble optimization method for the frequency division multiplexing (FDM)-structured preamble. Specifically, we formulate an optimization problem to determine the frequency division multiplexed preambles, where the objective is to minimize the mean square error (MSE) of the channel estimation, subject to the constraint on the transmit energy. For two transmit antennas, we find the relationship between preambles and the intrinsic interference from neighboring symbols to achieve the minimum channel estimation MSE, and derive the optimal closed-form solution. For more than two transmit antennas, the constrained preamble optimization problem is nonconvex quadratic. Therefore, we convert the original optimization problem into a quadratically constrained quadratic program (QCQP) and obtain the suboptimal solution by relaxing the nonconvex constraint. Simulation results demonstrate that, in terms of MSE and bit error rate (BER) performances, the proposed method outperforms the conventional FDM preamble design method at all signal-to-noise ratio (SNR) regimes and outperforms the interference approximation method-complex (IAM-C) preamble design method at low to medium SNR regimes with lower preamble overhead. △ Less

Submitted 11 February, 2018; originally announced February 2018.

Comments: 11 pages, 7 figures

arXiv:1711.10797 [pdf, ps, other]

doi 10.1109/TVT.2017.2774836

Downlink Precoding with Mixed Statistical and Imperfect Instantaneous CSI for Massive MIMO Systems

Authors: Shuang Qiu, Da Chen, Daiming Qu, Kai Luo, Tao Jiang

Abstract: In this paper, the feasibility of a new downlink transmission mode in massive multi-input multi-output (MIMO) systems is investigated with two types of users, i.e., the users with only statistical channel state information (CSI) and the users with imperfect instantaneous CSI. The problem of downlink precoding design with mixed utilization of statistical and imperfect instantaneous CSI is addressed… ▽ More In this paper, the feasibility of a new downlink transmission mode in massive multi-input multi-output (MIMO) systems is investigated with two types of users, i.e., the users with only statistical channel state information (CSI) and the users with imperfect instantaneous CSI. The problem of downlink precoding design with mixed utilization of statistical and imperfect instantaneous CSI is addressed. We first theoretically analyze the impact of the mutual interference between the two types of users on their achievable rate. Then, considering the mutual interference suppression, we propose an extended zero-forcing (eZF) and an extended maximum ratio transmission (eMRT) precoding methods to minimize the total transmit power of base station and to maximize the received signal power of users, respectively. Thanks to the exploitation of statistical CSI, pilot-based channel estimation is avoided enabling more active users, higher system sum rate and shorter transmission delay. Finally, simulations are performed to validate the accuracy of the theoretical analysis and the advantages of the proposed precoding methods. △ Less

Submitted 30 November, 2017; v1 submitted 29 November, 2017; originally announced November 2017.

Comments: 14 pages, 9 figures, transactions

arXiv:1601.00413 [pdf, ps, other]

Improving Bandwidth Efficiency of FBMC-OQAM Through Virtual Symbols

Authors: Daiming Qu, Fang Wang, Tao Jiang, Behrouz Farhang-Boroujeny

Abstract: Filter bank multicarrier (FBMC) systems that are based on offset quadrature amplitude modulation (OQAM), namely, FBMC-OQAM, have been criticized for their inefficiency in the use of spectral resources, because of the long ramp-up and ramp-down tails at the beginning and the end of each data packet, respectively. We propose a novel method for shortening these tails. By appending a set of virtual (i… ▽ More Filter bank multicarrier (FBMC) systems that are based on offset quadrature amplitude modulation (OQAM), namely, FBMC-OQAM, have been criticized for their inefficiency in the use of spectral resources, because of the long ramp-up and ramp-down tails at the beginning and the end of each data packet, respectively. We propose a novel method for shortening these tails. By appending a set of virtual (i.e., none data carrying) symbols to the beginning and the end of each packet, and clever selection of these symbols, we show that the ramp-up and rampdown tails in FMBC-OQAM can be suppressed to an extent that they deem as negligible and thus may be ignored. This shortens the length of signal burst in each FBMC-OQAM packet, hence, improves on its bandwidth efficiency, viz., the same data is transmitted over a shorter period of time. We develop an optimization method that allows computation of virtual symbols, for each data packet. Simulation results show that, compared to existing methods, the proposed tail-shortening approach leads to a superior out-of-band (OOB) emission performance and a much lower error vector magnitude (EVM) for the demodulated symbols. △ Less

Submitted 4 January, 2016; originally announced January 2016.

Showing 1–28 of 28 results for author: Qu, D