subscribe to arXiv mailings

Foundation Models for Remote Sensing and Earth Observation: A Survey

Authors: Aoran Xiao, Weihao Xuan, Junjue Wang, Jiaxing Huang, Dacheng Tao, Shijian Lu, Naoto Yokoya

Abstract: Remote Sensing (RS) is a crucial technology for observing, monitoring, and interpreting our planet, with broad applications across geoscience, economics, humanitarian fields, etc. While artificial intelligence (AI), particularly deep learning, has achieved significant advances in RS, unique challenges persist in developing more intelligent RS systems, including the complexity of Earth's environmen… ▽ More Remote Sensing (RS) is a crucial technology for observing, monitoring, and interpreting our planet, with broad applications across geoscience, economics, humanitarian fields, etc. While artificial intelligence (AI), particularly deep learning, has achieved significant advances in RS, unique challenges persist in developing more intelligent RS systems, including the complexity of Earth's environments, diverse sensor modalities, distinctive feature patterns, varying spatial and spectral resolutions, and temporal dynamics. Meanwhile, recent breakthroughs in large Foundation Models (FMs) have expanded AI's potential across many domains due to their exceptional generalizability and zero-shot transfer capabilities. However, their success has largely been confined to natural data like images and video, with degraded performance and even failures for RS data of various non-optical modalities. This has inspired growing interest in developing Remote Sensing Foundation Models (RSFMs) to address the complex demands of Earth Observation (EO) tasks, spanning the surface, atmosphere, and oceans. This survey systematically reviews the emerging field of RSFMs. It begins with an outline of their motivation and background, followed by an introduction of their foundational concepts. It then categorizes and reviews existing RSFM studies including their datasets and technical contributions across Visual Foundation Models (VFMs), Visual-Language Models (VLMs), Large Language Models (LLMs), and beyond. In addition, we benchmark these models against publicly available datasets, discuss existing challenges, and propose future research directions in this rapidly evolving field. △ Less

Submitted 21 October, 2024; originally announced October 2024.

arXiv:2410.05326 [pdf, other]

Early-Cycle Internal Impedance Enables ML-Based Battery Cycle Life Predictions Across Manufacturers

Authors: Tyler Sours, Shivang Agarwal, Marc Cormier, Jordan Crivelli-Decker, Steffen Ridderbusch, Stephen L. Glazier, Connor P. Aiken, Aayush R. Singh, Ang Xiao, Omar Allam

Abstract: Predicting the end-of-life (EOL) of lithium-ion batteries across different manufacturers presents significant challenges due to variations in electrode materials, manufacturing processes, cell formats, and a lack of generally available data. Methods that construct features solely on voltage-capacity profile data typically fail to generalize across cell chemistries. This study introduces a methodol… ▽ More Predicting the end-of-life (EOL) of lithium-ion batteries across different manufacturers presents significant challenges due to variations in electrode materials, manufacturing processes, cell formats, and a lack of generally available data. Methods that construct features solely on voltage-capacity profile data typically fail to generalize across cell chemistries. This study introduces a methodology that combines traditional voltage-capacity features with Direct Current Internal Resistance (DCIR) measurements, enabling more accurate and generalizable EOL predictions. The use of early-cycle DCIR data captures critical degradation mechanisms related to internal resistance growth, enhancing model robustness. Models are shown to successfully predict the number of cycles to EOL for unseen manufacturers of varied electrode composition with a mean absolute error (MAE) of 150 cycles. This cross-manufacturer generalizability reduces the need for extensive new data collection and retraining, enabling manufacturers to optimize new battery designs using existing datasets. Additionally, a novel DCIR-compatible dataset is released as part of ongoing efforts to enrich the growing ecosystem of cycling data and accelerate battery materials development. △ Less

Submitted 5 October, 2024; originally announced October 2024.

Comments: 17 pages, 7 figures

arXiv:2409.20548 [pdf, other]

Robi Butler: Remote Multimodal Interactions with Household Robot Assistant

Authors: Anxing Xiao, Nuwan Janaka, Tianrun Hu, Anshul Gupta, Kaixin Li, Cunjun Yu, David Hsu

Abstract: In this paper, we introduce Robi Butler, a novel household robotic system that enables multimodal interactions with remote users. Building on the advanced communication interfaces, Robi Butler allows users to monitor the robot's status, send text or voice instructions, and select target objects by hand pointing. At the core of our system is a high-level behavior module, powered by Large Language M… ▽ More In this paper, we introduce Robi Butler, a novel household robotic system that enables multimodal interactions with remote users. Building on the advanced communication interfaces, Robi Butler allows users to monitor the robot's status, send text or voice instructions, and select target objects by hand pointing. At the core of our system is a high-level behavior module, powered by Large Language Models (LLMs), that interprets multimodal instructions to generate action plans. These plans are composed of a set of open vocabulary primitives supported by Vision Language Models (VLMs) that handle both text and pointing queries. The integration of the above components allows Robi Butler to ground remote multimodal instructions in the real-world home environment in a zero-shot manner. We demonstrate the effectiveness and efficiency of this system using a variety of daily household tasks that involve remote users giving multimodal instructions. Additionally, we conducted a user study to analyze how multimodal interactions affect efficiency and user experience during remote human-robot interaction and discuss the potential improvements. △ Less

Submitted 30 September, 2024; originally announced September 2024.

arXiv:2409.18084 [pdf, other]

GSON: A Group-based Social Navigation Framework with Large Multimodal Model

Authors: Shangyi Luo, Ji Zhu, Peng Sun, Yuhong Deng, Cunjun Yu, Anxing Xiao, Xueqian Wang

Abstract: As the number of service robots and autonomous vehicles in human-centered environments grows, their requirements go beyond simply navigating to a destination. They must also take into account dynamic social contexts and ensure respect and comfort for others in shared spaces, which poses significant challenges for perception and planning. In this paper, we present a group-based social navigation fr… ▽ More As the number of service robots and autonomous vehicles in human-centered environments grows, their requirements go beyond simply navigating to a destination. They must also take into account dynamic social contexts and ensure respect and comfort for others in shared spaces, which poses significant challenges for perception and planning. In this paper, we present a group-based social navigation framework GSON to enable mobile robots to perceive and exploit the social group of their surroundings by leveling the visual reasoning capability of the Large Multimodal Model (LMM). For perception, we apply visual prompting techniques to zero-shot extract the social relationship among pedestrians and combine the result with a robust pedestrian detection and tracking pipeline to alleviate the problem of low inference speed of the LMM. Given the perception result, the planning system is designed to avoid disrupting the current social structure. We adopt a social structure-based mid-level planner as a bridge between global path planning and local motion planning to preserve the global context and reactive response. The proposed method is validated on real-world mobile robot navigation tasks involving complex social structure understanding and reasoning. Experimental results demonstrate the effectiveness of the system in these scenarios compared with several baselines. △ Less

Submitted 26 September, 2024; originally announced September 2024.

arXiv:2409.17604 [pdf, other]

RmGPT: Rotating Machinery Generative Pretrained Model

Authors: Yilin Wang, Yifei Yu, Kong Sun, Peixuan Lei, Yuxuan Zhang, Enrico Zio, Aiguo Xia, Yuanxiang Li

Abstract: In industry, the reliability of rotating machinery is critical for production efficiency and safety. Current methods of Prognostics and Health Management (PHM) often rely on task-specific models, which face significant challenges in handling diverse datasets with varying signal characteristics, fault modes and operating conditions. Inspired by advancements in generative pretrained models, we propo… ▽ More In industry, the reliability of rotating machinery is critical for production efficiency and safety. Current methods of Prognostics and Health Management (PHM) often rely on task-specific models, which face significant challenges in handling diverse datasets with varying signal characteristics, fault modes and operating conditions. Inspired by advancements in generative pretrained models, we propose RmGPT, a unified model for diagnosis and prognosis tasks. RmGPT introduces a novel token-based framework, incorporating Signal Tokens, Prompt Tokens, Time-Frequency Task Tokens and Fault Tokens to handle heterogeneous data within a unified model architecture. We leverage self-supervised learning for robust feature extraction and introduce a next signal token prediction pretraining strategy, alongside efficient prompt learning for task-specific adaptation. Extensive experiments demonstrate that RmGPT significantly outperforms state-of-the-art algorithms, achieving near-perfect accuracy in diagnosis tasks and exceptionally low errors in prognosis tasks. Notably, RmGPT excels in few-shot learning scenarios, achieving 92% accuracy in 16-class one-shot experiments, highlighting its adaptability and robustness. This work establishes RmGPT as a powerful PHM foundation model for rotating machinery, advancing the scalability and generalizability of PHM solutions. △ Less

Submitted 26 September, 2024; originally announced September 2024.

arXiv:2409.02244 [pdf, other]

Therapy as an NLP Task: Psychologists' Comparison of LLMs and Human Peers in CBT

Authors: Zainab Iftikhar, Sean Ransom, Amy Xiao, Jeff Huang

Abstract: Wider access to therapeutic care is one of the biggest challenges in mental health treatment. Due to institutional barriers, some people seeking mental health support have turned to large language models (LLMs) for personalized therapy, even though these models are largely unsanctioned and untested. We investigate the potential and limitations of using LLMs as providers of evidence-based therapy b… ▽ More Wider access to therapeutic care is one of the biggest challenges in mental health treatment. Due to institutional barriers, some people seeking mental health support have turned to large language models (LLMs) for personalized therapy, even though these models are largely unsanctioned and untested. We investigate the potential and limitations of using LLMs as providers of evidence-based therapy by using mixed methods clinical metrics. Using HELPERT, a prompt run on a large language model using the same process and training as a comparative group of peer counselors, we replicated publicly accessible mental health conversations rooted in Cognitive Behavioral Therapy (CBT) to compare session dynamics and counselor's CBT-based behaviors between original peer support sessions and their reconstructed HELPERT sessions. Two licensed, CBT-trained clinical psychologists evaluated the sessions using the Cognitive Therapy Rating Scale and provided qualitative feedback. Our findings show that the peer sessions are characterized by empathy, small talk, therapeutic alliance, and shared experiences but often exhibit therapist drift. Conversely, HELPERT reconstructed sessions exhibit minimal therapist drift and higher adherence to CBT methods but display a lack of collaboration, empathy, and cultural understanding. Through CTRS ratings and psychologists' feedback, we highlight the importance of human-AI collaboration for scalable mental health. Our work outlines the ethical implication of imparting human-like subjective qualities to LLMs in therapeutic settings, particularly the risk of deceptive empathy, which may lead to unrealistic patient expectations and potential harm. △ Less

Submitted 3 September, 2024; originally announced September 2024.

ACM Class: I.2.7; J.4

arXiv:2409.01491 [pdf, other]

EarthGen: Generating the World from Top-Down Views

Authors: Ansh Sharma, Albert Xiao, Praneet Rathi, Rohit Kundu, Albert Zhai, Yuan Shen, Shenlong Wang

Abstract: In this work, we present a novel method for extensive multi-scale generative terrain modeling. At the core of our model is a cascade of superresolution diffusion models that can be combined to produce consistent images across multiple resolutions. Pairing this concept with a tiled generation method yields a scalable system that can generate thousands of square kilometers of realistic Earth surface… ▽ More In this work, we present a novel method for extensive multi-scale generative terrain modeling. At the core of our model is a cascade of superresolution diffusion models that can be combined to produce consistent images across multiple resolutions. Pairing this concept with a tiled generation method yields a scalable system that can generate thousands of square kilometers of realistic Earth surfaces at high resolution. We evaluate our method on a dataset collected from Bing Maps and show that it outperforms super-resolution baselines on the extreme super-resolution task of 1024x zoom. We also demonstrate its ability to create diverse and coherent scenes via an interactive gigapixel-scale generated map. Finally, we demonstrate how our system can be extended to enable novel content creation applications including controllable world generation and 3D scene generation. △ Less

Submitted 7 September, 2024; v1 submitted 2 September, 2024; originally announced September 2024.

ACM Class: J.2; I.4.8

arXiv:2408.09085 [pdf, other]

Segment Anything with Multiple Modalities

Authors: Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Naoto Yokoya, Shijian Lu

Abstract: Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor su… ▽ More Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities. △ Less

Submitted 16 August, 2024; originally announced August 2024.

Comments: Project page: https://xiaoaoran.github.io/projects/MM-SAM

arXiv:2406.16112 [pdf, ps, other]

Greedy randomized Bregman-Kaczmarz method for constrained nonlinear systems of equations

Authors: Aqin Xiao, Junfeng Yin

Abstract: A greedy randomized nonlinear Bregman-Kaczmarz method by sampling the working index with residual information is developed for the solution of the constrained nonlinear system of equations. Theoretical analyses prove the convergence of the greedy randomized nonlinear Bregman-Kaczmarz method and its relaxed version. Numerical experiments verify the effectiveness of the proposed method,which converg… ▽ More A greedy randomized nonlinear Bregman-Kaczmarz method by sampling the working index with residual information is developed for the solution of the constrained nonlinear system of equations. Theoretical analyses prove the convergence of the greedy randomized nonlinear Bregman-Kaczmarz method and its relaxed version. Numerical experiments verify the effectiveness of the proposed method,which converges faster than the existing nonlinear Bregman-Kaczmarz methods. △ Less

Submitted 23 June, 2024; originally announced June 2024.

arXiv:2406.09813 [pdf, other]

Diffuse X-ray Explorer: a high-resolution X-ray spectroscopic sky surveyor on the China Space Station

Authors: Hai Jin, Junjie Mao, Liubiao Chen, Naihui Chen, Wei Cui, Bo Gao, Jinjin Li, Xinfeng Li, Jiejia Liu, Jia Quan, Chunyang Jiang, Guole Wang, Le Wang, Qian Wang, Sifan Wang, Aimin Xiao, Shuo Zhang

Abstract: DIffuse X-ray Explorer (DIXE) is a proposed high-resolution X-ray spectroscopic sky surveyor on the China Space Station (CSS). DIXE will focus on studying hot baryons in the Milky Way. Galactic hot baryons like the X-ray emitting Milky Way halo and eROSITA bubbles are best observed in the sky survey mode with a large field of view. DIXE will take advantage of the orbital motion of the CSS to scan… ▽ More DIffuse X-ray Explorer (DIXE) is a proposed high-resolution X-ray spectroscopic sky surveyor on the China Space Station (CSS). DIXE will focus on studying hot baryons in the Milky Way. Galactic hot baryons like the X-ray emitting Milky Way halo and eROSITA bubbles are best observed in the sky survey mode with a large field of view. DIXE will take advantage of the orbital motion of the CSS to scan a large fraction of the sky. High-resolution X-ray spectroscopy, enabled by superconducting microcalorimeters based on the transition-edge sensor (TES) technology, will probe the physical properties (e.g., temperature, density, elemental abundances, kinematics) of the Galactic hot baryons. This will complement the high-resolution imaging data obtained with the eROSITA mission. Here we present the preliminary design of DIXE. The payload consists mainly of a detector assembly and a cryogenic cooling system. The key components of the detector assembly are a microcalorimeter array and frequency-domain multiplexing readout electronics. To provide a working temperature for the detector assembly, the cooling system consists of an adiabatic demagnetization refrigerator and a mechanical cryocooler system. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 12 pages, 6 figures, the full version is published by Journal of Low Temperature Physics

arXiv:2405.02794 [pdf, other]

Octopi: Object Property Reasoning with Large Tactile-Language Models

Authors: Samson Yu, Kelvin Lin, Anxing Xiao, Jiafei Duan, Harold Soh

Abstract: Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they a… ▽ More Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply commonsense reasoning. We contribute a new dataset PhysiCLeAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-language models to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCLeAR show that Octopi is able to effectively use intermediate physical property predictions to improve its performance on various tactile-related tasks. PhysiCLeAR and Octopi are available at https://github.com/clear-nus/octopi. △ Less

Submitted 4 June, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

Comments: Accepted at Robotics: Science and Systems (R:SS 2024)

arXiv:2404.14953 [pdf, other]

Dynamic pricing with Bayesian updates from online reviews

Authors: José Correa, Mathieu Mari, Andrew Xia

Abstract: When launching new products, firms face uncertainty about market reception. Online reviews provide valuable information not only to consumers but also to firms, allowing firms to adjust the product characteristics, including its selling price. In this paper, we consider a pricing model with online reviews in which the quality of the product is uncertain, and both the seller and the buyers Bayesian… ▽ More When launching new products, firms face uncertainty about market reception. Online reviews provide valuable information not only to consumers but also to firms, allowing firms to adjust the product characteristics, including its selling price. In this paper, we consider a pricing model with online reviews in which the quality of the product is uncertain, and both the seller and the buyers Bayesianly update their beliefs to make purchasing & pricing decisions. We model the seller's pricing problem as a basic bandits' problem and show a close connection with the celebrated Catalan numbers, allowing us to efficiently compute the overall future discounted reward of the seller. With this tool, we analyze and compare the optimal static and dynamic pricing strategies in terms of the probability of effectively learning the quality of the product. △ Less

Submitted 23 April, 2024; originally announced April 2024.

arXiv:2402.03631 [pdf, other]

CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model

Authors: Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Ruijie Ren, Xiaoqin Zhang, Ling Shao, Shijian Lu

Abstract: The recent Segment Anything Model (SAM) has demonstrated remarkable zero-shot capability and flexible geometric prompting in general image segmentation. However, SAM often struggles when handling various unconventional images, such as aerial, medical, and non-RGB images. This paper presents CAT-SAM, a ConditionAl Tuning network that adapts SAM toward various unconventional target tasks with just f… ▽ More The recent Segment Anything Model (SAM) has demonstrated remarkable zero-shot capability and flexible geometric prompting in general image segmentation. However, SAM often struggles when handling various unconventional images, such as aerial, medical, and non-RGB images. This paper presents CAT-SAM, a ConditionAl Tuning network that adapts SAM toward various unconventional target tasks with just few-shot target samples. CAT-SAM freezes the entire SAM and adapts its mask decoder and image encoder simultaneously with a small number of learnable parameters. The core design is a prompt bridge structure that enables decoder-conditioned joint tuning of the heavyweight image encoder and the lightweight mask decoder. The bridging maps the prompt token of the mask decoder to the image encoder, fostering synergic adaptation of the encoder and the decoder with mutual benefits. We develop two representative tuning strategies for the image encoder which leads to two CAT-SAM variants: one injecting learnable prompt tokens in the input space and the other inserting lightweight adapter networks. Extensive experiments over 11 unconventional tasks show that both CAT-SAM variants achieve superior target segmentation performance consistently even under the very challenging one-shot adaptation setup. Project page: https://xiaoaoran.github.io/projects/CAT-SAM △ Less

Submitted 15 July, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: ECCV 2024

arXiv:2401.08407 [pdf, other]

Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining

Authors: Jiahao Nie, Yun Xing, Gongjie Zhang, Pei Yan, Aoran Xiao, Yap-Peng Tan, Alex C. Kot, Shijian Lu

Abstract: Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting novel categories from a distinct domain using only limited exemplars. In this paper, we undertake a comprehensive study of CD-FSS and uncover two crucial insights: (i) the necessity of a fine-tuning stage to effectively transfer the learned meta-knowledge across domains, and (ii) the overfitting risk during the naïve fin… ▽ More Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting novel categories from a distinct domain using only limited exemplars. In this paper, we undertake a comprehensive study of CD-FSS and uncover two crucial insights: (i) the necessity of a fine-tuning stage to effectively transfer the learned meta-knowledge across domains, and (ii) the overfitting risk during the naïve fine-tuning due to the scarcity of novel category examples. With these insights, we propose a novel cross-domain fine-tuning strategy that addresses the challenging CD-FSS tasks. We first design Bi-directional Few-shot Prediction (BFP), which establishes support-query correspondence in a bi-directional manner, crafting augmented supervision to reduce the overfitting risk. Then we further extend BFP into Iterative Few-shot Adaptor (IFA), which is a recursive framework to capture the support-query correspondence iteratively, targeting maximal exploitation of supervisory signals from the sparse novel category samples. Extensive empirical evaluations show that our method significantly outperforms the state-of-the-arts (+7.8\%), which verifies that IFA tackles the cross-domain challenges and mitigates the overfitting simultaneously. The code is available at: https://github.com/niejiahao1998/IFA. △ Less

Submitted 13 March, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted by CVPR 2024

arXiv:2401.08344 [pdf, other]

Large-population asymptotics for the maximum of diffusive particles with mean-field interaction in the noises

Authors: Nikolaos Kolliopoulos, David Sanchez, Amy Xiao

Abstract: We study the $N \to \infty$ limit of the normalized largest component in some systems of $N$ diffusive particles with mean-field interaction. By applying a universal time change, the interaction in noises is transferred to the drift terms, and the asymptotic behavior of the maximum becomes well-understood due to existing results in the literature. We expect that the normalized maximum in the origi… ▽ More We study the $N \to \infty$ limit of the normalized largest component in some systems of $N$ diffusive particles with mean-field interaction. By applying a universal time change, the interaction in noises is transferred to the drift terms, and the asymptotic behavior of the maximum becomes well-understood due to existing results in the literature. We expect that the normalized maximum in the original setting has the same limiting distribution as that of i.i.d copies of a solution to the corresponding McKean-Vlasov SDE and we present some results and numerical simulations that support this conjecture. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: 12 pages

MSC Class: 60K35; 60H10; 60F05; 60G70

arXiv:2311.17406 [pdf, other]

LLM-State: Open World State Representation for Long-horizon Task Planning with Large Language Model

Authors: Siwei Chen, Anxing Xiao, David Hsu

Abstract: This work addresses the problem of long-horizon task planning with the Large Language Model (LLM) in an open-world household environment. Existing works fail to explicitly track key objects and attributes, leading to erroneous decisions in long-horizon tasks, or rely on highly engineered state features and feedback, which is not generalizable. We propose an open state representation that provides… ▽ More This work addresses the problem of long-horizon task planning with the Large Language Model (LLM) in an open-world household environment. Existing works fail to explicitly track key objects and attributes, leading to erroneous decisions in long-horizon tasks, or rely on highly engineered state features and feedback, which is not generalizable. We propose an open state representation that provides continuous expansion and updating of object attributes from the LLM's inherent capabilities for context understanding and historical action reasoning. Our proposed representation maintains a comprehensive record of an object's attributes and changes, enabling robust retrospective summary of the sequence of actions leading to the current state. This allows continuously updating world model to enhance context understanding for decision-making in task planning. We validate our model through experiments across simulated and real-world task planning scenarios, demonstrating significant improvements over baseline methods in a variety of tasks requiring long-horizon state tracking and reasoning. (Video\footnote{Video demonstration: \url{https://youtu.be/QkN-8pxV3Mo}.}) △ Less

Submitted 22 April, 2024; v1 submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.06711 [pdf, ps, other]

Optimal $L^\infty(L^2)$ and $L^1(L^2)$ a posteriori error estimates for the fully discrete approximations of time fractional parabolic differential equations

Authors: Jiliang Cao, Wansheng Wang, Aiguo Xiao

Abstract: We derive optimal order a posteriori error estimates in the $L^\infty(L^2)$ and $L^1(L^2)$-norms for the fully discrete approximations of time fractional parabolic differential equations. For the discretization in time, we use the $L1$ methods, while for the spatial discretization, we use standard conforming finite element methods. The linear and quadratic space-time reconstructions are introduced… ▽ More We derive optimal order a posteriori error estimates in the $L^\infty(L^2)$ and $L^1(L^2)$-norms for the fully discrete approximations of time fractional parabolic differential equations. For the discretization in time, we use the $L1$ methods, while for the spatial discretization, we use standard conforming finite element methods. The linear and quadratic space-time reconstructions are introduced, which are generalizations of the elliptic space reconstruction. Then the related a posteriori error estimates for the linear and quadratic space-time reconstructions play key roles in deriving global and pointwise final error estimates. Numerical experiments verify and complement our theoretical results. △ Less

Submitted 11 November, 2023; originally announced November 2023.

Comments: 22 pages

arXiv:2310.12997 [pdf]

doi 10.1117/12.2677526

Parking Spot Classification based on surround view camera system

Authors: Andy Xiao, Deep Doshi, Lihao Wang, Harsha Gorantla, Thomas Heitzmann, Peter Groth

Abstract: Surround-view fisheye cameras are commonly used for near-field sensing in automated driving scenarios, including urban driving and auto valet parking. Four fisheye cameras, one on each side, are sufficient to cover 360° around the vehicle capturing the entire near-field region. Based on surround view cameras, there has been much research on parking slot detection with main focus on the occupancy s… ▽ More Surround-view fisheye cameras are commonly used for near-field sensing in automated driving scenarios, including urban driving and auto valet parking. Four fisheye cameras, one on each side, are sufficient to cover 360° around the vehicle capturing the entire near-field region. Based on surround view cameras, there has been much research on parking slot detection with main focus on the occupancy status in recent years, but little work on whether the free slot is compatible with the mission of the ego vehicle or not. For instance, some spots are handicap or electric vehicles accessible only. In this paper, we tackle parking spot classification based on the surround view camera system. We adapt the object detection neural network YOLOv4 with a novel polygon bounding box model that is well-suited for various shaped parking spaces, such as slanted parking slots. To the best of our knowledge, we present the first detailed study on parking spot detection and classification on fisheye cameras for auto valet parking scenarios. The results prove that our proposed classification approach is effective to distinguish between regular, electric vehicle, and handicap parking spots. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Comments: SPIE Optical Engineering + Applications, 2023, San Diego, California, United States. Proc. SPIE 12675, Applications of Machine Learning 2023

arXiv:2310.12141 [pdf, other]

A phase transition and critical phenomenon for the two-dimensional random field Ising model

Authors: Jian Ding, Fenglin Huang, Aoteng Xia

Abstract: We study the random field Ising model in a two-dimensional box with side length $N$ where the external field is given by independent normal variables with mean $0$ and variance $ε^2$. Our primary result is the following phase transition at $T = T_c$: for $ε\ll N^{-7/8}$ the boundary influence (i.e., the difference between the spin averages at the center of the box with the plus and the minus bound… ▽ More We study the random field Ising model in a two-dimensional box with side length $N$ where the external field is given by independent normal variables with mean $0$ and variance $ε^2$. Our primary result is the following phase transition at $T = T_c$: for $ε\ll N^{-7/8}$ the boundary influence (i.e., the difference between the spin averages at the center of the box with the plus and the minus boundary conditions) decays as $N^{-1/8}$ and thus the disorder essentially has no effect on the boundary influence; for $ε\gg N^{-7/8}$, the boundary influence decays as $N^{-\frac{1}{8}}e^{-Θ(ε^{8/7}\, N)}$ (i.e., the disorder contributes a factor of $e^{-Θ(ε^{8/7}\, N)}$ to the decay rate). For a natural notion of the correlation length, i.e., the minimal size of the box where the boundary influence shrinks by a factor of $2$ from that with no external field, we also prove the following: as $ε\downarrow 0$ the correlation length transits from $Θ(ε^{-8/7})$ at $T_c$ to $e^{Θ(ε^{-4/3}\,\,)}$ for $T < T_c$. △ Less

Submitted 4 March, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

Comments: 65 pages; minor revision throughout over previous version

MSC Class: 60K35; 82B44

arXiv:2310.09078 [pdf, other]

DNFS-VNE: Deep Neuro Fuzzy System Driven Virtual Network Embedding

Authors: Ailing Xiao, Ning Chen, Sheng Wu, Peiying Zhang, Linling Kuang, Chunxiao Jiang

Abstract: By decoupling substrate resources, network virtualization (NV) is a promising solution for meeting diverse demands and ensuring differentiated quality of service (QoS). In particular, virtual network embedding (VNE) is a critical enabling technology that enhances the flexibility and scalability of network deployment by addressing the coupling of Internet processes and services. However, in the exi… ▽ More By decoupling substrate resources, network virtualization (NV) is a promising solution for meeting diverse demands and ensuring differentiated quality of service (QoS). In particular, virtual network embedding (VNE) is a critical enabling technology that enhances the flexibility and scalability of network deployment by addressing the coupling of Internet processes and services. However, in the existing deep neural networks (DNNs)-based works, the black-box nature DNNs limits the analysis, development, and improvement of systems. For example, in the industrial Internet of Things (IIoT), there is a conflict between decision interpretability and the opacity of DNN-based methods. In recent times, interpretable deep learning (DL) represented by deep neuro fuzzy systems (DNFS) combined with fuzzy inference has shown promising interpretability to further exploit the hidden value in the data. Motivated by this, we propose a DNFS-based VNE algorithm that aims to provide an interpretable NV scheme. Specifically, data-driven convolutional neural networks (CNNs) are used as fuzzy implication operators to compute the embedding probabilities of candidate substrate nodes through entailment operations. And, the identified fuzzy rule patterns are cached into the weights by forward computation and gradient back-propagation (BP). Moreover, the fuzzy rule base is constructed based on Mamdani-type linguistic rules using linguistic labels. In addition, the DNFS-driven five-block structure-based policy network serves as the agent for deep reinforcement learning (DRL), which optimizes VNE decision-making through interaction with the environment. Finally, the effectiveness of evaluation indicators and fuzzy rules is verified by simulation experiments. △ Less

Submitted 3 July, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

arXiv:2309.13505 [pdf, other]

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Authors: Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Ling Shao, Shijian Lu

Abstract: Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear s… ▽ More Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pre-training data. △ Less

Submitted 4 January, 2024; v1 submitted 23 September, 2023; originally announced September 2023.

Comments: NeurIPS 2023. Code is available at https://github.com/xing0047/rewrite

arXiv:2309.06041 [pdf, other]

GVD-Exploration: An Efficient Autonomous Robot Exploration Framework Based on Fast Generalized Voronoi Diagram Extraction

Authors: Dingfeng Chen, Anxing Xiao, Meiyuan Zou, Wenzheng Chi, Jiankun Wang, Lining Sun

Abstract: Rapidly-exploring Random Trees (RRTs) are a popular technique for autonomous exploration of mobile robots. However, the random sampling used by RRTs can result in inefficient and inaccurate frontiers extraction, which affects the exploration performance. To address the issues of slow path planning and high path cost, we propose a framework that uses a generalized Voronoi diagram (GVD) based multi-… ▽ More Rapidly-exploring Random Trees (RRTs) are a popular technique for autonomous exploration of mobile robots. However, the random sampling used by RRTs can result in inefficient and inaccurate frontiers extraction, which affects the exploration performance. To address the issues of slow path planning and high path cost, we propose a framework that uses a generalized Voronoi diagram (GVD) based multi-choice strategy for robot exploration. Our framework consists of three components: a novel mapping model that uses an end-to-end neural network to construct GVDs of the environments in real time; a GVD-based heuristic scheme that accelerates frontiers extraction and reduces frontiers redundancy; and a multi-choice frontiers assignment scheme that considers different types of frontiers and enables the robot to make rational decisions during the exploration process. We evaluate our method on simulation and real-world experiments and show that it outperforms RRT-based exploration methods in terms of efficiency and robustness. △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: 11 pages, 10 figures

arXiv:2309.03005 [pdf, ps, other]

On multi-step extended maximum residual Kaczmarz method for solving large inconsistent linear systems

Authors: Aqin Xiao, Junfeng Yin, Ning Zheng

Abstract: A multi-step extended maximum residual Kaczmarz method is presented for the solution of the large inconsistent linear system of equations by using the multi-step iterations technique. Theoretical analysis proves the proposed method is convergent and gives an upper bound on its convergence rate. Numerical experiments show that the proposed method is effective and outperforms the existing extended K… ▽ More A multi-step extended maximum residual Kaczmarz method is presented for the solution of the large inconsistent linear system of equations by using the multi-step iterations technique. Theoretical analysis proves the proposed method is convergent and gives an upper bound on its convergence rate. Numerical experiments show that the proposed method is effective and outperforms the existing extended Kaczmarz methods in terms of the number of iteration steps and the computational costs. △ Less

Submitted 6 September, 2023; originally announced September 2023.

arXiv:2309.02780 [pdf, other]

GRASS: Unified Generation Model for Speech-to-Semantic Tasks

Authors: Aobo Xia, Shuyu Lei, Yushu Yang, Xiang Guo, Hua Chai

Abstract: This paper explores the instruction fine-tuning technique for speech-to-semantic tasks by introducing a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our pro… ▽ More This paper explores the instruction fine-tuning technique for speech-to-semantic tasks by introducing a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data. We pre-train the model using large and diverse data, where instruction-speech pairs are constructed via a text-to-speech (TTS) system. Extensive experiments demonstrate that our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more, after fine-tuning. Furthermore, the proposed model achieves competitive performance in zero-shot and few-shot scenarios. To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code. △ Less

Submitted 11 September, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

arXiv:2307.15283 [pdf, ps, other]

On averaging block Kaczmarz methods for solving nonlinear systems of equations

Authors: Aqin Xiao, Junfeng Yin

Abstract: A class of averaging block nonlinear Kaczmarz methods is developed for the solution of the nonlinear system of equations. The convergence theory of the proposed method is established under suitable assumptions and the upper bounds of the convergence rate for the proposed method with both constant stepsize and adaptive stepsize are derived. Numerical experiments are presented to verify the efficien… ▽ More A class of averaging block nonlinear Kaczmarz methods is developed for the solution of the nonlinear system of equations. The convergence theory of the proposed method is established under suitable assumptions and the upper bounds of the convergence rate for the proposed method with both constant stepsize and adaptive stepsize are derived. Numerical experiments are presented to verify the efficiency of the proposed method, which outperforms the existing nonlinear Kaczmarz methods in terms of the number of iteration steps and computational costs. △ Less

Submitted 27 July, 2023; originally announced July 2023.

arXiv:2305.19812 [pdf, other]

A Survey of Label-Efficient Deep Learning for 3D Point Clouds

Authors: Aoran Xiao, Xiaoqin Zhang, Ling Shao, Shijian Lu

Abstract: In the past decade, deep neural networks have achieved significant progress in point cloud learning. However, collecting large-scale precisely-annotated training data is extremely laborious and expensive, which hinders the scalability of existing point cloud datasets and poses a bottleneck for efficient exploration of point cloud data in various tasks and applications. Label-efficient learning off… ▽ More In the past decade, deep neural networks have achieved significant progress in point cloud learning. However, collecting large-scale precisely-annotated training data is extremely laborious and expensive, which hinders the scalability of existing point cloud datasets and poses a bottleneck for efficient exploration of point cloud data in various tasks and applications. Label-efficient learning offers a promising solution by enabling effective deep network training with much-reduced annotation efforts. This paper presents the first comprehensive survey of label-efficient learning of point clouds. We address three critical questions in this emerging research field: i) the importance and urgency of label-efficient learning in point cloud processing, ii) the subfields it encompasses, and iii) the progress achieved in this area. To achieve this, we propose a taxonomy that organizes label-efficient learning methods based on the data prerequisites provided by different types of labels. We categorize four typical label-efficient learning approaches that significantly reduce point cloud annotation efforts: data augmentation, domain transfer learning, weakly-supervised learning, and pretrained foundation models. For each approach, we outline the problem setup and provide an extensive literature review that showcases relevant progress and challenges. Finally, we share insights into current research challenges and potential future directions. A project associated with this survey has been built at https://github.com/xiaoaoran/3D_label_efficient_learning. △ Less

Submitted 17 June, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

Comments: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

arXiv:2304.00690 [pdf, other]

3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds

Authors: Aoran Xiao, Jiaxing Huang, Weihao Xuan, Ruijie Ren, Kangcheng Liu, Dayan Guan, Abdulmotaleb El Saddik, Shijian Lu, Eric Xing

Abstract: Robust point cloud parsing under all-weather conditions is crucial to level-5 autonomy in autonomous driving. However, how to learn a universal 3D semantic segmentation (3DSS) model is largely neglected as most existing benchmarks are dominated by point clouds captured under normal weather. We introduce SemanticSTF, an adverse-weather point cloud dataset that provides dense point-level annotations… ▽ More Robust point cloud parsing under all-weather conditions is crucial to level-5 autonomy in autonomous driving. However, how to learn a universal 3D semantic segmentation (3DSS) model is largely neglected as most existing benchmarks are dominated by point clouds captured under normal weather. We introduce SemanticSTF, an adverse-weather point cloud dataset that provides dense point-level annotations and allows to study 3DSS under various adverse weather conditions. We study all-weather 3DSS modeling under two setups: 1) domain adaptive 3DSS that adapts from normal-weather data to adverse-weather data; 2) domain generalizable 3DSS that learns all-weather 3DSS models from normal-weather data. Our studies reveal the challenge while existing 3DSS methods encounter adverse-weather data, showing the great value of SemanticSTF in steering the future endeavor along this very meaningful research direction. In addition, we design a domain randomization technique that alternatively randomizes the geometry styles of point clouds and aggregates their embeddings, ultimately leading to a generalizable model that can improve 3DSS under various adverse weather effectively. The SemanticSTF and related codes are available at \url{https://github.com/xiaoaoran/SemanticSTF}. △ Less

Submitted 2 April, 2023; originally announced April 2023.

Comments: CVPR2023

arXiv:2303.10300 [pdf, other]

doi 10.1103/PhysRevE.108.034901

Designing the pressure-dependent shear modulus using tessellated granular metamaterials

Authors: Jerry Zhang, Dong Wang, Weiwei Jin, Annie Xia, Nidhi Pashine, Rebecca Kramer-Bottiglio, Mark D. Shattuck, Corey S. O'Hern

Abstract: Jammed packings of granular materials display complex mechanical response. For example, the ensemble-averaged shear modulus $\left\langle G \right\rangle$ increases as a power-law in pressure $p$ for static packings of soft spherical particles that can rearrange during compression. We seek to design granular materials with shear moduli that can either increase {\it or} decrease with pressure witho… ▽ More Jammed packings of granular materials display complex mechanical response. For example, the ensemble-averaged shear modulus $\left\langle G \right\rangle$ increases as a power-law in pressure $p$ for static packings of soft spherical particles that can rearrange during compression. We seek to design granular materials with shear moduli that can either increase {\it or} decrease with pressure without particle rearrangements even in the large-system limit. To do this, we construct {\it tessellated} granular metamaterials by joining multiple particle-filled cells together. We focus on cells that contain a small number of bidisperse disks in two dimensions. We first study the mechanical properties of individual disk-filled cells with three types of boundaries: periodic boundary conditions (PBC), fixed-length walls (FXW), and flexible walls (FLW). Hypostatic jammed packings are found for cells with FLW, but not in cells with PBC and FXW, and they are stabilized by quartic modes of the dynamical matrix. The shear modulus of a single cell depends linearly on $p$. We find that the slope of the shear modulus with pressure, $λ_c < 0$ for all packings in single cells with PBC where the number of particles per cell $N \ge 6$. In contrast, single cells with FXW and FLW can possess $λ_c > 0$, as well as $λ_c < 0$, for $N \le 16$. We show that we can force the mechanical properties of multi-cell granular metamaterials to possess those of single cells by constraining the endpoints of the outer walls and enforcing an affine shear response. These studies demonstrate that tessellated granular metamaterials provide a novel platform for the design of soft materials with specified mechanical properties. △ Less

Submitted 10 September, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Journal ref: Phys. Rev. E 108, 034901 (2023)

arXiv:2303.06624 [pdf, other]

Collaborative Trolley Transportation System with Autonomous Nonholonomic Robots

Authors: Bingyi Xia, Hao Luan, Ziqi Zhao, Xuheng Gao, Peijia Xie, Anxing Xiao, Jiankun Wang, Max Q. -H. Meng

Abstract: Cooperative object transportation using multiple robots has been intensively studied in the control and robotics literature, but most approaches are either only applicable to omnidirectional robots or lack a complete navigation and decision-making framework that operates in real time. This paper presents an autonomous nonholonomic multi-robot system and an end-to-end hierarchical autonomy framewor… ▽ More Cooperative object transportation using multiple robots has been intensively studied in the control and robotics literature, but most approaches are either only applicable to omnidirectional robots or lack a complete navigation and decision-making framework that operates in real time. This paper presents an autonomous nonholonomic multi-robot system and an end-to-end hierarchical autonomy framework for collaborative luggage trolley transportation. This framework finds kinematic-feasible paths, computes online motion plans, and provides feedback that enables the multi-robot system to handle long lines of luggage trolleys and navigate obstacles and pedestrians while dealing with multiple inherently complex and coupled constraints. We demonstrate the designed collaborative trolley transportation system through practical transportation tasks, and the experiment results reveal their effectiveness and reliability in complex and dynamic environments. △ Less

Submitted 21 July, 2023; v1 submitted 12 March, 2023; originally announced March 2023.

arXiv:2303.05223 [pdf, other]

LEAP: The latent exchangeability prior for borrowing information from historical data

Authors: Ethan M. Alt, Xiuya Chang, Xun Jiang, Qing Liu, May Mo, H. Amy Xia, Joseph G. Ibrahim

Abstract: It is becoming increasingly popular to elicit informative priors on the basis of historical data. Popular existing priors, including the power prior, commensurate prior, and robust meta-analytic prior provide blanket discounting. Thus, if only a subset of participants in the historical data are exchangeable with the current data, these priors may not be appropriate. In order to combat this issue,… ▽ More It is becoming increasingly popular to elicit informative priors on the basis of historical data. Popular existing priors, including the power prior, commensurate prior, and robust meta-analytic prior provide blanket discounting. Thus, if only a subset of participants in the historical data are exchangeable with the current data, these priors may not be appropriate. In order to combat this issue, propensity score (PS) approaches have been proposed. However, PS approaches are only concerned with the covariate distribution, whereas exchangeability is typically assessed with parameters pertaining to the outcome. In this paper, we introduce the latent exchangeability prior (LEAP), where observations in the historical data are classified into exchangeable and non-exchangeable groups. The LEAP discounts the historical data by identifying the most relevant subjects from the historical data. We compare our proposed approach against alternative approaches in simulations and present a case study using our proposed prior to augment a control arm in a phase 3 clinical trial in plaque psoriasis with an unbalanced randomization scheme. △ Less

Submitted 9 March, 2023; originally announced March 2023.

arXiv:2302.10654 [pdf, ps, other]

On the rate of normal approximation for Poisson continuum percolation

Authors: Tiffany Y. Y. Lo, Aihua Xia

Abstract: It is known that the number of points in the largest cluster of a percolating Poisson process restricted to a large finite box is asymptotically normal. In this note, we establish a rate of convergence for the statement. As each point in the largest cluster is determined by points as far as the diameter of the box, known results in the literature of normal approximation for Poisson functionals can… ▽ More It is known that the number of points in the largest cluster of a percolating Poisson process restricted to a large finite box is asymptotically normal. In this note, we establish a rate of convergence for the statement. As each point in the largest cluster is determined by points as far as the diameter of the box, known results in the literature of normal approximation for Poisson functionals cannot be directly applied. To disentangle the long-range dependence of the largest cluster, we use the fact that the second largest cluster has comparatively shorter range of dependence to restrict the range of dependence, apply a recently established result in Chen, Röllin and Xia (2021) to obtain a Berry-Esseen type bound for the normal approximation of the number of points belonging to clusters that have a restricted range of dependence, and then estimate the gap between this quantity and the number of points in the largest cluster. △ Less

Submitted 7 September, 2023; v1 submitted 21 February, 2023; originally announced February 2023.

Comments: 10 pages. This version contains a correction to an error in Lemma 2.2 in the previous versions

MSC Class: primary 60K35; 60F05; secondary 60D05; 60G57; 82B43; 62E20

arXiv:2210.08818 [pdf]

doi 10.4271/2022-01-0107

The Digital Foundation Platform -- A Multi-layered SOA Architecture for Intelligent Connected Vehicle Operating System

Authors: David Yu, Andy Xiao

Abstract: Legacy AD/ADAS development from OEMs centers around developing functions on ECUs using services provided by AUTOSAR Classic Platform (CP) to meet automotive-grade and mass-production requirements. The AUTOSAR CP couples hardware and software components statically and encounters challenges to provide sufficient capacities for the processing of high-level intelligent driving functions, whereas the n… ▽ More Legacy AD/ADAS development from OEMs centers around developing functions on ECUs using services provided by AUTOSAR Classic Platform (CP) to meet automotive-grade and mass-production requirements. The AUTOSAR CP couples hardware and software components statically and encounters challenges to provide sufficient capacities for the processing of high-level intelligent driving functions, whereas the new platform, AUTOSAR Adaptive Platform (AP) is designed to support dynamically communication and provide richer services and function abstractions for those resource-intensive (memory, CPU) applications. Yet for both platforms, application development and the supporting system software are still closely coupled together, and this makes application development and the enhancement less scalable and flexible, resulting in longer development cycles and slower time-to-market. This paper presents a multi-layered, service-oriented intelligent driving operating system foundation (we named it as Digital Foundation Platform) that provides abstractions for easier adoption of heterogeneous computing hardware. It features a multi-layer SOA software architecture with each layer providing adaptive service API at north-bound for application developers. The proposed Digital Foundation Platform (DFP) has significant advantages of decoupling hardware, operating system core, middle-ware, functional software and application software development. It provides SOA at multiple layers and enables application developers from OEMs, to customize and develop new applications or enhance existing applications with new features, either in autonomous domain or intelligent cockpit domain, with great agility, and less code through re-usability, and thus reduce the time-to-market. △ Less

Submitted 17 October, 2022; originally announced October 2022.

Comments: WCX SAE World Congress Experience 2022

arXiv:2210.05128 [pdf, ps, other]

On fast greedy block Kaczmarz methods for solving large consistent linear systems

Authors: Aqin Xiao, Junfeng Yin, Ning Zheng

Abstract: A class of fast greedy block Kaczmarz methods combined with general greedy strategy and average technique are proposed for solving large consistent linear systems. Theoretical analysis of the convergence of the proposed method is given in detail. Numerical experiments show that the proposed methods are efficient and faster than the existing methods. A class of fast greedy block Kaczmarz methods combined with general greedy strategy and average technique are proposed for solving large consistent linear systems. Theoretical analysis of the convergence of the proposed method is given in detail. Numerical experiments show that the proposed methods are efficient and faster than the existing methods. △ Less

Submitted 16 October, 2022; v1 submitted 11 October, 2022; originally announced October 2022.

Comments: 11 pages, 1 figure

arXiv:2209.13998 [pdf, other]

Long range order for three-dimensional random field Ising model throughout the entire low temperature regime

Authors: Jian Ding, Yu Liu, Aoteng Xia

Abstract: For $d\geq 3$, we study the Ising model on $\mathbb Z^d$ with random field given by $\{εh_v: v\in \mathbb Z^d\}$ where $h_v$'s are independent normal variables with mean 0 and variance 1. We show that for any $T < T_c$ (here $T_c$ is the critical temperature without disorder), long range order exists as long as $ε$ is sufficiently small depending on $T$. Our work extends previous results of Imbrie… ▽ More For $d\geq 3$, we study the Ising model on $\mathbb Z^d$ with random field given by $\{εh_v: v\in \mathbb Z^d\}$ where $h_v$'s are independent normal variables with mean 0 and variance 1. We show that for any $T < T_c$ (here $T_c$ is the critical temperature without disorder), long range order exists as long as $ε$ is sufficiently small depending on $T$. Our work extends previous results of Imbrie (1985) and Bricmont--Kupiainen (1988) from the very low temperature regime to the entire low temperature regime. △ Less

Submitted 28 September, 2022; originally announced September 2022.

Comments: 36 pages

MSC Class: 60K35; 82B44

arXiv:2208.00223 [pdf, other]

PolarMix: A General Data Augmentation Technique for LiDAR Point Clouds

Authors: Aoran Xiao, Jiaxing Huang, Dayan Guan, Kaiwen Cui, Shijian Lu, Ling Shao

Abstract: LiDAR point clouds, which are usually scanned by rotating LiDAR sensors continuously, capture precise geometry of the surrounding environment and are crucial to many autonomous detection and navigation tasks. Though many 3D deep architectures have been developed, efficient collection and annotation of large amounts of point clouds remain one major challenge in the analytic and understanding of poi… ▽ More LiDAR point clouds, which are usually scanned by rotating LiDAR sensors continuously, capture precise geometry of the surrounding environment and are crucial to many autonomous detection and navigation tasks. Though many 3D deep architectures have been developed, efficient collection and annotation of large amounts of point clouds remain one major challenge in the analytic and understanding of point cloud data. This paper presents PolarMix, a point cloud augmentation technique that is simple and generic but can mitigate the data constraint effectively across different perception tasks and scenarios. PolarMix enriches point cloud distributions and preserves point cloud fidelity via two cross-scan augmentation strategies that cut, edit, and mix point clouds along the scanning direction. The first is scene-level swapping which exchanges point cloud sectors of two LiDAR scans that are cut along the azimuth axis. The second is instance-level rotation and paste which crops point instances from one LiDAR scan, rotates them by multiple angles (to create multiple copies), and paste the rotated point instances into other scans. Extensive experiments show that PolarMix achieves superior performance consistently across different perception tasks and scenarios. In addition, it can work as plug-and-play for various 3D deep architectures and also performs well for unsupervised domain adaptation. △ Less

Submitted 30 July, 2022; originally announced August 2022.

arXiv:2205.13211 [pdf, ps, other]

Convergence rate for geometric statistics of point processes with fast decay dependence

Authors: Tianshu Cong, Aihua Xia

Abstract: [Błaszczyszyn, Yogeshwaran and Yukich (2019)] established central limit theorems for geometric statistics of point processes having fast decay dependence. As limit theorems are of limited use unless we understand their errors involved in the approximation, in this paper, we consider the rates of a normal approximation in terms of the Wasserstein distance for statistics of point processes on… ▽ More [Błaszczyszyn, Yogeshwaran and Yukich (2019)] established central limit theorems for geometric statistics of point processes having fast decay dependence. As limit theorems are of limited use unless we understand their errors involved in the approximation, in this paper, we consider the rates of a normal approximation in terms of the Wasserstein distance for statistics of point processes on $\mathbb{R}^d$ satisfying fast decay dependence. We demonstrate the use of the theorems for statistics arising from two families of point processes: the rarified Gibbs point processes and the determinantal point processes with fast decay kernels. △ Less

Submitted 26 May, 2022; originally announced May 2022.

Comments: 42 pages

MSC Class: primary 60F05; secondary 60D05; 60G55; 62E20; 05C80

arXiv:2205.03967 [pdf, other]

doi 10.1111/rssc.12596

The saturated pairwise interaction Gibbs point process as a joint species distribution model

Authors: Ian Flint, Nick Golding, Peter Vesk, Yan Wang, Aihua Xia

Abstract: In an effort to effectively model observed patterns in the spatial configuration of individuals of multiple species in nature, we introduce the saturated pairwise interaction Gibbs point process. Its main strength lies in its ability to model both attraction and repulsion within and between species, over different scales. As such, it is particularly well-suited to the study of associations in… ▽ More In an effort to effectively model observed patterns in the spatial configuration of individuals of multiple species in nature, we introduce the saturated pairwise interaction Gibbs point process. Its main strength lies in its ability to model both attraction and repulsion within and between species, over different scales. As such, it is particularly well-suited to the study of associations in complex ecosystems. Based on the existing literature, we provide an easy to implement fitting procedure as well as a technique to make inference for the model parameters. We also prove that under certain hypotheses the point process is locally stable, which allows us to use the well-known `coupling from the past' algorithm to draw samples from the model. Different numerical experiments show the robustness of the model. We study three different ecological datasets, demonstrating in each one that our model helps disentangle competing ecological effects on species' distribution. △ Less

Submitted 20 August, 2022; v1 submitted 8 May, 2022; originally announced May 2022.

Comments: 36 pages, 14 figures

Journal ref: Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(5), 2022, pages 1721-1752

arXiv:2204.06456 [pdf, other]

doi 10.1103/PhysRevA.107.L031302

Non-equilibrium dynamics of fluctuations in an ultra-cold atomic mixture

Authors: Apoorva Hegde, Robert Ott, Andy Xia, Valentin Kasper, Jürgen Berges, Fred Jendrzejewski

Abstract: We investigate an ultra-cold mixture of Bose gases interacting via spin-changing collisions by studying the dynamics of spin fluctuations. The experimental implementation employs $^{23}$Na and $^{7}$Li atoms, which are prepared out of equilibrium across a wide range of initial conditions. We identify three regimes in the dynamics of the system for different initial states: a long-lived metastable… ▽ More We investigate an ultra-cold mixture of Bose gases interacting via spin-changing collisions by studying the dynamics of spin fluctuations. The experimental implementation employs $^{23}$Na and $^{7}$Li atoms, which are prepared out of equilibrium across a wide range of initial conditions. We identify three regimes in the dynamics of the system for different initial states: a long-lived metastable regime, an instability range with strong growth of fluctuations, and a fast relaxing regime approaching thermal equilibrium. Theoretical modelling of the data allows us to reconstruct effective potentials which characterize the different dynamical regimes of the system. △ Less

Submitted 13 April, 2022; originally announced April 2022.

Comments: 9 pages, 5 figures

arXiv:2204.03875 [pdf, other]

Deterministic, Near-Linear $\varepsilon$-Approximation Algorithm for Geometric Bipartite Matching

Authors: Pankaj K. Agarwal, Hsien-Chih Chang, Sharath Raghvendra, Allen Xiao

Abstract: Given point sets $A$ and $B$ in $\mathbb{R}^d$ where $A$ and $B$ have equal size $n$ for some constant dimension $d$ and a parameter $\varepsilon>0$, we present the first deterministic algorithm that computes, in $n\cdot(\varepsilon^{-1} \log n)^{O(d)}$ time, a perfect matching between $A$ and $B$ whose cost is within a $(1+\varepsilon)$ factor of the optimal under any $\smash{\ell_p}$-norm. Altho… ▽ More Given point sets $A$ and $B$ in $\mathbb{R}^d$ where $A$ and $B$ have equal size $n$ for some constant dimension $d$ and a parameter $\varepsilon>0$, we present the first deterministic algorithm that computes, in $n\cdot(\varepsilon^{-1} \log n)^{O(d)}$ time, a perfect matching between $A$ and $B$ whose cost is within a $(1+\varepsilon)$ factor of the optimal under any $\smash{\ell_p}$-norm. Although a Monte-Carlo algorithm with a similar running time is proposed by Raghvendra and Agarwal [J. ACM 2020], the best-known deterministic $\varepsilon$-approximation algorithm takes $Ω(n^{3/2})$ time. Our algorithm constructs a (refinement of a) tree cover of $\mathbb{R}^d$, and we develop several new tools to apply a tree-cover based approach to compute an $\varepsilon$-approximate perfect matching. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: The conference version of the paper is accepted to STOC 2022

arXiv:2203.10026 [pdf, other]

Unbiased Subclass Regularization for Semi-Supervised Semantic Segmentation

Authors: Dayan Guan, Jiaxing Huang, Aoran Xiao, Shijian Lu

Abstract: Semi-supervised semantic segmentation learns from small amounts of labelled images and large amounts of unlabelled images, which has witnessed impressive progress with the recent advance of deep neural networks. However, it often suffers from severe class-bias problem while exploring the unlabelled images, largely due to the clear pixel-wise class imbalance in the labelled images. This paper prese… ▽ More Semi-supervised semantic segmentation learns from small amounts of labelled images and large amounts of unlabelled images, which has witnessed impressive progress with the recent advance of deep neural networks. However, it often suffers from severe class-bias problem while exploring the unlabelled images, largely due to the clear pixel-wise class imbalance in the labelled images. This paper presents an unbiased subclass regularization network (USRN) that alleviates the class imbalance issue by learning class-unbiased segmentation from balanced subclass distributions. We build the balanced subclass distributions by clustering pixels of each original class into multiple subclasses of similar sizes, which provide class-balanced pseudo supervision to regularize the class-biased segmentation. In addition, we design an entropy-based gate mechanism to coordinate learning between the original classes and the clustered subclasses which facilitates subclass regularization effectively by suppressing unconfident subclass predictions. Extensive experiments over multiple public benchmarks show that USRN achieves superior performance as compared with the state-of-the-art. △ Less

Submitted 26 March, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

Comments: Accepted to CVPR 2022. Code is available at https://github.com/Dayan-Guan/USRN

arXiv:2203.04541 [pdf, other]

PUTN: A Plane-fitting based Uneven Terrain Navigation Framework

Authors: Zhuozhu Jian, Zihong Lu, Xiao Zhou, Bin Lan, Anxing Xiao, Xueqian Wang, Bin Liang

Abstract: Autonomous navigation of ground robots has been widely used in indoor structured 2D environments, but there are still many challenges in outdoor 3D unstructured environments, especially in rough, uneven terrains. This paper proposed a plane-fitting based uneven terrain navigation framework (PUTN) to solve this problem. The implementation of PUTN is divided into three steps. First, based on Rapidly… ▽ More Autonomous navigation of ground robots has been widely used in indoor structured 2D environments, but there are still many challenges in outdoor 3D unstructured environments, especially in rough, uneven terrains. This paper proposed a plane-fitting based uneven terrain navigation framework (PUTN) to solve this problem. The implementation of PUTN is divided into three steps. First, based on Rapidly-exploring Random Trees (RRT), an improved sample-based algorithm called Plane Fitting RRT* (PF-RRT*) is proposed to obtain a sparse trajectory. Each sampling point corresponds to a custom traversability index and a fitted plane on the point cloud. These planes are connected in series to form a traversable strip. Second, Gaussian Process Regression is used to generate traversability of the dense trajectory interpolated from the sparse trajectory, and the sampling tree is used as the training set. Finally, local planning is performed using nonlinear model predictive control (NMPC). By adding the traversability index and uncertainty to the cost function, and adding obstacles generated by the real-time point cloud to the constraint function, a safe motion planning algorithm with smooth speed and strong robustness is available. Experiments in real scenarios are conducted to verify the effectiveness of the method. The source code is released for the reference of the community. △ Less

Submitted 27 September, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

Comments: Accepted by IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022

arXiv:2203.03927 [pdf, other]

Quadruped Guidance Robot for the Visually Impaired: A Comfort-Based Approach

Authors: Yanbo Chen, Zhengzhe Xu, Zhuozhu Jian, Gengpan Tang, Yunong Yangli, Anxing Xiao, Xueqian Wang, Bin Liang

Abstract: Guidance robots that can guide people and avoid various obstacles, could potentially be owned by more visually impaired people at a fairly low cost. Most of the previous guidance robots for the visually impaired ignored the human response behavior and comfort, treating the human as an appendage dragged by the robot, which can lead to imprecise guidance of the human and sudden changes in the tracti… ▽ More Guidance robots that can guide people and avoid various obstacles, could potentially be owned by more visually impaired people at a fairly low cost. Most of the previous guidance robots for the visually impaired ignored the human response behavior and comfort, treating the human as an appendage dragged by the robot, which can lead to imprecise guidance of the human and sudden changes in the traction force experienced by the human. In this paper, we propose a novel quadruped guidance robot system with a comfort-based concept. We design a controllable traction device that can adjust the length and force between human and robot to ensure comfort. To allow the human to be guided safely and comfortably to the target position in complex environments, our proposed human motion planner can plan the traction force with the force-based human motion model. To track the planned force, we also propose a robot motion planner that can generate the specific robot motion command and design the force control device. Our system has been deployed on Unitree Laikago quadrupedal platform and validated in real-world scenarios. △ Less

Submitted 23 June, 2023; v1 submitted 8 March, 2022; originally announced March 2022.

Comments: IEEE International Conference on Robotics and Automation (ICRA) 2023

arXiv:2202.13589 [pdf, other]

doi 10.1109/TPAMI.2023.3262786

Unsupervised Point Cloud Representation Learning with Deep Neural Networks: A Survey

Authors: Aoran Xiao, Jiaxing Huang, Dayan Guan, Xiaoqin Zhang, Shijian Lu, Ling Shao

Abstract: Point cloud data have been widely explored due to its superior accuracy and robustness under various adverse situations. Meanwhile, deep neural networks (DNNs) have achieved very impressive success in various applications such as surveillance and autonomous driving. The convergence of point cloud and DNNs has led to many deep point cloud models, largely trained under the supervision of large-scale… ▽ More Point cloud data have been widely explored due to its superior accuracy and robustness under various adverse situations. Meanwhile, deep neural networks (DNNs) have achieved very impressive success in various applications such as surveillance and autonomous driving. The convergence of point cloud and DNNs has led to many deep point cloud models, largely trained under the supervision of large-scale and densely-labelled point cloud data. Unsupervised point cloud representation learning, which aims to learn general and useful point cloud representations from unlabelled point cloud data, has recently attracted increasing attention due to the constraint in large-scale point cloud labelling. This paper provides a comprehensive review of unsupervised point cloud representation learning using DNNs. It first describes the motivation, general pipelines as well as terminologies of the recent studies. Relevant background including widely adopted point cloud datasets and DNN architectures is then briefly presented. This is followed by an extensive discussion of existing unsupervised point cloud representation learning methods according to their technical approaches. We also quantitatively benchmark and discuss the reviewed methods over multiple widely adopted point cloud datasets. Finally, we share our humble opinion about several challenges and problems that could be pursued in future research in unsupervised point cloud representation learning. A project associated with this survey has been built at https://github.com/xiaoaoran/3d_url_survey. △ Less

Submitted 26 March, 2023; v1 submitted 28 February, 2022; originally announced February 2022.

Comments: IEEE Transactions on Pattern Analysis and Machine Intelligence

arXiv:2111.09983 [pdf, other]

Towards Measuring Fairness in Speech Recognition: Casual Conversations Dataset Transcriptions

Authors: Chunxi Liu, Michael Picheny, Leda Sarı, Pooja Chitkara, Alex Xiao, Xiaohui Zhang, Mark Chou, Andres Alvarado, Caner Hazirbas, Yatharth Saraf

Abstract: It is well known that many machine learning systems demonstrate bias towards specific groups of individuals. This problem has been studied extensively in the Facial Recognition area, but much less so in Automatic Speech Recognition (ASR). This paper presents initial Speech Recognition results on "Casual Conversations" -- a publicly released 846 hour corpus designed to help researchers evaluate the… ▽ More It is well known that many machine learning systems demonstrate bias towards specific groups of individuals. This problem has been studied extensively in the Facial Recognition area, but much less so in Automatic Speech Recognition (ASR). This paper presents initial Speech Recognition results on "Casual Conversations" -- a publicly released 846 hour corpus designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of metadata, including age, gender, and skin tone. The entire corpus has been manually transcribed, allowing for detailed ASR evaluations across these metadata. Multiple ASR models are evaluated, including models trained on LibriSpeech, 14,000 hour transcribed, and over 2 million hour untranscribed social media videos. Significant differences in word error rate across gender and skin tone are observed at times for all models. We are releasing human transcripts from the Casual Conversations dataset to encourage the community to develop a variety of techniques to reduce these statistical biases. △ Less

Submitted 18 November, 2021; originally announced November 2021.

Comments: Submitted to ICASSP 2022. Our dataset will be publicly available at (https://ai.facebook.com/datasets/casual-conversations-downloads) for general use. We also would like to note that considering the limitations of our dataset, we limit the use of it for only evaluation purposes (see license agreement)

arXiv:2111.05948 [pdf, other]

Scaling ASR Improves Zero and Few Shot Learning

Authors: Alex Xiao, Weiyi Zheng, Gil Keren, Duc Le, Frank Zhang, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Abdelrahman Mohamed

Abstract: With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such a… ▽ More With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such as sparse transducer loss and model sharding. By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains. Furthermore, our models learn powerful speech representations with zero and few-shot capabilities on novel domains and styles of speech, exceeding previous results across multiple in-house and public benchmarks. For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively, while realizing the best performance on public social media videos. Furthermore, the same universal model reaches equivalent performance with 500x less in-domain data on the SPGISpeech financial-domain dataset. △ Less

Submitted 29 November, 2021; v1 submitted 10 November, 2021; originally announced November 2021.

arXiv:2110.06648 [pdf, other]

Robotic Autonomous Trolley Collection with Progressive Perception and Nonlinear Model Predictive Control

Authors: Anxing Xiao, Hao Luan, Ziqi Zhao, Yue Hong, Jieting Zhao, Weinan Chen, Jiankun Wang, Max Q. -H. Meng

Abstract: Autonomous mobile manipulation robots that can collect trolleys are widely used to liberate human resources and fight epidemics. Most prior robotic trolley collection solutions only detect trolleys with 2D poses or are merely based on specific marks and lack the formal design of planning algorithms. In this paper, we present a novel mobile manipulation system with applications in luggage trolley c… ▽ More Autonomous mobile manipulation robots that can collect trolleys are widely used to liberate human resources and fight epidemics. Most prior robotic trolley collection solutions only detect trolleys with 2D poses or are merely based on specific marks and lack the formal design of planning algorithms. In this paper, we present a novel mobile manipulation system with applications in luggage trolley collection. The proposed system integrates a compact hardware design and a progressive perception and planning framework, enabling the system to efficiently and robustly collect trolleys in dynamic and complex environments. For the perception, we first develop a 3D trolley detection method that combines object detection and keypoint estimation. Then, a docking process in a short distance is achieved with an accurate point cloud plane detection method and a novel manipulator design. On the planning side, we formulate the robot's motion planning under a nonlinear model predictive control framework with control barrier functions to improve obstacle avoidance capabilities while maintaining the target in the sensors' field of view at close distances. We demonstrate our design and framework by deploying the system on actual trolley collection tasks, and their effectiveness and robustness are experimentally validated. △ Less

Submitted 1 March, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

Comments: Accepted to the 2022 International Conference on Robotics and Automation (ICRA 2022)

arXiv:2110.05241 [pdf, other]

Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

Authors: Yangyang Shi, Chunyang Wu, Dilin Wang, Alex Xiao, Jay Mahadeokar, Xiaohui Zhang, Chunxi Liu, Ke Li, Yuan Shangguan, Varun Nagaraja, Ozlem Kalinli, Mike Seltzer

Abstract: This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution. Many works apply the causal convolution to improve streaming transformer ignoring the lookahead context. We propose to use non-causal convolution to process the center block and lookahead context separately. This method leverages the lookahead context in convolution and maintains simila… ▽ More This paper improves the streaming transformer transducer for speech recognition by using non-causal convolution. Many works apply the causal convolution to improve streaming transformer ignoring the lookahead context. We propose to use non-causal convolution to process the center block and lookahead context separately. This method leverages the lookahead context in convolution and maintains similar training and decoding efficiency. Given the similar latency, using the non-causal convolution with lookahead context gives better accuracy than causal convolution, especially for open-domain dictation scenarios. Besides, this paper applies talking-head attention and a novel history context compression scheme to further improve the performance. The talking-head attention improves the multi-head self-attention by transferring information among different heads. The history context compression method introduces more extended history context compactly. On our in-house data, the proposed methods improve a small Emformer baseline with lookahead context by relative WERR 5.1\%, 14.5\%, 8.4\% on open-domain dictation, assistant general scenarios, and assistant calling scenarios, respectively. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: 5 pages, 3 figures, submit to ICASSP 2022

arXiv:2110.03374 [pdf, other]

Model Adaptation: Historical Contrastive Learning for Unsupervised Domain Adaptation without Source Data

Authors: Jiaxing Huang, Dayan Guan, Aoran Xiao, Shijian Lu

Abstract: Unsupervised domain adaptation aims to align a labeled source domain and an unlabeled target domain, but it requires to access the source data which often raises concerns in data privacy, data portability and data transmission efficiency. We study unsupervised model adaptation (UMA), or called Unsupervised Domain Adaptation without Source Data, an alternative setting that aims to adapt source-trai… ▽ More Unsupervised domain adaptation aims to align a labeled source domain and an unlabeled target domain, but it requires to access the source data which often raises concerns in data privacy, data portability and data transmission efficiency. We study unsupervised model adaptation (UMA), or called Unsupervised Domain Adaptation without Source Data, an alternative setting that aims to adapt source-trained models towards target distributions without accessing source data. To this end, we design an innovative historical contrastive learning (HCL) technique that exploits historical source hypothesis to make up for the absence of source data in UMA. HCL addresses the UMA challenge from two perspectives. First, it introduces historical contrastive instance discrimination (HCID) that learns from target samples by contrasting their embeddings which are generated by the currently adapted model and the historical models. With the historical models, HCID encourages UMA to learn instance-discriminative target representations while preserving the source hypothesis. Second, it introduces historical contrastive category discrimination (HCCD) that pseudo-labels target samples to learn category-discriminative target representations. Specifically, HCCD re-weights pseudo labels according to their prediction consistency across the current and historical models. Extensive experiments show that HCL outperforms and state-of-the-art methods consistently across a variety of visual tasks and setups. △ Less

Submitted 4 June, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: Accepted to Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

arXiv:2110.03174 [pdf, other]

Transferring Voice Knowledge for Acoustic Event Detection: An Empirical Study

Authors: Dawei Liang, Yangyang Shi, Yun Wang, Nayan Singhal, Alex Xiao, Jonathan Shaw, Edison Thomaz, Ozlem Kalinli, Mike Seltzer

Abstract: Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the po… ▽ More Detection of common events and scenes from audio is useful for extracting and understanding human contexts in daily life. Prior studies have shown that leveraging knowledge from a relevant domain is beneficial for a target acoustic event detection (AED) process. Inspired by the observation that many human-centered acoustic events in daily life involve voice elements, this paper investigates the potential of transferring high-level voice representations extracted from a public speaker dataset to enrich an AED pipeline. Towards this end, we develop a dual-branch neural network architecture for the joint learning of voice and acoustic features during an AED process and conduct thorough empirical studies to examine the performance on the public AudioSet [1] with different types of inputs. Our main observations are that: 1) Joint learning of audio and voice inputs improves the AED performance (mean average precision) for both a CNN baseline (0.292 vs 0.134 mAP) and a TALNet [2] baseline (0.361 vs 0.351 mAP); 2) Augmenting the extra voice features is critical to maximize the model performance with dual inputs. △ Less

Submitted 7 October, 2021; originally announced October 2021.

Comments: Submitted to ICASSP 2022

arXiv:2108.00177 [pdf, other]

Greedy Network Enlarging

Authors: Chuanjian Liu, Kai Han, An Xiao, Yiping Deng, Wei Zhang, Chunjing Xu, Yunhe Wang

Abstract: Recent studies on deep convolutional neural networks present a simple paradigm of architecture design, i.e., models with more MACs typically achieve better accuracy, such as EfficientNet and RegNet. These works try to enlarge all the stages in the model with one unified rule by sampling and statistical methods. However, we observe that some network architectures have similar MACs and accuracies, b… ▽ More Recent studies on deep convolutional neural networks present a simple paradigm of architecture design, i.e., models with more MACs typically achieve better accuracy, such as EfficientNet and RegNet. These works try to enlarge all the stages in the model with one unified rule by sampling and statistical methods. However, we observe that some network architectures have similar MACs and accuracies, but their allocations on computations for different stages are quite different. In this paper, we propose to enlarge the capacity of CNN models by improving their width, depth and resolution on stage level. Under the assumption that the top-performing smaller CNNs are a proper subcomponent of the top-performing larger CNNs, we propose an greedy network enlarging method based on the reallocation of computations. With step-by-step modifying the computations on different stages, the enlarged network will be equipped with optimal allocation and utilization of MACs. On EfficientNet, our method consistently outperforms the performance of the original scaling method. In particular, with application of our method on GhostNet, we achieve state-of-the-art 80.9% and 84.3% ImageNet top-1 accuracies under the setting of 600M and 4.4B MACs, respectively. △ Less

Submitted 25 November, 2021; v1 submitted 31 July, 2021; originally announced August 2021.

Showing 1–50 of 128 results for author: Xia, A