subscribe to arXiv mailings

AI Foundation Model for Heliophysics: Applications, Design, and Implementation

Authors: Sujit Roy, Talwinder Singh, Marcus Freitag, Johannes Schmude, Rohit Lal, Dinesha Hegde, Soumya Ranjan, Amy Lin, Vishal Gaur, Etienne Eben Vos, Rinki Ghosal, Badri Narayana Patro, Berkay Aydin, Nikolai Pogorelov, Juan Bernabe Moreno, Manil Maskey, Rahul Ramachandran

Abstract: Deep learning-based methods have been widely researched in the areas of language and vision, demonstrating their capacity to understand long sequences of data and their usefulness in numerous helio-physics applications. Foundation models (FMs), which are pre-trained on a large-scale datasets, form the basis for a variety of downstream tasks. These models, especially those based on transformers in… ▽ More Deep learning-based methods have been widely researched in the areas of language and vision, demonstrating their capacity to understand long sequences of data and their usefulness in numerous helio-physics applications. Foundation models (FMs), which are pre-trained on a large-scale datasets, form the basis for a variety of downstream tasks. These models, especially those based on transformers in vision and language, show exceptional potential for adapting to a wide range of downstream applications. In this paper, we provide our perspective on the criteria for designing an FM for heliophysics and associated challenges and applications using the Solar Dynamics Observatory (SDO) dataset. We believe that this is the first study to design an FM in the domain of heliophysics. △ Less

Submitted 30 September, 2024; originally announced October 2024.

Comments: 31 Pages, 12 figures

arXiv:2410.09250 [pdf, other]

Quantum-Trained Convolutional Neural Network for Deepfake Audio Detection

Authors: Chu-Hsuan Abraham Lin, Chen-Yu Liu, Samuel Yen-Chi Chen, Kuan-Cheng Chen

Abstract: The rise of deepfake technologies has posed significant challenges to privacy, security, and information integrity, particularly in audio and multimedia content. This paper introduces a Quantum-Trained Convolutional Neural Network (QT-CNN) framework designed to enhance the detection of deepfake audio, leveraging the computational power of quantum machine learning (QML). The QT-CNN employs a hybrid… ▽ More The rise of deepfake technologies has posed significant challenges to privacy, security, and information integrity, particularly in audio and multimedia content. This paper introduces a Quantum-Trained Convolutional Neural Network (QT-CNN) framework designed to enhance the detection of deepfake audio, leveraging the computational power of quantum machine learning (QML). The QT-CNN employs a hybrid quantum-classical approach, integrating Quantum Neural Networks (QNNs) with classical neural architectures to optimize training efficiency while reducing the number of trainable parameters. Our method incorporates a novel quantum-to-classical parameter mapping that effectively utilizes quantum states to enhance the expressive power of the model, achieving up to 70% parameter reduction compared to classical models without compromising accuracy. Data pre-processing involved extracting essential audio features, label encoding, feature scaling, and constructing sequential datasets for robust model evaluation. Experimental results demonstrate that the QT-CNN achieves comparable performance to traditional CNNs, maintaining high accuracy during training and testing phases across varying configurations of QNN blocks. The QT framework's ability to reduce computational overhead while maintaining performance underscores its potential for real-world applications in deepfake detection and other resource-constrained scenarios. This work highlights the practical benefits of integrating quantum computing into artificial intelligence, offering a scalable and efficient approach to advancing deepfake detection technologies. △ Less

Submitted 11 October, 2024; originally announced October 2024.

arXiv:2410.09239 [pdf, other]

Scaling Gaussian Processes for Learning Curve Prediction via Latent Kronecker Structure

Authors: Jihao Andreas Lin, Sebastian Ament, Maximilian Balandat, Eytan Bakshy

Abstract: A key task in AutoML is to model learning curves of machine learning models jointly as a function of model hyper-parameters and training progression. While Gaussian processes (GPs) are suitable for this task, naïve GPs require $\mathcal{O}(n^3m^3)$ time and $\mathcal{O}(n^2 m^2)$ space for $n$ hyper-parameter configurations and $\mathcal{O}(m)$ learning curve observations per hyper-parameter. Effi… ▽ More A key task in AutoML is to model learning curves of machine learning models jointly as a function of model hyper-parameters and training progression. While Gaussian processes (GPs) are suitable for this task, naïve GPs require $\mathcal{O}(n^3m^3)$ time and $\mathcal{O}(n^2 m^2)$ space for $n$ hyper-parameter configurations and $\mathcal{O}(m)$ learning curve observations per hyper-parameter. Efficient inference via Kronecker structure is typically incompatible with early-stopping due to missing learning curve values. We impose $\textit{latent Kronecker structure}$ to leverage efficient product kernels while handling missing values. In particular, we interpret the joint covariance matrix of observed values as the projection of a latent Kronecker product. Combined with iterative linear solvers and structured matrix-vector multiplication, our method only requires $\mathcal{O}(n^3 + m^3)$ time and $\mathcal{O}(n^2 + m^2)$ space. We show that our GP model can match the performance of a Transformer on a learning curve prediction task. △ Less

Submitted 11 October, 2024; originally announced October 2024.

Comments: Bayesian Decision-making and Uncertainty Workshop at NeurIPS 2024

arXiv:2409.13598 [pdf, other]

Prithvi WxC: Foundation Model for Weather and Climate

Authors: Johannes Schmude, Sujit Roy, Will Trojak, Johannes Jakubik, Daniel Salles Civitarese, Shraddha Singh, Julian Kuehnert, Kumar Ankur, Aman Gupta, Christopher E Phillips, Romeo Kienzler, Daniela Szwarcman, Vishal Gaur, Rajat Shinde, Rohit Lal, Arlindo Da Silva, Jorge Luis Guevara Diaz, Anne Jones, Simon Pfreundschuh, Amy Lin, Aditi Sheshadri, Udaysankar Nair, Valentine Anantharaj, Hendrik Hamann, Campbell Watson , et al. (4 additional authors not shown)

Abstract: Triggered by the realization that AI emulators can rival the performance of traditional numerical weather prediction models running on HPC systems, there is now an increasing number of large AI models that address use cases such as forecasting, downscaling, or nowcasting. While the parallel developments in the AI literature focus on foundation models -- models that can be effectively tuned to addr… ▽ More Triggered by the realization that AI emulators can rival the performance of traditional numerical weather prediction models running on HPC systems, there is now an increasing number of large AI models that address use cases such as forecasting, downscaling, or nowcasting. While the parallel developments in the AI literature focus on foundation models -- models that can be effectively tuned to address multiple, different use cases -- the developments on the weather and climate side largely focus on single-use cases with particular emphasis on mid-range forecasting. We close this gap by introducing Prithvi WxC, a 2.3 billion parameter foundation model developed using 160 variables from the Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). Prithvi WxC employs an encoder-decoder-based architecture, incorporating concepts from various recent transformer models to effectively capture both regional and global dependencies in the input data. The model has been designed to accommodate large token counts to model weather phenomena in different topologies at fine resolutions. Furthermore, it is trained with a mixed objective that combines the paradigms of masked reconstruction with forecasting. We test the model on a set of challenging downstream tasks namely: Autoregressive rollout forecasting, Downscaling, Gravity wave flux parameterization, and Extreme events estimation. The pretrained model with 2.3 billion parameters, along with the associated fine-tuning workflows, has been publicly released as an open-source contribution via Hugging Face. △ Less

Submitted 20 September, 2024; originally announced September 2024.

arXiv:2407.21075 [pdf, other]

Apple Intelligence Foundation Language Models

Authors: Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek , et al. (130 additional authors not shown)

Abstract: We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used… ▽ More We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development. △ Less

Submitted 29 July, 2024; originally announced July 2024.

arXiv:2407.06766 [pdf, other]

Relational Perspective on Graph Query Languages

Authors: Diego Figueira, Anthony W. Lin, Liat Peterfreund

Abstract: We study a relational perspective of graph database querying. Such a perspective underlies various graph database systems but very few theoretical investigations have been conducted on it. This perspective offers a powerful and unified framework to study graph database querying, by which algorithms and complexity follow from classical results. We provide two concrete applications. The first is q… ▽ More We study a relational perspective of graph database querying. Such a perspective underlies various graph database systems but very few theoretical investigations have been conducted on it. This perspective offers a powerful and unified framework to study graph database querying, by which algorithms and complexity follow from classical results. We provide two concrete applications. The first is querying property graphs. The property graph data model supersedes previously proposed graph models and underlies the new standard GQL for graph query languages. We show that this standard can be, by and large, expressed by extensions of relational calculus with transitive closure operators (FO[TC]) and existential second-order quantifiers (ESO). With this, we obtain optimal data complexity bounds, along with extensions including schema validation. The second application is incorporating data from concrete domains (e.g., numbers) in graph database querying. We use embedded finite model theory and, by exploiting a generic Restricted Quantifier Collapse (RQC) result for FO[TC] and ESO, we obtain optimal data complexity bounds for GQL with arithmetics and comparisons. Moreover, we show that Regular Data Path Querying with operations on data (i.e. using register automata formalisms) can be captured in FO[TC] over embedded finite graphs while preserving nondeterministic logspace data complexity. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2406.17871 [pdf, other]

Revisiting the Expressiveness Landscape of Data Graph Queries

Authors: Michael Benedikt, Anthony Widjaja Lin, Di-De Yen

Abstract: The study of graph queries in database theory has spanned more than three decades, resulting in a multitude of proposals for graph query languages. These languages differ in the mechanisms. We can identify three main families of languages, with the canonical representatives being: (1) regular path queries, (2) walk logic, and (3) first-order logic with transitive closure operators. This paper prov… ▽ More The study of graph queries in database theory has spanned more than three decades, resulting in a multitude of proposals for graph query languages. These languages differ in the mechanisms. We can identify three main families of languages, with the canonical representatives being: (1) regular path queries, (2) walk logic, and (3) first-order logic with transitive closure operators. This paper provides a complete picture of the expressive power of these languages in the context of data graphs. Specifically, we consider a graph data model that supports querying over both data and topology. For example, "Does there exist a path between two different persons in a social network with the same last name?". We also show that an extension of (1), augmented with transitive closure operators, can unify the expressivity of (1)--(3) without increasing the query evaluation complexity. △ Less

Submitted 25 June, 2024; originally announced June 2024.

arXiv:2406.16942 [pdf, other]

Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

Authors: Yuanyuan Peng, Aidi Lin, Meng Wang, Tian Lin, Ke Zou, Yinglin Cheng, Tingkun Shi, Xulong Liao, Lixia Feng, Zhen Liang, Xinjian Chen, Huazhu Fu, Haoyu Chen

Abstract: Inability to express the confidence level and detect unseen classes has limited the clinical implementation of artificial intelligence in the real-world. We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography (OCT). In the internal test set, FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RE… ▽ More Inability to express the confidence level and detect unseen classes has limited the clinical implementation of artificial intelligence in the real-world. We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography (OCT). In the internal test set, FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RETFound and UIOS, and got further improvement with thresholding strategy to 98.44%. In the external test sets obtained from other OCT devices, FMUE achieved an accuracy of 88.75% and 92.73% before and after thresholding. Our model is superior to two ophthalmologists with a higher F1 score (95.17% vs. 61.93% &71.72%). Besides, our model correctly predicts high uncertainty scores for samples with ambiguous features, of non-target-category diseases, or with low-quality to prompt manual checks and prevent misdiagnosis. FMUE provides a trustworthy method for automatic retinal anomalies detection in the real-world clinical open set environment. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: All codes are available at https://github.com/yuanyuanpeng0129/FMUE

arXiv:2406.09317 [pdf, other]

Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources, encompassing a diverse range of diseases across multiple ethnicities and countries. RetiZero exhibits superior performance in several downstream tasks, including zero-shot disease recognition, image-to-image retrieval, and internal- and cross-domain disease identification. In zero-shot scenarios, RetiZero achieves Top5 accuracy scores of 0.8430 for 15 fundus diseases and 0.7561 for 52 fundus diseases. For image retrieval, it achieves Top5 scores of 0.9500 and 0.8860 for the same disease sets, respectively. Clinical evaluations show that RetiZero's Top3 zero-shot performance surpasses the average of 19 ophthalmologists from Singapore, China and the United States. Furthermore, RetiZero significantly enhances clinicians' accuracy in diagnosing fundus disease. These findings underscore the value of integrating the RetiZero foundation model into clinical settings, where a variety of fundus diseases are encountered. △ Less

Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.06038 [pdf, other]

Navigation and 3D Surface Reconstruction from Passive Whisker Sensing

Authors: Michael A. Lin, Hao Li, Chengyi Xing, Mark R. Cutkosky

Abstract: Whiskers provide a way to sense surfaces in the immediate environment without disturbing it. In this paper we present a method for using highly flexible, curved, passive whiskers mounted along a robot arm to gather sensory data as they brush past objects during normal robot motion. The information is useful both for guiding the robot in cluttered spaces and for reconstructing the exposed faces of… ▽ More Whiskers provide a way to sense surfaces in the immediate environment without disturbing it. In this paper we present a method for using highly flexible, curved, passive whiskers mounted along a robot arm to gather sensory data as they brush past objects during normal robot motion. The information is useful both for guiding the robot in cluttered spaces and for reconstructing the exposed faces of objects. Surface reconstruction depends on accurate localization of contact points along each whisker. We present an algorithm based on Bayesian filtering that rapidly converges to within 1\,mm of the actual contact locations. The piecewise-continuous history of contact locations from each whisker allows for accurate reconstruction of curves on object surfaces. Employing multiple whiskers and traces, we are able to produce an occupancy map of proximal objects. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: arXiv admin note: text overlap with arXiv:2210.12387

arXiv:2406.02778 [pdf, other]

MS-IMAP -- A Multi-Scale Graph Embedding Approach for Interpretable Manifold Learning

Authors: Shay Deutsch, Lionel Yelibi, Alex Tong Lin, Arjun Ravi Kannan

Abstract: Deriving meaningful representations from complex, high-dimensional data in unsupervised settings is crucial across diverse machine learning applications. This paper introduces a framework for multi-scale graph network embedding based on spectral graph wavelets that employs a contrastive learning approach. A significant feature of the proposed embedding is its capacity to establish a correspondence… ▽ More Deriving meaningful representations from complex, high-dimensional data in unsupervised settings is crucial across diverse machine learning applications. This paper introduces a framework for multi-scale graph network embedding based on spectral graph wavelets that employs a contrastive learning approach. A significant feature of the proposed embedding is its capacity to establish a correspondence between the embedding space and the input feature space which aids in deriving feature importance of the original features. We theoretically justify our approach and demonstrate that, in Paley-Wiener spaces on combinatorial graphs, the spectral graph wavelets operator offers greater flexibility and better control over smoothness properties compared to the Laplacian operator. We validate the effectiveness of our proposed graph embedding on a variety of public datasets through a range of downstream tasks, including clustering and unsupervised feature importance. △ Less

Submitted 5 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.18457 [pdf, other]

Improving Linear System Solvers for Hyperparameter Optimisation in Iterative Gaussian Processes

Authors: Jihao Andreas Lin, Shreyas Padhy, Bruno Mlodozeniec, Javier Antorán, José Miguel Hernández-Lobato

Abstract: Scaling hyperparameter optimisation to very large datasets remains an open problem in the Gaussian process community. This paper focuses on iterative methods, which use linear system solvers, like conjugate gradients, alternating projections or stochastic gradient descent, to construct an estimate of the marginal likelihood gradient. We discuss three key improvements which are applicable across so… ▽ More Scaling hyperparameter optimisation to very large datasets remains an open problem in the Gaussian process community. This paper focuses on iterative methods, which use linear system solvers, like conjugate gradients, alternating projections or stochastic gradient descent, to construct an estimate of the marginal likelihood gradient. We discuss three key improvements which are applicable across solvers: (i) a pathwise gradient estimator, which reduces the required number of solver iterations and amortises the computational cost of making predictions, (ii) warm starting linear system solvers with the solution from the previous step, which leads to faster solver convergence at the cost of negligible bias, (iii) early stopping linear system solvers after a limited computational budget, which synergises with warm starting, allowing solver progress to accumulate over multiple marginal likelihood steps. These techniques provide speed-ups of up to $72\times$ when solving to tolerance, and decrease the average residual norm by up to $7\times$ when stopping early. △ Less

Submitted 6 June, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

Comments: Preprint. arXiv admin note: text overlap with arXiv:2405.18328

arXiv:2405.18328 [pdf, other]

Warm Start Marginal Likelihood Optimisation for Iterative Gaussian Processes

Authors: Jihao Andreas Lin, Shreyas Padhy, Bruno Mlodozeniec, José Miguel Hernández-Lobato

Abstract: Gaussian processes are a versatile probabilistic machine learning model whose effectiveness often depends on good hyperparameters, which are typically learned by maximising the marginal likelihood. In this work, we consider iterative methods, which use iterative linear system solvers to approximate marginal likelihood gradients up to a specified numerical precision, allowing a trade-off between co… ▽ More Gaussian processes are a versatile probabilistic machine learning model whose effectiveness often depends on good hyperparameters, which are typically learned by maximising the marginal likelihood. In this work, we consider iterative methods, which use iterative linear system solvers to approximate marginal likelihood gradients up to a specified numerical precision, allowing a trade-off between compute time and accuracy of a solution. We introduce a three-level hierarchy of marginal likelihood optimisation for iterative Gaussian processes, and identify that the computational costs are dominated by solving sequential batches of large positive-definite systems of linear equations. We then propose to amortise computations by reusing solutions of linear system solvers as initialisations in the next step, providing a $\textit{warm start}$. Finally, we discuss the necessary conditions and quantify the consequences of warm starts and demonstrate their effectiveness on regression tasks, where warm starts achieve the same results as the conventional procedure while providing up to a $16 \times$ average speed-up among datasets. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Advances in Approximate Bayesian Inference 2024

arXiv:2405.16166 [pdf, other]

The Power of Hard Attention Transformers on Data Sequences: A Formal Language Theoretic Perspective

Authors: Pascal Bergsträßer, Chris Köcher, Anthony Widjaja Lin, Georg Zetzsche

Abstract: Formal language theory has recently been successfully employed to unravel the power of transformer encoders. This setting is primarily applicable in Natural Languange Processing (NLP), as a token embedding function (where a bounded number of tokens is admitted) is first applied before feeding the input to the transformer. On certain kinds of data (e.g. time series), we want our transformers to be… ▽ More Formal language theory has recently been successfully employed to unravel the power of transformer encoders. This setting is primarily applicable in Natural Languange Processing (NLP), as a token embedding function (where a bounded number of tokens is admitted) is first applied before feeding the input to the transformer. On certain kinds of data (e.g. time series), we want our transformers to be able to handle \emph{arbitrary} input sequences of numbers (or tuples thereof) without \emph{a priori} limiting the values of these numbers. In this paper, we initiate the study of the expressive power of transformer encoders on sequences of data (i.e. tuples of numbers). Our results indicate an increase in expressive power of hard attention transformers over data sequences, in stark contrast to the case of strings. In particular, we prove that Unique Hard Attention Transformers (UHAT) over inputs as data sequences no longer lie within the circuit complexity class $AC^0$ (even without positional encodings), unlike the case of string inputs, but are still within the complexity class $TC^0$ (even with positional encodings). Over strings, UHAT without positional encodings capture only regular languages. In contrast, we show that over data sequences UHAT can capture non-regular properties. Finally, we show that UHAT capture languages definable in an extension of linear temporal logic with unary numeric predicates and arithmetics. △ Less

Submitted 25 May, 2024; originally announced May 2024.

arXiv:2405.06945 [pdf, other]

Direct Learning of Mesh and Appearance via 3D Gaussian Splatting

Authors: Ancheng Lin, Jun Li

Abstract: Accurately reconstructing a 3D scene including explicit geometry information is both attractive and challenging. Geometry reconstruction can benefit from incorporating differentiable appearance models, such as Neural Radiance Fields and 3D Gaussian Splatting (3DGS). However, existing methods encounter efficiency issues due to indirect geometry learning and the paradigm of separately modeling geome… ▽ More Accurately reconstructing a 3D scene including explicit geometry information is both attractive and challenging. Geometry reconstruction can benefit from incorporating differentiable appearance models, such as Neural Radiance Fields and 3D Gaussian Splatting (3DGS). However, existing methods encounter efficiency issues due to indirect geometry learning and the paradigm of separately modeling geometry and surface appearance. In this work, we propose a learnable scene model that incorporates 3DGS with an explicit geometry representation, namely a mesh. Our model learns the mesh and appearance in an end-to-end manner, where we bind 3D Gaussians to the mesh faces and perform differentiable rendering of 3DGS to obtain photometric supervision. The model creates an effective information pathway to supervise the learning of both 3DGS and mesh. Experimental results demonstrate that the learned scene model not only achieves state-of-the-art efficiency and rendering quality but also supports manipulation using the explicit mesh. In addition, our model has a unique advantage in adapting to scene updates, thanks to the end-to-end learning of both mesh and appearance. △ Less

Submitted 26 September, 2024; v1 submitted 11 May, 2024; originally announced May 2024.

arXiv:2404.08887 [pdf, other]

doi 10.1007/978-3-031-56069-9_6

Countering Mainstream Bias via End-to-End Adaptive Local Learning

Authors: Jinhao Pan, Ziwei Zhu, Jianling Wang, Allen Lin, James Caverlee

Abstract: Collaborative filtering (CF) based recommendations suffer from mainstream bias -- where mainstream users are favored over niche users, leading to poor recommendation quality for many long-tail users. In this paper, we identify two root causes of this mainstream bias: (i) discrepancy modeling, whereby CF algorithms focus on modeling mainstream users while neglecting niche users with unique preferen… ▽ More Collaborative filtering (CF) based recommendations suffer from mainstream bias -- where mainstream users are favored over niche users, leading to poor recommendation quality for many long-tail users. In this paper, we identify two root causes of this mainstream bias: (i) discrepancy modeling, whereby CF algorithms focus on modeling mainstream users while neglecting niche users with unique preferences; and (ii) unsynchronized learning, where niche users require more training epochs than mainstream users to reach peak performance. Targeting these causes, we propose a novel end-To-end Adaptive Local Learning (TALL) framework to provide high-quality recommendations to both mainstream and niche users. TALL uses a loss-driven Mixture-of-Experts module to adaptively ensemble experts to provide customized local models for different users. Further, it contains an adaptive weight module to synchronize the learning paces of different users by dynamically adjusting weights in the loss. Extensive experiments demonstrate the state-of-the-art performance of the proposed model. Code and data are provided at \url{https://github.com/JP-25/end-To-end-Adaptive-Local-Leanring-TALL-} △ Less

Submitted 12 April, 2024; originally announced April 2024.

Comments: ECIR 2024

Journal ref: In European Conference on Information Retrieval 2024, vol 14612 (pp. 75-89)

arXiv:2402.14817 [pdf, other]

Cameras as Rays: Pose Estimation via Ray Diffusion

Authors: Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani

Abstract: Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatia… ▽ More Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures. △ Less

Submitted 4 April, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

Comments: In ICLR 2024 (oral). v2-3: updated references. Project webpage: https://jasonyzhang.com/RayDiffusion

arXiv:2402.09430 [pdf, other]

WiMANS: A Benchmark Dataset for WiFi-based Multi-user Activity Sensing

Authors: Shuokang Huang, Kaihan Li, Di You, Yichong Chen, Arvin Lin, Siying Liu, Xiaohui Li, Julie A. McCann

Abstract: WiFi-based human sensing has exhibited remarkable potential to analyze user behaviors in a non-intrusive and device-free manner, benefiting applications as diverse as smart homes and healthcare. However, most previous works focus on single-user sensing, which has limited practicability in scenarios involving multiple users. Although recent studies have begun to investigate WiFi-based multi-user se… ▽ More WiFi-based human sensing has exhibited remarkable potential to analyze user behaviors in a non-intrusive and device-free manner, benefiting applications as diverse as smart homes and healthcare. However, most previous works focus on single-user sensing, which has limited practicability in scenarios involving multiple users. Although recent studies have begun to investigate WiFi-based multi-user sensing, there remains a lack of benchmark datasets to facilitate reproducible and comparable research. To bridge this gap, we present WiMANS, to our knowledge, the first dataset for multi-user sensing based on WiFi. WiMANS contains over 9.4 hours of dual-band WiFi Channel State Information (CSI), as well as synchronized videos, monitoring simultaneous activities of multiple users. We exploit WiMANS to benchmark the performance of state-of-the-art WiFi-based human sensing models and video-based models, posing new challenges and opportunities for future work. We believe WiMANS can push the boundaries of current studies and catalyze the research on WiFi-based multi-user sensing. △ Less

Submitted 12 March, 2024; v1 submitted 24 January, 2024; originally announced February 2024.

Comments: We present WiMANS, to our knowledge, the first dataset for multi-user activity sensing based on WiFi

arXiv:2402.01695 [pdf, other]

Language-Guided World Models: A Model-Based Approach to AI Control

Authors: Alex Zhang, Khanh Nguyen, Jens Tuyls, Albert Lin, Karthik Narasimhan

Abstract: This paper introduces the concept of Language-Guided World Models (LWMs) -- probabilistic models that can simulate environments by reading texts. Agents equipped with these models provide humans with more extensive and efficient control, allowing them to simultaneously alter agent behaviors in multiple tasks via natural verbal communication. In this work, we take initial steps in developing robust… ▽ More This paper introduces the concept of Language-Guided World Models (LWMs) -- probabilistic models that can simulate environments by reading texts. Agents equipped with these models provide humans with more extensive and efficient control, allowing them to simultaneously alter agent behaviors in multiple tasks via natural verbal communication. In this work, we take initial steps in developing robust LWMs that can generalize to compositionally novel language descriptions. We design a challenging world modeling benchmark based on the game of MESSENGER (Hanjie et al., 2021), featuring evaluation settings that require varying degrees of compositional generalization. Our experiments reveal the lack of generalizability of the state-of-the-art Transformer model, as it offers marginal improvements in simulation quality over a no-text baseline. We devise a more robust model by fusing the Transformer with the EMMA attention mechanism (Hanjie et al., 2021). Our model substantially outperforms the Transformer and approaches the performance of a model with an oracle semantic parsing and grounding capability. To demonstrate the practicality of this model in improving AI safety and transparency, we simulate a scenario in which the model enables an agent to present plans to a human before execution, and to revise plans based on their language feedback. △ Less

Submitted 4 September, 2024; v1 submitted 23 January, 2024; originally announced February 2024.

Comments: SpLU-RoboNLP workshop at ACL 2024

arXiv:2401.02618 [pdf, ps, other]

doi 10.1145/3632864

Regular Abstractions for Array Systems

Authors: Chih-Duo Hong, Anthony W. Lin

Abstract: Verifying safety and liveness over array systems is a highly challenging problem. Array systems naturally capture parameterized systems such as distributed protocols with an unbounded number of processes. Such distributed protocols often exploit process IDs during their computation, resulting in array systems whose element values range over an infinite domain. In this paper, we develop a novel fra… ▽ More Verifying safety and liveness over array systems is a highly challenging problem. Array systems naturally capture parameterized systems such as distributed protocols with an unbounded number of processes. Such distributed protocols often exploit process IDs during their computation, resulting in array systems whose element values range over an infinite domain. In this paper, we develop a novel framework for proving safety and liveness over array systems. The crux of the framework is to overapproximate an array system as a string rewriting system (i.e. over a finite alphabet) by means of a new predicate abstraction that exploits the so-called indexed predicates. This allows us to tap into powerful verification methods for string rewriting systems that have been heavily developed in the last few decades (e.g. regular model checking). We demonstrate how our method yields simple, automatically verifiable proofs of safety and liveness properties for challenging examples, including Dijkstra's self-stabilizing protocol and the Chang-Roberts leader election protocol. △ Less

Submitted 4 January, 2024; originally announced January 2024.

arXiv:2312.10074 [pdf]

STAGER checklist: Standardized Testing and Assessment Guidelines for Evaluating Generative AI Reliability

Authors: Jinghong Chen, Lingxuan Zhu, Weiming Mou, Zaoqu Liu, Quan Cheng, Anqi Lin, Jian Zhang, Peng Luo

Abstract: Generative Artificial Intelligence (AI) holds immense potential in medical applications. Numerous studies have explored the efficacy of various generative AI models within healthcare contexts, but there is a lack of a comprehensive and systematic evaluation framework. Given that some studies evaluating the ability of generative AI for medical applications have deficiencies in their methodological… ▽ More Generative Artificial Intelligence (AI) holds immense potential in medical applications. Numerous studies have explored the efficacy of various generative AI models within healthcare contexts, but there is a lack of a comprehensive and systematic evaluation framework. Given that some studies evaluating the ability of generative AI for medical applications have deficiencies in their methodological design, standardized guidelines for their evaluation are also currently lacking. In response, our objective is to devise standardized assessment guidelines tailored for evaluating the performance of generative AI systems in medical contexts. To this end, we conducted a thorough literature review using the PubMed and Google Scholar databases, focusing on research that tests generative AI capabilities in medicine. Our multidisciplinary team, comprising experts in life sciences, clinical medicine, medical engineering, and generative AI users, conducted several discussion sessions and developed a checklist of 23 items. The checklist is designed to encompass the critical evaluation aspects of generative AI in medical applications comprehensively. This checklist, and the broader assessment framework it anchors, address several key dimensions, including question collection, querying methodologies, and assessment techniques. We aim to provide a holistic evaluation of AI systems. The checklist delineates a clear pathway from question gathering to result assessment, offering researchers guidance through potential challenges and pitfalls. Our framework furnishes a standardized, systematic approach for research involving the testing of generative AI's applicability in medicine. It enhances the quality of research reporting and aids in the evolution of generative AI in medicine and life sciences. △ Less

Submitted 7 December, 2023; originally announced December 2023.

Comments: 11 pages, 0 figure, 2 tables

arXiv:2312.08604 [pdf, other]

Verification of Neural Reachable Tubes via Scenario Optimization and Conformal Prediction

Authors: Albert Lin, Somil Bansal

Abstract: Learning-based approaches for controlling safety-critical systems are rapidly growing in popularity; thus, it is important to assure their performance and safety. Hamilton-Jacobi (HJ) reachability analysis is a popular formal verification tool for providing such guarantees, since it can handle general nonlinear system dynamics, bounded adversarial system disturbances, and state and input constrain… ▽ More Learning-based approaches for controlling safety-critical systems are rapidly growing in popularity; thus, it is important to assure their performance and safety. Hamilton-Jacobi (HJ) reachability analysis is a popular formal verification tool for providing such guarantees, since it can handle general nonlinear system dynamics, bounded adversarial system disturbances, and state and input constraints. However, its computational and memory complexity scales exponentially with the state dimension, making it intractable for large-scale systems. To overcome this challenge, neural approaches, such as DeepReach, have been used to synthesize reachable tubes and safety controllers for high-dimensional systems. However, verifying these neural reachable tubes remains challenging. In this work, we propose two verification methods, based on robust scenario optimization and conformal prediction, to provide probabilistic safety guarantees for neural reachable tubes. Our methods allow a direct trade-off between resilience to outlier errors in the neural tube, which are inevitable in a learning-based approach, and the strength of the probabilistic safety guarantee. Furthermore, we show that split conformal prediction, a widely used method in the machine learning community for uncertainty quantification, reduces to a scenario-based approach, making the two methods equivalent not only for verification of neural reachable tubes but also more generally. To our knowledge, our proof is the first in the literature to show a strong relationship between conformal prediction and scenario optimization. Finally, we propose an outlier-adjusted verification approach that uses the error distribution in neural reachable tubes to recover greater safe volumes. We demonstrate the efficacy of the proposed approaches for the high-dimensional problems of multi-vehicle collision avoidance and rocket landing with no-go zones. △ Less

Submitted 9 April, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

Comments: Accepted to 6th Annual Learning for Dynamics & Control Conference. arXiv admin note: text overlap with arXiv:2209.12336

arXiv:2311.17037 [pdf, other]

Concurrent Stochastic Lossy Channel Games

Authors: Daniel Stan, Muhammad Najib, Anthony Widjaja Lin, Parosh Aziz Abdulla

Abstract: Concurrent stochastic games are an important formalism for the rational verification of probabilistic multi-agent systems, which involves verifying whether a temporal logic property is satisfied in some or all game-theoretic equilibria of such systems. In this work, we study the rational verification of probabilistic multi-agent systems where agents can cooperate by communicating over unbounded lo… ▽ More Concurrent stochastic games are an important formalism for the rational verification of probabilistic multi-agent systems, which involves verifying whether a temporal logic property is satisfied in some or all game-theoretic equilibria of such systems. In this work, we study the rational verification of probabilistic multi-agent systems where agents can cooperate by communicating over unbounded lossy channels. To model such systems, we present concurrent stochastic lossy channel games (CSLCG) and employ an equilibrium concept from cooperative game theory known as the core, which is the most fundamental and widely studied cooperative equilibrium concept. Our main contribution is twofold. First, we show that the rational verification problem is undecidable for systems whose agents have almost-sure LTL objectives. Second, we provide a decidable fragment of such a class of objectives that subsumes almost-sure reachability and safety. Our techniques involve reductions to solving infinite-state zero-sum games with conjunctions of qualitative objectives. To the best of our knowledge, our result represents the first decidability result on the rational verification of stochastic multi-agent systems on infinite arenas. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: To appear at CSL 2024. Extended version

arXiv:2311.15883 [pdf, other]

Characterising and Verifying the Core in Concurrent Multi-Player Mean-Payoff Games (Full Version)

Authors: Julian Gutierrez, Anthony W. Lin, Muhammad Najib, Thomas Steeples, Michael Wooldridge

Abstract: Concurrent multi-player mean-payoff games are important models for systems of agents with individual, non-dichotomous preferences. Whilst these games have been extensively studied in terms of their equilibria in non-cooperative settings, this paper explores an alternative solution concept: the core from cooperative game theory. This concept is particularly relevant for cooperative AI systems, as i… ▽ More Concurrent multi-player mean-payoff games are important models for systems of agents with individual, non-dichotomous preferences. Whilst these games have been extensively studied in terms of their equilibria in non-cooperative settings, this paper explores an alternative solution concept: the core from cooperative game theory. This concept is particularly relevant for cooperative AI systems, as it enables the modelling of cooperation among agents, even when their goals are not fully aligned. Our contribution is twofold. First, we provide a characterisation of the core using discrete geometry techniques and establish a necessary and sufficient condition for its non-emptiness. We then use the characterisation to prove the existence of polynomial witnesses in the core. Second, we use the existence of such witnesses to solve key decision problems in rational verification and provide tight complexity bounds for the problem of checking whether some/every equilibrium in a game satisfies a given LTL or GR(1) specification. Our approach is general and can be adapted to handle other specifications expressed in various fragments of LTL without incurring additional computational costs. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: This is the full version of the paper with the same title that appears in the CSL'24 proceedings

arXiv:2311.04031 [pdf, other]

Ramsey Quantifiers in Linear Arithmetics

Authors: Pascal Bergsträßer, Moses Ganardi, Anthony W. Lin, Georg Zetzsche

Abstract: We study Satisfiability Modulo Theories (SMT) enriched with the so-called Ramsey quantifiers, which assert the existence of cliques (complete graphs) in the graph induced by some formulas. The extended framework is known to have applications in proving program termination (in particular, whether a transitive binary predicate is well-founded), and monadic decomposability of SMT formulas. Our main r… ▽ More We study Satisfiability Modulo Theories (SMT) enriched with the so-called Ramsey quantifiers, which assert the existence of cliques (complete graphs) in the graph induced by some formulas. The extended framework is known to have applications in proving program termination (in particular, whether a transitive binary predicate is well-founded), and monadic decomposability of SMT formulas. Our main result is a new algorithm for eliminating Ramsey quantifiers from three common SMT theories: Linear Integer Arithmetic (LIA), Linear Real Arithmetic (LRA), and Linear Integer Real Arithmetic (LIRA). In particular, if we work only with existentially quantified formulas, then our algorithm runs in polynomial time and produces a formula of linear size. One immediate consequence is that checking well-foundedness of a given formula in the aforementioned theory defining a transitive predicate can be straightforwardly handled by highly optimized SMT-solvers. We show also how this provides a uniform semi-algorithm for verifying termination and liveness with completeness guarantee (in fact, with an optimal computational complexity) for several well-known classes of infinite-state systems, which include succinct timed systems, one-counter systems, and monotonic counter systems. Another immediate consequence is a solution to an open problem on checking monadic decomposability of a given relation in quantifier-free fragments of LRA and LIRA, which is an important problem in automated reasoning and constraint databases. Our result immediately implies decidability of this problem with an optimal complexity (coNP-complete) and enables exploitation of SMT-solvers. It also provides a termination guarantee for the generic monadic decomposition algorithm of Veanes et al. for LIA, LRA, and LIRA. We report encouraging experimental results on a prototype implementation of our algorithms on micro-benchmarks. △ Less

Submitted 7 November, 2023; originally announced November 2023.

arXiv:2311.03901 [pdf, ps, other]

Parikh's Theorem Made Symbolic

Authors: Matthew Hague, Artur Jeż, Anthony W. Lin

Abstract: Parikh's Theorem is a fundamental result in automata theory with numerous applications in computer science: software verification (e.g. infinite-state verification, string constraints, and theory of arrays), verification of cryptographic protocols (e.g. using Horn clauses modulo equational theories) and database querying (e.g. evaluating path-queries in graph databases). Parikh's Theorem states th… ▽ More Parikh's Theorem is a fundamental result in automata theory with numerous applications in computer science: software verification (e.g. infinite-state verification, string constraints, and theory of arrays), verification of cryptographic protocols (e.g. using Horn clauses modulo equational theories) and database querying (e.g. evaluating path-queries in graph databases). Parikh's Theorem states that the letter-counting abstraction of a language recognized by finite automata or context-free grammars is definable in Presburger Arithmetic. Unfortunately, real-world applications typically require large alphabets - which are well-known to be not amenable to explicit treatment of the alphabets. Symbolic automata have proven in the last decade to be an effective algorithmic framework for handling large finite or even infinite alphabets. A symbolic automaton employs an effective boolean algebra, which offers a symbolic representation of character sets and often lends itself to an exponentially more succinct representation of a language. Instead of letter-counting, Parikh's Theorem for symbolic automata amounts to counting the number of times different predicates are satisfied by an input sequence. Unfortunately, naively applying Parikh's Theorem from classical automata theory to symbolic automata yields existential Presburger formulas of exponential size. We provide a new construction for Parikh's Theorem for symbolic automata and grammars, which avoids this exponential blowup: our algorithm computes an existential formula in polynomial-time over (quantifier-free) Presburger and the base theory. In fact, our algorithm extends to the model of parametric symbolic grammars, which are one of the most expressive models of languages over infinite alphabets. We have implemented our algorithm and show it can be used to solve string constraints that are difficult to solve by existing solvers. △ Less

Submitted 31 July, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

Comments: Accepted tp POPL '24

arXiv:2310.20581 [pdf, other]

Stochastic Gradient Descent for Gaussian Processes Done Right

Authors: Jihao Andreas Lin, Shreyas Padhy, Javier Antorán, Austin Tripp, Alexander Terenin, Csaba Szepesvári, José Miguel Hernández-Lobato, David Janz

Abstract: As is well known, both sampling from the posterior and computing the mean of the posterior in Gaussian process regression reduces to solving a large linear system of equations. We study the use of stochastic gradient descent for solving this linear system, and show that when \emph{done right} -- by which we mean using specific insights from the optimisation and kernel communities -- stochastic gra… ▽ More As is well known, both sampling from the posterior and computing the mean of the posterior in Gaussian process regression reduces to solving a large linear system of equations. We study the use of stochastic gradient descent for solving this linear system, and show that when \emph{done right} -- by which we mean using specific insights from the optimisation and kernel communities -- stochastic gradient descent is highly effective. To that end, we introduce a particularly simple \emph{stochastic dual descent} algorithm, explain its design in an intuitive manner and illustrate the design choices through a series of ablation studies. Further experiments demonstrate that our new method is highly competitive. In particular, our evaluations on the UCI regression tasks and on Bayesian optimisation set our approach apart from preconditioned conjugate gradients and variational Gaussian process approximations. Moreover, our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction. △ Less

Submitted 28 April, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.09974 [pdf, other]

Algorithmic Contract Design for Crowdsourced Ranking

Authors: Kiriaki Frangias, Andrew Lin, Ellen Vitercik, Manolis Zampetakis

Abstract: Ranking is fundamental to many areas, such as search engine optimization, human feedback for language models, as well as peer grading. Crowdsourcing, which is often used for these tasks, requires proper incentivization to ensure accurate inputs. In this work, we draw on the field of \emph{contract theory} from Economics to propose a novel mechanism that enables a \emph{principal} to accurately ran… ▽ More Ranking is fundamental to many areas, such as search engine optimization, human feedback for language models, as well as peer grading. Crowdsourcing, which is often used for these tasks, requires proper incentivization to ensure accurate inputs. In this work, we draw on the field of \emph{contract theory} from Economics to propose a novel mechanism that enables a \emph{principal} to accurately rank a set of items by incentivizing agents to provide pairwise comparisons of the items. Our mechanism implements these incentives by verifying a subset of each agent's comparisons, a task we assume to be costly. The agent is compensated (for example, monetarily or with class credit) based on the accuracy of these comparisons. Our mechanism achieves the following guarantees: (1) it only requires the principal to verify $O(\log s)$ comparisons, where $s$ is the total number of agents, and (2) it provably achieves higher total utility for the principal compared to ranking the items herself with no crowdsourcing. △ Less

Submitted 24 January, 2024; v1 submitted 15 October, 2023; originally announced October 2023.

arXiv:2310.08873 [pdf, other]

Interactive Navigation in Environments with Traversable Obstacles Using Large Language and Vision-Language Models

Authors: Zhen Zhang, Anran Lin, Chun Wai Wong, Xiangyu Chu, Qi Dou, K. W. Samuel Au

Abstract: This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we ca… ▽ More This paper proposes an interactive navigation framework by using large language and vision-language models, allowing robots to navigate in environments with traversable obstacles. We utilize the large language model (GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an action-aware costmap to perform effective path planning without fine-tuning. With the large models, we can achieve an end-to-end system from textual instructions like "Can you pass through the curtains to deliver medicines to me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can be used to segment LiDAR point clouds into two parts: traversable and untraversable parts, and then an action-aware costmap is constructed for generating a feasible path. The pre-trained large models have great generalization ability and do not require additional annotated data for training, allowing fast deployment in the interactive navigation tasks. We choose to use multiple traversable objects such as curtains and grasses for verification by instructing the robot to traverse them. Besides, traversing curtains in a medical scenario was tested. All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments. △ Less

Submitted 12 March, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Accepted by 2024 IEEE International Conference on Robotics and Automation (ICRA), 7 pages, 8 figures

arXiv:2310.07916 [pdf, other]

Dynamic Appearance Particle Neural Radiance Field

Authors: Ancheng Lin, Jun Li

Abstract: Neural Radiance Fields (NeRFs) have shown great potential in modelling 3D scenes. Dynamic NeRFs extend this model by capturing time-varying elements, typically using deformation fields. The existing dynamic NeRFs employ a similar Eulerian representation for both light radiance and deformation fields. This leads to a close coupling of appearance and motion and lacks a physical interpretation. In th… ▽ More Neural Radiance Fields (NeRFs) have shown great potential in modelling 3D scenes. Dynamic NeRFs extend this model by capturing time-varying elements, typically using deformation fields. The existing dynamic NeRFs employ a similar Eulerian representation for both light radiance and deformation fields. This leads to a close coupling of appearance and motion and lacks a physical interpretation. In this work, we propose Dynamic Appearance Particle Neural Radiance Field (DAP-NeRF), which introduces particle-based representation to model the motions of visual elements in a dynamic 3D scene. DAP-NeRF consists of superposition of a static field and a dynamic field. The dynamic field is quantised as a collection of {\em appearance particles}, which carries the visual information of a small dynamic element in the scene and is equipped with a motion model. All components, including the static field, the visual features and motion models of the particles, are learned from monocular videos without any prior geometric knowledge of the scene. We develop an efficient computational framework for the particle-based model. We also construct a new dataset to evaluate motion modelling. Experimental results show that DAP-NeRF is an effective technique to capture not only the appearance but also the physically meaningful motions in a 3D dynamic scene. △ Less

Submitted 10 December, 2023; v1 submitted 11 October, 2023; originally announced October 2023.

arXiv:2310.05126 [pdf, other]

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Authors: Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, Qin Jin, Liang He, Xin Alex Lin, Fei Huang

Abstract: Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and… ▽ More Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2310.03817 [pdf, ps, other]

Logical Languages Accepted by Transformer Encoders with Hard Attention

Authors: Pablo Barcelo, Alexander Kozachinskiy, Anthony Widjaja Lin, Vladimir Podolskii

Abstract: We contribute to the study of formal languages that can be recognized by transformer encoders. We focus on two self-attention mechanisms: (1) UHAT (Unique Hard Attention Transformers) and (2) AHAT (Average Hard Attention Transformers). UHAT encoders are known to recognize only languages inside the circuit complexity class ${\sf AC}^0$, i.e., accepted by a family of poly-sized and depth-bounded boo… ▽ More We contribute to the study of formal languages that can be recognized by transformer encoders. We focus on two self-attention mechanisms: (1) UHAT (Unique Hard Attention Transformers) and (2) AHAT (Average Hard Attention Transformers). UHAT encoders are known to recognize only languages inside the circuit complexity class ${\sf AC}^0$, i.e., accepted by a family of poly-sized and depth-bounded boolean circuits with unbounded fan-ins. On the other hand, AHAT encoders can recognize languages outside ${\sf AC}^0$), but their expressive power still lies within the bigger circuit complexity class ${\sf TC}^0$, i.e., ${\sf AC}^0$-circuits extended by majority gates. We first show a negative result that there is an ${\sf AC}^0$-language that cannot be recognized by an UHAT encoder. On the positive side, we show that UHAT encoders can recognize a rich fragment of ${\sf AC}^0$-languages, namely, all languages definable in first-order logic with arbitrary unary numerical predicates. This logic, includes, for example, all regular languages from ${\sf AC}^0$. We then show that AHAT encoders can recognize all languages of our logic even when we enrich it with counting terms. We apply these results to derive new results on the expressive power of UHAT and AHAT up to permutation of letters (a.k.a. Parikh images). △ Less

Submitted 5 October, 2023; originally announced October 2023.

arXiv:2310.00420 [pdf, other]

An Efficient Algorithm for Clustered Multi-Task Compressive Sensing

Authors: Alexander Lin, Demba Ba

Abstract: This paper considers clustered multi-task compressive sensing, a hierarchical model that solves multiple compressive sensing tasks by finding clusters of tasks that leverage shared information to mutually improve signal reconstruction. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. The main bottleneck involves repeated matri… ▽ More This paper considers clustered multi-task compressive sensing, a hierarchical model that solves multiple compressive sensing tasks by finding clusters of tasks that leverage shared information to mutually improve signal reconstruction. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. The main bottleneck involves repeated matrix inversion and log-determinant computation for multiple large covariance matrices. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices. Our approach combines Monte Carlo sampling with iterative linear solvers. Our experiments reveal that compared to the existing baseline, our algorithm can be up to thousands of times faster and an order of magnitude more memory-efficient. △ Less

Submitted 30 September, 2023; originally announced October 2023.

arXiv:2309.07574 [pdf, other]

doi 10.1145/3539618.3591887

MMEAD: MS MARCO Entity Annotations and Disambiguations

Authors: Chris Kamphuis, Aileen Lin, Siwen Yang, Jimmy Lin, Arjen P. de Vries, Faegheh Hasibi

Abstract: MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for entity links for the MS MARCO datasets. We specify a format to store and share links for both document and passage collections of MS MARCO. Following this specification, we release entity links to Wikipedia for documents and passages in both MS MARCO collections (v1 and v2). Entity links have been produced by the REL and… ▽ More MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for entity links for the MS MARCO datasets. We specify a format to store and share links for both document and passage collections of MS MARCO. Following this specification, we release entity links to Wikipedia for documents and passages in both MS MARCO collections (v1 and v2). Entity links have been produced by the REL and BLINK systems. MMEAD is an easy-to-install Python package, allowing users to load the link data and entity embeddings effortlessly. Using MMEAD takes only a few lines of code. Finally, we show how MMEAD can be used for IR research that uses entity information. We show how to improve recall@1000 and MRR@10 on more complex queries on the MS MARCO v1 passage dataset by using this resource. We also demonstrate how entity expansions can be used for interactive search applications. △ Less

Submitted 14 September, 2023; originally announced September 2023.

arXiv:2308.00175 [pdf, ps, other]

Decision Procedures for Sequence Theories (Technical Report)

Authors: Artur Jeż, Anthony W. Lin, Oliver Markgraf, Philipp Rümmer

Abstract: Sequence theories are an extension of theories of strings with an infinite alphabet of letters, together with a corresponding alphabet theory (e.g. linear integer arithmetic). Sequences are natural abstractions of extendable arrays, which permit a wealth of operations including append, map, split, and concatenation. In spite of the growing amount of tool support for theories of sequences by leadin… ▽ More Sequence theories are an extension of theories of strings with an infinite alphabet of letters, together with a corresponding alphabet theory (e.g. linear integer arithmetic). Sequences are natural abstractions of extendable arrays, which permit a wealth of operations including append, map, split, and concatenation. In spite of the growing amount of tool support for theories of sequences by leading SMT-solvers, little is known about the decidability of sequence theories, which is in stark contrast to the state of the theories of strings. We show that the decidable theory of strings with concatenation and regular constraints can be extended to the world of sequences over an alphabet theory that forms a Boolean algebra, while preserving decidability. In particular, decidability holds when regular constraints are interpreted as parametric automata (which extend both symbolic automata and variable automata), but fails when interpreted as register automata (even over the alphabet theory of equality). When length constraints are added, the problem is Turing-equivalent to word equations with length (and regular) constraints. Similar investigations are conducted in the presence of symbolic transducers, which naturally model sequence functions like map, split, filter, etc. We have developed a new sequence solver, SeCo, based on parametric automata, and show its efficacy on two classes of benchmarks: (i) invariant checking on array-manipulating programs and parameterized systems, and (ii) benchmarks on symbolic register automata. △ Less

Submitted 31 July, 2023; originally announced August 2023.

arXiv:2307.09729 [pdf, other]

NTIRE 2023 Quality Assessment of Video Enhancement Challenge

Authors: Xiaohong Liu, Xiongkuo Min, Wei Sun, Yulun Zhang, Kai Zhang, Radu Timofte, Guangtao Zhai, Yixuan Gao, Yuqin Cao, Tengchuan Kou, Yunlong Dong, Ziheng Jia, Yilin Li, Wei Wu, Shuming Hu, Sibin Deng, Pengxiang Xiao, Ying Chen, Kai Li, Kai Zhao, Kun Yuan, Ming Sun, Heng Cong, Hao Wang, Lingzhi Fu , et al. (47 additional authors not shown)

Abstract: This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual… ▽ More This paper reports on the NTIRE 2023 Quality Assessment of Video Enhancement Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2023. This challenge is to address a major challenge in the field of video processing, namely, video quality assessment (VQA) for enhanced videos. The challenge uses the VQA Dataset for Perceptual Video Enhancement (VDPVE), which has a total of 1211 enhanced videos, including 600 videos with color, brightness, and contrast enhancements, 310 videos with deblurring, and 301 deshaked videos. The challenge has a total of 167 registered participants. 61 participating teams submitted their prediction results during the development phase, with a total of 3168 submissions. A total of 176 submissions were submitted by 37 participating teams during the final testing phase. Finally, 19 participating teams submitted their models and fact sheets, and detailed the methods they used. Some methods have achieved better results than baseline methods, and the winning methods have demonstrated superior prediction performance. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2307.07816 [pdf, other]

Minimal Random Code Learning with Mean-KL Parameterization

Authors: Jihao Andreas Lin, Gergely Flamich, José Miguel Hernández-Lobato

Abstract: This paper studies the qualitative behavior and robustness of two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_{\mathbf{w}}$ and uses relative entropy coding to compress a weight sample from the posterior using a Gaussian coding di… ▽ More This paper studies the qualitative behavior and robustness of two variants of Minimal Random Code Learning (MIRACLE) used to compress variational Bayesian neural networks. MIRACLE implements a powerful, conditionally Gaussian variational approximation for the weight posterior $Q_{\mathbf{w}}$ and uses relative entropy coding to compress a weight sample from the posterior using a Gaussian coding distribution $P_{\mathbf{w}}$. To achieve the desired compression rate, $D_{\mathrm{KL}}[Q_{\mathbf{w}} \Vert P_{\mathbf{w}}]$ must be constrained, which requires a computationally expensive annealing procedure under the conventional mean-variance (Mean-Var) parameterization for $Q_{\mathbf{w}}$. Instead, we parameterize $Q_{\mathbf{w}}$ by its mean and KL divergence from $P_{\mathbf{w}}$ to constrain the compression cost to the desired value by construction. We demonstrate that variational training with Mean-KL parameterization converges twice as fast and maintains predictive performance after compression. Furthermore, we show that Mean-KL leads to more meaningful variational distributions with heavier tails and compressed weight samples which are more robust to pruning. △ Less

Submitted 4 December, 2023; v1 submitted 15 July, 2023; originally announced July 2023.

Comments: ICML Neural Compression Workshop 2023

arXiv:2307.06093 [pdf, other]

Online Laplace Model Selection Revisited

Authors: Jihao Andreas Lin, Javier Antorán, José Miguel Hernández-Lobato

Abstract: The Laplace approximation provides a closed-form model selection objective for neural networks (NN). Online variants, which optimise NN parameters jointly with hyperparameters, like weight decay strength, have seen renewed interest in the Bayesian deep learning community. However, these methods violate Laplace's method's critical assumption that the approximation is performed around a mode of the… ▽ More The Laplace approximation provides a closed-form model selection objective for neural networks (NN). Online variants, which optimise NN parameters jointly with hyperparameters, like weight decay strength, have seen renewed interest in the Bayesian deep learning community. However, these methods violate Laplace's method's critical assumption that the approximation is performed around a mode of the loss, calling into question their soundness. This work re-derives online Laplace methods, showing them to target a variational bound on a mode-corrected variant of the Laplace evidence which does not make stationarity assumptions. Online Laplace and its mode-corrected counterpart share stationary points where 1. the NN parameters are a maximum a posteriori, satisfying the Laplace method's assumption, and 2. the hyperparameters maximise the Laplace evidence, motivating online methods. We demonstrate that these optima are roughly attained in practise by online algorithms using full-batch gradient descent on UCI regression datasets. The optimised hyperparameters prevent overfitting and outperform validation-based early stopping. △ Less

Submitted 9 January, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

Comments: Advances in Approximate Bayesian Inference 2023

arXiv:2307.06055 [pdf, other]

Function-Space Regularization for Deep Bayesian Classification

Authors: Jihao Andreas Lin, Joe Watson, Pascal Klink, Jan Peters

Abstract: Bayesian deep learning approaches assume model parameters to be latent random variables and infer posterior distributions to quantify uncertainty, increase safety and trust, and prevent overconfident and unpredictable behavior. However, weight-space priors are model-specific, can be difficult to interpret and are hard to specify. Instead, we apply a Dirichlet prior in predictive space and perform… ▽ More Bayesian deep learning approaches assume model parameters to be latent random variables and infer posterior distributions to quantify uncertainty, increase safety and trust, and prevent overconfident and unpredictable behavior. However, weight-space priors are model-specific, can be difficult to interpret and are hard to specify. Instead, we apply a Dirichlet prior in predictive space and perform approximate function-space variational inference. To this end, we interpret conventional categorical predictions from stochastic neural network classifiers as samples from an implicit Dirichlet distribution. By adapting the inference, the same function-space prior can be combined with different models without affecting model architecture or size. We illustrate the flexibility and efficacy of such a prior with toy experiments and demonstrate scalability, improved uncertainty quantification and adversarial robustness with large-scale image classification experiments. △ Less

Submitted 12 July, 2023; originally announced July 2023.

Comments: Advances in Approximate Bayesian Inference 2023

arXiv:2307.03093 [pdf, other]

Beyond Intuition, a Framework for Applying GPs to Real-World Data

Authors: Kenza Tazi, Jihao Andreas Lin, Ross Viljoen, Alex Gardner, ST John, Hong Ge, Richard E. Turner

Abstract: Gaussian Processes (GPs) offer an attractive method for regression over small, structured and correlated datasets. However, their deployment is hindered by computational costs and limited guidelines on how to apply GPs beyond simple low-dimensional datasets. We propose a framework to identify the suitability of GPs to a given problem and how to set up a robust and well-specified GP model. The guid… ▽ More Gaussian Processes (GPs) offer an attractive method for regression over small, structured and correlated datasets. However, their deployment is hindered by computational costs and limited guidelines on how to apply GPs beyond simple low-dimensional datasets. We propose a framework to identify the suitability of GPs to a given problem and how to set up a robust and well-specified GP model. The guidelines formalise the decisions of experienced GP practitioners, with an emphasis on kernel design and options for computational scalability. The framework is then applied to a case study of glacier elevation change yielding more accurate results at test time. △ Less

Submitted 17 July, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: Accepted at the ICML Workshop on Structured Probabilistic Inference and Generative Modelling (2023)

arXiv:2306.11589 [pdf, other]

Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent

Authors: Jihao Andreas Lin, Javier Antorán, Shreyas Padhy, David Janz, José Miguel Hernández-Lobato, Alexander Terenin

Abstract: Gaussian processes are a powerful framework for quantifying uncertainty and for sequential decision-making but are limited by the requirement of solving linear systems. In general, this has a cubic cost in dataset size and is sensitive to conditioning. We explore stochastic gradient algorithms as a computationally efficient method of approximately solving these linear systems: we develop low-varia… ▽ More Gaussian processes are a powerful framework for quantifying uncertainty and for sequential decision-making but are limited by the requirement of solving linear systems. In general, this has a cubic cost in dataset size and is sensitive to conditioning. We explore stochastic gradient algorithms as a computationally efficient method of approximately solving these linear systems: we develop low-variance optimization objectives for sampling from the posterior and extend these to inducing points. Counterintuitively, stochastic gradient descent often produces accurate predictions, even in cases where it does not converge quickly to the optimum. We explain this through a spectral characterization of the implicit bias from non-convergence. We show that stochastic gradient descent produces predictive distributions close to the true posterior both in regions with sufficient data coverage, and in regions sufficiently far away from the data. Experimentally, stochastic gradient descent achieves state-of-the-art performance on sufficiently large-scale or ill-conditioned regression tasks. Its uncertainty estimates match the performance of significantly more expensive baselines on a large-scale Bayesian optimization task. △ Less

Submitted 15 January, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

Journal ref: Advances in Neural Information Processing Systems, 2023

arXiv:2306.05500 [pdf, other]

Word-Level Explanations for Analyzing Bias in Text-to-Image Models

Authors: Alexander Lin, Lucas Monteiro Paes, Sree Harsha Tanneru, Suraj Srinivas, Himabindu Lakkaraju

Abstract: Text-to-image models take a sentence (i.e., prompt) and generate images associated with this input prompt. These models have created award wining-art, videos, and even synthetic datasets. However, text-to-image (T2I) models can generate images that underrepresent minorities based on race and sex. This paper investigates which word in the input prompt is responsible for bias in generated images. We… ▽ More Text-to-image models take a sentence (i.e., prompt) and generate images associated with this input prompt. These models have created award wining-art, videos, and even synthetic datasets. However, text-to-image (T2I) models can generate images that underrepresent minorities based on race and sex. This paper investigates which word in the input prompt is responsible for bias in generated images. We introduce a method for computing scores for each word in the prompt; these scores represent its influence on biases in the model's output. Our method follows the principle of \emph{explaining by removing}, leveraging masked language models to calculate the influence scores. We perform experiments on Stable Diffusion to demonstrate that our method identifies the replication of societal stereotypes in generated images. △ Less

Submitted 3 June, 2023; originally announced June 2023.

Comments: 5 main pages, 3 pages in appendix, and 3 figures

arXiv:2306.03249 [pdf, other]

Probabilistic Unrolling: Scalable, Inverse-Free Maximum Likelihood Estimation for Latent Gaussian Models

Authors: Alexander Lin, Bahareh Tolooshams, Yves Atchadé, Demba Ba

Abstract: Latent Gaussian models have a rich history in statistics and machine learning, with applications ranging from factor analysis to compressed sensing to time series analysis. The classical method for maximizing the likelihood of these models is the expectation-maximization (EM) algorithm. For problems with high-dimensional latent variables and large datasets, EM scales poorly because it needs to inv… ▽ More Latent Gaussian models have a rich history in statistics and machine learning, with applications ranging from factor analysis to compressed sensing to time series analysis. The classical method for maximizing the likelihood of these models is the expectation-maximization (EM) algorithm. For problems with high-dimensional latent variables and large datasets, EM scales poorly because it needs to invert as many large covariance matrices as the number of data points. We introduce probabilistic unrolling, a method that combines Monte Carlo sampling with iterative linear solvers to circumvent matrix inversion. Our theoretical analyses reveal that unrolling and backpropagation through the iterations of the solver can accelerate gradient estimation for maximum likelihood estimation. In experiments on simulated and real data, we demonstrate that probabilistic unrolling learns latent Gaussian models up to an order of magnitude faster than gradient EM, with minimal losses in model performance. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: 29 pages, 4 figures

Journal ref: International Conference on Machine Learning, 2023

arXiv:2305.17110 [pdf, other]

IndustReal: Transferring Contact-Rich Assembly Tasks from Simulation to Reality

Authors: Bingjie Tang, Michael A. Lin, Iretiayo Akinola, Ankur Handa, Gaurav S. Sukhatme, Fabio Ramos, Dieter Fox, Yashraj Narang

Abstract: Robotic assembly is a longstanding challenge, requiring contact-rich interaction and high precision and accuracy. Many applications also require adaptivity to diverse parts, poses, and environments, as well as low cycle times. In other areas of robotics, simulation is a powerful tool to develop algorithms, generate datasets, and train agents. However, simulation has had a more limited impact on as… ▽ More Robotic assembly is a longstanding challenge, requiring contact-rich interaction and high precision and accuracy. Many applications also require adaptivity to diverse parts, poses, and environments, as well as low cycle times. In other areas of robotics, simulation is a powerful tool to develop algorithms, generate datasets, and train agents. However, simulation has had a more limited impact on assembly. We present IndustReal, a set of algorithms, systems, and tools that solve assembly tasks in simulation with reinforcement learning (RL) and successfully achieve policy transfer to the real world. Specifically, we propose 1) simulation-aware policy updates, 2) signed-distance-field rewards, and 3) sampling-based curricula for robotic RL agents. We use these algorithms to enable robots to solve contact-rich pick, place, and insertion tasks in simulation. We then propose 4) a policy-level action integrator to minimize error at policy deployment time. We build and demonstrate a real-world robotic assembly system that uses the trained policies and action integrator to achieve repeatable performance in the real world. Finally, we present hardware and software tools that allow other researchers to reproduce our system and results. For videos and additional details, please see http://sites.google.com/nvidia.com/industreal . △ Less

Submitted 26 May, 2023; originally announced May 2023.

Comments: Accepted to Robotics: Science and Systems (RSS) 2023

arXiv:2305.04926 [pdf, other]

RelPose++: Recovering 6D Poses from Sparse-view Observations

Authors: Amy Lin, Jason Y. Zhang, Deva Ramanan, Shubham Tulsiani

Abstract: We address the task of estimating 6D camera poses from sparse-view image sets (2-8 images). This task is a vital pre-processing stage for nearly all contemporary (neural) reconstruction algorithms but remains challenging given sparse views, especially for objects with visual symmetries and texture-less surfaces. We build on the recent RelPose framework which learns a network that infers distributi… ▽ More We address the task of estimating 6D camera poses from sparse-view image sets (2-8 images). This task is a vital pre-processing stage for nearly all contemporary (neural) reconstruction algorithms but remains challenging given sparse views, especially for objects with visual symmetries and texture-less surfaces. We build on the recent RelPose framework which learns a network that infers distributions over relative rotations over image pairs. We extend this approach in two key ways; first, we use attentional transformer layers to process multiple images jointly, since additional views of an object may resolve ambiguous symmetries in any given image pair (such as the handle of a mug that becomes visible in a third view). Second, we augment this network to also report camera translations by defining an appropriate coordinate system that decouples the ambiguity in rotation estimation from translation prediction. Our final system results in large improvements in 6D pose prediction over prior art on both seen and unseen object categories and also enables pose estimation and 3D reconstruction for in-the-wild objects. △ Less

Submitted 18 December, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

Comments: Project webpage: https://amyxlase.github.io/relpose-plus-plus (Accepted to 3DV 2024)

arXiv:2305.04451 [pdf, other]

doi 10.1145/3588432.3591568

FashionTex: Controllable Virtual Try-on with Text and Texture

Authors: Anran Lin, Nanxuan Zhao, Shuliang Ning, Yuda Qiu, Baoyuan Wang, Xiaoguang Han

Abstract: Virtual try-on attracts increasing research attention as a promising way for enhancing the user experience for online cloth shopping. Though existing methods can generate impressive results, users need to provide a well-designed reference image containing the target fashion clothes that often do not exist. To support user-friendly fashion customization in full-body portraits, we propose a multi-mo… ▽ More Virtual try-on attracts increasing research attention as a promising way for enhancing the user experience for online cloth shopping. Though existing methods can generate impressive results, users need to provide a well-designed reference image containing the target fashion clothes that often do not exist. To support user-friendly fashion customization in full-body portraits, we propose a multi-modal interactive setting by combining the advantages of both text and texture for multi-level fashion manipulation. With the carefully designed fashion editing module and loss functions, FashionTex framework can semantically control cloth types and local texture patterns without annotated pairwise training data. We further introduce an ID recovery module to maintain the identity of input portrait. Extensive experiments have demonstrated the effectiveness of our proposed pipeline. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: Accepted to SIGGRAPH 2023 (Conference Proceedings)

arXiv:2304.03981 [pdf, other]

Uncertainty-inspired Open Set Learning for Retinal Anomaly Identification

Authors: Meng Wang, Tian Lin, Lianyu Wang, Aidi Lin, Ke Zou, Xinxing Xu, Yi Zhou, Yuanyuan Peng, Qingquan Meng, Yiming Qian, Guoyao Deng, Zhiqun Wu, Junhong Chen, Jianhong Lin, Mingzhi Zhang, Weifang Zhu, Changqing Zhang, Daoqiang Zhang, Rick Siow Mong Goh, Yong Liu, Chi Pui Pang, Xinjian Chen, Haoyu Chen, Huazhu Fu

Abstract: Failure to recognize samples from the classes unseen during training is a major limitation of artificial intelligence in the real-world implementation for recognition and classification of retinal anomalies. We established an uncertainty-inspired open-set (UIOS) model, which was trained with fundus images of 9 retinal conditions. Besides assessing the probability of each category, UIOS also calcul… ▽ More Failure to recognize samples from the classes unseen during training is a major limitation of artificial intelligence in the real-world implementation for recognition and classification of retinal anomalies. We established an uncertainty-inspired open-set (UIOS) model, which was trained with fundus images of 9 retinal conditions. Besides assessing the probability of each category, UIOS also calculated an uncertainty score to express its confidence. Our UIOS model with thresholding strategy achieved an F1 score of 99.55%, 97.01% and 91.91% for the internal testing set, external target categories (TC)-JSIEC dataset and TC-unseen testing set, respectively, compared to the F1 score of 92.20%, 80.69% and 64.74% by the standard AI model. Furthermore, UIOS correctly predicted high uncertainty scores, which would prompt the need for a manual check in the datasets of non-target categories retinal diseases, low-quality fundus images, and non-fundus images. UIOS provides a robust method for real-world screening of retinal anomalies. △ Less

Submitted 29 August, 2023; v1 submitted 8 April, 2023; originally announced April 2023.

arXiv:2302.06656 [pdf, other]

Enhancing User Personalization in Conversational Recommenders

Authors: Allen Lin, Ziwei Zhu, Jianling Wang, James Caverlee

Abstract: Conversational recommenders are emerging as a powerful tool to personalize a user's recommendation experience. Through a back-and-forth dialogue, users can quickly hone in on just the right items. Many approaches to conversational recommendation, however, only partially explore the user preference space and make limiting assumptions about how user feedback can be best incorporated, resulting in lo… ▽ More Conversational recommenders are emerging as a powerful tool to personalize a user's recommendation experience. Through a back-and-forth dialogue, users can quickly hone in on just the right items. Many approaches to conversational recommendation, however, only partially explore the user preference space and make limiting assumptions about how user feedback can be best incorporated, resulting in long dialogues and poor recommendation performance. In this paper, we propose a novel conversational recommendation framework with two unique features: (i) a greedy NDCG attribute selector, to enhance user personalization in the interactive preference elicitation process by prioritizing attributes that most effectively represent the actual preference space of the user; and (ii) a user representation refiner, to effectively fuse together the user preferences collected from the interactive elicitation process to obtain a more personalized understanding of the user. Through extensive experiments on four frequently used datasets, we find the proposed framework not only outperforms all the state-of-the-art conversational recommenders (in terms of both recommendation performance and conversation efficiency), but also provides a more personalized experience for the user under the proposed multi-groundtruth multi-round conversational recommendation setting. △ Less

Submitted 13 February, 2023; originally announced February 2023.

Comments: To Appear On TheWebConf (WWW) 2023

arXiv:2211.10580 [pdf, other]

Normal Transformer: Extracting Surface Geometry from LiDAR Points Enhanced by Visual Semantics

Authors: Ancheng Lin, Jun Li

Abstract: High-quality estimation of surface normal can help reduce ambiguity in many geometry understanding problems, such as collision avoidance and occlusion inference. This paper presents a technique for estimating the normal from 3D point clouds and 2D colour images. We have developed a transformer neural network that learns to utilise the hybrid information of visual semantic and 3D geometric data, as… ▽ More High-quality estimation of surface normal can help reduce ambiguity in many geometry understanding problems, such as collision avoidance and occlusion inference. This paper presents a technique for estimating the normal from 3D point clouds and 2D colour images. We have developed a transformer neural network that learns to utilise the hybrid information of visual semantic and 3D geometric data, as well as effective learning strategies. Compared to existing methods, the information fusion of the proposed method is more effective, which is supported by experiments. We have also built a simulation environment of outdoor traffic scenes in a 3D rendering engine to obtain annotated data to train the normal estimator. The model trained on synthetic data is tested on the real scenes in the KITTI dataset. And subsequent tasks built upon the estimated normal directions in the KITTI dataset show that the proposed estimator has advantage over existing methods. △ Less

Submitted 6 July, 2023; v1 submitted 18 November, 2022; originally announced November 2022.

arXiv:2210.12387 [pdf, other]

Whisker-Inspired Tactile Sensing for Contact Localization on Robot Manipulators

Authors: Michael A. Lin, Emilio Reyes, Jeannette Bohg, Mark R. Cutkosky

Abstract: Perceiving the environment through touch is important for robots to reach in cluttered environments, but devising a way to sense without disturbing objects is challenging. This work presents the design and modelling of whisker-inspired sensors that attach to the surface of a robot manipulator to sense its surrounding through light contacts. We obtain a sensor model using a calibration process that… ▽ More Perceiving the environment through touch is important for robots to reach in cluttered environments, but devising a way to sense without disturbing objects is challenging. This work presents the design and modelling of whisker-inspired sensors that attach to the surface of a robot manipulator to sense its surrounding through light contacts. We obtain a sensor model using a calibration process that applies to straight and curved whiskers. We then propose a sensing algorithm using Bayesian filtering to localize contact points. The algorithm combines the accurate proprioceptive sensing of the robot and sensor readings from the deflections of the whiskers. Our results show that our algorithm is able to track contact points with sub-millimeter accuracy, outperforming a baseline method. Finally, we demonstrate our sensor and perception method in a real-world system where a robot moves in between free-standing objects and uses the whisker sensors to track contacts tracing object contours. △ Less

Submitted 22 October, 2022; originally announced October 2022.

Comments: 8 pages, 7 figures, conference

Showing 1–50 of 125 results for author: Lin, A