\WarningFilter

latexFont shape \WarningFilterlatexfontFont shape

\justify

Deep Learning for Camera Calibration and Beyond: A Survey

Kang Liao, Lang Nie, Shujuan Huang, Chunyu Lin, Jing Zhang, Yao Zhao, , Moncef Gabbouj, , Dacheng Tao Kang Liao, Lang Nie, Shujuan Huang, Chunyu Lin (corresponding author), and Yao Zhao are with the Institute of Information Science, Beijing Jiaotong University (BJTU), Beijing 100044, China, and also with the Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing 100044, China (email: kang_liao@bjtu.edu.cn, nielang@bjtu.edu.cn, shujuanhuang@bjtu.edu.cn, cylin@bjtu.edu.cn, yzhao@bjtu.edu.cn)Moncef Gabbouj is with the Department of Computing Sciences, Tampere University, 33101 Tampere, Finland (e-mail: moncef.gabbouj@tuni.fi)Jing Zhang and Dacheng Tao are with the School of Computer Science, Faculty of Engineering, The University of Sydney, Australia (e-mail: jing.zhang1@sydney.edu.au; dacheng.tao@gmail.com)
Abstract

Camera calibration involves estimating camera parameters to infer geometric features from captured sequences, which is crucial for computer vision and robotics. However, conventional calibration is laborious and requires dedicated collection. Recent efforts show that learning-based solutions have the potential to be used in place of the repeatability works of manual calibrations. Among these solutions, various learning strategies, networks, geometric priors, and datasets have been investigated. In this paper, we provide a comprehensive survey of learning-based camera calibration techniques, by analyzing their strengths and limitations. Our main calibration categories include the standard pinhole camera model, distortion camera model, cross-view model, and cross-sensor model, following the research trend and extended applications. As there is no benchmark in this community, we collect a holistic calibration dataset that can serve as a public platform to evaluate the generalization of existing methods. It comprises both synthetic and real-world data, with images and videos captured by different cameras in diverse scenes. Toward the end of this paper, we discuss the challenges and provide further research directions. To our knowledge, this is the first survey for the learning-based camera calibration (spanned 8 years). The summarized methods, datasets, and benchmarks are available and will be regularly updated at https://github.com/KangLiao929/Awesome-Deep-Camera-Calibration.

Index Terms:
Camera calibration, Deep learning, Computational photography, Multiple view geometry, 3D vision, Robotics.

1 Introduction

Camera calibration is a fundamental and indispensable field in computer vision and it has a long research history [1, 2, 3, 4], tracing back to around 60 years ago[5]. The first step for many vision and robotics tasks is to calibrate the intrinsic (image sensor and distortion parameters) and/or extrinsic (rotation and translation) camera parameters, ranging from computational photography, and multi-view geometry, to 3D reconstruction. In terms of the task type, there are different techniques to calibrate the standard pinhole camera, fisheye lens camera, stereo camera, light field camera, event camera, and LiDAR-camera system, etc. Figure 1 shows the popular calibration objectives, models, and extended applications in camera calibration.

Refer to caption
Figure 1: Popular calibration objectives, models, and extended applications in camera calibration.

Traditional methods for camera calibration generally depend on hand-crafted features and model assumptions. These methods can be broadly divided into three categories. The most prevalent one involves using a known calibration target (e.g., a checkerboard) as it is deliberately moved in the 3D scene [6, 7, 8]. Then, the camera captures the target from different viewpoints and the checkerboard corners are detected for calculating the camera parameters. However, such a procedure requires cumbersome manual interactions and it cannot achieve automatic calibration “in the wild”. To pursue better flexibility, the second category of camera calibration, i.e., the geometric-prior-based calibration has been largely studied [9, 10, 11, 12]. To be specific, the geometric structures are leveraged to model the 3D-2D correspondence in the scene, such as lines and vanishing points. However, this type of method heavily relies on structured man-made scenes containing rich geometric priors, leading to poor performance when applied to general environments. The third category is self-calibration[13, 14, 15]. Such a solution takes a sequence of images as inputs and estimates the camera parameters using multi-view geometry. The accuracy of self-calibration, however, is constrained by the limits of the feature detectors, which can be influenced by diverse lighting conditions and textures. The above parametric models are based on the physical interpretation of camera geometry. While they are user-friendly, they tend to be tailored for specific camera models and might not offer optimal accuracy. Instead, non-parametric models [16, 17, 18] link each image pixel to its 3D observation ray, eliminating the constraints of parametric models.

Since there are many standard techniques for calibrating cameras in an industry/laboratory implementation[19, 20], this process is usually ignored in recent developments. However, calibrating single and wild images remains challenging, especially when images are collected from websites and unknown camera models. This challenge motivates the researchers to investigate a new paradigm.

Recently, deep learning has brought new inspirations to camera calibration and its applications. Learning-based methods achieve state-of-the-art performances on various tasks with higher efficiency. In particular, diverse deep neural networks (DNNs) have been developed, such as convolutional neural networks (CNNs), generative adversarial networks (GANs), PointNet, and vision transformers (ViTs), of which the high-level semantic features show more powerful representation capability compared with the hand-crafted features. Moreover, diverse learning strategies have been exploited to boost the geometric perception of neural networks. Learning-based methods offer a flexible and end-to-end camera calibration solution, without manual interventions or calibration targets, which sets them apart from traditional methods. Furthermore, some of these methods achieve camera model-free and label-free calibration, showing promising and meaningful applications.

With the rapid increase in the number of learning-based camera calibration methods, it has become increasingly challenging to keep up with new advances. Consequently, there is an urgent need to analyze existing works and foster a community dedicated to this field. Previously, some surveys, e.g., [21, 22, 23] only focused on a specific task/camera in camera calibration or one type of approach. For instance, Salvi et al. [21] reviewed the traditional camera calibration methods in terms of the algorithms. Hughes et al. [22] provided a detailed review for calibrating fisheye cameras with traditional solutions. While Fan et al. [23] discussed both the traditional methods and deep learning methods, their survey only considers calibrating the wide-angle cameras. In addition, due to the few amount of reviewed learning-based methods (around 10 papers), the readers are difficult to picture the development trend of general camera calibration in Fan et al. [23].

In this paper, we provide a comprehensive and in-depth overview of recent advances in learning-based camera calibration, covering over 100 papers. We also discuss potential directions for further improvements and examine various types of cameras and targets. To facilitate future research on different topics, we categorize the current solutions according to calibration objectives and applications. In addition to fundamental parameters such as focal length, rotation, and translation, we also provide detailed reviews for correcting image distortion (radial distortion and rolling shutter distortion), estimating cross-view mapping, calibrating camera-LiDAR systems, and other applications. Such a trend follows the development of cameras and market demands for virtual reality, autonomous driving, neural rendering, etc.

To our best knowledge, this is the first survey of the learning-based camera calibration and its extended applications, it has the following unique contributions. (1) Our work mainly follows recent advances in deep learning-based camera calibration. In-depth analysis and discussion in various aspects are offered, including publications, network architecture, loss functions, datasets, evaluation metrics, learning strategies, implementation platforms, etc. The detailed information of each literature is listed in Table I. (2) Despite the calibration algorithm, we comprehensively review the classical camera models and their extended models. In particular, we summarize the redesigned calibration objectives in deep learning since some traditional calibration objectives are verified to be hard to learn by neural networks. (3) We collect a dataset containing images and videos captured by different cameras in different environments, which can serve as a platform to evaluate the generalization of existing methods. (4) We discuss the open challenges of learning-based camera calibration and propose some future directions to provide guidance for further research in this field. (5) An open-source repository is created that provides a taxonomy of all reviewed works and benchmarks. The repository will be updated regularly in https://github.com/KangLiao929/Awesome-Deep-Camera-Calibration.

In the following sections, we discuss and analyze various aspects of learning-based camera calibration. The remainder of this paper is organized as follows. In Section 2, we provide the concrete learning paradigms and learning strategies of the learning-based camera calibration. Subsequently, we introduce and discuss the specific methods based on the standard camera model, distortion model, cross-view model, and cross-sensor model in Section 3, Section 4, Section 5, and Section 6, respectively (see Figure 2). The collected benchmark for calibration methods is depicted in Section 7. Finally, we conclude the learning-based camera calibration and suggest the future directions of this community in Section 8.

2 Preliminaries

Deep learning has brought new inspirations to camera calibration, enabling a fully automatic calibration procedure without manual intervention. Here, we first summarize two prevalent paradigms in learning-based camera calibration: regression-based calibration and reconstruction-based calibration. Then, the widely-used learning strategies are reviewed in this research field. The detailed definitions for classical camera models and their corresponding calibration objectives are exhibited in the supplementary material.

2.1 Learning Paradigm

Driven by different architectures of the neural network, the researchers have developed two main paradigms for learning-based camera calibration and its applications.

Regression-based Calibration Given an uncalibrated input, the regression-based calibration first extracts the high-level semantic features using stacked convolutional layers. Then, the fully connected layers aggregate the semantic features and form a vector of the estimated calibration objective. The regressed parameters are used to conduct subsequent tasks such as distortion rectification, image warping, camera localization, etc. This paradigm is the earliest and has a dominant role in learning-based camera calibration and its applications. All the first works in various objectives, e.g., intrinsics: Deepfocal [24], extrinsic: PoseNet [25], radial distortion: Rong et al. [26], rolling shutter distortion: URS-CNN [27], homography matrix: DHN [28], hybrid parameters: Hold-Geoffroy et al. [29], camera-LiDAR parameters: RegNet [30] have been achieved with this paradigm.

Reconstruction-based Calibration On the other hand, the reconstruction-based calibration paradigm discards the parameter regression and directly learns the pixel-level mapping function between the uncalibrated input and target, inspired by the conditional image-to-image translation [31] and dense visual perception[32, 33]. The reconstructed results are then calculated for the pixel-wise loss with the ground truth. In this regard, most reconstruction-based calibration methods [34, 35, 36, 37] design their network architecture based on the fully convolutional network such as U-Net[38]. Specifically, an encoder-decoder network, with skip connections between the encoder and decoder features at the same spatial resolution, progressively extracts the features from low-level to high-level and effectively integrates multi-scale features. At the last convolutional layer, the learned features are aggregated into the target channel, reconstructing the calibrated result at the pixel level.

In contrast to the regression-based paradigm, the reconstruction-based paradigm does not require the label of diverse camera parameters. Besides, the imbalance loss problem can be eliminated since it only optimizes the photometric loss of calibrated results. Therefore, the reconstruction-based paradigm enables a blind camera calibration without a strong camera model assumption.

2.2 Learning Strategies

In the following, we review the learning-based camera calibration literature regarding different learning strategies.

Supervised Learning Most learning-based camera calibration methods train their networks with the supervised learning strategy, from the classical methods [24, 25, 28, 39, 26, 40] to the state-of-the-art methods [41, 42, 43, 44, 45]. In terms of the learning paradigm, this strategy supervises the network with the ground truth of the camera parameters (regression-based paradigm) or paired data (reconstruction-based paradigm). In general, they synthesize the training dataset from other large-scale datasets, under the random parameter/transformation sampling and camera model simulation. Some recent works [46, 47, 48, 49] establish their training dataset using a real-world setup and label the captured images with manual annotations, thereby fostering advancements in this research domain.

Semi-Supervised Learning Training the network using an annotated dataset under diverse scenarios is an effective learning strategy. However, human annotation can be prone to errors, leading to inconsistent annotation quality or the inclusion of contaminated data. Consequently, increasing the training dataset to improve performance can be challenging due to the complexity and cost of constructing the dataset. To address this challenge, SS-WPC[50] proposes a semi-supervised method for correcting portraits captured by a wide-angle camera. It employs a surrogate task (segmentation) and a semi-supervised method that utilizes direction and range consistency and regression consistency to leverage both labeled and unlabeled data.

Weakly-Supervised Learning Although significant progress has been made, data labeling for camera calibration is a notorious costly process, and obtaining perfect ground-truth labels is challenging. As a result, it is often preferable to use weak supervision with machine learning methods. Weakly supervised learning refers to the process of building prediction models through learning with inadequate supervision. Zhu et al. [51] present a weakly supervised camera calibration method for single-view metrology in unconstrained environments, where there is only one accessible image of a scene composed of objects of uncertain sizes. This work leverages 2D object annotations from large-scale datasets, where people and buildings are frequently present and serve as useful “reference objects” for determining 3D size.

Refer to caption
Figure 2: The structural and hierarchical taxonomy of camera calibration with deep learning. Some classical methods are listed under each category.

Unsupervised Learning Unsupervised learning, commonly referred to as unsupervised machine learning, analyzes and groups unlabeled datasets using machine learning algorithms. UDHN [52] is the first work for a cross-view camera model using unsupervised learning, which estimates the homography matrix of a paired image without the projection labels. By reducing a pixel-wise intensity error that does not require ground truth data, UDHN [52] outperforms previous supervised learning techniques. While preserving superior accuracy and robustness to fluctuation in light, the proposed unsupervised algorithm can also achieve faster inference time. Inspired by this work, increasing more methods leverage the unsupervised learning strategy to estimate the homography such as CA-UDHN [53], BaseHomo [54], HomoGAN[55], and Liu et al. [56]. Besides, UnFishCor [57] frees the demands for distortion parameters and designs an unsupervised framework for the wide-angle camera.

Self-supervised Learning Robotics is where the phrase “self-supervised learning” first appears, as training data is automatically categorized by utilizing relationships between various input sensor signals. Compared to supervised learning, self-supervised learning leverages input data itself as the supervision. Many self-supervised techniques are presented to learn visual characteristics from massive amounts of unlabeled photos or videos without the need for time-consuming and expensive human annotations. SSR-Net [58] presents a self-supervised deep homography estimation network, which relaxes the need for ground truth annotations and leverages the invertibility constraints of homography. To be specific, SSR-Net [58] utilizes the homography matrix representation in place of other approaches’ typically-used 4-point parameterization, to apply the invertibility constraints. SIR [59] devises a brand-new self-supervised camera calibration pipeline for wide-angle image rectification, based on the principle that the corrected results of distorted images of the same scene taken with various lenses need to be the same. With self-supervised depth and pose learning as a proxy aim, Fang et al. [60] present to self-calibrate a range of generic camera models from raw video, offering for the first time a calibration evaluation of camera model parameters learned solely via self-supervision.

Reinforcement Learning Instead of aiming to minimize at each stage, reinforcement learning can maximize the cumulative benefits of a learning process as a whole. To date, DQN-RecNet [61] is the first and only work in camera calibration using reinforcement learning. It applies a deep reinforcement learning technique to tackle the fisheye image rectification by a single Markov Decision Process, which is a multi-step gradual calibration scheme. In this situation, the current fisheye image represents the state of the environment. The agent, Deep Q-Network [62], generates an action that should be executed to correct the distorted image.

In the following, we will review the specific methods and literature for learning-based camera calibration. The structural and hierarchical taxonomy is shown in Figure 2.

TABLE I: Details of the learning-based camera calibration and its extended applications from 2015 to 2022, including the method abbreviation, publication, calibration objective, network architecture, loss function, dataset, evaluation metrics, learning strategy, platform, and simulation or not (training data). For the learning strategies, SL, USL, WSL, Semi-SL, SSL, and RL denote supervised learning, unsupervised learning, weakly-supervised learning, semi-supervised learning, self-supervised learning, and reinforcement learning, respectively.
Method Publication Objective Network Loss Function Dataset Evaluation Learning Platform Simulation
2015 DeepFocal [24] ICIP Standard AlexNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss 1DSfM[63] Accuracy SL Caffe
PoseNet [25] ICCV Standard GoogLeNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Cambridge Landmarks[64] Accuracy SL Caffe
2016 DeepHorizon [65] BMVC Standard GoogLeNet Huber loss HLW[66] Accuracy SL Caffe
DeepVP [39] CVPR Standard AlexNet Logistic loss YUD[67], ECD[68], HLW[66] Accuracy SL Caffe
Rong et al. [26] ACCV Distortion AlexNet Softmax loss ImageNet[69] Line length SL Caffe
DHN[28] RSSW Cross-View VGG 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss MS-COCO[70] MSE SL Caffe
2017 CLKN [71] CVPR Cross-View CNNs Hinge loss MS-COCO[70] MSE SL Torch
HierarchicalNet [72] ICCVW Cross-View VGG 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss MS-COCO[70] MSE SL TensorFlow
URS-CNN [27] CVPR Distortion CNNs 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Sun[73], Oxford[74], Zubud[75], LFW[76] PSNR, RMSE SL Torch
RegNet [30] IV Cross-Sensor CNNs 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss KITTI[77] MAE SL Caffe
2018 Hold-Geoffroy et al. [29] CVPR Standard DenseNet Entropy loss SUN360[78] Human sensitivity SL -
DeepCalib [40] CVMP Distortion Inception-V3 Logcosh loss SUN360[78] Mean error SL TensorFlow
FishEyeRecNet [79] ECCV Distortion VGG 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss ADE20K[80] PSNR, SSIM SL Caffe
Shi et al.[81] ICPR Distortion ResNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss ImageNet[69] MSE SL PyTorch
DeepFM[82] ECCV Cross-View ResNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss T&T[83], KITTI[77], 1DSfM[63] F-score, Mean SL PyTorch
Poursaeed et al.[84] ECCVW Cross-View CNNs 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss KITTI[77] EPI-ABS, EPI-SQR SL -
UDHN[52] RAL Cross-View VGG 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss MS-COCO[70] RMSE USL TensorFlow
PFNet[85] ACCV Cross-View FCN Smooth 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss MS-COCO[70] MAE SL TensorFlow
CalibNet[86] IROS Cross-Sensor ResNet Point cloud distance, 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss KITTI[77] Geodesic distance, MAE SL TensorFlow
Chang et al.[87] ICRA Standard AlexNet Cross-entropy loss DeepVP-1M [87] MSE, Accuracy SL Matconvnet
2019 Lopez et al. [88] CVPR Distortion DenseNet Bearing loss SUN360[78] MSE SL PyTorch
UprightNet [89] ICCV Standard U-Net Geometry loss InteriorNet[90], ScanNet[91], SUN360[78] Mean error SL PyTorch
Zhuang et al. [92] IROS Distortion ResNet 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss KITTI[77] Mean error, RMSE SL PyTorch
SSR-Net [58] PRL Cross-View ResNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss MS-COCO[70] MAE SSL PyTorch
Abbas et al. [93] ICCVW Cross-View CNNs Softmax loss CARLA[94] AUC[95], Mean error SL TensorFlow
DR-GAN [34] TCSVT Distortion GANs Perceptual loss MS-COCO[70] PSNR, SSIM SL TensorFlow
STD [96] TCSVT Distortion GANs+CNNs Perceptual loss MS-COCO[70] PSNR, SSIM SL TensorFlow
Deep360Up [97] VR Standard DenseNet Log-cosh loss[98] SUN360[78] Mean error SL -
UnFishCor [57] JVCIR Distortion VGG 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss Places2[99] PSNR, SSIM USL TensorFlow
BlindCor [37] CVPR Distortion U-Net 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Places2[99] MSE SL PyTorch
RSC-Net [100] CVPR Distortion ResNet 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss KITTI[77] Mean error SL PyTorch
Xue et al. [101] CVPR Distortion ResNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Wireframes[102], SUNCG[103] PSNR, SSIM, RPE SL PyTorch
Zhao et al. [46] ICCV Distortion VGG+U-Net 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss Self-constructed+BU-4DFE[104] Mean error SL -
NeurVPS [105] NeurIPS Standard CNNs Binary cross entropy, chamfer-2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss ScanNet [91], SU3 [106] Angle accuracy SL PyTorch
2020 Sha et al. [107] CVPR Cross-View U-Net Cross-entropy loss World Cup 2014[108] IoU SL TensorFlow
Lee et al. [109] ECCV Standard PointNet + CNNs Cross-entropy loss Google Street View[110], HLW[66] Mean error, AUC[95] SL -
MisCaliDet [111] ICRA Distortion CNNs 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss KITTI[77] MSE SL TensorFlow
DeepPTZ [112] WACV Distortion Inception-V3 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss SUN360[78] Mean error SL PyTorch
MHN [113] CVPR Cross-View VGG Cross-entropy loss MS-COCO[70], Self-constructed MAE SL TensorFlow
Davidson et al. [114] ECCV Standard FCN Dice loss SUN360[78] Accuracy SL -
CA-UDHN [53] ECCV Cross-View FCN + ResNet Triplet loss Self-constructed MSE USL PyTorch
DeepFEPE [115] IROS Standard VGG + PointNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss KITTI[77], ApolloScape[116] Mean error SL PyTorch
DDM [35] TIP Distortion GANs 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss MS-COCO[70] PSNR, SSIM SL TensorFlow
Li et al. [117] TIP Distortion CNNs Cross-entropy, 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss CelebA[118] Cosine distance SL -
PSE-GAN [119] ICPR Distortion GANs 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, WGAN loss Place2[99] MSE SL -
RDC-Net [120] ICIP Distortion ResNet 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss ImageNet[69] PSNR, SSIM SL PyTorch
FE-GAN [121] ICASSP Distortion GANs 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, GAN loss Wireframe[102], LSUN[122] PSNR, SSIM, RMSE SSL PyTorch
RDCFace [123] CVPR Distortion ResNet Cross-entropy, 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss IMDB-Face[124] Accuracy SL -
LaRecNet [125] arXiv Distortion ResNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Wireframes[102], SUNCG[103] PSNR, SSIM, RPE SL PyTorch
Baradad et al. [126] CVPR Standard CNNs 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss ScanNet[91], NYU[127], SUN360[78] Mean error, RMS SL PyTorch
Zheng et al. [128] CVPR Standard CNNs 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss FocaLens[129] Mean error, PSNR, SSIM SL -
Zhu et al. [51] ECCV Standard CNNs + PointNet 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss SUN360[78], MS-COCO[70] Mean error, Accuracy WSL PyTorch
DeepUnrollNet [49] CVPR Distortion FCN 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, perceptual, total variation loss Carla-RS[49], Fastec-RS[49] PSNR, SSIM SL PyTorch
RGGNet [130] RAL Cross-Sensor ResNet Geodesic distance loss KITTI[77] MSE, MSEE, MRR SL TensorFlow
CalibRCNN [131] IROS Cross-Sensor RNNs 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, Epipolar geometry loss KITTI [77] MAE SL TensorFlow
SSI-Calib [132] ICRA Cross-Sensor CNNs 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Pascal VOC 2012 [133] Mean/standard deviation SL TensorFlow
SOIC [134] arXiv Cross-Sensor ResNet + PointRCNN Cost function KITTI [77] Mean error SL -
NetCalib [135] ICPR Cross-Sensor CNNs 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss KITTI [77] MAE SL PyTorch
SRHEN [136] ACM-MM Cross-View CNNs 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss MS-COCO [70], SUN397 [78] MACE SL -
2021 StereoCaliNet [137] TCI Standard U-Net 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss TAUAgent[138], KITTI[77] Mean error SL PyTorch
CTRL-C [139] ICCV Standard Transformer Cross-entropy, 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss Google Street View[110], SUN360[78] Mean error, AUC[95] SL PyTorch
Wakai et al. [140] ICCVW Distortion DenseNet Smooth 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss StreetLearn[141] Mean error, PSNR, SSIM SL -
OrdianlDistortion [142] TIP Distortion CNNs Smooth 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss MS-COCO[70] PSNR, SSIM, MDLD SL TensorFlow
PolarRecNet [143] TCSVT Distortion VGG + U-Net 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss MS-COCO[70], LMS[144] PSNR, SSIM, MSE SL PyTorch
DQN-RecNet [61] PRL Distortion VGG 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Wireframes[102] PSNR, SSIM, MSE RL PyTorch
Tan et al. [47] CVPR Distortion U-Net 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Self-constructed Accuracy SL PyTorch
PCN [145] CVPR Distortion U-Net 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, GAN loss Place2[99] PSNR, SSIM, FID, CW-SSIM SL PyTorch
DaRecNet [36] ICCV Distortion U-Net Smooth 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss ADE20K[80] PSNR, SSIM SL PyTorch
DLKFM [146] CVPR Cross-View Siamese-Net 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss MS-COCO[70], Google Earth, Google Map MSE SL TensorFlow
LocalTrans [147] ICCV Cross-View Transformer 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss MS-COCO[70] MSE, PSNR, SSIM SL PyTorch
BasesHomo [54] ICCV Cross-View ResNet Triplet loss CA-UDHN[53] MSE USL PyTorch
ShuffleHomoNet [148] ICIP Cross-View ShuffleNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss MS-COCO[70] RMSE SL TensorFlow
DAMG-Homo [44] TCSVT Cross-View CNNs 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss MS-COCO[70], UDIS[149] RMSE, PSNR, SSIM SL TensorFlow
SA-MobileNet [150] BMVC Standard MobileNet Cross-entropy loss SUN360[78], ADE20K[80], NYU[127] MAE, Accuracy SL TensorFlow
SPEC [48] ICCV Standard ResNet Softargmax-2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Self-constructed W-MPJPE, PA-MPJPE SL PyTorch
DirectionNet [151] CVPR Standard U-Net Cosine similarity loss InteriorNet[90], Matterport3D[152] Mean and median error SL TensorFlow
JCD [153] CVPR Distortion FCN Charbonnier[154], perceptual loss BS-RSCD [153], Fastec-RS [49] PSNR, SSIM, LPIPS SL PyTorch
LCCNet [155] CVPRW Cross-Sensor CNNs Smooth 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss KITTI[77] MSE SL PyTorch
CFNet [156] Sensors Cross-Sensor FCN 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Charbonnier[154] loss KITTI[77], KITTI-360[157] MAE, MSEE, MRR SL PyTorch
Fan et al.  [158] ICCV Distortion U-Net 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, perceptual loss Carla-RS [49], Fastec-RS [49] PSNR, SSIM, LPIPS SL PyTorch
SUNet [159] ICCV Distortion DenseNet + ResNet 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, perceptual loss Carla-RS [49], Fastec-RS [49] PSNR, SSIM SL PyTorch
SemAlign [160] IROS Cross-Sensor CNNs Semantic alignment loss KITTI [77] Mean/median rotation errors SL PyTorch
2022 DVPD [41] CVPR Standard CNNs Cross-entropy loss SU3[106], ScanNet[91], YUD[67], NYU[127] Accuracy, AUC[95] SL PyTorch
Fang et al. [60] ICRA Standard CNNs 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss KITTI[77], EuRoC[161], OmniCam[162] MRE, RMSE SSL PyTorch
CPL [163] ICASSP Standard Inception-V3 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss CARLA[94], CyclistDetection[164] MAE SL TensorFlow
IHN [165] CVPR Cross-View Siamese-Net 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss MS-COCO[70], Google Earth, Google Map MACE SL PyTorch
HomoGAN [55] CVPR Cross-View GANs Cross-entropy, WGAN loss CA-UDHN[53] Mean error USL PyTorch
SS-WPC [50] CVPR Distortion Transformer Cross-entropy, 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss Tan et al.[47] Accuracy Semi-SL PyTorch
AW-RSC [166] CVPR Distortion CNNs Charbonnier[154], perceptual loss Self-constructed, FastecRS[49] PSNR, SSIM SL PyTorch
EvUnroll [42] CVPR Distortion U-Net Charbonnier, perceptual, TV loss Self-constructed, FastecRS[49] PSNR, SSIM, LPIPS SL PyTorch
Do et al. [167] CVPR Standard ResNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, Robust angular [168] loss Self-constructed, 7-SCENES[169] Median error, Recall SL PyTorch
DiffPoseNet [170] CVPR Standard CNNs + LSTM 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss TartanAir[171], KITTI[77], TUM-RGBD[172] PEE, AEE[173] SSL PyTorch
SceneSqueezer [174] CVPR Standard Transformer 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss RobotCar Seasons[175], Cambridge Landmarks[64] Mean error, Recall[173] SL PyTorch
FocalPose [176] CVPR Standard CNNs 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Huber loss Pix3D[177], CompCars[178], StanfordCars[178] Median error, Accuracy SL PyTorch
DXQ-Net [179] arXiv Cross-Sensor CNNs + RNNs 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, geodesic loss KITTI[77], KITTI-360[157] MSE SL PyTorch
SST-Calib [45] ITSC Cross-Sensor CNNs 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss KITTI[77] QAD, AEAD SL PyTorch
CCS-Net [180] IROS Distortion U-Net 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss TUM-RGBD[172] MAE, RPE SL PyTorch
FishFormer [43] arXiv Distortion Transformer 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss Place2[99], CelebA[118] PSNR, SSIM, FID SL PyTorch
SIR [59] TIP Distortion ResNet 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss ADE20K[80], WireFrames[102], MS-COCO[70] PSNR, SSIM SSL PyTorch
ATOP [181] TIV Cross-Sensor CNNs Cross entropy loss Self-constructed + KITTI[77] RRE, RTE SL -
FusionNet [182] ICRA Cross-Sensor CNNs+PointNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss KITTI[77] MAE SL PyTorch
RGKCNet [183] TIM Cross-Sensor CNNs+PointNet 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss KITTI[77] MSE SL PyTorch
GenCaliNet [184] ECCV Distortion DenseNet 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss StreetLearn[141], SP360[185] MAE, PSNR, SSIM SL -
Liu et al. [56] TPAMI Cross-View ResNet Triplet loss Self-constructed MSE, Accuracy USL PyTorch

3 Standard Model

Generally, for learning-based calibration works, the objectives of the intrinsics calibration contain focal length and optical center, and the objectives of the extrinsic calibration contain the rotation matrix and translation vector.

3.1 Intrinsics Calibration

Deepfocal [24] is a pioneer work in learning-based camera calibration, it aims to estimate the focal length of any image “in the wild”. In detail, Deepfocal considered a simple pinhole camera model and regressed the horizontal field of view using a deep convolutional neural network. Given the width w𝑤witalic_w of an image, the relationship between the horizontal field of view Hθsubscript𝐻𝜃H_{\theta}italic_H start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and focal length f𝑓fitalic_f can be described by:

Hθ=2arctan(w2f).subscript𝐻𝜃2𝑤2𝑓H_{\theta}=2\arctan(\frac{w}{2f}).italic_H start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = 2 roman_arctan ( divide start_ARG italic_w end_ARG start_ARG 2 italic_f end_ARG ) . (1)

Due to component wear, temperature fluctuations, or outside disturbances like collisions, the calibrated parameters of a camera are susceptible to change over time. To this end, MisCaliDet [111] proposed to identify if a camera needs to be recalibrated intrinsically. Compared to the conventional intrinsic parameters such as the focal length and image center, MisCaliDet presented a new scalar metric, i.e., the average pixel position difference (APPD) to measure the degree of camera miscalibration, which describes the mean value of the pixel position differences over the entire image.

3.2 Extrinsics Calibration

In contrast to intrinsic calibration, extrinsic calibration infers the spatial correspondence of the camera and its located 3D scene. PoseNet[25] first proposed deep convolutional neural networks to regress 6-DoF camera pose in real-time. A pose vector p was predicted by PoseNet, given by the 3D position x and orientation represented by quaternion q of a camera, namely, p=[x,q]pxq\textbf{p}=[\textbf{x},\textbf{q}]p = [ x , q ]. For constructing the training dataset, the labels are automatically calculated from a video of the scenario using a structure from motion method [186].

Inspired by PoseNet[25], the following works improved the extrinsic calibration in terms of the intermediate representation, interpretability, data format, learning objective, etc. For example, to optimize the geometric pose objective, DeepFEPE [115] designed an end-to-end keypoint-based framework with learnable modules for detection, feature extraction, matching, and outlier rejection. Such a pipeline imitated the traditional baseline, in which the final performance can be analyzed and improved by the intermediate differentiable module. To bridge the domain gap between the extrinsic objective and image features, recent works proposed to first learn an intermediate representation from the input, such as surface geometry [89], depth map [137], directional probability distribution [151], and normal flow [170], etc. Then, the extrinsic are reasoned by geometric constraints and learned representation. Therefore, the neural networks are gradually guided to perceive the geometry-related features, which are crucial for extrinsic estimation. Considering the privacy concerns and limited storage problem, some recent works compressed the scene and exploited the point-like feature to estimate the extrinsic. For example, Do et al. [167] trained a network to recognize sparse but significant 3D points, dubbed scene landmarks, by encoding their appearance as implicit features. And the camera pose can be calculated using a robust minimal solver followed by a Levenberg-Marquardt-based nonlinear refinement. SceneSqueezer [174] compressed the scene information from three levels: the database frames are clustered using pairwise co-visibility information, a point selection module prunes each cluster based on estimation performance, and learned quantization further compresses the selected points.

3.3 Joint Intrinsic and Extrinsic Calibration

3.3.1 Geometric Representations

Vanishing Points The intersection of projections of a set of parallel lines in the world leads to a vanishing point. The detection of vanishing points is a fundamental and crucial challenge in 3D vision. In general, vanishing points reveal the direction of 3D lines, allowing the agent to deduce 3D scene information from a single 2D image.

DeepVP [39] is the first learning-based work for detecting the vanishing points given a single image. It reversed the conventional process by scoring the horizon line candidates according to the vanishing points they contain. Chang et al. [87] redesigned this task as a CNN classification problem using an output layer with 225 discrete possible vanishing point locations. For constructing the dataset, the camera view is panned and tilted with step 5° from -35° to 35° in the panorama scene (total 225 images) from a single GPS location. To directly leverage the geometric properties of vanishing points, NeurVPS [105] proposed a canonical conic space and a conic convolution operator that can be implemented as regular convolutions in this space, where the learning model is capable of calculating the global geometric information of vanishing points locally. To overcome the need for a large amount of training data in previous methods, DVPD [41] incorporated the neural network with two geometric priors: Hough transformation and Gaussian sphere. First, the convolutional features are transformed into a Hough domain, mapping lines to distinct bins. The projection of the Hough bins is then extended to the Gaussian sphere, where lines are transformed into great circles and vanishing points are located at the intersection of these circles. Geometric priors are data-efficient because they eliminate the necessity for learning this information from data, which enables an interpretable learning framework and generalizes better to domains with slightly different data distributions.

Horizon Lines The horizon line is a crucial contextual attribute for various computer vision tasks especially image metrology, computational photography, and 3D scene understanding. The projection of the line at infinity onto any plane that is perpendicular to the local gravity vector determines the location of the horizon line.

Given the FoV, pitch, and roll of a camera, it is straightforward to locate the horizon line in its captured image space. DeepHorizon [65] proposed the first learning-based solution for estimating the horizon line from an image, without requiring any explicit geometric constraints or other cues. To train the network, a new benchmark dataset, Horizon Lines in the Wild (HLW), was constructed, which consists of real-world images with labeled horizon lines. SA-MobileNet [150] proposed an image tilt detection and correction with self-attention MobileNet [187] for smartphones. A spatial self-attention module was devised to learn long-range dependencies and global context within the input images. To address the difficulty of the regression task, they trained the network to estimate multiple angles within a narrow interval of the ground truth tilt, penalizing only those values that locate outside this narrow range.

Refer to caption
Figure 3: Overview of CTRL-C. The figure is from  [139]. It estimates parameters including the zenith VP, FoV, and horizon line for camera calibration from an input image and a set of line segments. Moreover, two auxiliary outputs (vertical and horizontal convergence line scores) guide the network in learning scene geometry for calibration.

3.3.2 Composite Parameters

Calibrating the composite parameters aims to estimate the intrinsic parameters and extrinsic parameters simultaneously. By jointly estimating composite parameters and training using data from a large-scale panorama dataset [78], Hold-Geoffroy et al. [29] largely outperformed previous independent calibration tasks. Moreover, Hold-Geoffroy et al. [29] performed human perception research in which the participants were asked to evaluate the realism of 3D objects composited with and without accurate calibration. This data was further designed to a new perceptual measure for the calibration errors. In terms of the feature category, Lee et al. [109] and CTRL-C [139] considered both semantic features and geometric cues for camera calibration. They showed how taking use of geometric features, is capable of facilitating the network to comprehend the underlying perspective structure of an image. The pipeline of CTRL-C is illustrated in Figure 3. In recent literature, more applications are jointly studied with camera calibration, for example, single view metrology [51], 3D human pose and shape estimation [48], depth estimation [126, 60], object pose estimation [176], and image reflection removal [128], etc.

Considering the heterogeneousness and visual implicitness of different camera parameters, CPL [163] estimated the parameters using a novel camera projection loss, exploiting the camera model neural network to reconstruct the 3D point cloud. The proposed loss addressed the training imbalance problem by representing different errors of camera parameters in terms of a unified metric.

3.4 Discussion

3.4.1 Technique Summary

The above methods target automatic calibration without manual intervention and scene assumption. Early literature [24, 25] separately studied the intrinsic calibration or extrinsic calibration. Driven by large-scale datasets and powerful networks, subsequent works [39, 65, 29, 139] considered a comprehensive camera calibration, inferring various parameters and geometric representations. To relieve the difficulty of learning the camera parameters, some works [89, 137, 151, 170] proposed to learn an intermediate representation. In recent literature, more applications are jointly studied with camera calibration [51, 48, 126, 60, 128]. This suggests solving the downstream vision tasks, especially in 3D tasks may require prior knowledge of the image formation model. Moreover, some geometric priors [41] can alleviate the data-starved requirement of deep learning, showing the potential to bridge the gap between the calibration target and semantic features.

It is interesting to find that increasing more extrinsic calibration methods [115, 167, 174] revisited and restored the traditional feature point-based solutions. The standard extrinsics that describe the camera motion contain limited degrees of freedom, and thus some local features can well represent the spatial correspondence. Besides, the network designed for point learning significantly improves the efficiency of calibration models, such as PointNet [188] and PointCNN [189]. Such a pipeline also enables clear interpretability of learning-based camera calibration, which promotes understanding of how the network calibrates and magnifies the influences of intermediate modules.

NeRF, recognized for its groundbreaking capability in synthesizing novel views from 2D images, has seen various advancements recently. Progress in this arena includes the incorporation of additional trainable components, prior constraints, enhanced network designs, and novel training strategies. Specifically, NeRF--- - [190] marked a pivotal moment, demonstrating the simultaneous optimization of camera parameters and poses during training. It introduced a trainable pinhole camera model, followed by SCNeRF [191] which proposed a more complex camera model, comprising of a pinhole design, radial distortion, and a pixel-specific noise model. While increased trainable parameters offer enhanced representational capabilities, they also complicate training. To counteract this, researchers have utilized prior geometric knowledge as constraints to stabilize optimization, integrating depth estimation [192, 193], multi-view correspondence [191, 194, 193], and GAN-based constraints [195]. Additionally, refinements to NeRF’s architecture have emerged, leveraging Gaussian [196] or sinusoidal activations [197]. Concurrently, training strategies such as coarse-to-fine pipelines [198, 192] and advanced sampling techniques [198, 199, 197] have been proposed, enhancing both reconstruction quality and parameter precision.

Popular frameworks like InstantNGP [200] and NeRFStudio [201] offer features to fine-tune camera parameters. Typically, NeRF methods use outputs from tools like COLMAP or Polycam, with rendering quality tied closely to initial data quality. Minor errors in intrinsics or poses can lead to noticeable rendering artifacts. Current approaches [200, 201] integrate intrinsic parameters, distortion coefficients, and camera poses into NeRF’s training optimization, enhancing both geometric structure learning and parameter optimization. This boosts 3D geometry robustness and refines camera models. However, fine-tuning parameters may complicate optimization and can destabilize training. It may also not work well if there are not enough views to reliably adjust the intrinsics without overfitting.

3.4.2 Future Effort

(1) Explore more model priors. Most learning-based calibration methods study the parametric camera models but their generalization abilities are limited. In contrast, non-parametric models directly model the relationship between the 3D imaging ray and its resulting pixel in the image, encoding valuable priors in the learned semantic features to reason the camera parameters. Recent works [202, 203, 204] also incorporate the perspective of modeling pixel-wise information for camera calibration, making minimal assumptions on the camera model and showing more interpretable and in line with how humans perceive.

(2) Decouple different stages in an end-to-end calibration learning model. Most learning-based camera calibration methods include a feature extraction stage and an objective estimation stage. However, how the networks learn the features related to calibration is ambiguous. Therefore, decoupling the learning process by different traditional calibration stages can guide the way of feature extraction. It would be meaningful to extend the idea in extrinsic calibration [115, 167, 174] to more general calibration problems.

(3) Transfer the measurement space from the parameter error to the geometric difference. When it comes to jointly calibrating various camera parameters, the training process will suffer from an imbalance loss optimization problem. The main reason is different camera parameters correspond to different sample distributions. The simple normalization strategy cannot unify their error spaces. Therefore, we can formulate a straightforward measurement space in terms of the geometric properties of different camera parameters.

(4) Despite the strides achieved by recent NeRF methods, training NeRF without precise camera parameters remains a challenge, especially in scenarios with sparse views, pronounced movements, low-texture regions, and suboptimal initial values. Contemporary NeRF-based methodologies do optimize camera parameters, yielding notable results. However, they exhibit significant computational demands and lack the generalization seen in current deep-learning calibration techniques. We contend that in the present NeRF-based methods, camera parameters often play a secondary role. Thus, crafting effective calibration algorithms that capitalize on NeRF remains an arduous but promising endeavor.

4 Distortion Model

In the learning-based camera calibration, calibrating the radial distortion and roll shutter distortion gains increasing attention due to their widely used applications for the wide-angle lens and CMOS sensor. In this part, we mainly review the calibration/rectification of these two distortions.

4.1 Radial Distortion

The literature on learning-based radial distortion calibration can be classified into two main categories: regression-based solutions and reconstruction-based solutions.

Refer to caption
Figure 4: Three common learning solutions of the regression-based wide-angle camera calibration: (a) SingleNet, (b) DualNet, (c) SeqNet, where 𝐈𝐈\mathbf{I}bold_I is the distortion image and f𝑓fitalic_f and ξ𝜉\xiitalic_ξ denote the focal length and distortion parameters, respectively. The figure is from  [40].

4.1.1 Regression-based Solution

Rong  et al.  [26] and DeepCalib [40] are pioneer works for the learning-based wide-angle camera calibration. They treated the camera calibration as a supervised classification [26] or regression [40] problem, and then the networks with the convolutional layers and fully connected layers were used to learn the distortion features of inputs and predict the camera parameters. In particular, DeepCalib [40] explored three learning solutions for wide-angle camera calibration as illustrated in Figure 4. Their experiments showed the simplest architecture SingleNet achieves the best performance on both accuracy and efficiency. To enhance the distortion perception of networks, the following works investigated introducing more diverse features such as the semantic features [79] and geometry features [101, 125, 123]. Additionally, some works improved the generalization by designing learning strategies such as unsupervised learning [57], self-supervised learning [59], and reinforcement learning [46]. By randomly chosen coefficients throughout each mini-batch of the training process, RDC-Net [120] was able to dynamically generate distortion images on-the-fly. It enhanced the rectification performance and prevents the learning model from overfitting. Instead of contributing to the techniques of deep learning, other works leaned to explore the vision prior to interpretable calibration. For example, having observed the radial distortion image owns the center symmetry characteristics, in which the texture far from the image center has stronger distortion, Shi et al.  [81] and PSE-GAN [119] developed a position-aware weight layer (fixed [81] and learnable [119]) of this property and enabled the network to explicitly perceive the distortion. Lopez et al.  [88] proposed a novel parameterization for radial distortion that is better suited for networks than directly learning the distortion parameters. Furthermore, OrdinalDistortion [142] presented a learning-friendly representation, i.e., ordinal distortion. Compared to the implicit and heterogeneous camera parameters, such a representation can facilitate the distortion perception of the neural network due to its clear relation to the image features.

4.1.2 Reconstruction-based Solution

Inspired by the conditional image-to-image translation and dense visual perception, the reconstruction-based solution starts to evolve from the conventional regression-based paradigm. DR-GAN [34] is the first reconstruction-based solution for calibrating the radial distortion, which directly models the pixel-wise mapping between the distorted image and rectified image. It achieved the camera parameter-free training and one-stage rectification. Thanks to the liberation of the assumption of camera models, the reconstruction-based solution showed the potential to calibrate various types of cameras in one learning network. For example, DDM [35] unified different camera models into a domain by presenting the distortion distribution map, which explicitly describes the distortion level of each pixel in a distorted image. Then, the network learned to reconstruct the rectified image using this geometric prior map. To make the mapping function interpretable, the subsequent works [96, 37, 46, 121, 145, 47, 50, 143] developed the displacement filed between the distorted image and rectified image. Such a manner is able to eliminate the generated artifacts in the pixel-level reconstruction. In particular, FE-GAN [121] integrated the geometry prior like Shi et al.  [81] and PSE-GAN [119] into their reconstruction-based solution and presented a self-supervised strategy to learn the distortion flow for wide-angle camera calibration in Figure 5. Most reconstruction-based solutions exploit a U-Net-like architecture to learn pixel-level mapping. However, the distortion feature can be transferred from encoder to decoder by the skip-connection operation, leading to a blurring appearance and incomplete correction in reconstruction results. To address this issue, Li et al.  [117] abandoned the skip-connection in their rectification network. To keep the feature fusion and restrain the geometric difference simultaneously, PCN [145] designed a correction layer in skip-connection and applied the appearance flows to revise the convolved features in different encoder layers. Having noticed that the previous sampling strategy of the convolution kernel neglected the radial symmetry of distortion, PolarRecNet [143] transformed the distorted image from the Cartesian coordinates domain into the polar coordinates domain.

Refer to caption
Figure 5: Architecture of FE-GAN. The figure is from  [121]. It consists of two components: a generator G=(U,W)𝐺𝑈𝑊G=(U,W)italic_G = ( italic_U , italic_W ) that rectifies the distortion image x𝑥xitalic_x, and a discriminator D=(Dadv,Dcls)𝐷subscript𝐷𝑎𝑑𝑣subscript𝐷𝑐𝑙𝑠D=(D_{adv},D_{cls})italic_D = ( italic_D start_POSTSUBSCRIPT italic_a italic_d italic_v end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ). The module U𝑈Uitalic_U in G𝐺Gitalic_G predicts the distortion flow f=U(x)𝑓𝑈𝑥f=U(x)italic_f = italic_U ( italic_x ), while W𝑊Witalic_W rectifies the distortion image using f𝑓fitalic_f.

4.2 Roll Shutter Distortion

The existing deep learning calibration works on roll shutter (RS) distortion can be classified into two categories: single-frame-based [27, 100, 42] and multi-frame-based [49, 153, 159, 158, 166]. The single-frame-based solution studies the case of a single roll shutter image as input and directly learns to correct the distortion using neural networks. The ideal corrected result can be regarded as the global shutter (GS) image. It is an ill-posed problem and requires some additional prior assumptions to be defined. On the contrary, the multi-frame-based solution considers the consecutive frames (two or more) of a video taken by a roll shutter camera, in which the strong temporal correlation can be investigated for more reasonable correction.

4.2.1 Single-frame-based Solution

URS-CNN [27] is the first learning work for calibrating the rolling shutter camera. In this work, a neural network with long kernel characteristics was used to understand how the scene structure and row-wise camera motion interact. To specifically address the nature of the RS effect produced by the row-wise exposure, the row-kernel and column-kernel convolutions were leveraged to extract attributes along horizontal and vertical axes. RSC-Net [100] improved URS-CNN [27] from 2 degrees of freedom (DoF) to 6-DoF and presents a structure-and-motion-aware RS correction model, where the camera scanline velocity and depth were estimated. Compared to URS-CNN [27], RSC-Net [100] further reasoned about the concealed motion between the scanlines as well as the scene structure as shown in Figure 6. To bridge the spatiotemporal connection between RS and GS, EvUnroll [42] exploited the neuromorphic events to correct the RS effect. Event cameras can overcome a number of drawbacks of conventional frame-based activities for dynamic situations with quick motion due to their high temporal resolution property with microsecond-level sensitivity.

Refer to caption
Figure 6: Architecture of RSC-Net. The figure is from  [100]. It consists of two sub-networks, namely DepthNet and Velocity-Net, for learning an RS depth map and RS camera motion from an input image, respectively. Among them, a 6-DOF camera velocity is regressed, including a 3D translational velocity vector v𝑣vitalic_v and 3D angular velocity vector w𝑤witalic_w.
Refer to caption
Figure 7: Architecture of AW-RSC. The figure is from  [166]. To address current imprecise motion estimation, it attempts to predict multiple displacement fields instead of only one. Additionally, AW-RSC suggests an adaptive warping module that uses the bundle of fields to guide the adaptive warping of the RS features into the GS one.

4.2.2 Multi-frame-based Solution

Most multi-frame-based solutions are based on the reconstruction paradigm, they mainly devote to contributing how to represent the dense displacement field between RS and global GS images and accurately warp the RS domain to the GS domain. For the first time, DeepUnrollNet [49] proposed an end-to-end network for two consecutive rolling shutter images using a differentiable forward warping module. In this method, a motion estimation network is used to estimate the dense displacement field from a rolling shutter image to its matching global shutter image. The second contribution of DeepUnrollNet [49] is to construct two novel datasets: the Fastec-RS dataset and the Carla-RS dataset. Furthermore, JCD [153] jointly considered the rolling shutter correction and deblurring (RSCD) techniques, which largely exist in the medium and long exposure cases of rolling shutter cameras. It applied bi-directional warping streams to compensate for the displacement while keeping the non-warped deblurring stream to restore details. The authors also contributed a real-world dataset using a well-designed beam-splitter acquisition system, BS-RSCD, which includes both ego-motion and object motion in dynamic scenes. SUNet [159] extended DeepUnrollNet [49] from the middle time of the second frame (3τ23𝜏2\frac{3\tau}{2}divide start_ARG 3 italic_τ end_ARG start_ARG 2 end_ARG) into the intermediate time of two frames (τ𝜏\tauitalic_τ). By using PWC-Net [205], SUNet [159] estimated the symmetric undistortion fields and reconstructed the potential GS frames by a time-centered GS image decoder network. To effectively reduce the misalignment between the contexts warped from two consecutive RS images, the context-aware undistortion flow estimator and the symmetric consistency enforcement were designed. To achieve a higher frame rate, Fan et al.  [158] generated a GS video from two consecutive RS images based on the scanline-dependent nature of the RS camera. In particular, they first analyzed the inherent connection between bidirectional RS undistortion flow and optical flow, demonstrating the RS undistortion flow map has a more pronounced scanline dependency than the isotropically smooth optical flow map. Then, they developed the bidirectional undistortion flows to describe the pixel-wise RS-aware displacement, and further devised a computation technique for the mutual conversion between different RS undistortion flows corresponding to various scanlines. To eliminate the inaccurate displacement field estimation and error-prone warping problems in previous methods, AW-RSC  [166] proposed to predict multiple fields and adaptively warped the learned RS features into global shutter counterparts. Using a coarse-to-fine approach, these warped features were combined and generated to precise global shutter frames as shown in Figure 7. Compared to previous works [49, 153, 159, 158], the warping operation consisting of adaptive multi-head attention and a convolutional block in AW-RSC  [166] is learnable and effective. In addition, AW-RSC  [166] contributed a real-world rolling shutter correction dataset: BS-RSC, where the RS videos with corresponding GS ground truth are captured simultaneously with a beam-splitter-based acquisition system.

4.3 Discussion

4.3.1 Technique Summary

The deep learning works on wide-angle camera and roll shutter calibration share a similar technique pipeline. Along this research trend, most early literature begins with the regression-based solution [26, 40, 27]. The subsequent works innovated the traditional calibration with a reconstruction perspective [34, 35, 121, 49], which directly learns the displacement field to rectify the uncalibrated input. For higher accuracy of calibration, a more intuitive displacement field, and more effective warping strategy have been developed [145, 166, 153, 158]. To fit the distribution of different distortions, some works designed different shapes of the convolutional kernel [27] or transformed the convolved coordinates [143].

Existing works devoted themselves to designing more powerful networks and introducing more diverse features to facilitate calibration performance. Increasingly more methods focused on the geometry priors of the distortion [121, 119, 81]. These priors can be directly weighted into the convolutional layers or used to supervise network training, promoting the learning model to converge faster.

4.3.2 Future Effort

(1) The development of wide-angle camera calibration and roll shutter camera calibration can promote each other. For instance, the well-studied multi-frame-based solution in roll shutter calibration is able to inspire wide-angle calibration. The same object located at different sequences could provide useful priors regarding to radial distortion. Additionally, the elaborate studies of the displacement field and warping layer [166, 153, 158] have the potential to motivate the development of wide-angle camera calibration and other fields. Furthermore, the investigation of geometric priors in wide-angle calibration could also improve the interpretability of the network in roll shutter calibration.

(2) Most methods synthesize their training dataset based on random samples from all camera parameters. However, for the images captured by real lenses, the distribution of camera parameters probably locates at a potential manifold [88]. Learning on a label-redundant calibration dataset makes the training process inefficient. Thus, exploring a practical sampling strategy for the synthesized dataset could be a meaningful task in the future direction.

(3) To overcome the ill-posed problem of single-frame calibration, introducing other high-precision sensors can compensate for the current calibration performance, such as event cameras [42]. With the rapid development of vision sensors, joint calibration using multiple sensors is valuable. Consequently, more cross-modal and multi-modal fusion techniques will be investigated along this research way.

5 Cross-View Model

The existing deep calibration methods can estimate the specific camera parameters from a single camera. In fact, there can be more complicated parameter representations in multi-camera circumstances. For example, in the multi-view model, the fundamental matrix and essential matrix describe the epipolar geometry and they are intricately tangled with intrinsics and extrinsics. The homography depicts the pixel-level correspondences between different views. In addition to intrinsics and extrinsics, it is also intertwined with depth. Among these complex parameter representations, homography is the most widely leveraged in practical applications and its related learning-based methods are the most investigated. To this end, we mainly focus on the review of deep homography estimation solutions for the cross-view model and they can be divided into three categories: direct, cascaded, and iterative solution.

Refer to caption
Figure 8: Architectures of DHN [28] and UDHN [52]. The figure is from  [52]. The supervised approach [28] learns to regress a 4 point parameterization of homography 𝐇~4ptsubscript~𝐇4𝑝𝑡\mathbf{\tilde{H}}_{4pt}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT 4 italic_p italic_t end_POSTSUBSCRIPT using 2subscript2\mathcal{L}_{2}caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss. The unsupervised approach [52] outputs 𝐇~4ptsubscript~𝐇4𝑝𝑡\mathbf{\tilde{H}}_{4pt}over~ start_ARG bold_H end_ARG start_POSTSUBSCRIPT 4 italic_p italic_t end_POSTSUBSCRIPT that minimizes the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT pixel-wise photometric loss of paired inputs (DLT: direct linear transform; PSGG: parameterized sampling grid generator; DS: differentiable sampling).

5.1 Direct Solution

We review the direct deep homography solutions from the perspective of different parameterizations, including the classical 4-pt parameterization and other parameterizations.

5.1.1 4-pt Parameterization

Deep homography estimation is first proposed in DHN[28], where a VGG-style network is adopted to predict the 4-pt parameterization H4ptsubscript𝐻4𝑝𝑡H_{4pt}italic_H start_POSTSUBSCRIPT 4 italic_p italic_t end_POSTSUBSCRIPT. To train and evaluate the network, a synthetic dataset named Warped MS-COCO is created to provide ground truth 4-pt parameterization H^4ptsubscript^𝐻4𝑝𝑡\hat{H}_{4pt}over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 4 italic_p italic_t end_POSTSUBSCRIPT. The pipeline is illustrated in Fig. 8(a), and the objective function is formulated as LHsubscript𝐿𝐻L_{H}italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT:

LH=12H4ptH^4pt22.subscript𝐿𝐻12superscriptsubscriptnormsubscript𝐻4𝑝𝑡subscript^𝐻4𝑝𝑡22L_{H}=\frac{1}{2}\parallel H_{4pt}-\hat{H}_{4pt}\parallel_{2}^{2}.italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_H start_POSTSUBSCRIPT 4 italic_p italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_H end_ARG start_POSTSUBSCRIPT 4 italic_p italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

Then the 4-pt parameterization can be solved as a 3×3333\times 33 × 3 homography matrix using normalized DLT[206]. However, DHN is limited to synthetic datasets where the ground truth can be generated for free or requires costly labeling of real-world datasets. Subsequently, the first unsupervised solution named UDHN[52] is proposed to address this problem. As shown in Fig. 8(c), it used the same network architecture as DHN and defined an unsupervised loss function by minimizing the average photometric error motivated by traditional methods[207]:

LPW=𝒫(IA(x))𝒫(IB(𝒲(x;p)))1,subscript𝐿𝑃𝑊subscriptnorm𝒫subscript𝐼𝐴𝑥𝒫subscript𝐼𝐵𝒲𝑥𝑝1L_{PW}=\parallel\mathcal{P}(I_{A}(x))-\mathcal{P}(I_{B}(\mathcal{W}(x;p)))% \parallel_{1},italic_L start_POSTSUBSCRIPT italic_P italic_W end_POSTSUBSCRIPT = ∥ caligraphic_P ( italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_x ) ) - caligraphic_P ( italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( caligraphic_W ( italic_x ; italic_p ) ) ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (3)

where 𝒲(;)𝒲\mathcal{W}(\cdot;\cdot)caligraphic_W ( ⋅ ; ⋅ ) and 𝒫()𝒫\mathcal{P}(\cdot)caligraphic_P ( ⋅ ) denote the operations of warping via homography parameters p𝑝pitalic_p and extracting an image patch, respectively. IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and IBsubscript𝐼𝐵I_{B}italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT are the original images with overlapping regions. The input of UDHN is a pair of image patches, but it warps the original images when calculating the loss. In this manner, it avoids the adverse effects of invalid pixels after warping and lifts the magnitude of pixel supervision. To gain accuracy and speed with a tiny model, Chen et al. proposed ShuffleHomoNet [148], which integrates ShuffleNet compressed units[208] and location-aware pooling[84] into a lightweight model. To further handle large displacement, a multi-scale weight-sharing version is exploited by extracting multi-scale feature representations and adaptively fusing multi-scale predictions. However, the homography cannot perfectly align images with parallax caused by non-planar structures with non-overlapping camera centers. To deal with parallax, CA-UDHN[53] designs learnable attention masks to overlook the parallax regions, contributing to better background plane alignment. Besides, the 4-pt homography can be extended to meshflow[56] to realize non-planar accurate alignment.

5.1.2 Other Parameterizations

In addition to 4-pt parameterization, the homography can be parameterized as other formulations. To better utilize homography invertibility, Wang et al. proposed SSR-Net [58]. They established the invertibility constraint through a conventional matrix representation in a cyclic manner. Zeng et al. [85] argued that the 4-point parameterization regressed by a fully-connected layer can harm the spatial order of the corners and be susceptible to perturbations, since four points are the minimum requirement to solve the homography. To address these issues, they formulated the parameterization as a perspective field (PF) that models pixel-to-pixel bijection and designed a PFNet. This extends the displacements of the four vertices to as many dense pixel points as possible. The homography can then be solved using RANSAC [209] with outlier filtering, enabling robust estimation by utilizing dense correspondences. Nevertheless, dense correspondences lead to a significant increase in the computational complexity of RANSAC. Furthermore, Ye et al.[54] proposed an 8-DOF flow representation without extra post-processing, which has a size of H×W×2𝐻𝑊2H\times W\times 2italic_H × italic_W × 2 in an 8D subspace constrained by the homography. To represent arbitrary homography flows in this subspace, 8 flow bases are defined, and the proposed BasesHomo is to predict the coefficients for the flow bases. To obtain desirable bases, BasesHomo first generates 8 homography flows by modifying every single entry of an identity homography matrix except for the last entry. Then, these flows are normalized by their largest flow magnitude followed by a QR decomposition, enforcing all the bases normalized and orthogonal.

Refer to caption
Figure 9: Architecture of HomoGAN. The figure is from  [55]. In particular, the homography estimation transformer with cascaded encoder-decoder blocks takes a feature pyramid of each image as inputs, and predicts the homography from coarse to fine. Coplanarity-aware GAN imposes coplanarity constraints on the model by predicting soft masks of the dominant plane.

5.2 Cascaded Solution

Direct solutions explore various homography parameterizations with simple network structures, while the cascaded ones focus on complex designs of network architectures.

In HierarchicalNet[72], Nowruzi et al. hold that the warped images can be regarded as the input of another network. Therefore they stacked the networks sequentially to reduce the error bounds of the estimate. Based on HierarchicalNet, SRHEN [136] introduced the cost volume[205] to the cascaded network, measuring the feature correlation by cosine distance and formulating it as a volume. The stacked networks and cost volume increase the performance, but they cannot handle the dynamic scenes. MHN [113] developed a multi-scale neural network and proposed to learn homography estimation and dynamic content detection simultaneously. Moreover, to tackle the cross-resolution problem, LocalTrans [147] formulated it as a multimodal problem and proposed a local transformer network embedded within a multiscale structure to explicitly learn correspondences between the multimodal inputs. These inputs include images with different resolutions, and LocalTrans achieved superior performance on cross-resolution cases with a resolution gap of up to 10x. All the solutions mentioned above leverage image pyramids to progressively enhance the ability to address large displacements. However, every image pair at each level requires a unique feature extraction network, resulting in the redundancy of feature maps. To alleviate this problem, some researchers[210, 149, 44, 55] replaced image pyramids with feature pyramids. Specifically, they warped the feature maps directly instead of images to avoid excessive feature extraction networks. To address the low-overlap homography estimation problem in real-world images[149], Nie et al.[149] modified the unsupervised constraint (Eq. 3) to adapt to low-overlap scenes:

LPW=IA(x)𝟙(𝒲(x;p))IB(𝒲(x;p))1,subscriptsuperscript𝐿𝑃𝑊subscriptnormsubscript𝐼𝐴𝑥1𝒲𝑥𝑝subscript𝐼𝐵𝒲𝑥𝑝1L^{\prime}_{PW}=\parallel I_{A}(x)\cdot\mathbbm{1}(\mathcal{W}(x;p))-I_{B}(% \mathcal{W}(x;p))\parallel_{1},italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P italic_W end_POSTSUBSCRIPT = ∥ italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_x ) ⋅ blackboard_1 ( caligraphic_W ( italic_x ; italic_p ) ) - italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( caligraphic_W ( italic_x ; italic_p ) ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (4)

where 𝟙1\mathbbm{1}blackboard_1 is an all-one matrix with the same size as IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT or IBsubscript𝐼𝐵I_{B}italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. It solved the low-overlap problem by taking the original images as network input and ablating the corresponding pixels of IAsubscript𝐼𝐴I_{A}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT to the invalid pixels of warped IBsubscript𝐼𝐵I_{B}italic_I start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. To solve the non-planar homography estimation problem, DAMG-Homo[44] proposed backward multi-gird deformation with contextual correlation to align parallax images. Compared with traditional cost volume, the proposed contextual correlation helped to reach better accuracy with lower computational complexity. Another way to address the non-planar problem is to focus on the dominant plane. In HomoGAN [55], an unsupervised GAN is proposed to impose a coplanarity constraint on the predicted homography, as shown in Figure 9. To implement this approach, a generator is used to predict masks of aligned regions, while a discriminator is used to determine whether two masked feature maps were produced by a single homography.

5.3 Iterative Solution

Compared with cascaded methods, iterative solutions achieve higher accuracy by iteratively optimizing the last estimation. Lucas-Kanade (LK) algorithm[207] is usually used in image registration to estimate the parameterized warps iteratively, such as affine transformation, optical flow, etc. It aims at the incremental update of warp parameters ΔpΔ𝑝\varDelta proman_Δ italic_p every iteration by minimizing the sum of squared error between a template image T𝑇Titalic_T and an input image I𝐼Iitalic_I:

E(Δp)=T(x)I(𝒲(x;p+Δp))22.𝐸Δ𝑝superscriptsubscriptnorm𝑇𝑥𝐼𝒲𝑥𝑝Δ𝑝22E(\varDelta p)=\parallel T(x)-I(\mathcal{W}(x;p+\varDelta p))\parallel_{2}^{2}.italic_E ( roman_Δ italic_p ) = ∥ italic_T ( italic_x ) - italic_I ( caligraphic_W ( italic_x ; italic_p + roman_Δ italic_p ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5)

However, when optimizing Eq. 5 using first-order Taylor expansion, I(𝒲(x;p))/p𝐼𝒲𝑥𝑝𝑝\partial I(\mathcal{W}(x;p))/\partial p∂ italic_I ( caligraphic_W ( italic_x ; italic_p ) ) / ∂ italic_p should be recomputed every iteration because I(𝒲(x;p))𝐼𝒲𝑥𝑝I(\mathcal{W}(x;p))italic_I ( caligraphic_W ( italic_x ; italic_p ) ) varies with p𝑝pitalic_p. To avoid this problem, the inverse compositional (IC) LK algorithm[211], an equivalence to LK algorithm, can be used to reformulate the optimization goal as follows:

E(Δp)=T(𝒲(x;Δp))I(𝒲(x;p))22.superscript𝐸Δ𝑝superscriptsubscriptnorm𝑇𝒲𝑥Δ𝑝𝐼𝒲𝑥𝑝22E^{\prime}(\varDelta p)=\parallel T(\mathcal{W}(x;\varDelta p))-I(\mathcal{W}(% x;p))\parallel_{2}^{2}.italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( roman_Δ italic_p ) = ∥ italic_T ( caligraphic_W ( italic_x ; roman_Δ italic_p ) ) - italic_I ( caligraphic_W ( italic_x ; italic_p ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

After linearizing Eq. 6 with first-order Taylor expansion, we compute T(𝒲(x;0))/p𝑇𝒲𝑥0𝑝\partial T(\mathcal{W}(x;0))/\partial p∂ italic_T ( caligraphic_W ( italic_x ; 0 ) ) / ∂ italic_p instead of I(𝒲(x;p))/p𝐼𝒲𝑥𝑝𝑝\partial I(\mathcal{W}(x;p))/\partial p∂ italic_I ( caligraphic_W ( italic_x ; italic_p ) ) / ∂ italic_p, which would not vary every iteration.

To combine the advantages of deep learning with IC-LK iterator, CLKN [71] conducted LK iterative optimization on semantic feature maps extracted by CNNs as follows:

Ef(Δp)=FT(𝒲(x;Δp))FI(𝒲(x;p))22,superscript𝐸𝑓Δ𝑝superscriptsubscriptnormsubscript𝐹𝑇𝒲𝑥Δ𝑝subscript𝐹𝐼𝒲𝑥𝑝22E^{f}(\varDelta p)=\parallel F_{T}(\mathcal{W}(x;\varDelta p))-F_{I}(\mathcal{% W}(x;p))\parallel_{2}^{2},italic_E start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT ( roman_Δ italic_p ) = ∥ italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( caligraphic_W ( italic_x ; roman_Δ italic_p ) ) - italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( caligraphic_W ( italic_x ; italic_p ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

where FTsubscript𝐹𝑇F_{T}italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and FIsubscript𝐹𝐼F_{I}italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT are the feature maps of the template and input images. Then, they enforced the network to run a single iteration with a hinge loss, while the network runs multiple iterations until the stopping condition is met in the testing stage. Besides, CLKN stacked three similar LK networks to further boost the performance by treating the output of the last LK network as the initial warp parameters of the next LK network. From Eq. 7, the IC-LK algorithm heavily relied on feature maps, which tend to fail in multimodal images. Instead, DLKFM [146] constructed a single-channel feature map by using the eigenvalues of the local covariance matrix on the output tensor. To learn DLKFM, it designed two special constraint terms to align multimodal feature maps and contribute to convergence.

However, LK-based algorithms can fail if the Jacobian matrix is rank-deficient [212]. Additionally, the IC-LK iterator is untrainable, which means this drawback is theoretically unavoidable. To address this issue, a completely trainable iterative homography network (IHN) [165] was proposed. Inspired by RAFT [213], IHN updates the cost volume to refine the estimated homography using the same estimator repeatedly every iteration. Furthermore, IHN can handle dynamic scenes by producing an inlier mask in the estimator without requiring extra supervision.

5.4 Discussion

5.4.1 Technique Summary

The above works are devoted to exploring different homography parameterizations such as 4-pt parameterization[28], perspective field[85], and motion bases representation[54], which contributes to better convergence and performance. Other works tend to design various network architectures. In particular, cascaded and iterative solutions are proposed to refine the performance progressively, which can be further combined together to reach higher accuracy. To make the methods more practical, various challenging problems are preliminarily addressed, such as cross resolutions[147], multiple modalities[146, 165], dynamic objects[113, 165], and non-planar scenes[53, 55, 44], etc.

5.4.2 Challenge and Future Effort

We summarize the existing challenges as follows:

(1) Many homography estimation solutions are designed for fixed resolutions, while real-world applications often involve much more flexible resolutions. When pre-trained models are applied to images with different resolutions, performance can dramatically drop due to the need for input resizing to satisfy the regulated resolution.

(2) Unlike optical flow estimation, which assumes small motions between images, homography estimation often deals with images that have significantly low-overlap rates. In such cases, existing methods may exhibit inferior performance due to limited receptive fields.

(3) Existing methods address the parallax or dynamic objects by learning to reject outliers in the feature extractor[53], cost volume[214], or estimator[165]. However, it is still unclear which stage is more appropriate for outlier rejection.

Based on the challenges we have discussed, some potential research directions for future efforts can be identified:

(1) To overcome the first challenge, we can design various strategies to enhance resolution robustness, such as resolution-related data augmentation, and continual learning on multiple datasets with different resolutions. Besides, we can also formulate a resolution-free parameterization form. The perspective field [85] is a typical case, which represents the homography as dense correspondences with the same resolution as input images. But it requires RANSAC as the post-processing approach, introducing extra computational complexity, especially in the case of extensive correspondences. Therefore, a resolution-free and efficient parameterization form should be explored.

(2) To enhance the performance in low-overlap rate, the main insight is to increase the receptive fields of a network. To this end, the cross-attention module of the transformer explicitly leverages the long-range correlation to eliminate short-range inductive bias[215]. On the other hand, we can exploit beneficial varieties of cost volume to integrate feature correlation [44, 165].

(3) As there is no interaction between different image features in the feature extractor, it is reasonable to assume that outlier rejection should occur after feature extraction. It is not possible to identify outliers within a single image as the depth alone cannot be used as an outlier cue. For example, images captured by purely rotated cameras do not contain parallax outliers. Additionally, it seems intuitive to learn the capability of outlier rejection by combining global and local correlation, similar to the insight of RANSAC.

6 Cross-Sensor Model

Multi-sensor calibration estimates intrinsic and extrinsic parameters of multiple sensors like cameras, LiDARs, and IMUs. This ensures that data from different sensors are synchronized and registered in a common coordinate system, allowing them to be fused together for a more accurate representation of the environment. Accurate multi-sensor calibration is crucial for applications like autonomous driving and robotics, where reliable sensor fusion is necessary for safe and efficient operation.

In this part, we mainly review the literature on learning-based camera-LiDAR calibration, i.e., predicting the 6-DoF rigid body transformation between a camera and a 3D LiDAR, without requiring any presence of specific features or landmarks in the implementation. Like calibration works on other types of cameras/systems, this research field can also be classified into regression-based solutions and flow/reconstruction-based solutions. But we are prone to follow the special matching principle in camera-LiDAR calibration and divide the existing learning-based literature into three categories: pixel-level solution, semantics-level solution, and object/keypoint-level solution.

6.1 Pixel-level Solution

The first deep learning technique in camera-LiDAR calibration, RegNet [30], used CNNs to combine feature extraction, feature matching, and global regression to infer the 6-DoF extrinsic parameters. It processed the RGB and LiDAR depth map separately and branched two parallel data network streams. Then, a specific correlation layer was proposed to convolve the stacked LiDAR and RGB features as a joint representation. After this feature matching, the global information fusion and parameter regression were achieved by two fully connected layers with a Euclidean loss function. Motivated by this work, the subsequent works made a further step into more accurate camera-LiDAR calibration in terms of the geometric constraint [86, 131], temporal correlation [131], loss design [130], feature extraction [182], feature matching [155, 135], feature fusion [182], and calibration representation [156, 179].

Refer to caption
Figure 10: Network architecture of CalibNet. The figure is from  [86]. It takes an RGB image from a calibrated camera and a raw LiDAR point cloud as inputs, and regresses a 6-DoF transformation by an SE(3) layer.

For example, as shown in Figure 10, CalibNet [86] designed a network to predict calibration parameters that maximize the geometric and photometric consistency of images and point clouds, solving the underlying physical problem by 3D Spatial Transformers [216]. To refine the calibration model, CalibRCNN [131] presented a synthetic view and an epipolar geometry constraint to measure the photometric and geometric inaccuracies between consecutive frames, of which the temporal information learned by the LSTM network has been investigated in the learning-based camera-LiDAR calibration for the first time. Since the output space of the LiDAR-camera calibration is on the 3D Special Euclidean Group (SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 )) rather than the normal Euclidean space, RGGNet [130] considered Riemannian geometry constraints in the loss function, namely, used a SE(3)𝑆𝐸3SE(3)italic_S italic_E ( 3 ) geodesic distance equipped with left-invariant Riemannian metrics to optimize the calibration network. LCCNet [155] exploited the cost volume layer to learn the correlation between the image and the depth transformed by the point cloud. Because the depth map ignores the 3D geometric structure of the point cloud, FusionNet [182] leveraged PointNet++ [217] to directly learn the features from the 3D point cloud. Subsequently, a feature fusion with Ball Query [217] and attention strategy was proposed to effectively fuse the features of images and point clouds.

CFNet [156] first proposed the calibration flow for camera-LiDAR calibration, which represents the deviation between the positions of initial projected 2D points and ground truth. Compared to directly predicting extrinsic parameters, learning the calibration flow helped the network to understand the underlying geometric constraint. To build precise 2D-3D correspondences, CFNet [156] corrected the originally projected points using the estimated calibration flow. Then the efficient Perspective-n-Point (EPnP) algorithm was applied to calculate the final extrinsic parameters by RANSAC. Because RANSAC is nondifferentiable, DXQ-Net [179] further presented a probabilistic model for LiDAR-camera calibration flow, which estimates the uncertainty to measure the quality of LiDAR-camera data association. Then, the differentiable pose estimation module was designed for solving extrinsic parameters, back-propagating the extrinsic error to the flow prediction network.

6.2 Semantics-level Solution

Semantic features can be well learned and represented by deep neural networks. A perfect calibration enables to accurately align the same instance in different sensors. To this end, some works [134, 132, 160, 45] explored to guide the camera-LiDAR calibration with the semantic information. SOIC [134] calibrated and transforms the initialization issue into the semantic centroids’ PnP problem using semantic information. Since the 3D semantic centroids of the point cloud and the 2D semantic centroids of the picture cannot match precisely, a matching constraint cost function based on the semantic components was presented. SSI-Calib [132] reformulated the calibration as an optimization problem with a novel calibration quality metric based on semantic features. Then, a non-monotonic subgradient ascent algorithm was proposed to calculate the calibration parameters. Other works utilized the off-the-shelf segmentation networks for point cloud and image, and optimized the calibration parameters by minimizing semantic alignment loss in single-direction [160] and bi-direction [45].

6.3 Object/Keypoint-level Solution

ATOP [181] designed an attention-based object-level matching network, i.e., Cross-Modal Matching Network to explore the overlapped FoV between camera and LiDAR, which facilitated generating the 2D-3D object-level correspondences. 2D and 3D object proposals were detected by YOLOv4 [218] and PointPillar [219]. Then, two cascaded PSO-based algorithms [220] were devised to estimate the calibration extrinsic parameters in the optimization stage. Using the deep declarative network (DDN) [221], RGKCNet [183] combined the standard neural layer and a PnP solver in the same network, formulating the 2D–3D data association and pose estimation as a bilevel optimization problem. Therefore, both the feature extraction capability of the convolutional layer and the conventional geometric solver can be employed. Microsoft’s human keypoint extraction network [222] was applied to detect the 2D–3D matching keypoints. Additionally, RGKCNet [183] presented a learnable weight layer that determines the keypoints involved in the solver, enabling the whole pipeline to be trained end-to-end.

6.4 Discussion

6.4.1 Technique Summary

The current method can be briefly classified based on the principle of building 2D and 3D matching, namely, the calibration target. In summary, most pixel-level solutions utilized the end-to-end framework to address this task. While these solutions delivered satisfactory performances on specific datasets, their generalization abilities are limited. Semantics-level and object/keypoint-level methods derived from traditional calibration offered both acceptable performances and generalization abilities. However, they heavily relied on the quality of fore-end feature extraction.

6.4.2 Research Trend

(1) Network architecture is becoming more complex with the use of different structures for feature extraction, matching, and fusion. Current methods employ strategies like multi-scale feature extraction, cross-modal interaction, cost-volume establishment, and confidence-guided fusion.

(2) Directly regressing 6-DoF parameters yields weak generalization ability. To overcome this, intermediate representations like calibration flow have been introduced. Additionally, calibration flow can handle non-rigid transformations that are common in real-world applications.

(3) Traditional methods require specific environments but have well-designed strategies. To balance accuracy and generalization, a combination of geometric solving algorithms and learning methods has been investigated.

6.4.3 Future Effort

(1) Camera-LiDAR calibration methods typically rely on datasets like KITTI, which provide only initial extrinsic parameters. To create a decalibration dataset, researchers add noise transformations to the initial extrinsics, but this approach assumes a fixed position camera-LiDAR system with miscalibration. In real-world applications, the camera-LiDAR relative pose varies, making it challenging to collect large-scale real data with ground truth extrinsics. To address this challenge, generating synthetic camera-LiDAR data using simulation systems could be a valuable solution.

(2) To optimize the combination of networks and traditional solutions, a more compact approach is needed. Current methods mainly use networks as feature extractors, resulting in non-end-to-end pipelines with inadequate feature extraction adjustments for calibration. A deep declarative network (DDN) is a promising framework for making the entire pipeline differentiable. The aggregation of learning and traditional methods can be optimized using DDN.

(3) The most important aspect of camera-LiDAR calibration is 2D-3D matching. To achieve this, the point cloud is commonly transformed into a depth image. However, large deviations in extrinsic simulation can result in detail loss. With the great development of Transformer and cross-modal techniques, we believe leveraging Transformer to directly learn the features of image and point cloud in the same pipeline could facilitate better 2D-3D matching.

7 Benchmark

Refer to caption
Figure 11: Overview of our collected benchmark, which covers all models reviewed in this paper. In this dataset, the image and video derive from diverse cameras under different environments. The accurate ground truth and label are provided for each sample.

As there is no public and unified benchmark in learning-based camera calibration, we contribute a dataset that can serve as a platform for generalization evaluations. In this dataset, the images and videos are captured by different cameras under diverse scenes, including simulation environments and real-world settings. Additionally, we provide the calibration ground truth, parameter label, and visual clues in this dataset based on different conditions. Figure 11 shows some samples of our collected dataset. Please refer to the evaluation of representative calibration methods on this benchmark in supplementary material.

Standard Model. We collected 300 high-resolution images on the Internet, captured by popular digital cameras such as Canon, Fujifilm, Nikon, Olympus, Sigma, Sony, etc. For each image, we provide the specific focal length of its lens. We have included a diverse range of subjects, including landscapes, portraits, wildlife, architecture, etc. The range of focal length is from 4.5mm to 600mm.

Distortion Model. We created a comprehensive dataset for the distortion camera model, with a focus on wide-angle cameras. The dataset is comprised of three subcategories. The first is a synthetic dataset, which was generated using the widely-used 4th order polynomial model. It contains both circular and rectangular structures, with 1,000 distortion-rectification image pairs. The second subcategory consists of data captured under real-world settings, derived from the raw calibration data for around 40 types of wide-angle cameras. For each calibration data, the intrinsics, extrinsics, and distortion coefficients are available. Finally, we exploit a car equipped with different cameras to capture video sequences. The scenes cover both indoor and outdoor environments, including daytime and nighttime footage.

Cross-View Model. We selected 500 testing samples at random from each of four representative datasets (MS-COCO [28], GoogleEarch [146], GoogleMap [146], CAHomo [53]) to create a dataset for the cross-view model. It covers a range of scenarios: MS-COCO provides natural synthetic data, GoogleEarch contains aerial synthetic data, and GoogleMap offers multi-modal synthetic data. Parallax is not a factor in these three datasets, while CAHomo provides real-world data with non-planar scenes. To standardize the dataset, we converted all images to a unified format and recorded the matched points between two views. In MS-COCO, GoogleEarch, and GoogleMap, we used four vertices of the images as the matched points. In CAHomo, we identified six matched key points within the same plane.

Cross-Sensor Model. We collected RGB and point cloud data from Apollo [223], DAIR-V2X [224], KITTI [77], KUCL [225], NuScenes [226], and ONCE [227]. Around 300 data pairs with calibration parameters are included in each category. The datasets are captured in different countries to provide enough variety. Each dataset has a different sensor setup, obtaining camera-LiDAR data with varying image resolution, LiDAR scan pattern, and camera-LiDAR relative location. The image resolution ranges from 2448×\times×2048 to 1242×\times×375, while the LiDAR sensors are from Velodyne and Hesai, with 16, 32, 40, 64, and 128 beams. They include not only normal surrounding multi-view images but also small baseline multi-view data. Additionally, we also added random disturbance of around 20 degrees rotation and 1.5 meters translation based on classical settings [30] to simulate vibration and collision.

8 Future Research Directions

Camera calibration is a fundamental and challenging research topic. From the above technical reviews and limitation analysis, we can conclude there is still room for improvement with deep learning. From Section 3 to Section 6, specific future efforts are discussed for each model. In this section, we suggest more general future research directions.

8.1 Sequences

Bundle adjustment is a well-established technique central to Multi-View Stereo (MVS) and Simultaneous Localization and Mapping (SLAM) using multi-view constraints. Traditional bundle adjustment focuses on pose estimation, often under the assumption of pre-calibrated cameras, thus sidelining the nuances of camera parameter fine-tuning. While learning-based camera calibration has made significant strides, most methods are tailored for a single image. We have highlighted intrinsics calibration to underscore how sequence constraints bolster prediction accuracy. Notably, there is a burgeoning interest in integrating bundle adjustment into end-to-end deep learning pipelines. By transitioning from conventional keypoint extraction and matching to learning-based methods, recent works [228, 229, 230, 231, 232] propose differentiable bundle adjustment layers to refine pose, depth, and camera parameters together. Consequently, there is immense potential in further harnessing sequence constraints for accurate calibration. Current methods combine front-end matching with a back-end solver, which can be inefficient and unreliable in cases like fast motion. We suggest separating front-end and back-end refinements, using large models for features and introducing more trainable parameters in optimization.

8.2 Learning Target

Due to the implicit relationship to image features, conventional calibration objectives can be challenging for neural networks to learn. To this end, some works have developed novel learning targets that replace conventional calibration objectives, providing learning-friendly representations for neural networks. Additionally, intermediate geometric representations have been presented to bridge the gap between image features and calibration objectives, such as reflective amplitude coefficient maps [128], rectification flow [37], surface geometry [89], and normal flow [170], etc. Looking ahead to the future development of this community, we believe there is still great potential for designing more explicit and reasonable learning targets for calibration objectives.

8.3 Pre-training

Pre-training on ImageNet [69] has become a widely used strategy in deep learning. However, recent studies [96] have shown that this approach provides less benefit for specific camera calibration tasks, such as wide-angle camera calibration. This is due to two main reasons: the data gap and the task gap. The ImageNet dataset only contains perspective images without distortions, making the initialized weights of neural networks irrelevant to distortion models. Furthermore, He et al. [233] demonstrated that the task of ImageNet pre-training has limited benefits when the final task is more sensitive to localization. As a result, the performance of extrinsics estimation may be impacted by this task gap. Moreover, pre-training beyond a single image and a single modality, to our knowledge, has not been thoroughly investigated in the related field. We suggest that designing a customized pre-training strategy for learning-based camera calibration is an interesting area of research.

8.4 Implicit Unified Model

Deep learning-based camera calibration methods use traditional parametric camera models, which lack the flexibility to fit complex situations. Non-parametric camera models relate each pixel to its corresponding 3D observation ray, overcoming parametric model limitations. However, they require strict calibration targets and are more complex for undistortion, projection, and unprojection. Deep learning methods show potential for calibration tasks, making non-parametric models worth revisiting and potentially replacing parametric models in the future. Moreover, they allow for implicit and unified calibration, fitting all camera types through pixel-level regression and avoiding explicit feature extraction and geometry solving. Researchers combined the advantages of implicit and unified representation with the Neural Radiance Field (NeRF) for reconstructing 3D structures and synthesizing novel views. Self-calibration NeRF [191] has been proposed for generic cameras with arbitrary non-linear distortions, and end-to-end pipelines have been explored to learn depth and ego-motion without calibration targets. We believe the implicit and unified camera models could be used to optimize learning-based algorithms or integrated into downstream 3D vision tasks.

9 Conclusion

In this paper, we present a comprehensive survey of the recent efforts in the area of deep learning-based camera calibration. Our survey covers conventional camera models, classified learning paradigms and learning strategies, detailed reviews of the state-of-the-art approach, a public benchmark, and future research directions. To exhibit the development process and link the connections between existing works, we provide a fine-grained taxonomy that categorizes literature by jointly considering camera models and applications. Moreover, the relationships, strengths, distinctions, and limitations are thoroughly discussed in each category. An open-source repository will keep updating regularly with new works and datasets. We hope that this survey could promote future research in this field.

Acknowledgment

We thank Leidong Qin and Shangrong Yang at Beijing Jiaotong University for the partial dataset collection. We thank Jinlong Fan at the University of Sydney for the insightful discussion.

References

  • [1] C. B. Duane, “Close-range camera calibration,” Photogramm. Eng, vol. 37, no. 8, pp. 855–866, 1971.
  • [2] S. J. Maybank and O. D. Faugeras, “A theory of self-calibration of a moving camera,” International Journal of Computer Vision, vol. 8, no. 2, pp. 123–151, 1992.
  • [3] J. Weng, P. Cohen, M. Herniou et al., “Camera calibration with distortion models and accuracy evaluation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 10, pp. 965–980, 1992.
  • [4] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 11, pp. 1330–1334, 2000.
  • [5] D. C. Brown, “Decentering distortion of lenses,” Photogrammetric Engineering and Remote Sensing, 1966.
  • [6] Z. Zhang, “Flexible camera calibration by viewing a plane from unknown orientations,” in International Conference on Computer Vision, vol. 1, 1999, pp. 666–673.
  • [7] S. Gasparini, P. Sturm, and J. P. Barreto, “Plane-based calibration of central catadioptric cameras,” in International Conference on Computer Vision, 2009, pp. 1195–1202.
  • [8] S. Shah and J. Aggarwal, “A simple calibration procedure for fish-eye (high distortion) lens camera,” in Proceedings of IEEE International Conference on Robotics and Automation, 1994, pp. 3422–3427.
  • [9] J. P. Barreto and H. Araujo, “Geometric properties of central catadioptric line images and their application in calibration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1327–1333, 2005.
  • [10] R. Carroll, M. Agrawal, and A. Agarwala, “Optimizing content-preserving projections for wide-angle images,” in ACM Transactions on Graphics, vol. 28, no. 3, 2009, p. 43.
  • [11] F. Bukhari and M. N. Dailey, “Automatic radial distortion estimation from a single image,” Journal of Mathematical Imaging and Vision, vol. 45, no. 1, pp. 31–45, 2013.
  • [12] M. Alemán-Flores, L. Alvarez, L. Gomez, and D. Santana-Cedrés, “Automatic lens distortion correction using one-parameter division models,” Image Processing On Line, vol. 4, pp. 327–343, 2014.
  • [13] O. D. Faugeras, Q.-T. Luong, and S. J. Maybank, “Camera self-calibration: Theory and experiments,” in European Conference on Computer Vision, 1992, pp. 321–334.
  • [14] C. S. Fraser, “Digital camera self-calibration,” ISPRS Journal of Photogrammetry and Remote sensing, vol. 52, no. 4, pp. 149–159, 1997.
  • [15] R. I. Hartley, “Self-calibration from multiple views with a rotating camera,” in European Conference on Computer Vision, 1994, pp. 471–478.
  • [16] F. Camposeco, T. Sattler, and M. Pollefeys, “Non-parametric structure-based calibration of radially symmetric cameras,” in International Conference on Computer Vision, 2015, pp. 2192–2200.
  • [17] T. Schops, V. Larsson, M. Pollefeys, and T. Sattler, “Why having 10,000 parameters in your camera model is better than twelve,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2535–2544.
  • [18] L. Pan, M. Pollefeys, and V. Larsson, “Camera pose estimation using implicit distortion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 819–12 828.
  • [19] [Online]. Available: https://docs.opencv.org/4.x/dc/dbb/tutorial_py_calibration.html
  • [20] [Online]. Available: https://www.mathworks.com/help/vision/camera-calibration.html
  • [21] J. Salvi, X. Armangué, and J. Batlle, “A comparative review of camera calibrating methods with accuracy evaluation,” Pattern Recognition, vol. 35, no. 7, pp. 1617–1635, 2002.
  • [22] C. Hughes, M. Glavin, E. Jones, and P. Denny, “Review of geometric distortion compensation in fish-eye cameras,” 2008.
  • [23] J. Fan, J. Zhang, S. J. Maybank, and D. Tao, “Wide-angle image rectification: a survey,” International Journal of Computer Vision, vol. 130, no. 3, pp. 747–776, 2022.
  • [24] S. Workman, C. Greenwell, M. Zhai, R. Baltenberger, and N. Jacobs, “Deepfocal: A method for direct focal length estimation,” in IEEE International Conference on Image Processing, 2015, pp. 1369–1373.
  • [25] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in International Conference on Computer Vision, 2015.
  • [26] J. Rong, S. Huang, Z. Shang, and X. Ying, “Radial lens distortion correction using convolutional neural networks trained with synthesized images,” in Asian Conference on Computer Vision, 2016, pp. 35–49.
  • [27] V. Rengarajan, Y. Balaji, and A. Rajagopalan, “Unrolling the shutter: Cnn to correct motion distortions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2291–2299.
  • [28] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image homography estimation,” arXiv preprint arXiv:1606.03798, 2016.
  • [29] Y. Hold-Geoffroy, K. Sunkavalli, J. Eisenmann, M. Fisher, E. Gambaretto, S. Hadap, and J.-F. Lalonde, “A perceptual measure for deep single image camera calibration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [30] N. Schneider, F. Piewak, C. Stiller, and U. Franke, “Regnet: Multimodal sensor registration using deep neural networks,” in IEEE intelligent vehicles symposium, 2017, pp. 1803–1810.
  • [31] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.
  • [32] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
  • [33] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Advances in Neural Information Processing Systems, vol. 27, 2014.
  • [34] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Dr-gan: Automatic radial distortion rectification using conditional gan in real-time,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 3, pp. 725–733, 2020.
  • [35] K. Liao, C. Lin, Y. Zhao, and M. Xu, “Model-free distortion rectification framework bridged by distortion distribution map,” IEEE Transactions on Image Processing, vol. 29, pp. 3707–3718, 2020.
  • [36] K. Liao, C. Lin, L. Liao, Y. Zhao, and W. Lin, “Multi-level curriculum for training a distortion-aware barrel distortion rectification model,” in International Conference on Computer Vision, 2021, pp. 4389–4398.
  • [37] X. Li, B. Zhang, P. V. Sander, and J. Liao, “Blind geometric distortion correction on images through deep learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [38] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
  • [39] M. Zhai, S. Workman, and N. Jacobs, “Detecting vanishing points using global image context in a non-manhattan world,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [40] O. Bogdan, V. Eckstein, F. Rameau, and J.-C. Bazin, “Deepcalib: a deep learning approach for automatic intrinsic calibration of wide field-of-view cameras,” in Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production, 2018.
  • [41] Y. Lin, R. Wiersma, S. L. Pintea, K. Hildebrandt, E. Eisemann, and J. C. van Gemert, “Deep vanishing point detection: Geometric priors make dataset variations vanish,” arXiv preprint arXiv:2203.08586, 2022.
  • [42] X. Zhou, P. Duan, Y. Ma, and B. Shi, “Evunroll: Neuromorphic events based rolling shutter image correction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 775–17 784.
  • [43] Y. Shangrong, L. Chunyu, L. Kang, and Z. Yao, “Fishformer: Annulus slicing-based transformer for fisheye rectification with efficacy domain exploration,” arXiv preprint arXiv:2207.01925, 2022.
  • [44] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Depth-aware multi-grid deep homography estimation with contextual correlation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4460–4472, 2021.
  • [45] K. Akio, Z. Yiyang, Z. Pengwei, Z. Wei, and T. Masayoshi, “Sst-calib: Simultaneous spatial-temporal parameter calibration between lidar and camera,” arXiv preprint arXiv:2207.03704, 2022.
  • [46] Y. Zhao, Z. Huang, T. Li, W. Chen, C. LeGendre, X. Ren, A. Shapiro, and H. Li, “Learning perspective undistortion of portraits,” in International Conference on Computer Vision, 2019.
  • [47] J. Tan, S. Zhao, P. Xiong, J. Liu, H. Fan, and S. Liu, “Practical wide-angle portraits correction with deep structured models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3498–3506.
  • [48] M. Kocabas, C.-H. P. Huang, J. Tesch, L. Müller, O. Hilliges, and M. J. Black, “Spec: Seeing people in the wild with an estimated camera,” in International Conference on Computer Vision, 2021, pp. 11 035–11 045.
  • [49] P. Liu, Z. Cui, V. Larsson, and M. Pollefeys, “Deep shutter unrolling network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5941–5949.
  • [50] F. Zhu, S. Zhao, P. Wang, H. Wang, H. Yan, and S. Liu, “Semi-supervised wide-angle portraits correction by multi-scale transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 689–19 698.
  • [51] R. Zhu, X. Yang, Y. Hold-Geoffroy, F. Perazzi, J. Eisenmann, K. Sunkavalli, and M. Chandraker, “Single view metrology in the wild,” in European Conference on Computer Vision, 2020, pp. 316–333.
  • [52] T. Nguyen, S. W. Chen, S. S. Shivakumar, C. J. Taylor, and V. Kumar, “Unsupervised deep homography: A fast and robust homography estimation model,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2346–2353, 2018.
  • [53] J. Zhang, C. Wang, S. Liu, L. Jia, N. Ye, J. Wang, J. Zhou, and J. Sun, “Content-aware unsupervised deep homography estimation,” in European Conference on Computer Vision, 2020, pp. 653–669.
  • [54] N. Ye, C. Wang, H. Fan, and S. Liu, “Motion basis learning for unsupervised deep homography estimation with subspace projection,” in International Conference on Computer Vision, 2021, pp. 13 117–13 125.
  • [55] M. Hong, Y. Lu, N. Ye, C. Lin, Q. Zhao, and S. Liu, “Unsupervised homography estimation with coplanarity-aware gan,” arXiv preprint arXiv:2205.03821, 2022.
  • [56] S. Liu, N. Ye, C. Wang, J. Zhang, L. Jia, K. Luo, J. Wang, and J. Sun, “Content-aware unsupervised deep homography estimation and its extensions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 2849–2863, 2022.
  • [57] S. Yang, C. Lin, K. Liao, Y. Zhao, and M. Liu, “Unsupervised fisheye image correction through bidirectional loss with geometric prior,” Journal of Visual Communication and Image Representation, vol. 66, p. 102692, 2020.
  • [58] X. Wang, C. Wang, B. Liu, X. Zhou, L. Zhang, J. Zheng, and X. Bai, “Multi-view stereo in the deep learning era: A comprehensive revfiew,” Displays, vol. 70, p. 102102, 2021.
  • [59] J. Fan, J. Zhang, and D. Tao, “Sir: Self-supervised image rectification via seeing the same scene from multiple different lenses,” IEEE Transactions on Image Processing, 2022.
  • [60] J. Fang, I. Vasiljevic, V. Guizilini, R. Ambrus, G. Shakhnarovich, A. Gaidon, and M. R. Walter, “Self-supervised camera self-calibration from video,” arXiv preprint arXiv:2112.03325, 2021.
  • [61] J. Zhao, S. Wei, L. Liao, and Y. Zhao, “Dqn-based gradual fisheye image rectification,” Pattern Recognition Letters, vol. 152, pp. 129–134, 2021.
  • [62] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [63] K. Wilson and N. Snavely, “Robust global translations with 1dsfm,” in European Conference on Computer Vision, 2014, pp. 61–75.
  • [64] [Online]. Available: https://www.repository.cam.ac.uk/handle/1810/251342;jsessionid=90AB1617B8707CD387CBF67437683F77
  • [65] S. Workman, M. Zhai, and N. Jacobs, “Horizon lines in the wild,” arXiv preprint arXiv:1604.02129, 2016.
  • [66] [Online]. Available: https://mvrl.cse.wustl.edu/datasets/hlw/
  • [67] P. Denis, J. H. Elder, and F. J. Estrada, “Efficient edge-based methods for estimating manhattan frames in urban imagery,” in European Conference on Computer Vision, 2008, pp. 197–210.
  • [68] O. Barinova, V. Lempitsky, E. Tretiak, and P. Kohli, “Geometric image parsing in man-made environments,” in European Conference on Computer Vision, 2010, pp. 57–70.
  • [69] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  • [70] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, 2014, pp. 740–755.
  • [71] C.-H. Chang, C.-N. Chou, and E. Y. Chang, “Clkn: Cascaded lucas-kanade networks for image alignment,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [72] F. Erlik Nowruzi, R. Laganiere, and N. Japkowicz, “Homography estimation from image pairs with hierarchical convolutional networks,” in International Conference on Computer Vision Workshops, 2017.
  • [73] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 3485–3492.
  • [74] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabularies and fast spatial matching,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8.
  • [75] H. Shao, T. Svoboda, and L. Van Gool, “Zubud-zurich buildings database for image based recognition,” Computer Vision Lab, Swiss Federal Institute of Technology, vol. 260, no. 20, p. 6, 2003.
  • [76] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database forstudying face recognition in unconstrained environments,” in Workshop on Faces in’Real-Life’Images: Detection, Alignment, and Recognition, 2008.
  • [77] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
  • [78] J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba, “Recognizing scene viewpoint using panoramic place representation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2695–2702.
  • [79] X. Yin, X. Wang, J. Yu, M. Zhang, P. Fua, and D. Tao, “Fisheyerecnet: A multi-context collaborative deep network for fisheye image rectification,” in European Conference on Computer Vision, 2018.
  • [80] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE Computer Vision and Pattern Recognition, 2017, pp. 633–641.
  • [81] Y. Shi, D. Zhang, J. Wen, X. Tong, X. Ying, and H. Zha, “Radial lens distortion correction by adding a weight layer with inverted foveal models to convolutional neural networks,” in International Conference on Pattern Recognition, 2018.
  • [82] R. Ranftl and V. Koltun, “Deep fundamental matrix estimation,” in European Conference on Computer Vision, 2018.
  • [83] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM Transactions on Graphics, vol. 36, no. 4, pp. 1–13, 2017.
  • [84] O. Poursaeed, G. Yang, A. Prakash, Q. Fang, H. Jiang, B. Hariharan, and S. Belongie, “Deep fundamental matrix estimation without correspondences,” in European Conference on Computer Vision Workshops, 2018.
  • [85] R. Zeng, S. Denman, S. Sridharan, and C. Fookes, “Rethinking planar homography estimation using perspective fields,” in Asian Conference on Computer Vision, 2018, pp. 571–586.
  • [86] G. Iyer, R. K. Ram, J. K. Murthy, and K. M. Krishna, “Calibnet: Geometrically supervised extrinsic calibration using 3d spatial transformer networks,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018, pp. 1110–1117.
  • [87] C.-K. Chang, J. Zhao, and L. Itti, “Deepvp: Deep learning for vanishing point detection on 1 million street view images,” in IEEE International Conference on Robotics and Automation, 2018, pp. 4496–4503.
  • [88] M. Lopez, R. Mari, P. Gargallo, Y. Kuang, J. Gonzalez-Jimenez, and G. Haro, “Deep single image camera calibration with radial distortion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [89] W. Xian, Z. Li, M. Fisher, J. Eisenmann, E. Shechtman, and N. Snavely, “Uprightnet: Geometry-aware camera orientation estimation from single images,” in International Conference on Computer Vision, 2019.
  • [90] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger, “Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset,” arXiv preprint arXiv:1809.00716, 2018.
  • [91] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5828–5839.
  • [92] B. Zhuang, Q.-H. Tran, G. H. Lee, L. F. Cheong, and M. Chandraker, “Degeneracy in self-calibration revisited and a deep learning solution for uncalibrated slam,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019, pp. 3766–3773.
  • [93] S. Ammar Abbas and A. Zisserman, “A geometric approach to obtain a bird’s eye view from an image,” in International Conference on Computer Vision Workshops, 2019.
  • [94] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on Robot Learning, 2017, pp. 1–16.
  • [95] O. Barinova, V. Lempitsky, E. Tretiak, and P. Kohli, “Geometric image parsing in man-made environments,” in European Conference on Computer Vision, 2010, pp. 57–70.
  • [96] K. Liao, C. Lin, Y. Zhao, and M. Gabbouj, “Distortion rectification from static to dynamic: A distortion sequence construction perspective,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 3870–3882, 2020.
  • [97] R. Jung, A. S. J. Lee, A. Ashtari, and J.-C. Bazin, “Deep360up: A deep learning-based approach for automatic vr image upright adjustment,” in IEEE Conference on Virtual Reality and 3D User Interfaces, 2019.
  • [98] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab, “Robust optimization for deep regression,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2830–2838.
  • [99] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, 2018.
  • [100] B. Zhuang, Q.-H. Tran, P. Ji, L.-F. Cheong, and M. Chandraker, “Learning structure-and-motion-aware rolling shutter correction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [101] Z. Xue, N. Xue, G.-S. Xia, and W. Shen, “Learning to calibrate straight lines for fisheye image rectification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [102] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma, “Learning to parse wireframes in images of man-made environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 626–635.
  • [103] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1746–1754.
  • [104] L. Yin, X. Sun, T. Worm, and M. Reale, “A high-resolution 3d dynamic facial expression database, 2008,” in IEEE International Conference on Automatic Face and Gesture Recognition, Amsterdam, The Netherlands, vol. 126.
  • [105] Y. Zhou, H. Qi, J. Huang, and Y. Ma, “Neurvps: Neural vanishing point scanning via conic convolution,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [106] Y. Zhou, H. Qi, Y. Zhai, Q. Sun, Z. Chen, L.-Y. Wei, and Y. Ma, “Learning to reconstruct 3d manhattan wireframes from a single image,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7698–7707.
  • [107] L. Sha, J. Hobbs, P. Felsen, X. Wei, P. Lucey, and S. Ganguly, “End-to-end camera calibration for broadcast videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [108] N. Homayounfar, S. Fidler, and R. Urtasun, “Sports field localization via deep structured models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5212–5220.
  • [109] J. Lee, M. Sung, H. Lee, and J. Kim, “Neural geometric parser for single image camera calibration,” in European Conference on Computer Vision, 2020, pp. 541–557.
  • [110] [Online]. Available: https://developers.google.com/maps/
  • [111] A. Cramariuc, A. Petrov, R. Suri, M. Mittal, R. Siegwart, and C. Cadena, “Learning camera miscalibration detection,” in IEEE International Conference on Robotics and Automation, 2020, pp. 4997–5003.
  • [112] C. Zhang, F. Rameau, J. Kim, D. M. Argaw, J.-C. Bazin, and I. S. Kweon, “Deepptz: Deep self-calibration for ptz cameras,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020.
  • [113] H. Le, F. Liu, S. Zhang, and A. Agarwala, “Deep homography estimation for dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [114] B. Davidson, M. S. Alvi, and J. F. Henriques, “360° camera alignment via segmentation,” in European Conference on Computer Vision, 2020, pp. 579–595.
  • [115] Y.-Y. Jau, R. Zhu, H. Su, and M. Chandraker, “Deep keypoint-based camera pose estimation with geometric constraints,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 4950–4957.
  • [116] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The apolloscape open dataset for autonomous driving and its application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2702–2719, 2019.
  • [117] Y.-H. Li, I.-C. Lo, and H. H. Chen, “Deep face rectification for 360° dual-fisheye cameras,” IEEE Transactions on Image Processing, vol. 30, pp. 264–276, 2021.
  • [118] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset and benchmark for large-scale face recognition,” in European Conference on Computer Vision, 2016, pp. 87–102.
  • [119] Y. Shi, X. Tong, J. Wen, H. Zhao, X. Ying, and H. Zha, “Position-aware and symmetry enhanced gan for radial distortion correction,” in International Conference on Pattern Recognition, 2021, pp. 1701–1708.
  • [120] H. Zhao, Y. Shi, X. Tong, X. Ying, and H. Zha, “A simple yet effective pipeline for radial distortion correction,” in IEEE International Conference on Image Processing, 2020, pp. 878–882.
  • [121] C.-H. Chao, P.-L. Hsu, H.-Y. Lee, and Y.-C. F. Wang, “Self-supervised deep learning for fisheye image rectification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 2248–2252.
  • [122] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
  • [123] H. Zhao, X. Ying, Y. Shi, X. Tong, J. Wen, and H. Zha, “Rdcface: Radial distortion correction for face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [124] F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. C. Loy, “The devil of face recognition is in the noise,” in European Conference on Computer Vision, 2018, pp. 765–780.
  • [125] Z.-C. Xue, N. Xue, and G.-S. Xia, “Fisheye distortion rectification from deep straight lines,” arXiv preprint arXiv:2003.11386, 2020.
  • [126] M. Baradad and A. Torralba, “Height and uprightness invariance for 3d prediction from a single view,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [127] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European Conference on Computer Vision, 2012, pp. 746–760.
  • [128] Q. Zheng, J. Chen, Z. Lu, B. Shi, X. Jiang, K.-H. Yap, L.-Y. Duan, and A. C. Kot, “What does plate glass reveal about camera calibration?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [129] [Online]. Available: https://figshare.com/articles/dataset/FocaLens/3399169/2
  • [130] K. Yuan, Z. Guo, and Z. J. Wang, “Rggnet: Tolerance aware lidar-camera online calibration with geometric deep learning and generative model,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 6956–6963, 2020.
  • [131] J. Shi, Z. Zhu, J. Zhang, R. Liu, Z. Wang, S. Chen, and H. Liu, “Calibrcnn: Calibrating camera and lidar by recurrent convolutional neural network and geometric constraints,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020, pp. 10 197–10 202.
  • [132] Y. Zhu, C. Li, and Y. Zhang, “Online camera-lidar calibration with sensor semantic information,” in IEEE International Conference on Robotics and Automation, 2020, pp. 4970–4976.
  • [133] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  • [134] W. Wang, S. Nobuhara, R. Nakamura, and K. Sakurada, “Soic: Semantic online initialization and calibration for lidar and camera,” arXiv preprint arXiv:2003.04260, 2020.
  • [135] S. Wu, A. Hadachi, D. Vivet, and Y. Prabhakar, “Netcalib: A novel approach for lidar-camera auto-calibration based on deep learning,” in International Conference on Pattern Recognition, 2021, pp. 6648–6655.
  • [136] Y. Li, W. Pei, and Z. He, “Srhen: stepwise-refining homography estimation network via parsing geometric correspondences in deep latent space,” in Proceedings of the ACM International Conference on Multimedia, 2020, pp. 3063–3071.
  • [137] Y. Gil, S. Elmalem, H. Haim, E. Marom, and R. Giryes, “Online training of stereo self-calibration using monocular depth estimation,” IEEE Transactions on Computational Imaging, vol. 7, pp. 812–823, 2021.
  • [138] [Online]. Available: http://www.cs.toronto.edu/~harel/TAUAgent/download.html
  • [139] J. Lee, H. Go, H. Lee, S. Cho, M. Sung, and J. Kim, “Ctrl-c: Camera calibration transformer with line-classification,” in International Conference on Computer Vision, 2021, pp. 16 228–16 237.
  • [140] N. Wakai and T. Yamashita, “Deep single fisheye image camera calibration for over 180-degree projection of field of view,” in International Conference on Computer Vision Workshops, 2021, pp. 1174–1183.
  • [141] P. Mirowski, A. Banki-Horvath, K. Anderson, D. Teplyashin, K. M. Hermann, M. Malinowski, M. K. Grimes, K. Simonyan, K. Kavukcuoglu, A. Zisserman et al., “The streetlearn environment and dataset,” arXiv preprint arXiv:1903.01292, 2019.
  • [142] K. Liao, C. Lin, and Y. Zhao, “A deep ordinal distortion estimation approach for distortion rectification,” IEEE Transactions on Image Processing, vol. 30, pp. 3362–3375, 2021.
  • [143] K. Zhao, C. Lin, K. Liao, S. Yang, and Y. Zhao, “Revisiting radial distortion rectification in polar-coordinates: A new and efficient learning perspective,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3552–3560, 2021.
  • [144] A. Eichenseer and A. Kaup, “A data set providing synthetic and real-world fisheye video sequences,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 1541–1545.
  • [145] S. Yang, C. Lin, K. Liao, C. Zhang, and Y. Zhao, “Progressively complementary network for fisheye image rectification using appearance flow,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6348–6357.
  • [146] Y. Zhao, X. Huang, and Z. Zhang, “Deep lucas-kanade homography for multimodal image alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 950–15 959.
  • [147] R. Shao, G. Wu, Y. Zhou, Y. Fu, L. Fang, and Y. Liu, “Localtrans: A multiscale local transformer network for cross-resolution homography estimation,” in International Conference on Computer Vision, 2021, pp. 14 890–14 899.
  • [148] Y. Chen, G. Wang, P. An, Z. You, and X. Huang, “Fast and accurate homography estimation using extendable compression network,” in International Conference on Image Processing, 2021, pp. 1024–1028.
  • [149] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, “Unsupervised deep image stitching: Reconstructing stitched features to images,” IEEE Transactions on Image Processing, vol. 30, pp. 6184–6197, 2021.
  • [150] S. Garg, D. P. Mohanty, S. P. Thota, and S. Moharana, “A simple approach to image tilt correction with self-attention mobilenet for smartphones,” arXiv preprint arXiv:2111.00398, 2021.
  • [151] K. Chen, N. Snavely, and A. Makadia, “Wide-baseline relative camera pose estimation with directional learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3258–3268.
  • [152] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” arXiv preprint arXiv:1709.06158, 2017.
  • [153] Z. Zhong, Y. Zheng, and I. Sato, “Towards rolling shutter correction and deblurring in dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9219–9228.
  • [154] W.-S. Lai, J.-B. Huang, N. Ahuja, and M.-H. Yang, “Fast and accurate image super-resolution with deep laplacian pyramid networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 11, pp. 2599–2613, 2018.
  • [155] X. Lv, B. Wang, Z. Dou, D. Ye, and S. Wang, “Lccnet: Lidar and camera self-calibration using cost volume network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2894–2901.
  • [156] X. Lv, S. Wang, and D. Ye, “Cfnet: Lidar-camera registration using calibration flow network,” Sensors, vol. 21, no. 23, p. 8112, 2021.
  • [157] Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [158] B. Fan and Y. Dai, “Inverting a rolling shutter camera: bring rolling shutter images to high framerate global shutter video,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4228–4237.
  • [159] B. Fan, Y. Dai, and M. He, “Sunet: symmetric undistortion network for rolling shutter correction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4541–4550.
  • [160] Z. Liu, H. Tang, S. Zhu, and S. Han, “Semalign: Annotation-free camera-lidar calibration with semantic alignment loss,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021, pp. 8845–8851.
  • [161] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
  • [162] M. Schönbein, T. Strauß, and A. Geiger, “Calibrating and centering quasi-central catadioptric cameras,” in IEEE International Conference on Robotics and Automation, 2014, pp. 4443–4450.
  • [163] T. H. Butt and M. Taj, “Camera calibration through camera projection loss,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 2649–2653.
  • [164] X. Li, F. Flohr, Y. Yang, H. Xiong, M. Braun, S. Pan, K. Li, and D. M. Gavrila, “A new benchmark for vision-based cyclist detection,” in IEEE Intelligent Vehicles Symposium, 2016, pp. 1028–1033.
  • [165] S.-Y. Cao, J. Hu, Z. Sheng, and H.-L. Shen, “Iterative deep homography estimation,” arXiv preprint arXiv:2203.15982, 2022.
  • [166] M. Cao, Z. Zhong, J. Wang, Y. Zheng, and Y. Yang, “Learning adaptive warping for real-world rolling shutter correction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 785–17 793.
  • [167] T. Do, O. Miksik, J. DeGol, H. S. Park, and S. N. Sinha, “Learning to detect scene landmarks for camera localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 132–11 142.
  • [168] T. Do, K. Vuong, S. I. Roumeliotis, and H. S. Park, “Surface normal estimation of tilted images via spatial rectifier,” in European Conference on Computer Vision, 2020, pp. 265–280.
  • [169] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930–2937.
  • [170] C. M. Parameshwara, G. Hari, C. Fermüller, N. J. Sanket, and Y. Aloimonos, “Diffposenet: Direct differentiable camera pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 6845–6854.
  • [171] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020, pp. 4909–4916.
  • [172] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 573–580.
  • [173] B. J. Pijnacker Hordijk, K. Y. Scheper, and G. C. De Croon, “Vertical landing for micro air vehicles using event-based optical flow,” Journal of Field Robotics, vol. 35, no. 1, pp. 69–90, 2018.
  • [174] L. Yang, R. Shrestha, W. Li, S. Liu, G. Zhang, Z. Cui, and P. Tan, “Scenesqueezer: Learning to compress scene for camera relocalization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8259–8268.
  • [175] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The oxford robotcar dataset,” The International Journal of Robotics Research, vol. 36, no. 1, pp. 3–15, 2017.
  • [176] G. Ponimatkin, Y. Labbé, B. Russell, M. Aubry, and J. Sivic, “Focal length and object pose estimation via render and compare,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3825–3834.
  • [177] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman, “Pix3d: Dataset and methods for single-image 3d shape modeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2974–2983.
  • [178] Y. Wang, X. Tan, Y. Yang, X. Liu, E. Ding, F. Zhou, and L. S. Davis, “3d pose estimation for fine-grained object categories,” in European Conference on Computer Vision Workshops, 2018.
  • [179] X. Jing, X. Ding, R. Xiong, H. Deng, and Y. Wang, “Dxq-net: Differentiable lidar-camera extrinsic calibration using quality-aware flow,” arXiv preprint arXiv:2203.09385, 2022.
  • [180] Y. Zhang, X. Zhao, and D. Qian, “Learning-based framework for camera calibration with distortion correction and high precision feature detection,” arXiv preprint arXiv:2202.00158, 2022.
  • [181] Y. Sun, J. Li, Y. Wang, X. Xu, X. Yang, and Z. Sun, “Atop: An attention-to-optimization approach for automatic lidar-camera calibration via cross-modal object matching,” IEEE Transactions on Intelligent Vehicles, 2022.
  • [182] G. Wang, J. Qiu, Y. Guo, and H. Wang, “Fusionnet: Coarse-to-fine extrinsic calibration network of lidar and camera with hierarchical point-pixel fusion,” in International Conference on Robotics and Automation, 2022, pp. 8964–8970.
  • [183] C. Ye, H. Pan, and H. Gao, “Keypoint-based lidar-camera online calibration with robust geometric network,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–11, 2021.
  • [184] N. Wakai, S. Sato, Y. Ishii, and T. Yamashita, “Rethinking generic camera models for deep single image camera calibration to recover rotation and fisheye distortion,” in European Conference on Computer Vision, vol. 13678, 2022, pp. 679–698.
  • [185] S.-H. Chang, C.-Y. Chiu, C.-S. Chang, K.-W. Chen, C.-Y. Yao, R.-R. Lee, and H.-K. Chu, “Generating 360 outdoor panorama dataset with reliable sun position estimation,” in SIGGRAPH Asia, 2018, pp. 1–2.
  • [186] C. Wu, “Towards linear-time incremental structure from motion,” in International Conference on 3D Vision, 2013, pp. 127–134.
  • [187] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1314–1324.
  • [188] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660.
  • [189] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on x-transformed points,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [190] Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu, “Nerf–: Neural radiance fields without known camera parameters,” arXiv preprint arXiv:2102.07064, 2021.
  • [191] Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho, and J. Park, “Self-calibrating neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5846–5854.
  • [192] P. Truong, M.-J. Rakotosaona, F. Manhardt, and F. Tombari, “Sparf: Neural radiance fields from sparse and noisy poses,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4190–4200.
  • [193] K. Wang, Z. Yan, H. Tian, Z. Zhang, X. Li, J. Li, and J. Yang, “Altnerf: Learning robust neural radiance field via alternating depth-pose optimization,” arXiv preprint arXiv:2308.10001, 2023.
  • [194] W. Bian, Z. Wang, K. Li, J.-W. Bian, and V. A. Prisacariu, “Nope-nerf: Optimising neural radiance field with no pose prior,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4160–4169.
  • [195] Q. Meng, A. Chen, H. Luo, M. Wu, H. Su, L. Xu, X. He, and J. Yu, “Gnerf: Gan-based neural radiance field without posed camera,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6351–6361.
  • [196] S.-F. Chng, S. Ramasinghe, J. Sherrah, and S. Lucey, “Gaussian activated neural radiance fields for high fidelity reconstruction and pose estimation,” in European Conference on Computer Vision, 2022, pp. 264–280.
  • [197] Y. Xia, H. Tang, R. Timofte, and L. Van Gool, “Sinerf: Sinusoidal neural radiance fields for joint pose estimation and scene reconstruction,” arXiv preprint arXiv:2210.04553, 2022.
  • [198] C.-H. Lin, W.-C. Ma, A. Torralba, and S. Lucey, “Barf: Bundle-adjusting neural radiance fields,” in International Conference on Computer Vision, 2021, pp. 5741–5751.
  • [199] L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T.-Y. Lin, “iNeRF: Inverting neural radiance fields for pose estimation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021.
  • [200] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Transactions on Graphics, vol. 41, no. 4, pp. 1–15, 2022.
  • [201] M. Tancik, E. Weber, E. Ng, R. Li, B. Yi, T. Wang, A. Kristoffersen, J. Austin, K. Salahi, A. Ahuja et al., “Nerfstudio: A modular framework for neural radiance field development,” in ACM SIGGRAPH Conference Proceedings, 2023, pp. 1–12.
  • [202] W. Xian, A. Božič, N. Snavely, and C. Lassner, “Neural lens modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8435–8445.
  • [203] S. Zhu, A. Kumar, M. Hu, and X. Liu, “Tame a wild camera: In-the-wild monocular camera calibration,” arXiv preprint arXiv:2306.10988, 2023.
  • [204] L. Jin, J. Zhang, Y. Hold-Geoffroy, O. Wang, K. Blackburn-Matzen, M. Sticha, and D. F. Fouhey, “Perspective fields for single image camera calibration,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 307–17 316.
  • [205] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [206] R. Hartley and A. Zisserman, Multiple view geometry in computer vision.   Cambridge University Press, 2003.
  • [207] B. D. Lucas, T. Kanade et al., An iterative image registration technique with an application to stereo vision.   Vancouver, 1981, vol. 81.
  • [208] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in European conference on computer vision, 2018, pp. 116–131.
  • [209] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [210] L. Nie, C. Lin, K. Liao, and Y. Zhao, “Learning edge-preserved image stitching from multi-scale deep homography,” Neurocomputing, vol. 491, pp. 533–543, 2022.
  • [211] S. Baker and I. Matthews, “Lucas-kanade 20 years on: A unifying framework,” International Journal of Computer Vision, vol. 56, no. 3, pp. 221–255, 2004.
  • [212] J. Nocedal and S. J. Wright, Numerical optimization.   Springer, 1999.
  • [213] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in European Conference on Computer Vision, 2020, pp. 402–419.
  • [214] Y. Li, W. Pei, and Z. He, “Ssorn: Self-supervised outlier removal network for robust homography estimation,” arXiv preprint arXiv:2208.14093, 2022.
  • [215] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [216] A. Handa, M. Bloesch, V. Pătrăucean, S. Stent, J. McCormac, and A. Davison, “gvnn: Neural network library for geometric computer vision,” in European Conference on Computer Vision.   Springer, 2016, pp. 67–82.
  • [217] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [218] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
  • [219] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 697–12 705.
  • [220] R. Poli, J. Kennedy, and T. Blackwell, “Particle swarm optimization,” Swarm Intelligence, vol. 1, no. 1, pp. 33–57, 2007.
  • [221] S. Gould, R. Hartley, and D. Campbell, “Deep declarative networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 8, pp. 3988–4004, 2021.
  • [222] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in European Conference on Computer Vision, 2018, pp. 466–481.
  • [223] X. Huang, P. Wang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The apolloscape open dataset for autonomous driving and its application,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2702–2719, 2019.
  • [224] H. Yu, Y. Luo, M. Shu, Y. Huo, Z. Yang, Y. Shi, Z. Guo, H. Li, X. Hu, J. Yuan et al., “Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 361–21 370.
  • [225] J. Kang and N. L. Doh, “Automatic targetless camera–LIDAR calibration by aligning edge with Gaussian mixture model,” Journal of Field Robotics, vol. 37, no. 1, pp. 158–179, 2020.
  • [226] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
  • [227] J. Mao, M. Niu, C. Jiang, X. Liang, Y. Li, C. Ye, W. Zhang, Z. Li, J. Yu, C. Xu et al., “One million scenes for autonomous driving: Once dataset,” 2021.
  • [228] C. Tang and P. Tan, “Ba-net: Dense bundle adjustment network,” arXiv preprint arXiv:1806.04807, 2018.
  • [229] Z. Teed and J. Deng, “Deepv2d: Video to depth with differentiable structure from motion,” arXiv preprint arXiv:1812.04605, 2018.
  • [230] X. Wei, Y. Zhang, Z. Li, Y. Fu, and X. Xue, “Deepsfm: Structure from motion via deep bundle adjustment,” in European Conference on Computer Vision, 2020, pp. 230–247.
  • [231] X. Gu, W. Yuan, Z. Dai, S. Zhu, C. Tang, Z. Dong, and P. Tan, “Dro: Deep recurrent optimizer for video to depth,” IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2844–2851, 2023.
  • [232] A. Hagemann, M. Knorr, and C. Stiller, “Deep geometry-aware camera self-calibration from video,” in International Conference on Computer Vision, 2023, pp. 3438–3448.
  • [233] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet pre-training,” in International Conference on Computer Vision, 2019, pp. 4918–4927.

See pages - of supp.pdf