PnLCalib: Sports Field Registration via
Points and Lines Optimization

Marc Gutiérrez-Pérez and Antonio Agudo
Institut de Robòtica i Informàtica Industrial, CSIC-UPC, Spain

Abstract

Camera calibration in broadcast sports videos presents numerous challenges for accurate sports field registration due to multiple camera angles, varying camera parameters, and frequent occlusions of the field. Traditional search-based methods depend on initial camera pose estimates, which can struggle in non-standard positions and dynamic environments. In response, we propose an optimization-based calibration pipeline that leverages a 3D soccer field model and a predefined set of keypoints to overcome these limitations. Our method also introduces a novel refinement module that improves initial calibration by using detected field lines in a non-linear optimization process. This approach outperforms existing techniques in both multi-view and single-view 3D camera calibration tasks, while maintaining competitive performance in homography estimation. Extensive experimentation on real-world soccer datasets, including SoccerNet-Calibration, WorldCup 2014, and TS-WorldCup, highlights the robustness and accuracy of our method across diverse broadcast scenarios. Our approach offers significant improvements in camera calibration precision and reliability. ¹¹1https://github.com/mguti97/PnLCalib

Index Terms:

Camera Calibration, Homography Estimation, Sports Analytics, SoccerNet, World Cup.

I Introduction

Sports analytics has become an increasingly vital component in modern sports, transforming the way teams, coaches, and fans understand and optimize athletic performance. The proliferation of advanced tracking technologies, such as player and ball tracking systems, has enabled the generation of rich, high-resolution data that provides unprecedented insights into the dynamics of sports competitions. This wealth of tracking data has revolutionized the way sports are analyzed, allowing for more informed decision-making, enhanced player development, and the identification of strategic advantages. Leveraging these data-driven insights has become a key competitive edge, as teams strive to gain a deeper understanding of player movements, team tactics, and in-game patterns. The ability to accurately capture and analyze tracking data has become a crucial aspect of sports analytics, fueling innovations in areas like player performance optimization, injury prevention [4], and the development of advanced coaching strategies [51].

While the proliferation of wearable tracking devices has been instrumental in generating sports performance data, the use of computer vision techniques has emerged as a compelling alternative approach. By leveraging advanced computer vision algorithms, researchers and sports organizations can now extract valuable tracking data directly from video footage, without the need for intrusive wearable sensors [35]. This camera-based tracking approach offers several advantages, including the ability to capture data from multiple athletes simultaneously, the elimination of potential interference or disconnection issues associated with wearables, and the potential for retroactive analysis of historical game footage. Computer vision-based tracking leverages techniques such as object detection, object tracking, and pose estimation to accurately identify and monitor the movements of players, balls, and other key elements within a sports environment. This data-driven, non-invasive approach to tracking has become increasingly sophisticated, enabling the generation of rich, high-fidelity datasets that can provide deeper insights into athletic performance and team dynamics.

The applications of computer vision in sports extend far beyond just tracking player and ball motions. Innovative computer vision tools have been leveraged to enhance various aspects of the sports experience. One prominent example is the use of semi-automatic offside detection systems, which leverage computer vision algorithms to quickly and accurately determine offside positions during live matches, providing crucial support to referees and improving the fairness and pace of the game. Additionally, computer vision techniques are being employed to generate real-time graphics and overlays for sports broadcasts, seamlessly integrating information such as player statistics, team formations, and tactical visualizations. This enhanced visual experience not only informs and engages the audience but also creates new opportunities for data-driven storytelling and fan engagement. Looking ahead, the continued advancements in computer vision are poised to revolutionize various facets of sports, from automated refereeing and in-depth performance analytics to immersive fan experiences and the integration of augmented reality into the viewing experience.

While the advancements in sports analytics and computer vision have enabled unprecedented insights and experiences, one crucial aspect that underpins these capabilities is the accurate calibration of cameras used to capture sports footage. Camera calibration refers to the process of determining both intrinsic and extrinsic parameters of a camera system, which is essential for transforming 2D image data into meaningful 3D representations of the sports environment. While sports fields, with their well-defined dimensions [42], serve as calibration objects, achieving accurate camera calibration in the broadcast setting poses challenges due to multiple camera views, focal length variability and partial occlusion of the court, hindering the matching process between 2D and 3D correspondences.

Refer to caption — Figure 1: Overview of our proposed framework. Top: Training data generation pipeline. Beginning with SoccerNet [8] annotations, we utilize field line extraction and ellipse fitting to establish a hierarchical structure for computing each set of keypoints. Bottom: Inference stage pipeline. The encoder-decoder networks produce heatmaps for keypoints and extremities of soccer field lines to extract their positions in the image space. The obtained keypoint set is augmented with intersections of lines generated by the second model to ensure a sufficient number of points. After initial calibration, our PnL refinement module is applied to further refine the calibration estimate by jointly using detected points and lines information.

Traditional sports field registration relied on feature-based methods [37], such as detecting and matching local features like SIFT [27] and MSER [31] to estimate pairwise correspondences and compute a homography matrix by using RANSAC [15]. The recent surge in deep learning has led to several data-driven approaches leveraging Convolutional Neural Networks (CNNs) for feature extraction, showing promising results in sports field registration. These methods include field-specific feature prediction [9, 33, 7, 34, 23, 14, 19] and direct homography matrix regression [26, 45]. Other researchers have investigated camera calibration as a search problem [39, 5, 38, 52, 53, 32], generating camera pose databases and refining estimates to improve calibration accuracy. Moreover, some approaches [2, 18, 16, 9, 33, 10] leverage temporal calibration consistency between video frames, intending to better align with the nature of sports video broadcasts. Focusing on the soccer domain, despite the potential of estimating camera parameters for reconstructing non-planar points and enabling applications such as automatic camera control, offside detection, or 3D ball tracking, previous studies [5, 38, 39, 26, 9, 52, 53, 33, 40, 7, 34] have predominantly treated the task as homography estimation rather than full calibration [46, 32].

Inspired by the limitations of existing approaches, we propose a novel calibration pipeline (see Fig. 1) for 3D sports field registration. An early version of this work was presented in [19], in which we proposed our method to be capable of addressing the challenges posed by the multiple-view broadcast nature. This approach involves defining a hierarchical pipeline to extract a pre-defined keypoint grid from the court’s geometric properties and leveraging an encoder-decoder network to estimate keypoint positions. Moreover, the soccer field’s lines are defined following the SoccerNet [8] notation and line extremities are also extracted. Particularly, we adopt HRNetv2 [49] as the backbone model for the keypoints and line extremities prediction. The estimated keypoints are used to compute an initial estimate of the projection matrix using RANSAC [15] and Direct Linear Transformation (DLT) [22] algorithms. In this paper, we extend our contribution by incorporating a novel refinement module that jointly uses the detected keypoints and lines further to optimize the initial estimate as a non-linear least-squares problem [47]. We extensively evaluate our approach on three real-world soccer broadcast datasets, including SoccerNet-Calibration [8], WorldCup 2014 [23], and TS-WorldCup [7] datasets, and compare it with state-of-the-art methods in both 2D and 3D sports field registration. The experiments demonstrate that our model achieves superior performance on 3D camera calibration while maintaining comparable results on homography estimation with respect to competing approaches. In summary, this paper makes the following contributions:

•

A novel geometry-based keypoints grid and a robust pipeline for their retrieval.
•

A calibration pipeline capable of integrating non-planar points for 3D camera calibration and extending to multiple views from the broadcast.
•

A refinement module able to optimize the calibration estimate by jointly using the detected keypoints and lines.

II Related work

Sports field registration is a critical component of most sports applications in computer vision, whose common approaches intend to estimate homography matrices in team sports. Traditionally, homography estimation has relied on identifying corresponding features or keypoints between images and the court field model. These features, typically obtained by exploiting geometric primitives such as lines and/or circles, are subsequently used to estimate the mapping between the images. This is often done using the RANSAC algorithm [15] in conjunction with DLT [22] or non-linear optimization techniques [47] that minimize a particular loss function. More recent approaches have diverged from this traditional method. Some directly predict an initial homography matrix, while others seek the optimal matching homography within a reference database containing synthetic images with known homography matrices. Furthermore, the latest approaches have shifted towards directly retrieving camera parameters instead of the homography matrix. This is achieved through various means, including direct prediction, optimization techniques, leveraging databases of image-pose pairs, or decomposing the homography matrix.

II-A Search-based Methods

A prevalent approach in the field has been generating synthetic data to populate databases with homographies or camera poses paired with corresponding image features. These features are often derived from edge maps or semantically segmented images, which represent key elements of the sports field such as lines, circles, and other distinctive markings. Sharma et al. [39] created a synthetic database of edge map-homography pairs. One of the main drawbacks is using normal distribution in the camera pose sampling, which often leads to non-realistic poses. Addressing this issue, Chen and Little [5] created a features-pose database by deducing statistics derived from WorldCup 2014 dataset [23]. Those statistics, related to the camera parameters, were used to sample 90,000 poses which will be encoded through a Siamese Network [20] to distinguish different edge maps. During inference, the network extracts the encoding from edge maps and searches database for nearest neighbour candidate pose. By using area-based semantic segmentation of the soccer field instead of edge images, in contrast with previous approaches, Sha et al. [39] also generated an artificial camera pose database based on the possible ranges for pan and tilt angles and focal length parameters. Aiming to solve image-to-image translation problems, Zhang et al. [52] use improved semantic segmentation using a conditional generative adversarial network [24]. Overall, search-based methods for camera calibration face inherent trade-offs between database size, processing speed, and estimation accuracy. Smaller databases offer faster searches but may compromise initial estimation quality, while larger databases provide better accuracy at the cost of increased computational time. Additionally, these methods often struggle with non-common camera poses frequently encountered in broadcast videos, such as close-ups or oblique angles, which are typically underrepresented in predefined databases.

II-B Optimization-based Methods

Alternatively, common approaches also use edge maps, semantically segmented images or information extracted from the field’s visual landmarks, like intersection points or lines, to obtain homography matrix or camera parameters from optimization methods. Homayounfar et al. [23] classiﬁes ﬁeld lines using a modified VGG network [41] and extracts their vanishing points reducing the effective number of degrees of freedom (DoF) of the homography from 8 to 4. The field localization problem is formulated as a Markov random field and it requires at least a pair of both vertical and horizontal lines to estimate the vanishing points. Citraro et al. [9] make use of a U-Net architecture [9] to jointly detect semantic keypoints corresponding to field’s line intersections and player positions. In addition to obtaining the homography matrix from the 2D-3D correspondences, intrinsic and extrinsic camera parameters are subsequently obtained through homography decomposition [22]. In order to alleviate the problem of visual landmarks sparsity, Nie et al. [33] propose an encoder-decoder network that jointly outputs a grid of keypoints distributed uniformly on the ﬁeld template and dense template-features in order to further reﬁne the initial keypoint based homography estimate. Similarly, Maglo et al. [30] use an encoder-decoder network to predict a perspective-aware keypoint grid. This is, the points nearest to the camera are more spread out than the point farthest to the camera in order to compensate for the too large distance variations in the image generated by the perspective effect. More recently, Chu et al. [7] also use a grid of 91 uniformly distributed keypoints and formulate its detection problem as an instance segmentation with dynamic filter learning. This is, the convolution filters are generated dynamically, conditioned on the field image and associated keypoint identity. Oo et al. [34] use a Residual EfficientNet-Attention UNet architecture to estimate the initial homography matrix using pre-defined keypoints to register sports fields. The encoder uses the EfficientNetV2 [44] as the backbone network, and the decoder consists of deconvolution layers with residual blocks, skip connections and attention gates. Theiner et al. [46] introduce a differentiable objective function that is able to learn the camera pose and focal length from segment correspondences. Instance segmentation for each visible line or circle segment is achieved with ResNet [6] backbone and then, the segment reprojection error induced by the estimated camera parameters is iteratively minimized with a gradient-based method [1]. Finally, leveraging the temporal consistency from sports videos, Claasen et al. [10] propose a Bayesian framework. Inspired by recent developments in tracking-by-detection methods, this work proposes a dynamics model that explicitly relates image keypoint positions from one frame to the next through two stages: the first stage consists of a linear Kalman filter, which considers the image keypoints the only part of its state vector, and the second stage incorporates the initial homography estimate to an Extended Kalman Filter [25], with the assumption that the relative homography between frames is small. Falaleev et al. [14] significantly increases the number of usable points for calibration by exploiting line-line and line-conic intersections, points on the conics, and other geometric features, followed by a DLT optimization. Although the computational cost of the enumerated approaches is lower, optimizion-based methods are, overall, not as robust and accurate as search-based methods. Moreover, optimization-based methods are heavily dependent on the landmark detection accuracy.

II-C Prediction-based Methods

Prediction-based approaches use Deep Neural Network-based (DNN) models to calibrate the moving camera in sports. Jiang et al. [26] train a DNN that directly regresses a homography parameterization given an input frame. Subsequently, a sports ﬁeld template is warped according to the initial estimate. A concatenation of the input frame and the warped template is fed into a second DNN that estimates the error of the current warping and the estimated homography is accordingly optimized until convergence. Tarashima et al. [45] propose SFLNet, a CNN single shot regressor that jointly predicts several outputs: a court metric model defined as an 8-dimensional parameter set, which correspond to the homography’s DoF, a court semantic segmentation which divides the sports field’s spatial layout into divide a frame into court, person, and background regions, and a label adjacency, which comprises adjacencies of label pairs in addition to their presence in an input frame, regularizing the model training via exploiting contextual information. Towards the design of an end-to-end approach, Shi et al. [40] propose a self-supervised learning method for homography estimation. This work employs a self-supervised data mining method to train the registration network with an image and its edge map by using an iterative estimation process controlled by a score regression network to measure the registration error. This method is able to obtain competitive results with previous approaches without the need of any labelled data. Recently, Zhang et al. [53] proposed a four-point calibration method. A cGAN is used to generate semantically segmented frames, eliminating foreground objects. Subsequently, a regression network estimates four points from the frames, which will be used to calculate a homography using DLT algorithm, keeping computational cost low. Lastly, extending to the multiple-view nature of the sports broadcast videos, Mavrogiannis et al. [32] propose a camera calibration method based on synthetic poses. They build a training data generation pipeline with separate flows for the main-camera frames and the rest of camera locations. For the first case, an EfﬁcientNet [6] model is trained to take synthetic edge images as input and regress the camera location. For the second case, a perspective transformation is calculated from four annotated pairs of corresponding points, and the camera location is obtained through homography decomposition. Moreover, a second model is trained on edge maps produced from the poses for each of the possible camera locations to estimate rotation angles and focal length of the camera. Although yielding competitive results on camera calibration and extending it to multiple views, the location-dependent auxiliary model limits the generalization capability and versatility of the method.

II-D Homography and Calibration Refinement

Homography refinement is a crucial step in camera calibration, aiming to achieve an even more accurate homography estimation and camera calibration, when necessary. Previous works use one or a combination of the following methods to refine the initial estimate: Puwein et al. [37] refine the initial homography by bundle adjustment [47]. Chen et al. [5] use the Lucas-Kanade algorithm [3] to refine the initial homography matrix reducing the distance from every pixel in the testing image to its closest edge pixels from a retrieved edge image. Sha et al. [38] introduce the spatial transformer network (STN) to handle large non-afﬁne transformation. The method stacks the input semantic image and the selected template to feed the STN, which outputs a relative homography that maps the semantic image to the template. Citraro et al. [9] use the detected players’ position on the court’s plane to refine the initial homography estimate. After homography decomposition is performed to extract intrinsic and extrinsic camera parameters, these are further refined with Levenberg–Marquardt [22] algorithm. A common approach [52, 53, 32] is to use enhanced correlation coefﬁcient method [13], which performs image alignment between its inputs and returns the refined homography matrix. Alternatively, Oo et al. [34] train an homography refinement network by applying random perturbations to input images. The network outputs the transformation matrix to be applied to the initial estimate. Lastly, other approaches exploit the temporal consistency between subsequent video frames information, such as dense feature maps [33], homography matrices [33, 10] or intrinsic and/or extrinsic camera parameters [9].

III Methodology

Sports TV broadcasts consist of video sequences featuring a fraction of the sports field from different uncalibrated moving camera perspectives, defining a multiple-view setting. Our approach focuses on retrieving both extrinsic and intrinsic camera parameters from each individual frame, without any prior information about the camera’s position or orientation, except for a partial view of the soccer field. The proposed method comprises five processing components: soccer field modelling and keypoints generation, keypoints and line extremities detection, DLT algorithm and camera parameters retrieval, as well as calibration estimate refinement. Next, these components are introduced.

III-A Modelling the Soccer Field

A soccer field is composed of lines, circles and semi-circles, representing all field markings, goal posts, and crossbars. Our approach, following keypoint-based methods [45, 9, 33, 7, 30, 10, 34, 14], relies on the lines painted on the ground, their intersections, the corners they define and some extra geometric properties, due to its known position on the world coordinate system. We follow the segment definitions of Cioppa et al. [8] and set them as starting points to hierarchically compute our pre-defined keypoint grid using court geometric properties.

III-A1 Keypoint Generation

The full set of sampled keypoints is organized into subgroups based on the specific geometric features they represent (see Fig. 2). The hierarchical nature of the keypoint generation pipeline ensures that information from initially identified keypoints is exploited for computing the subsequent ones (some instances in Fig. 1-Top). Next, we define the keypoints sets:

•

Line-Line intersections. Following [45, 11, 9, 10, 34, 14], this set of points ( $\mathcal{K}p$ ) includes the intersections of boundary lines, and the penalty or goal area markings. Considering the 23 lines depicted in [8], including goal posts and crossbars, up to 30 points can be included.
•

Extended Line-Line intersections. Following [11], this set ( $\mathcal{K}p_{e}$ ) addresses the intersections of extended lines that represent non-adjacent segments of the soccer field. To obtain this set, the field lines are extended by exploiting their line equations beyond their original boundaries. It is worth noting that not all non-adjacent line intersections are added; if the lines have to be largely extended, small errors in the line equation would lead to huge deviations in the intersection position.
•

Line-Ellipse intersections. Following [45, 11, 9, 10, 34, 14], this set ( $\mathcal{K}p_{1}$ ) considers the intersections between the field lines and the circles or semi-circles present on the court. Given the distortions introduced by the camera perspective, conics on the field are considered ellipses for its equation computation. The parameters of these ellipses are fitted using the least squares method [21]. Line-ellipse intersection points are analytically derived using ellipse and line formulas (see Fig. 2-Bottom for a visual example).
•

Ellipse tangent points. Following [18, 14], the augmentation of available points is achieved through the utilization of tangent points on tangent lines, extending from a specified external point to the previously defined ellipses. These tangent points (denoted by $\mathcal{K}p_{2}$ ) were analytically determined by employing an ellipse equation and incorporating the known coordinates of an external point (see Fig. 2-Bottom for a visual example).
•

Additional points. Following [7, 10, 30, 33, 34, 14], once the previous keypoint sets and the corresponding homography are inferred, for the sake of grid completeness, an additional set ( $\mathcal{K}p_{3}$ ) of nine points is integrated along the central pitch axis, encompassing the pitch center and penalty points. Additionally, four points are strategically placed to designate quarter turns along the central circle. Furthermore, the homography facilitates the inclusion of other points that are initially missing, addressing situations such as unannotated or wrongly annotated lines.

III-A2 Keypoint Disambiguation

Due to the multi-view nature of the SoccerNet dataset [12] and, for instance, considering one of the soccer field’s semi-circles, ambiguity appears in its respective $\mathcal{K}p_{1}$ and $\mathcal{K}p_{2}$ keypoints candidates, as shown in Fig. 2-Bottom. This is, while the expected locations of keypoint pairs in the image are known, the challenge lies in uniquely identifying and matching each individual keypoint. To solve that, we define two different strategies to handle that disambiguation depending on the total number of keypoints generated in the previous sets: when there are sufficient points in the $\mathcal{K}p\cup\mathcal{K}e$ set to infer a homography ( $\mathbf{H}$ ), i.e., four points, $\mathcal{K}p_{1}$ is computed first by choosing the candidates combination that minimizes the reprojection error. Then, we include $\mathcal{K}p_{1}$ to newly infer a homography estimation ( $\mathbf{H_{1}}$ ) and repeat the same strategy on the $\mathcal{K}p_{2}$ set (obtaining $\mathbf{H_{2}}$ ), as it is shown in Fig. 1. Otherwise, we perform a grid-search involving both $\mathcal{K}p_{1}$ and $\mathcal{K}p_{2}$ candidates when $\mathcal{K}p\cup\mathcal{K}p_{e}\cup\mathcal{K}p_{1}^{*}\cup\mathcal{K}p_{2}^{% *}\geq 4$ , where $*$ denotes a possible candidate combination. The grid-search iterates over all keypoints candidates in a set-wise manner to avoid unfeasible combinations and keeps the one with minimum reprojection error. In this case, $\mathbf{H_{1}}$ estimation is bypassed, and $\mathbf{H_{2}}$ is computed only after resolving both $\mathcal{K}p_{1}$ and $\mathcal{K}p_{2}$ . It is worth pointing out that no more keypoint sets beyond $\mathcal{K}p\cup\mathcal{K}p_{e}$ are computed if neither of the above stated conditions are met. Moreover, once we compute all the pre-defined keypoint sets, two additional geometrical constraints are applied in case that homography estimation or ellipse fitting is not accurate enough. Initially, we manually establish a reprojection error threshold to validate keypoints. Subsequently, through iteration over combinations of keypoints, we construct vectors and ensure that the cross-products maintain consistent signs in both world and image coordinates. This final step is essential in cases where two distinct combinations yield valid top- and bottom-view perspectives of the field while exhibiting identical reprojection errors. Utilizing cross-products enables us to differentiate and retain the keypoint combination corresponding to the field’s top-view. The full keypoint generation process is depicted in Fig. 1-top.

III-A3 Left-Right Disambiguation

In sequences where the camera angle aligns with the longitudinal axis of the court, an ambiguity arises regarding the distinction between the right and left halves of the field. Hence, a critical step to ensure consistency and robustness across keypoints and lines detection processes involves differentiating between the two sides. Taking into account the camera calibration evaluation protocol in [8], which considers both of the ambiguous options and keeps the highest score, this is accomplished by implementing a remap to the ground-truth (GT) values, ensuring that the goal area closest to the camera consistently represents the left side. The process of checking whether or not the mapping should be applied is defined in a heuristic fashion. We compute angles of horizontal and vertical soccer field lines, respectively; and then set a threshold taking into account angle distribution and visual inspection. This approach facilitates an effective model training process by deferring the disambiguation task until after the inference stage as an extra step if needed.

III-B Keypoints and Lines Detection

Our approach is built upon Falaleev et al. [14] solution, which makes use of two encoder-decoder convolutional neural networks to estimate the position of the pre-defined keypoints and the soccer field lines depicted in [8] excluding conics. While in [14] the line model is given an auxiliary role to enhance keypoint completeness, in our approach is used as the key component of our refinement module in the last calibration stage. During inference, the former produces heatmaps for each pre-defined keypoint with a single $2$ -pixel Gaussian peak sigma positioned in the keypoint location, accompanied by an additional background channel. This additional channel reflects the inverse of the maximum value among the other target feature maps, ensuring that the resultant target tensor behaves as a probability distribution function at every spatial point. Meanwhile, the latter network produces heatmaps for each visible soccer field line within the frame, assigning two Gaussian peaks at the line extremities’ locations. Additionally, we introduce an extra channel, known as the boundary channel, to our heatmap following the approach outlined in [50]. This augmentation aims to enhance the efficient capture of global information regarding the soccer field and improve extremities detection, particularly near image borders. We effectively extract the position of keypoints and line extremities from the generated heatmaps by employing a max pooling operation, which calculates the maximum value for patches of a feature map, drawing inspiration from the methodology proposed in [55]. This process is summarized on Fig. 1-Bottom.

III-B1 Architecture

Following [14], the keypoint and line extremities detection utilized a modified HRNetV2-w48 [49] network as the encoder’s backbone network. HRNetv2 [49] is a new family of convolutional networks that maintains high-resolution representations through the whole process resulting in semantically richer and spatially more precise representations. To improve the spatial resolution of the predicted heatmaps, we incorporated $2\times$ upsampling and concatenated skip-connection features from the corresponding resolution of the convolution stem to fuse the features at different scales. The resulting feature maps are produced at half the spatial resolution of the input image, striking a balance between computational efficiency and spatial detail retention. Additionally, a more lightweight version of the backbone is achieved by reducing the dense representation layers sizes in the network’s last stages. The final predictions exhibit half the resolution of the original image, with softmax and sigmoid employed as the final activation functions for keypoints and lines, respectively.

III-B2 Keypoints Mask

When homography was unavailable due to a limited number of points, the heatmaps associated with points belonging to $\mathcal{K}p_{1}$ , $\mathcal{K}p_{2}$ , and $\mathcal{K}p_{3}$ —which would have been derived from the homography itself—were masked out from the loss function as long as the line to which they belong is included in the GT annotation. Additionally, when the external point required to compute ellipse tangent points in $\mathcal{K}p_{2}$ is not present in $\mathcal{K}p$ , the pair of tangent points candidates is also masked out.

III-C Camera Projection Model

We employ a standard full-perspective camera model as:

\mathbf{P}=\mathbf{KR}[\mathbf{I}\,|\,\mathbf{-t}]\in\mathbb{R}^{3\times 4},

(1)

where $\mathbf{R}\in\mathbb{R}^{3\times 3}$ and $\mathbf{t}\in\mathbb{R}^{3}$ denote the extrinsic parameters (rotation and translation, respectively) to map from scene coordinates to camera ones; and $\mathbf{K}\in\mathbb{R}^{3\times 3}$ , which includes the intrinsic parameters to transform from camera coordinates to image ones. The latter has the form:

\mathbf{K}=\begin{bmatrix}\alpha_{x}&s&x_{0}\\ &\alpha_{y}&y_{0}\\ &&1\end{bmatrix},

(2)

where $\{\alpha_{x},\alpha_{y}\}$ denote the focal length, $\{x_{0},y_{0}\}$ the principal point, and $s$ the skew value. Following [22], we assume zero skew and a known pixel aspect ratio. Additionally, for simplicity, we assume the principal point ( $x_{0},y_{0}$ ) coincides with the center of the image, i.e., we neglect astigmatism or distortions.

III-C1 Camera Parameters Estimation

Extrinsic and intrinsic parameters in Eq. (1) are inferred by leveraging the coordinates of 3D object points and their corresponding 2D projections using the soccer field model as a calibration rig, following [54], which consists of a closed-form solution followed by a non-linear refinement based on the maximum likelihood criterion. To calibrate the 3D soccer field model rig, we consider two additional vertical planes containing the goal polygons, including non-planar points such as keypoints belonging to the goal posts and crossbars. Additionally, we extend this strategy to enhance completeness by providing estimations when insufficient points are on the ground plane, calibrating over the vertical planes, and subsequently transforming camera parameters to the ground-plane coordinate system. When a sufficient number of points, i.e., 6 keypoints, is available, an initial estimate of the camera’s intrinsic matrix $\mathbf{K}$ is inferred. Camera calibration is then performed using this estimation, resulting in a more robust and stable calibration. Otherwise, calibration is done by jointly optimizing all three unknowns: $\mathbf{K}$ , $\mathbf{R}$ , and $\mathbf{t}$ . To account for keypoint misdetections and other complexities in camera parameter retrieval, such as frames with only one non-planar keypoint visible, the calibration process was repeated on several subsets of keypoints. Similar to [14], these subsets were selected based on various heuristics: full-keypoints, including all keypoints sets $\mathcal{K}p$ , $\mathcal{K}p_{e}$ , $\mathcal{K}p_{1}$ , $\mathcal{K}p_{2}$ and $\mathcal{K}p_{3}$ ; main-keypoints, comprising only line-line intersections from the original SoccerNet annotations [8]; and ground-plane-keypoints, which excludes non-planar keypoints. Furthermore, we applied a grid of RANSAC [15] reprojection error thresholds to each subset. The final camera calibration values were determined through a heuristic voting process, prioritizing the method yielding a lower reprojection error, with emphasis on the full-keypoints subset.

III-C2 Homography Estimation

Assuming the world coordinate system such that $z=0$ corresponds to the ground plane, the ground-to-image homography $\mathbf{H}$ can be obtained from the first, second, and fourth columns of the camera projection matrix $\mathbf{P}$ as:

\mathbf{H}\cong\mathbf{KR}\begin{bmatrix}1&0&-t_{x}\\ 0&1&-t_{y}\\ 0&0&-t_{z}\end{bmatrix}\in\mathbb{R}^{3\times 3},

(3)

where $\mathbf{t}=[t_{x},t_{y},t_{z}]^{\top}$ . Nevertheless, inaccurate estimations for keypoints associated with the non-planar rig, such as those belonging to the goalposts and crossbars, may result in a flawed homography estimation. To address this issue, we employ classical homography estimation with DLT [22] and RANSAC [15] on the ground-plane-keypoints subset. We define a maximum allowable reprojection error to consider a point pair as an inlier and subsequently refine the initial homography estimation matrix using the Levenberg-Marquardt method [22] on the 2D-3D point correspondences. Similarly to the approach used for camera parameter estimation, we applied a grid search over RANSAC reprojection error and employed the heuristic voting method, but in this instance, it is restricted to the ground-plane-keypoints subset.

III-D Point and Line Refinement Module

To address the sparsity of keypoints on the grid, particularly in regions distant from the penalty boxes, where the majority of keypoints are concentrated, and to mitigate issues arising from keypoint misdetections overlapping with detected lines, we have developed a refinement module which makes use of the detected lines information. This module, termed the Point and Line (PnL) refinement module, enhances both homography and calibration estimates by jointly leveraging the information from detected keypoints and lines. The PnL module integrates line information to complement the keypoint data, providing a more robust and comprehensive basis for accurate field registration and camera calibration, especially in areas where keypoint information alone may be insufficient or unreliable. Setting the internal camera calibration matrix $\mathbf{K}$ as a fixed term, PnL refinement module will optimize the calibration estimate through the camera pose $\bm{\Theta}=\{\mathbf{R,t}\}$ space, which includes the rotation and translation parameters, respectively.

Inspired by previous approaches [36, 48, 17], we next describe the line parameterization, the error function, and its integration within the PnL refinement module. Let $\mathbf{p},\mathbf{q}\in\mathbb{R}^{3}$ represent the real extremities of a known 3D soccer field line based on a field model, and let $\mathbf{p_{d}},\mathbf{q_{d}}\in\mathbb{R}^{3}$ denote the visible extremities of the same 3D soccer field line detected in the image frame, with image coordinates $\bar{\mathbf{p}}_{d},\bar{\mathbf{q}}_{d}\in\mathbb{R}^{2}$ defining the detected soccer field line $\mathbf{l}_{d}=\overrightarrow{\bar{\mathbf{p}}_{d}\bar{\mathbf{q}}_{d}}$ . Note that if the real line extremities are visible in the camera frame, they are equivalent to the detected extremities, i.e., $\mathbf{p}\equiv\mathbf{p_{d}}$ . Otherwise, the 3D positions of $\mathbf{p_{d}}$ and $\mathbf{q_{d}}$ are unknown.

By projecting $\mathbf{p}$ and $\mathbf{q}$ into the image plane, given an initial calibration estimate $\bm{\Theta}$ (or a homography matrix $\mathbf{H}$ in the case of planar estimation), we define the projected soccer field line $\mathbf{l}$ and compute both $\bar{\mathbf{p}}$ and $\bar{\mathbf{q}}$ as its intersections with the image boundaries (see Fig. 3-Top). Hence, analogously to the detected lines, $\mathbf{l}=\overrightarrow{\bar{\mathbf{p}}\bar{\mathbf{q}}}$ .

However, it is worth noting that 3D line extremities $\mathbf{p},\mathbf{q}$ lying behind the field’s camera plane $\mathbf{\Pi_{c}}$ —i.e., points whose image projections have $z_{c}<0$ —cannot be projected. To address this issue, we construct the field’s camera plane using the estimated camera position $\mathbf{t}$ and a normal vector $\mathbf{n_{t}}$ to the camera plane (see Fig. 3-Bottom). The latter is built based on the camera position and the projection of the principal point onto the planar field. In the case of planar estimation, camera position and rotation are obtained directly from homography matrix [22]. Therefore, to correctly project $\mathbf{l}$ onto the image, $\mathbf{p}$ and $\mathbf{q}$ must be checked and corrected to satisfy the next planar condition:

\mathbf{q_{c}}=\begin{dcases}\mathbf{q}+\frac{\mathbf{n_{t}}\cdot(\mathbf{t}-% \mathbf{q})}{\mathbf{n_{t}}\cdot(\mathbf{p}-\mathbf{q})}+\epsilon(\mathbf{n_{t% }})&\text{if $(\mathbf{q}-\mathbf{t})\cdot\mathbf{n_{t}}<0$}\\ \mathbf{q}&\text{otherwise}\end{dcases},

(4)

with $\epsilon(\mathbf{n_{t}})$ being a threshold added in the direction of the normal vector to ensure that $\mathbf{q_{c}}$ lies ahead of the camera plane $\mathbf{\Pi_{c}}$ . “ $\cdot$ ” denotes a dot product. Once the projected extremities are corrected, the line $\mathbf{l}$ , and consequently $\bar{\mathbf{p}}$ and $\bar{\mathbf{q}}$ , can be computed. Given the image observations $\bar{\mathbf{p}}_{d}$ and $\bar{\mathbf{q}}_{d}$ , we then define the point-to-line distances, following [43], to the projected line $\mathbf{l}$ as follows:

\begin{aligned} d(\mathbf{l},\mathbf{l}_{d})&=d(\mathbf{l},\bar{\mathbf{p}}_{d% })+d(\mathbf{l},\bar{\mathbf{q}}_{d})\\ &=\frac{|\Delta_{y}\bar{\mathbf{p}}_{d,x}-\Delta_{x}\bar{\mathbf{p}}_{d,y}+% \bar{\mathbf{q}}_{x}\bar{\mathbf{p}}_{y}-\bar{\mathbf{q}}_{y}\bar{\mathbf{p}}_% {x}|}{\sqrt{\Delta_{y}^{2}+\Delta_{x}^{2}}}\\ &+\frac{|\Delta_{y}\bar{\mathbf{q}}_{d,x}-\Delta_{x}\bar{\mathbf{q}}_{d,y}+% \bar{\mathbf{q}}_{x}\bar{\mathbf{p}}_{y}-\bar{\mathbf{q}}_{y}\bar{\mathbf{p}}_% {x}|}{\sqrt{\Delta_{y}^{2}+\Delta_{x}^{2}}}\end{aligned},

(5)

where $\Delta_{y}=\bar{\mathbf{q}}_{y}-\bar{\mathbf{p}}_{y}$ and $\Delta_{x}=\bar{\mathbf{q}}_{x}-\bar{\mathbf{p}}_{x}$ . It is important to note that by representing lines using their endpoints, we obtain comparable error representations for both points and lines. Given an image detection $\bar{\mathbf{x}}_{d}$ of a predefined keypoint, we define a reprojection error as:

d(\bar{\mathbf{x}}_{d},\mathbf{x})=\bar{\mathbf{x}}_{d}-\bm{\pi}(\bm{\Theta},% \mathbf{x}),

(6)

where $\bm{\pi}(\bm{\Theta},\mathbf{x})$ represents the projection of the keypoint world coordinate $\mathbf{x}\in\mathbb{R}^{3}$ onto the image plane, given a calibration estimate $\bm{\Theta}$ . We can thus construct a unified cost function that integrates each of the error terms as follows:

C=\alpha\sum_{i\in\mathcal{L}}d(\mathbf{l}_{i},\mathbf{l}_{d,i})+(1-\alpha)% \sum_{j\in\mathcal{K}p_{c}}d(\bar{\mathbf{x}}_{d,j},\mathbf{x}_{j}),

(7)

where $\mathcal{L}$ represents the set of detected lines and $\mathcal{K}p_{c}=\mathcal{K}p\cup\mathcal{K}p_{e}\cup\mathcal{K}p_{1}\cup% \mathcal{K}p_{2}\cup\mathcal{K}p_{3}$ denotes the set-theoretic union of all available keypoint sets. A weighting parameter, $\alpha$ , is empirically set to balance the influence of points and lines in the cost function. Following [36], a recursive approach over the detected reprojection line and point error will be applied to optimize the pose parameters $\bm{\Theta}$ as a nonlinear least-squares problem. To mitigate line misdetection issues, each detected line is compared to its projected counterpart using the initial estimate $\bm{\Theta}$ . A projection error threshold is applied to both extremities of the detected line before it is subsequently added to the $\mathcal{L}$ set.

IV Experiments

This section provides an overview of the datasets we use, the evaluation metrics employed to assess our approach, as well as implementation details. Subsequently, we present both qualitative and quantitative results, comparing our method with state-of-the-art approaches and performing an ablation study of the keypoint sets effect on the method performance.

IV-A Datasets

To evaluate our method, we compare our results with state-of-the-art methods on the SoccerNet-Calibration [8], the WorldCup [23] and the TS-WorldCup [7] soccer datasets.

SN-Calib Dataset: The SoccerNetV3-Calibration (SN23) dataset [8] comprises 22,816 images extracted from SoccerNet [12] videos and encompasses a broadcast-based multi-view nature, offering a broader range of camera perspectives beyond the main broadcast camera. Cioppa et al. [8] provide annotations for all segments of the soccer field, encompassing lines, conics and goal posts. For each visible segment on the court, at least two annotated positions are provided, optimally representing the segment in a polyline format. For the conics drawn on the pitch, the annotations consist of a list of points that roughly give the circle shape when connected. Additionally, Theiner et al. [46] provided manual camera view annotations for the SoccerNetv3-Calibration 2022 (SN22) test set version, allowing the creation of data subsets taking into account the camera view distribution.

WC14 Dataset: The WorldCup 2014 dataset (WC14) [23] stands as the reference benchmark for sports field registration and consists of 209 images from ten games for training and 186 images from other ten games for testing and the corresponding manually annotated homography matrices from the FIFA WorldCup 2014. Additionally, Theiner et al. [46] provides segment annotations in SN-Calib [8] format.

TS-WC Dataset: The TS-WorldCup dataset (TSWC) [7] contains detailed field markings on 3,812 field images from 43 videos of Soccer WorldCup 2014 and 2018 in a time-sequence fashion, which is ten times larger than the WorldCup 2014 dataset.

IV-B Evaluation Metrics

The quality of estimated camera parameters or homography matrices can be evaluated both in 2D and 3D spaces.

Jaccard Index ( $\text{JaC}_{\gamma}$ ): Following Magera et al. [28] benchmarking protocol, the evaluation relies on calculating the reprojection error between each annotated point and the line to which it belongs. Adopting a binary classification approach, each pitch segment is treated as a single entity. To be considered correctly detected, all points within the segment must have a reprojection error smaller than a threshold. The projection of pitch elements from densely sampled points of the soccer field 3D model yields a polyline for each segment. Therefore, a polyline representing a soccer field segment $s$ is classified as a true positive (TP) if $\forall p\in s:\min\left(d(p,\hat{s})\right)<\gamma$ , being $\hat{s}$ the corresponding annotated segment and $\gamma$ the distance threshold in pixels. Otherwise, this segment is counted as a false positive (FP). Segments only present in the annotations are counted as false negatives (FN). Hence, the Jaccard Index for camera calibration, $\text{JaC}_{\gamma}$ , at a threshold of $\gamma$ pixels is defined as:

\text{JaC}_{\gamma}=\frac{\text{TP}_{\gamma}}{\text{TP}_{\gamma}+\text{FN}+% \text{FP}},

(8)

where it serves as a measure of calibration accuracy. We also measure the completeness rate (CR) as the number of camera parameters provided divided by the number of images with more than four semantic line annotations in the dataset. The final score (FS) as an evaluation criterion is calculated as the product of CR and $\text{JaC}_{5}$ .

Intersection over Union: The intersection over union (IoU) includes two components. First, $\text{IoU}_{part}$ quantifies the visible area of the video frame by warping that frame using both the estimated homography and the GT one, projecting them onto the template, and then calculating the IoU. Second, $\text{IoU}_{whole}$ evaluates the entire sports field by warping the template with the refined homography, projecting it onto the original template, and calculating the IoU.

Projection Error: The projection error was quantified as the average distance, in meters, between the projected points using the estimated homography and the corresponding GT. To achieve that, we uniformly sampled 2,500 pixels from the visible field area of the camera image and projected them onto the field to compute the distance. The standard dimensions of a soccer field are 105 $\times$ 68 meters.

Reprojection Error: The reprojection error was calculated by averaging the distance between the reprojected points in the video frame, utilizing both the estimated and the GT homography.

IV-C Implementation Details

Due to the absence of publicly available results on the multiple-view SN23 [8] distribution, we trained two models from scratch: Multi-view (MV) and Single-view (SV). The latter is trained on a data subset composed almost entirely of non-replay frames, ensuring a high percentage of central camera shots. We train separate networks for the keypoints and line extremities detection tasks on the SN23-train dataset [8]. For the MV model, we train for $200$ epochs, using an initial learning rate of $1e^{-2}$ and a batch size of $2$ . For the SV model, we train for $200$ epochs, using an initial learning rate of $1e^{-5}$ and a batch size of $1$ . We utilize the Adam optimizer with default parameters $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . ${l}_{2}$ -norm loss is used for heatmap regression in both neural networks. Data augmentation such as random horizontal flip, color jitter, and Gaussian noise are applied to enhance model robustness and generalization. Furthermore, we fine-tune both SV networks, keypoint and line extremity detection, on the WC14 [23] and TSWC [7] datasets. GT homographies are transformed into our proposed keypoint sets and line extremities by projecting their respective world coordinates to the field’s ground plane. Note that non-planar points and lines are excluded from the transformation process, and their corresponding output layers are masked in the loss function. In fine-tuning, input images are resized to match the SoccerNet dataset image size. The experiments are conducted on a single NVIDIA GeForce RTX 2080 Ti GPU with $12$ GB of memory, and the implementation is carried out in the PyTorch framework. Calibration and further optimization tasks are conducted on a Intel Xeon Silver 4214 Processor and the implementations are carried out in the opencv-python and scipy frameworks, respectively.

IV-D Results and comparisons

We now present the results of an extensive evaluation, divided into camera calibration and homography estimation. The camera calibration evaluation assesses the accuracy of individual camera parameters using the $\text{JaC}_{\gamma}$ metric, while the homography estimation is evaluated using the IoU one, projection error, and reprojection error. As previously noted, methods with the subscript MV and SV correspond to multi-view and single-view models, respectively, and the PnL designation indicates the use of point and line calibration refinement module.

IV-D1 Camera Calibration

In team sports such as soccer, the action takes place on a nearly planar field. Consequently, most methods utilize homography estimation to map all elements positioned on this plane but cannot project non-planar points such as points belonging to goal posts or crossbars. Conversely, in [46, 32, 14] make use of a 3D model of the soccer field to extract camera pose and intrinsic parameters directly. In homography-based approaches, parameter retrieval is accomplished through homography decomposition (HDecomp). We conduct a quantitative comparison of our proposed method to state-of-the-art approaches [5, 26, 46, 32] on the SN22-test-center dataset, comprising only images where the main camera center is visible (1,454 images). Furthermore, utilizing the SoccerNet annotation format for the WC14-test dataset provided by Theiner et al. [46], we conduct a comparison of our proposed method’s performance in camera parameter estimation on the WorldCup 2014 dataset distribution.

We report the statistics from [46] for the results of state-of-the-art approaches [5, 26, 46], along with results provided by [32]. As it is shown in Tables I-II for the SN22-test and WC14 datasets, respectively, our SV method outperforms state-of-the-art approaches across several metrics in both datasets. Additionally, the inclusion of the PnL module demonstrates its effectiveness by increasing the FS metric by $4.3\%$ on the SN22-test-center dataset and a $7.6\%$ increase on the WC14-test one. Minor variations in CR, compared to previous approaches, stem from our method’s requirement for a minimum number of visible keypoints for calibration and differences in the maximum allowable reprojection error to consider the parameter estimation valid. Moreover, variations in CR between our SV models are also due to the differing maximum allowable reprojection errors. Reprojection error is computed based on keypoint errors, meaning that while line-based refinement increases keypoint reprojection error, it results in improved overall calibration. Qualitative results demonstrating the effect of the PnL refinement module are presented in Fig. 4.

		$\text{JaC}_{\gamma}$ [%]
Dataset	Approach	5	10	20	CR	FS
SN22-test -center	[5] + HDecomp	34.4	64.6	81.3	66.6	22.9
	TVCalib ( $\tau$ ) [46]	57.6	81.7	93.2	93.7	53.9
	TVCalib [46]	54.8	78.5	90.4	100.0	54.8
	[32]	63.9	80.7	86.3	100.0	63.9
	Ours ${}_{\text{SV}}$	75.8	89.7	91.9	98.1	74.4
	Ours ${}_{\text{SV}}$ + PnL	80.6	89.9	92.4	97.7	78.7
SN23-test	[14]	76.6	-	-	73.6	56.3
	Ours ${}_{\text{MV}}$	72.2	84.9	88.4	80.4	58.1
	Ours ${}_{\text{MV}}$ + PnL	76.7	87.2	90.1	79.5	60.9

TABLE I: Evaluating camera calibration on SoccerNet [12] distributions. The table reports our results for the Single-view model on the SN22-test-center dataset as well as for the Multi-view model on the full SN23-test dataset, including in both cases comparisons with competing approaches. For computing the

\text{JaC}_{\gamma}

metric, we consider

\gamma=\{5,10,20\}

	$\text{JaC}_{\gamma}$ [%]
Approach	5	10	20	CR	FS
[5] + HDecomp	32.7	67.3	87.3	81.7	26.7
[26] + HDecomp	36.9	66.4	83.9	84.9	31.3
TVCalib ( $\tau$ ) [46]	41.3	73.6	91.4	95.7	39.5
TVCalib [46]	39.9	71.9	90.5	100.0	39.9
[32]	59.4	82.3	90.9	100.0	59.4
Ours ${}_{\text{SV}}$	77.6	89.8	93.7	100.0	77.6
Ours ${}_{\text{SV}}$ + PnL	85.2	94.0	96.1	100.0	85.2

TABLE II: Quantitative comparison of our Single-view model on camera calibration conducted on the WC14-test dataset. For computing the

\text{JaC}_{\gamma}

metric, we consider

\gamma=\{5,10,20\}

We also evaluate our method on the entire SN23-test dataset, reporting statistics from [14], which represents the only comparable approach capable of extending calibration assessments to full multi-view scenarios. Although the calibration approach proposed by Mavrogiannis et al. [32] extends beyond the main camera, it is limited to calibrating the main, offside, and behind-the-goalpost cameras. Moreover, their method requires training a separate model for each camera location, whereas our approach generalizes keypoint and line extremity detection across the entire SoccerNet distribution using a single model. As shown in Table I for the SN23-test datasets, our MV model outperforms the existing approaches across all the presented metrics. Additionally, our PnL refinement module showcases a $2.8\%$ increase in the FS metric. The cost function’s weighting parameter is empirically set to $\alpha=0.5$ to maximize the FS metric. Qualitative results showcasing calibration results of our base pipeline on different camera angles are presented in Fig. 5. The significant decrease in performance compared to the SN22-test-center dataset is attributed to the challenges of calibrating images with camera poses that deviate substantially from the main camera, such as close-up shots, where few or no landmarks are visible, or fisheye shots from inside the goals, where extreme lens distortion makes calibration unable. Examples of these challenging camera shots are shown in Fig. 6.

In terms of inference speed, the plain MV model achieves 7 Hz by using the proposed image size and the software and hardware configurations described above. When incorporating the PnL refinement module, the inference speed decreases to an average of 4 Hz during evaluation on the SN23-test dataset. However, the speed for specific frames varies significantly, depending on the number of detected lines and the initialization quality.

IV-D2 Homography Estimation

The proposed method is compared with respect to state-of-the-art approaches [5, 26, 9, 52, 53, 33, 40, 7, 34, 46, 29] using the WC14-test dataset. Additionally, our method is also compared to state-of-the-art approaches [5, 33, 7, 29, 34] using the TSWC-test dataset. For computing IoU-based metrics, projection error, and reprojection error, we adopt the approach outlined in [7]. The dimensions of the soccer field template are set at $115\times 74$ yards for a fair comparison with previous approaches.

In the evaluation on the WC14-test dataset, we report performance metrics based on the findings from the respective works, as shown in Table III. For a fair comparison with [9], we use the results from their ours-w/o-players approach, as reported in their paper. Our fine-tuned model produces comparable results to previous approaches across all IoU-based, projection error, and reprojection error metrics. However, with the addition of the PnL refinement module, our method achieves state-of-the-art solutions across all metrics, further demonstrating the effectiveness of the proposed refinement module also in the homography estimation task.

For the evaluation on the TSWC-test dataset, we report the results for [33, 7, 29, 34] in Table III, observing how our fine-tuned model with the PnL refinement module outperforms those methods across all metrics.

Dataset	Approach	$\text{IoU}_{\text{part}}$ $\uparrow$ (%)		$\text{IoU}_{\text{whole}}$ $\uparrow$ (%)		Proj. $\downarrow$ (m)		Reproj. $\downarrow$
Dataset	Approach	Mean	Median	Mean	Median	Mean	Median	Mean	Median
WC14-test	Jiang et al. [26]	95.1	96.7	89.8	92.9	-	-	-	-
WC14-test	Citraro et al. [9]	-	-	90.5	91.8	-	-	0.018	0.012
	Zhang et al. [52]	95.9	97.5	91.4	94.2	-	-	-	-
	Nie et al. [33]	95.9	97.1	91.6	93.4	0.84	0.65	0.019	0.014
	Shi et al. [40]	96.6	97.8	93.1	94.8	-	-	-	-
	Chu et al. [7]	96.0	97.0	91.2	93.1	0.81	0.63	0.019	0.014
	Zhang et al [53]	95.9	97.3	91.4	94.1	-	-	-	-
	Maglo et al. [29]	96.3	97.4	92.0	94.1	0.74	0.55	0.018	0.014
	Oo et al. [34]	96.9	97.9	92.9	94.6	0.65	0.46	0.016	0.012
	Ours ${}_{\text{SV}}^{\ast\dagger}$	96.4	97.9	92.4	94.8	0.65	0.44	0.015	0.011
	Ours ${}_{\text{SV}}^{\ast\dagger}$ + PnL	97.0	98.2	93.4	95.5	0.60	0.42	0.014	0.010
TSWC-test	Nie et al. [33]^‡	97.4	97.8	92.5	94.2	0.43	0.38	0.011	0.010
TSWC-test	Chu et al. [7]^‡	98.1	98.2	94.8	95.4	0.36	0.33	0.009	0.008
	Maglo et al. [29]^‡	98.3	98.5	95.7	96.2	0.26	0.23	0.008	0.006
	Oo et al. [34]^‡	98.5	98.7	95.8	96.7	0.26	0.21	0.007	0.006
	Ours ${}_{\text{SV}}^{\ast\ddagger}$	98.2	98.4	94.6	95.8	0.28	0.24	0.007	0.006
	Ours ${}_{\text{SV}}^{\ast\ddagger}$ + PnL	98.6	98.9	96.3	96.8	0.23	0.20	0.005	0.005

TABLE III: Evaluating the homography estimation on WC14-test and TSWC-test.

\ast

denotes the methods trained on SoccerNet distribution,

\dagger

denotes the methods fine-tuned on the WC14 dataset and

\ddagger

denotes the methods fine-tuned on the TSWC one.

IV-D3 Ablation Study on Keypoint Sets Contribution

The contribution of each keypoint set, namely $\mathcal{K}p_{e}$ , $\mathcal{K}p_{1}$ , $\mathcal{K}p_{2}$ , and $\mathcal{K}p_{3}$ , is analyzed in Table IV. Overall, the integration of each keypoint set results in improvements in the CR and FS metrics, with each set contributing to different geometric elements of the field. The inclusion of $\mathcal{K}p_{e}$ increases CR by providing visible landmarks along the main field’s straight lines, particularly when line-line intersections are scarce in the image. However, despite these improvements in CR and FS, the Acc metrics decrease, as the method is able to calibrate more lines but with some lower-quality calibrations. A similar pattern is observed with the integration of the $\mathcal{K}p_{1}$ set, which boosts CR and FS by adding more visible landmarks to the field’s straight lines and enabling the inclusion of field circles, though calibration quality in these areas remains suboptimal. A significant increase in CR, Acc, and consequently FS, is achieved through the integration of the $\mathcal{K}p_{2}$ set, particularly in frames where the field’s center circle is partially visible, but midfield line intersections are absent from the image. Finally, the integration of the $\mathcal{K}p_{3}$ set leads to further improvements across all metrics, enhancing the robustness of the method and setting a new state-of-the-art in sports field registration benchmarks.

				$\text{JaC}_{\gamma}$ [%]
$\mathcal{K}p_{e}$	$\mathcal{K}p_{1}$	$\mathcal{K}p_{2}$	$\mathcal{K}p_{3}$	5	10	20	CR	FS
✗	✗	✗	✗	74.2	89.4	93.7	86.1	63.9
✓	✗	✗	✗	74.0	89.0	93.5	88.8	65.7
✓	✓	✗	✗	74.0	88.3	91.9	94.4	69.9
✓	✓	✓	✗	75.1	89.2	92.6	97.8	73.5
✓	✓	✓	✓	75.8	89.7	91.9	98.1	74.4

TABLE IV: Ablation study of our keypoint sets. The table shows the effect of every keypoint set on the SN22-test-center dataset.

IV-D4 Ablation Study on Refinement Contribution

The contribution of each field geometrical object for initial estimate refinement—namely points, lines, and the full PnL module—is analyzed in Table V. Similar behaviors are observed across the SoccerNet distributions: after obtaining the initial calibration estimate using the proposed keypoint sets, point-based refinement does not significantly alter the results. While point refinement slightly reduces reprojection error, leading to a minor increase in the CR metric, it simultaneously lowers accuracy metrics. In contrast, line-based refinement not only reduces reprojection error but also improves accuracy metrics, as reprojection error is calculated using points, and the maximum allowable reprojection error in the method affects these outcomes. Thus, line refinement generally produces more accurate calibration results. Finally, we confirm the previously noted effectiveness of the PnL module, which balances the contributions of both geometrical objects to deliver state-of-the-art results in the SoccerNet benchmarks.

For the WC14 dataset, a different behavior is observed: point-based refinement has no impact, and line-based refinement results in worse calibration across all metrics. Nevertheless, the proposed PnL refinement still achieves state-of-the-art results.

Dataset			$\text{JaC}_{\gamma}$ [%]
Dataset	P	L	5	10	20	CR	FS
SN23-test	✗	✗	72.2	84.9	88.4	80.4	58.1
	✓	✗	71.6	84.2	87.7	81.3	58.2
	✗	✓	74.3	86.4	90.8	74.9	55.7
	✓	✓	76.7	87.2	90.1	79.5	60.9
SN22-test -center	✗	✗	75.8	89.7	91.9	98.1	74.4
	✓	✗	75.8	89.8	93.7	98.1	74.4
	✗	✓	76.3	89.4	94.3	96.5	73.7
	✓	✓	80.6	89.9	92.4	97.7	78.7
WC14-test	✗	✗	77.6	89.8	93.7	100.0	77.6
	✓	✗	77.6	89.8	93.7	100.0	77.6
	✗	✓	76.5	90.8	94.5	99.4	76.1
	✓	✓	85.2	94.0	96.1	100.0	85.2

TABLE V: Ablation study of points and lines contribution in the refinement module The table presents the impact of points and lines, both individually and jointly, on the calibration estimate refinement across the SN23-test, SN22-test-center, and WC14-test datasets.

V Conclusion

In this paper, we introduce a novel framework for 3D sports field registration. Our proposed pipeline adopts a minimalist approach by solely utilizing the geometric properties of the soccer field. We demonstrate superior performance in 3D camera calibration on SoccerNet and WorldCup 2014 datasets compared to state-of-the-art methods, while also achieving comparable results in homography estimation on WorldCup 2014 and TS-WorldCup datasets. Additionally, we introduce a novel calibration refinement module that leverages field landmarks, such as keypoints and lines, which, when integrated with our calibration pipeline, further enhance performance in 3D camera calibration benchmarks and achieve state-of-the-art results in homography estimation. We also extend our pipeline to multi-view camera calibration, establishing the state-of-the-art benchmark for multi-view broadcast-based camera calibration in soccer. Our method exhibits promising results, highlighting the effectiveness of utilizing a robust field template without the need for further refinements. Furthermore, the refinement module showcases its effectiveness showing superior performance and improving calibration results when visible field landmarks are scarce. As long as the video frame distortions are not too harsh (i.e., fisheye camera shots) and a minimum of four keypoints are visible, the proposed sports field registration approach is shown to be effective. In future work, we plan to enhance our approach by incorporating temporal consistency between subsequent video frames, aligning better with the nature of sports video broadcasts.

Acknowledgment. This work has been supported by the project GRAVATAR PID2023-151184OB-I00 funded by MCIU/AEI/10.13039/501100011033 and by ERDF, UE and by the Government of Catalonia under Joan Oró FI 2024 grant.

References

[1] Acuna, R., Willert, V.: Insights into the robustness of control point configurations for homography and planar pose estimation. arXiv preprint arXiv:1803.03025 (2018)
[2] Agudo, A.: Total estimation from RGB video: On-line camera self-calibration, non-rigid shape and motion. In: ICPR (2020)
[3] Baker, S., Matthews, I.: Lucas-kanade 20 years on: A unifying framework. IJCV 56, 221–255 (2004)
[4] Blanchard, N., Skinner, K., Kemp, A., Scheirer, W., Flynn, P.: ”keep me in, coach!”: A computer vision perspective on assessing acl injury risk in female athletes. In: WACV (2019)
[5] Chen, J., Little, J.J.: Sports camera calibration via synthetic data. In: CVPRW (2019)
[6] Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)
[7] Chu, Y.J., Su, J.W., Hsiao, K.W., Lien, C.Y., Fan, S.H., Hu, M.C., Lee, R.R., Yao, C.Y., Chu, H.K.: Sports field registration via keypoints-aware label condition. In: CVPR (2022)
[8] Cioppa, A., Deliege, A., Giancola, S., Ghanem, B., Van Droogenbroeck, M.: Scaling up soccernet with multi-view spatial localization and re-identification. Scientific data 9(1), 355 (2022)
[9] Citraro, L., Márquez-Neila, P., Savare, S., Jayaram, V., Dubout, C., Renaut, F., Hasfura, A., Ben Shitrit, H., Fua, P.: Real-time camera pose estimation for sports fields. MVA 31, 1–13 (2020)
[10] Claasen, P.J., de Villiers, J.P.: Video-based sequential bayesian homography estimation for soccer field registration. Expert Systems with Applications 252, 124156 (2024)
[11] Cuevas, C., Quilon, D., García, N.: Automatic soccer field of play registration. Pattern Recognition 103, 107278 (2020)
[12] Deliege, A., Cioppa, A., Giancola, S., Seikavandi, M.J., Dueholm, J.V., Nasrollahi, K., Ghanem, B., Moeslund, T.B., Van Droogenbroeck, M.: Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In: CVPR (2021)
[13] Evangelidis, G.D., Psarakis, E.Z.: Parametric image alignment using enhanced correlation coefficient maximization. TPAMI 30(10), 1858–1865 (2008)
[14] Falaleev, N.S., Chen, R.: Enhancing soccer camera calibration through keypoint exploitation. In: ACMMMW (2024)
[15] Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981)
[16] Ghanem, B., Zhang, T., Ahuja, N.: Robust video registration applied to field-sports video analysis. In: ICASSP (2012)
[17] Gomez-Ojeda, R., Moreno, F.A., Zuniga-Noël, D., Scaramuzza, D., Gonzalez-Jimenez, J.: PL-SLAM: A stereo SLAM system through the combination of points and line segments. TRO 35(3), 734–746 (2019)
[18] Gupta, A., Little, J.J., Woodham, R.J.: Using line and ellipse features for rectification of broadcast hockey video. In: CRV (2011)
[19] Gutiérrez-Pérez, M., Agudo, A.: No bells just whistles: Sports field registration by leveraging geometric properties. In: CVPRW (2024)
[20] Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: CVPR (2006)
[21] Halır, R., Flusser, J.: Numerically stable direct least squares fitting of ellipses. In: WSCG. vol. 98, pp. 125–132 (1998)
[22] Hartley, R., Zisserman, A.: Multiple view geometry in computer vision. Cambridge university press (2003)
[23] Homayounfar, N., Fidler, S., Urtasun, R.: Sports field localization via deep structured models. In: CVPR (2017)
[24] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR. pp. 1125–1134 (2017)
[25] Jazwinski, A.H.: Stochastic processes and filtering theory. Courier Corporation (2007)
[26] Jiang, W., Higuera, J.C.G., Angles, B., Sun, W., Javan, M., Yi, K.M.: Optimizing through learned errors for accurate sports field registration. In: WACV (2020)
[27] Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60, 91–110 (2004)
[28] Magera, F., Hoyoux, T., Barnich, O., Van Droogenbroeck, M.: A universal protocol to benchmark camera calibration for sports. In: CVPR (2024)
[29] Maglo, A., Orcesi, A., Denize, J., Pham, Q.C.: Individual locating of soccer players from a single moving view. Sensors 23(18), 7938 (2023)
[30] Maglo, A., Orcesi, A., Pham, Q.C.: Kalicalib: A framework for basketball court registration. In: MMW (2022)
[31] Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. IMAVIS 22(10), 761–767 (2004)
[32] Mavrogiannis, P., Maglogiannis, I.: Using synthetic camera poses for camera calibration in soccer videos. Multimedia Tools and Applications pp. 1–25 (2024)
[33] Nie, X., Chen, S., Hamid, R.: A robust and efficient framework for sports-field registration. In: WACV (2021)
[34] Oo, Y.M., Jamsrandorj, A., Chao, V., Mun, K.R., Kim, J.: A residual attention-based efficientnet homography estimation model for sports field registration. In: IECON (2023)
[35] Perez-Yus, A., Agudo, A.: Matching and recovering 3D people from multiple views. In: WACV (2022)
[36] Pumarola, A., Vakhitov, A., Agudo, A., Sanfeliu, A., Moreno-Noguer, F.: PL-SLAM: Real-time monocular visual SLAM with points and lines. In: ICRA (2017)
[37] Puwein, J., Ziegler, R., Vogel, J., Pollefeys, M.: Robust multi-view camera calibration for wide-baseline camera networks. In: WACV (2011)
[38] Sha, L., Hobbs, J., Felsen, P., Wei, X., Lucey, P., Ganguly, S.: End-to-end camera calibration for broadcast videos. In: CVPR (2020)
[39] Sharma, R.A., Bhat, B., Gandhi, V., Jawahar, C.: Automated top view registration of broadcast football videos. In: WACV (2018)
[40] Shi, F., Marchwica, P., Higuera, J.C.G., Jamieson, M., Javan, M., Siva, P.: Self-supervised shape alignment for sports field registration. In: WACV (2022)
[41] Simonyan, K.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
[42] Smith, C., Schirgi, H., Stiegler, C., Heyral, K., Smith, G., Filochowski, K., Ferguson, A., Hasselsjo, E., Hodge, E., Kistenyov, D.: Fifa football stadiums guidelines. https://publications.fifa.com/en/football-stadiums-guidelines/, (Accessed 2024-02-26)
[43] Spain, B.: Analytical conics. Courier Corporation (2007)
[44] Tan, M., Le, Q.: Efficientnetv2: Smaller models and faster training. In: ICML (2021)
[45] Tarashima, S.: SFLNet: direct sports field localization via cnn-based regression. In: ACPR (2020)
[46] Theiner, J., Ewerth, R.: Tvcalib: Camera calibration for sports field registration in soccer. In: WACV (2023)
[47] Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment—a modern synthesis. In: Vision Algorithms: Theory and Practice: International Workshop on Vision Algorithms Corfu, Greece, September 21–22, 1999 Proceedings. pp. 298–372. Springer (2000)
[48] Vakhitov, A., Funke, J., Moreno-Noguer, F.: Accurate and linear time pose estimation from points and lines. In: ECCV (2016)
[49] Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual recognition. TPAMI 43(10), 3349–3364 (2020)
[50] Wang, X., Bo, L., Fuxin, L.: Adaptive wing loss for robust face alignment via heatmap regression. In: ICCV (2019)
[51] Wang, Z., Veličković, P., Hennes, D., Tomašev, N., Prince, L., Kaisers, M., Bachrach, Y., Elie, R., Wenliang, L.K., Piccinini, F., et al.: Tacticai: an ai assistant for football tactics. Nature communications 15(1), 1906 (2024)
[52] Zhang, N., Izquierdo, E.: A high accuracy camera calibration method for sport videos. In: VCIP (2021)
[53] Zhang, N., Izquierdo, E.: A four-point camera calibration method for sport videos. TCSVT (2023)
[54] Zhang, Z.: A flexible new technique for camera calibration. TPAMI 22(11), 1330–1334 (2000)
[55] Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

PnLCalib: Sports Field Registration via Points and Lines Optimization