1 Introduction

The Closest Pair of Points or Closest Pair problem is one of the fundamental problems in Computational Geometry: Given a set of points in \(\mathbb {R}^d\), find the closest pair of P, i.e. two points \(p_0 \in P\) and \(p_1 \in P\) (\(p_0 \ne p_1\)) such that the distance between \(p_0\) and \(p_1\) is less than or equal to the distance of any distinct pair of points of P.

Shamos and Hoey [25] are one of the first to mention this problem and introduce an algorithm based on Voronoi diagrams for the planar case, improving the running time of the best known algorithms at the time from \(\mathcal {O}(n^2)\) to \(\mathcal {O}(n \log n)\). They also prove that this running time is optimal for a deterministic computation model. One year later, in 1976, Bentley and Shamos [2] publish a, also optimal, divide-and-conquer algorithm to solve the Closest Pair problem that can be non-trivially extended to work in arbitrary dimensions. Since then the problem has been the focus of extensive research and a multitude of optimal algorithms have been published. Smid [24] provides a comprehensive overview over the available algorithms, including randomized approaches which improve the running time even further to \(\mathcal {O}(n)\).

The main contribution of this paper is the first verification of two related functional implementations of the divide-and-conquer algorithm solving the Closest Pair problem for the two-dimensional Euclidean plane with the optimal running time of \(\mathcal {O}(n \log n)\). We use the interactive theorem prover Isabelle/HOL [17, 18] to prove functional correctness as well as the running time of the algorithms. In contrast to many publications and implementations we do not assume all points of to have unique -coordinates which causes some tricky complications. Empirical testing shows that our verified algorithms are competitive with handwritten reference implementations. Our formalizations are available online [23] in the Archive of Formal Proofs.

This paper is structured as follows: Sect. 2 familiarizes the reader with the algorithm by presenting a high-level description that covers both implementations. Section 3 presents the first implementation and its functional correctness proof. Section 4 proves the running time of \(\mathcal {O}(n \log n)\) of the implementation of the previous section. Section 5 describes our second implementation and illustrates how the proofs of Sects. 3 and 4 need to be adjusted. We also give an overview over further implementation approaches. Section 6 describes final adjustments to obtain executable versions of the algorithms in target languages such as OCaml and SML and evaluates them against handwritten imperative and functional implementations. Section 7 concludes.

1.1 Related Verification Work

Computational geometry is a vast area but only a few algorithms and theorems seem to have been verified formally. We are aware of a number of verifications of convex hull algorithms [4, 14, 20] (and a similar algorithm for the intersection of zonotopes [12]) and algorithms for triangulation [3, 7]. Geometric models based on maps and hypermaps [6, 22] are frequently used.

Work on theorem proving in geometry (see [15] for an overview) is also related but considers fixed geometric constructions rather than algorithms.

1.2 Isabelle/HOL and Notation

The notation means that term has type . Basic types include , , and ; type variables are written , etc; the function space arrow is . Functions and return the first and second component of a pair.

We suppress numeric conversion functions, e.g. , except where that would result in ambiguities for the reader.

Most type constructors are written postfix, e.g. and . Sets follow standard mathematical notation. Lists are constructed from the empty list via the infix cons-operator . Functions and return head and tail, function converts a list into a set.

2 Closest Pair Algorithm

In this section we provide a high-level overview of the Closest Pair algorithm and give the reader a first intuition without delving into implementation details, functional correctness or running time proofs.

Let P denote a set of \(n \ge 2\) points. If \(n \le 3\) we solve the problem naively using the brute force approach of examining all \(n \atopwithdelims ()2\) possible closest pair combinations. Otherwise we apply the divide-and-conquer tactic.

We divide P into two sets \(P_L\) and \(P_R\) along a vertical line l such that the sizes of \(P_L\) and \(P_R\) differ by at most 1 and the x-coordinate of all points \(p_L \in P_L\,(p_R \in P_R)\) is ( ).

We then conquer the left and right subproblems by applying the algorithm recursively, obtaining \((p_{L0},\;p_{L1})\) and \((p_{R0},\;p_{R1})\), the respective closest pairs of \(P_L\) and \(P_R\). Let \(\delta _L\) and \(\delta _R\) denote the distance of the left and right closest pairs and let \(\delta = min \;\delta _L\;\delta _R\). At this point the closest pair of P is either \((p_{L0},\; p_{L1})\), \((p_{R0},\,p_{R1})\) or a pair of points \(p_0 \in P_L\) and \(p_1 \in P_R\) with a distance strictly less than \(\delta \). In the first two cases we have already found our closest pair.

Now we assume the third case and have reached the most interesting part of divide-and-conquer algorithms, the step. It is not hard to see that both points of the closest pair must be within a \(2\delta \) wide vertical strip centered around l. Let \( ps \) be a list of all points in P that are contained within this \(2\delta \) wide strip, sorted in ascending order by y-coordinate. We can find our closest pair by iterating over \( ps \) and computing for each point its closest neighbor. But in the worst case \( ps \) contains all points of P, and we might think our only option is to again check all \(n \atopwithdelims ()2\) point combinations. This is not the case. Let denote an arbitrary point of \( ps \), depicted as the square point in Fig. 1. If is one of the points of the closest pair, then the distance to its closest neighbor is strictly less than \(\delta \) and we only have to check all points \(q \in set\ ps \) that are contained within the \(2\delta \) wide horizontal strip centered around the y-coordinate of .

Fig. 1.
figure 1

The step

In Sect. 4 we prove that, for each \(p \in set\ ps \), it suffices to check only a constant number of closest point candidates. This fact allows for an implementation of the step that runs in linear time and ultimately lets us achieve the familiar recurrence of \(T(n) = T(\lceil n/2 \rceil ) + T(\lfloor n/2 \rfloor ) + \mathcal {O}(n)\), which results in the running time of \(\mathcal {O}(n \log n)\).

We glossed over some implementation details to achieve this time complexity. In Sect. 3 we refine the algorithm and in Sect. 4 we prove the \(\mathcal {O}(n \log n)\) running time.

3 Implementation and Functional Correctness Proof

We present the implementation of the divide-and-conquer algorithm and the corresponding correctness proofs using a bottom-up approach, starting with the step. The basis for both implementation and proof is the version presented by Cormen et al. [5]. But first we need to define the closest pair problem formally.

A point in the two-dimensional Euclidean plane is represented as a pair of (unbounded) integersFootnote 1. The library HOL-Analysis provides a generic distance function applicable to our point definition. The definition of this specific instance corresponds to the familiar Euclidean distance measure.

The closest pair problem can be stated formally as follows: A set of points P is \(\mathbf {\delta }\)-sparse iff \(\delta \) is a lower bound for the distance of all distinct pairs of points of P.

figure aj

We can now state easily when two points and are a closest pair of P: , , and (most importantly) .

In the following we focus on outlining the proof of the sparsity property of our implementation, without going into the details. The additional set membership and distinctness properties of a closest pair can be proved relatively straightforwardly by adhering to a similar proof structure.

3.1 The Combine Step

The essence of the step deals with the following scenario: We are given an initial pair of points with a distance of \(\delta \) and a list \( ps \) of points, sorted in ascending order by y-coordinate, that are contained in the \(2\delta \) wide vertical strip centered around l (see Fig. 1). Our task is to efficiently compute a pair of points with a distance \(\delta ' \le \delta \) such that \( ps \) is \(\delta '\)-sparse. The recursive function achieves this by iterating over \( ps \), computing for each point its closest neighbor by calling the recursive function find_closest that considers only the points within the shaded square of Fig. 1, and updating the current pair of closest points if the newly found pair is closer together. We omit the implementation of the trivial base cases.

figure as
figure at

There are several noteworthy aspects of this implementation. The recursive search for the closest neighbor of a given point p of starts at the first point spatially above p, continues upwards and is stopped early at the first point whose vertical distance to p is equal to or exceeds the given \(\delta \). Thus the function considers, in contrast to Fig. 1, only the upper half of the shaded square during this search. This is sufficient for the computation of a closest pair because for each possible point q preceding p in \( ps \) we already considered the pair \((q,\,p)\), if needed, and do not have to re-check \((p,\,q)\) due to the commutative property of our closest pair definition. Note also that \(\delta \) is updated, if possible, during the computation and consequently the search space for each point is limited to a \(2\delta \times \delta '\) rectangle with \(\delta ' \le \delta \). Lastly we intentionally do not minimize the number of distance computations. In Sect. 6 we make this optimization for the final executable code.

The following lemma captures the desired sparsity property of our implementation of the step so far. It is proved by induction on the computation.

Lemma 1

where means that is sorted in ascending order by y-coordinate.

We wrap up the step by limiting our search for the closest pair to only the points contained within the \(2\delta \) wide vertical strip and choosing as argument for the initial pair of points of find_closest_pair the closest pair of the two recursive invocations of our divide-and-conquer algorithm with the smaller distance \(\delta \).

figure ba

Lemma 2 shows that if there exists a pair of distinct points with a distance , then both its points are contained in the mentioned vertical strip, otherwise we have already found our closest pair , and the pair returned by is its initial argument.

Lemma 2

We then can prove, additionally using Lemma 1, that computes indeed a pair of points such that our given list of points is ( )-sparse.

Lemma 3

One can also show that and are in and distinct (and thus a closest pair of ), if ( ) and ( ) are in ( ) and distinct.

3.2 The Divide & Conquer Algorithm

In Sect. 2 we glossed over some implementation detail which is necessary to achieve to running time of \(\mathcal {O}(n \log n)\). In particular we need to partition the given listFootnote 2 of points \( ps \) along a vertical line l into two lists of nearly equal length during the divide step and obtain a list \( ys \) of the same points, sorted in ascending order by y-coordinate, for the step in linear time at each level of recursion.

Cormen et al. propose the following top-down approach: Their algorithm takes three arguments: the set of points and lists and which contain the same set of points but are respectively sorted by and -coordinate. The algorithm first splits at into two still sorted lists and and chooses as either the -coordinate of the last element of or the -coordinate of the first element of . It then constructs the sets and respectively consisting of the points of and . For the recursive invocations it needs to obtain in addition lists and that are still sorted by -coordinate and again respectively refer to the same points as and . It achieves this by iterating once through and checking for each point if it is contained in or not, constructing and along the way.

But this approach requires an implementation of sets. In fact, if we want to achieve the overall worst case running time of \(\mathcal {O}(n \log n)\) it requires an implementation of sets with linear time construction and constant time membership test, which is nontrivial, in particular in a functional setting. To avoid sets many publications and implementations either assume all points have unique -coordinates or preprocess the points by applying for example a rotation such that the input fulfills this condition. For distinct -coordinates one can then compute and by simply filtering depending on the -coordinate of the points relative to and eliminate the usage of sets entirely.

But there exists a third option which we have found only in Cormen et al. where it is merely hinted at in an exercise left to the reader. The approach is the following. Looking at the overall structure of the closest pair algorithm we recognize that it closely resembles the structure of a standard mergesort implementation and that we only need for the step after the two recursive invocations of the algorithm. Thus we can obtain by merging ‘along the way’ using a bottom-up approach. This is the actual code:

figure di
figure dj

Function expects a list of points sorted by -coordinate. The construction of , and is analogous to Cormen et al. In the base case we then sort by -coordinate and compute the closest pair using the brute-force approach ( ). The recursive call of the algorithm on returns in addition to the closest pair of the list , containing all points of but now sorted by -coordinate. Analogously for and . Furthermore, we reuse function from our implementation, which we utilize to presort the points by -coordinate, to obtain from and in linear time at each level of recursion.

Splitting of is performed by the function via a simple linear pass over . Our implementation of sorts a list of points depending on a given projection function, for ‘by -coordinate’ and for ‘by -coordinate’.

Using Lemma 3, the functional correctness proofs of our mergesort implementation and several auxiliary lemmas proving that closest_pair_rec also sorts the points by y-coordinate, we arrive at the correctness proof of the desired sparsity property of the algorithm.

Theorem 1

Corollary 1 together with Theorems 2 and 3 then show that the pair is indeed a closest pair of .

Corollary 1

Theorem 2

Theorem 3

4 Time Complexity Proof

To formally verify the running time we follow the approach in [16]. For each function we define a function that takes the same arguments as but computes the number of function calls the computation of needs, the ‘time’. Function follows the same recursion structure as and can be seen as an abstract interpretation of . To ensure the absence of errors we derive and from a monadic function that computes both the value and the time but for simplicity of presentation we present only and . We also simplify matters a bit: we count only expensive operations where the running time increases with the size of the input; in particular we assume constant time arithmetic and ignore small additive constants. Due to reasons of space we only show one example of such a ‘timing’ function, , which is crucial to our time complexity proof.

figure fg

We set the time to execute computations to since it is a combination of cheap operations. For the base cases of recursive functions we fix the computation time to be equivalent to the size of the input. This choice of constants is arbitrary and has no impact on the overall running time analysis but leads in general to ‘cleaner’ arithmetic bounds.

4.1 Time Analysis of the Combine Step

In Sect. 2 we claimed that the running time of the algorithm is captured by the recurrence , where is the length of the given list of points. This claim implies an at most linear overhead at each level of recursion. Splitting of the list , merging and and the filtering operation of the step run in linear time. But it is non-trivial that the function , central to the step, also exhibits a linear time complexity. It is applied to an argument list of, in the worst case, length , iterates once through the list and calls for each element. Consequently our proof obligation is the constant running time of or, considering our timing function, that there exists some constant such that holds in the context of the step.

Looking at the definition of we see that the function terminates as soon as it encounters the first point within the given list that does not fulfill the predicate ( ), the point being an argument to , or if is a list of length . The corresponding timing function computes the number of recursive function calls, which is, in this case, synonymous with the number of examined points. For our time complexity proof it suffices to show the following bound on the result of . The proof is by induction on the computation of . The function is an abbreviation for .

Lemma 4

Therefore we need to prove that the term does not depend on the length of . Looking back at Fig. 1, the square point representing , we can assume that the list is distinct and sorted in ascending order by -coordinate. From the precomputing effort of the step we know that its points are contained within the wide vertical strip centered around and can be split into two sets ( ) consisting of all points which lie to the left (right) of or on the line . Due to the two recursive invocations of the algorithm during the conquer step we can additionally assume that both and are -sparse, suggesting the following lemma which implies and thus the constant running time of .

Lemma 5

Proof

The library HOL-Analysis defines some basic geometric building blocks needed for the proof. A closed box describes points contained within rectangular shapes in Euclidean space. For our purposes the planar definition is sufficient.

figure hb

The box is ‘closed’ since it includes points located on the border of the box. We then introduce some useful abbreviations:

  • The rectangle is the upper half of the shaded square of Fig. 1:

  • The set consists of all points of that are encompassed by :

  • The squares and denote the left and right halves of :

  • The set holds all points of that are contained within the square . The definition of is analogous:

Let additionally abbreviate the term . First we prove : Let denote an arbitrary point of . We know because the list and therefore is sorted in ascending order by -coordinate and due to the filter predicate ( ). Using the additional facts (derived from our assumption that all points of are contained within the strip) and the definitions of , and we know and thus . Since the list maintains the distinctness property of we additionally have . Our intermediate goal immediately follows because holds for the finite set .

But how many points can there be in ? Lets first try to determine an upper bound for the number of points of . All its points are contained within the square whose side length is . Moreover, since is -sparse and , is also -sparse, or the distance between each distinct pair of points of is at least . Therefore the cardinality of is bounded by the number of points we can maximally fit into , maintaining a pairwise minimum distance of . As the left-hand side of Fig. 2 depicts, we can arrange at most four points adhering to these restrictions, and consequently have . An analogous argument holds for the number of points of . Furthermore we know due to our assumption and the fact and can conclude . Our original statement then follows from .    \(\square \)

Fig. 2.
figure 2

Core argument.

Note that the intermediate proof for the bound on relies on basic human geometric intuition. Indeed Cormen et al. [5] and most of the proofs in the literature do. But for a formal proof we have to be rigorous. First we show two auxiliary lemmas: The maximum distance between two points in a square with side length is less than or equal to \(\sqrt{2}\delta \).

Lemma 6

The proof is straightforward. Both points are contained within the square , the difference between their and coordinates is hence bounded by and we finish the proof using the definition of the Euclidean distance. Below we employ the following variation of the pigeonhole principle:

Lemma 7

Finally we replace human intuition with formal proof:

Lemma 8

Proof

Let denote the square with a side length of and suppose, for the sake of contradiction, that . Then can be split into the four congruent squares along the common point as depicted by the right-hand side of Fig. 2. Since all points of are contained within and we have . Using Lemma 7 and our assumption we know there exists a square and a pair of distinct points and . Lemma 6 and the fact that all four sub-squares have the same side length shows that the distance between and must be less than or equal to and hence strictly less than . But we also know that is a lower bound for this distance because , , and our premise that is -sparse, a contradiction.   \(\square \)

4.2 Time Analysis of the Divide & Conquer Algorithm

In summary, the time to evaluate is constant in the context of the step and thus evaluating as well as takes time linear in .

Next we turn our attention to the timing of and derive (but do not show) the corresponding function . At this point we could prove a concrete bound on . But since we are dealing with a divide-and-conquer algorithm we should, in theory, be able to determine its running time using the ‘master theorem’ [5]. This is, in practice, also possible in Isabelle/HOL. Eberl [8] has formalized the Akra-Bazzi theorem [1], a generalization of the master theorem. Using this formalization we can derive the running time of our divide-and-conquer algorithm without a direct proof for . First we capture the essence of as a recurrence on natural numbers representing the length of the list argument of :

figure lf

The time complexity of this recurrence is proved completely automatically:

Lemma 9

Next we need to connect this bound with our timing function. Lemma 10 below expresses a procedure for deriving complexity properties of the form

figure lh

where is a timing function on the data domain, in our case lists. The function is a measure on that data domain, is a recurrence or any other function of type and is the set of valid inputs. The term ‘ ’ should be read as ‘if the measured size of valid inputs is sufficiently large’ and utilizes Eberls formalization of Landau Notation [9] and the “filter” machinery of asymptotics in Isabelle/HOL [11]. For readability we omit stating the filter and explicitly in the following and just state the conditions required of the input . The measure always corresponds to the function.

Lemma 10

Lemma 11

Using Lemmas��9, 10 and 11 we arrive at Theorem 4, expressing our main claim: the running time of the divide-and-conquer algorithm.

Theorem 4

Since the function only presorts the given list of points using our mergesort implementation and then calls we obtain Corollary 2 and finish the time complexity proof.

Corollary 2

For inputs:

5 Alternative Implementations

In the literature there exist various other algorithmic approaches to solve the closest pair problem. Most of them are closely related to our implementation of Sect. 3, deviating primarily in two aspects: the exact implementation of the step and the approach to sorting the points by -coordinate we already discussed in Subsect. 3.2. We present a short overview, concentrating on the step and the second implementation we verified.

5.1 A Second Verified Implementation

Although the algorithm described by Cormen et al. is the basis for our implementation of Sect. 3, we took the liberty to optimize it. During execution of our algorithm searches for the closest neighbor of within the rectangle , the upper half of the shaded square of Fig. 1, and terminates the search if it examines points on or beyond the upper border of . Cormen et al. originally follow a slightly different approach. They search for a closest neighbor of by examining a constant number of points of , the first to be exact. This is valid because there are at most points within , not counting , and the th point of would again lie on or beyond the upper border of . This slightly easier implementation comes at the cost of being less efficient in practice. Cormen et al. are always assuming the worst case by checking all points following . But it is unlikely that the algorithm needs to examine even close to points, except for specifically constructed inputs. They furthermore state that the bound of is an over-approximation and dare the reader to lower it to as an exercise. We refrain from doing so since a bound of suffices for the time complexity proof of our, inherently faster, implementation. At this point we should also mention that the specific optimization of Sect. 3 is not our idea but rather an algorithmic detail which is unfortunately rarely mentioned in the literature.

Nonetheless we can adapt the implementation of Sect. 3 and the proofs of Sect. 4 to verify the original implementation of Cormen et al. as follows: We replace each call of by a call to where iterates in brute-force fashion through its argument list to find the closest neighbor of . To verify this implementation we then reuse most of the elementary lemmas and proof structure of Sects. 3 and 4, only a slightly adapted version of Lemma 5 is necessary. Note that this lemma was previously solely required for the time complexity proof of the algorithm. Now it is already necessary during the functional correctness proof since we need to argue that examining only a constant number of points of is sufficient. The time analysis is overall greatly simplified: A call of the form runs in constant time and we again are able to reuse the remaining time analysis proof structure of Sect. 4. For the exact differences between both formalizations we encourage the reader the consult our entry in the Archive of Formal Proofs [23].

5.2 Related Work

Over the years a considerable amount of effort has been made to further optimize the step. Central to these improvements is the ‘complexity of computing distances’, abbreviated CCP in the following, a term introduced by Zhou et al. [26] which measures the number of Euclidean distances computed by a closest pair algorithm. The core idea being, since computing the Euclidean distance is more expensive than other primitive operations, it might be possible to improve overall algorithmic running time by reducing this complexity measure. In the same paper they introduce an optimized version of the closest pair algorithm with a CCP of \(2n \log n\), in contrast to \(7n \log n\) which will be the worst case CCP of the algorithm of Sect. 3 after we minimize the number of distance computations in Sect. 6. They improve upon the algorithm presented by Preparata and Shamos [21] which achieves a CCP of \(3n \log n\). Ge et al. [10] base their, quite sophisticated, algorithm on the version of Zhou et al. and prove an even lower CCP of \(\frac{3}{2}n \log n\) for their implementation. The race for the lowest number of distance computations culminates so far with the work of Jiang and Gillespie [13] who present their algorithms ‘Basic-2’Footnote 3 and ‘2-Pass’ with a respective CCP of \(2n \log n\) and (for the first time linear) \(\frac{7}{2}n\).

6 Executable Code

Before we explore how our algorithm stacks up against Basic-2 (which is surprisingly the fastest of the CCP minimizing algorithms according to Jiang and Gillespie) we have to make some final adjustments to generate executable code from our formalization.

In Sect. 3 we fixed the data representation of a point to be a pair of mathematical ints rather than mathematical reals. During code export Isabelle then maps, correctly and automatically, its abstract data type to a suitable concrete implementation of (arbitrary-sized) integers; for our target language OCaml using the library ‘zarith’. For the data type this is not possible since we cannot implement mathematical reals. We would instead have to resort to an approximation (e.g. floats) losing all proved guarantees in the process. But currently our algorithm still uses the standard Euclidean distance and hence mathematical reals due to the function. For the executable code we have to replace this distance measure by the squared Euclidean distance. To prove that we preserve the correctness of our implementation several small variations of the following lemma suffice:

figure ng

We apply two further code transformations. To minimize the number of distance computations we introduce auxiliary variables which capture and then replace repeated computations. For all of the shown functions that return a point or a pair of points this entails returning the corresponding computed distance as well. Furthermore we replace recursive auxiliary functions such as by corresponding tail-recursive implementations to allow the OCaml compiler to optimize the generated code and prevent stackoverflows. To make sure these transformations are correct we prove lemmas expressing the equivalence of old and new implementations for each function. Isabelles code export machinery can then apply these transformations automatically.

Now it is time to evaluate the performance of our verified code. Figure 3 depicts the running time ratios of several implementations of the algorithm of Sect. 3 (called Basic- ) and Basic-7 (the original approach of Cormen et al.) over Basic-2. Basis- is tested in three variations: the exported (purely functional) Isabelle code and equivalent handwritten functional and imperative implementations to gauge the overhead of the machine generated code. All algorithms are implemented in OCaml, use our bottom-up approach to sorting (imperative implementations sort in place) of Subsect. 3.2 and for each input of uniformly distributed points 50 independent executions were performed. Remarkably the exported code is only about 2.28Footnote 4 times slower than Basic-2 and furthermore most of the difference is caused by the inefficiencies inherent to machine generated code since its equivalent functional implementation is only 11% slower than Basic-2. Basic-7 is 2.26 times slower than the imperative Basic- which demonstrates the huge impact the small optimization of Subsect. 5.1 can have in practice.

Fig. 3.
figure 3

Benchmarks.

7 Conclusion

We have presented the first verification (both functional correctness and running time) of two related closest pair of points algorithms in the plane, without assuming the coordinates of all points to be distinct. The executable code generated from the formalization is competitive with existing reference implementations. A challenging and rewarding next step would be to formalize and verify a closest pair of points algorithm in arbitrary dimensions. This case is treated rather sketchily in the literature.