Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2019 Dec 10;9(3):677–719. doi: 10.1093/imaiai/iaz018

Two-sample statistics based on anisotropic kernelsInline graphic

Xiuyuan Cheng 1, Alexander Cloninger 2,, Ronald R Coifman 3
PMCID: PMC7478116  PMID: 32929389

Abstract

The paper introduces a new kernel-based Maximum Mean Discrepancy (MMD) statistic for measuring the distance between two distributions given finitely many multivariate samples. When the distributions are locally low-dimensional, the proposed test can be made more powerful to distinguish certain alternatives by incorporating local covariance matrices and constructing an anisotropic kernel. The kernel matrix is asymmetric; it computes the affinity between Inline graphic data points and a set of Inline graphic reference points, where Inline graphic can be drastically smaller than Inline graphic. While the proposed statistic can be viewed as a special class of Reproducing Kernel Hilbert Space MMD, the consistency of the test is proved, under mild assumptions of the kernel, as long as Inline graphic, and a finite-sample lower bound of the testing power is obtained. Applications to flow cytometry and diffusion MRI datasets are demonstrated, which motivate the proposed approach to compare distributions.

Keywords: two-sample statistics, maximum mean discrepancy, anisotropic kernel

1. Introduction

We address the problem of comparing two probability distributions from finite samples in Inline graphic, where both distributions are assumed to be continuous (with respect to Lebesgue measure), compactly supported, and the two densities are Inline graphic and Inline graphic. We consider the case where each distribution is observed from i.i.d. samples, called Inline graphic (Inline graphic) and Inline graphic (Inline graphic), respectively, and the two datasets Inline graphic and Inline graphic are independent. The methodology has applications in a variety of fields, particularly in bio-informatics. It can be used, for example, to test genetic similarities between subtypes of cancers, to compare patient groups to determine potential treatment propensity and to detect small anomalies in medical images that are symptomatic of a certain disease. We will cover applications to flow cytometry and diffusion MRI datasets in this paper.

We are interested in the general alternative hypothesis test, namely Inline graphic against Inline graphic. We also focus on the medium dimensional setting, in which Inline graphic, where Inline graphic (Inline graphic) is the number of samples in Inline graphic (Inline graphic). In the large sample limit we consider the case where Inline graphic, the ratio of Inline graphic converges to a positive constant between 0 and 1, and the dimension Inline graphic is assumed to be fixed. (For the scenario where Inline graphic, the convergence of kernel matrices to the limiting integral operator needs to be treated quite differently, and a new analysis is needed.) We are particularly interested in the situation where data are sampled from distributions which are locally low-rank, which means that the local covariance matrices are generally of rank much smaller than Inline graphic. As will be clear in the analysis, the approach of constructing anisotropic kernels is most useful when data exhibit such characteristics in a high-dimensional ambient space.

We are similarly concerned with an Inline graphic-sample problem, in which the question is to determine the differentiation between the Inline graphic distributions, each of which has a finite set of i.i.d. samples. This can be done by measuring a pairwise distance between any two distributions and combining these pairwise measurements to build a weighted graph between the Inline graphic distributions. Thus, we focus on the two-sample test as the primary problem.

In the two-sample problem, the two data sets do not have any point-to-point correspondence, which motivates the comparison of their unknown probability densities. One way to do this is to construct ‘bins’ at many places in the domain, compute the histograms of the two datasets at these bins and then compare the histograms. This turns out to be a good description of our basic strategy, and the techniques beyond this include using ‘anisotropic Gaussian’ bins at every local point and ‘smoothing’ the histograms when needed. Notice that there is a trade-off in constructing these bins: if a bin is too small, which may leave no points in it most of the time, the histogram will have a large variance compared to its mean. When a bin is too large, it will lose the power to distinguish Inline graphic and Inline graphic when they only differ slightly within the bin. In more than one dimension, one may hope to construct anisotropic bins which are large in some directions and small in others, so as to maximize the power to differentiate Inline graphic and Inline graphic while maintaining small variance. It turns out to be possible when the deviation Inline graphic has certain preferred (local) directions. We illustrate this idea in a toy example below, and the analysis of testing consistency, including the comparison of different ‘binning’ strategies, is carried out by analysing the spectrum of the associated kernels in Section 3.

The idea of using bins and comparing histograms dates back to measuring distribution distances based on kernel density estimation, and is also closely related to two-sample statistics by Reproducing Kernel Hilbert Space (RKHS) Maximum-mean discrepancy (MMD) [13]. Other non-parametric tests have been proposed for the two-sample problem, including the energy distance test [26, 30], kernel Fisher discriminant analysis [10], the mean embedding (ME) test and the smooth characteristic function test [6, 18]. We discuss these connections in more detail in Section 1.2. While anisotropic kernels have been previously studied in manifold learning and image processing, kernel-based two-sample statistics with anisotropic kernels have not been examined in the situation where the data are locally low-rank, which is the theme of the current paper. At the same time, we are not just interested in whether Inline graphic and Inline graphic deviate, but how and where they deviate. The difference of the histograms of Inline graphic and Inline graphic measured at multiple bins, introduced as above, can surely be used as an indication of where Inline graphic and Inline graphic differ. The formal concept is known as the witness function in literature [13], which we describe in Section 2.3.

This paper yields several contributions to the two-sample problem. Methodologically, we introduce a kernel-based MMD statistic that increases the testing power against certain deviated distributions by adopting an anisotropic kernel, and both computation and memory requirements are reduced by using a reference set of locations on which the difference of the two distributions are computed. The proposed method can be combined with spectral smoothing of the histograms in order to reduce variability and possibly optimize the importance of certain regions of data, so that the power of the test may be further improved. Theoretically, asymptotic consistency is proved whenever the system is beyond the critical regime, i.e. when the magnitude of the density deviation Inline graphic is proportional to Inline graphic, under generic assumptions. The subscript of Inline graphic in Inline graphic denotes the possible dependence of the deviated density on the number of samples Inline graphic. Both asymptotic and non-asymptotic results are given. Experimentally, we provide two novel applications of two-sample and Inline graphic-sample problems for biological datasets.

The rest of the paper is organized as follows: we begin with a sketch of the main idea and motivating example in the remainder of this section, together with a review of previous studies. Section 2 formalizes the definition of the MMD statistics being proposed, and the theoretical analysis of test power is given in Section 3. The algorithm and other implementation matters, including computation complexity, are discussed in Section 4. Section 5 covers numerical experiments on synthetic and real-world datasets.

1.1 Main idea

Let Inline graphic and Inline graphic be two distributions supported on a compact set Inline graphic. Suppose a reference set Inline graphic is given which consists of Inline graphic points, and for each point Inline graphic there is a (non-degenerate) covariance matrix Inline graphic. We define the asymmetric affinity kernel to be

graphic file with name M54.gif (1.1)

In practice, Inline graphic is either given or can be computed from data, e.g. by local PCA on the combined dataset. Consider the two independent datasets Inline graphic and Inline graphic, where Inline graphic has Inline graphic i.i.d. samples and Inline graphic has Inline graphic i.i.d. samples. The empirical histograms of Inline graphic and Inline graphic at the reference point Inline graphic are defined as

graphic file with name M65.gif (1.2)

for which the population quantities are

graphic file with name M66.gif (1.3)

The empirical histograms are nothing else but the Gaussian binning of Inline graphic and Inline graphic at point Inline graphic with the anisotropic bins corresponding to the covariance matrix Inline graphic. We then compute the quantity

graphic file with name M71.gif (1.4)

as a measurement of the (squared) distance between the two datasets.

We use the following example to illustrate the difference between (1) using anisotropic kernel where Inline graphic is aligned with the tangent space of the manifold data, and (2) using the isotropic ones where Inline graphic is a multiple of the identity matrix.

The data are alike in Fig. 1(A1–A3), where Inline graphic and Inline graphic are supported on two arcs in Inline graphic separated by a gap of size Inline graphic at various regions. We begin by specifying a reference set Inline graphic and the covariance field Inline graphic. For simplicity, we do this by uniformly sampling Inline graphic reference points from Inline graphic (see Fig. 1). At each reference point Inline graphic, we take Inline graphic neighbours Inline graphic to estimate the local covariance matrix by Inline graphic. The empirical histograms Inline graphic and Inline graphic are computed as in Equation (1.2) at every point Inline graphic (see Fig. 1), as well as the quantity Inline graphic as in Equation (1.4). We also compute Inline graphic under a permutation of the data points in Inline graphic and Inline graphic so as to mimic the null hypothesis Inline graphic, and we call the two values Inline graphic and Inline graphic, respectively. The experiment is repeated 100 times, where Inline graphic, and the distribution of Inline graphic and Inline graphic are shown as red and blue bars in Fig. 1. The simulation is done across three datasets where Inline graphic takes value as Inline graphic, and we compare isotropic and anisotropic kernels. When Inline graphic, the distributions of Inline graphic and Inline graphic overlay each other, as expected. When Inline graphic, greater separation between distributions of Inline graphic and Inline graphic implies greater power for the test. The advantage of the anisotropic kernel is clearly demonstrated, particularly when Inline graphic (the middle row).

Fig. 1.

Fig. 1.

A1–A3: (Left) Inline graphic and Inline graphic sampled from curves with separation delta and Inline graphic, (middle) location of Inline graphic reference points and (right) Inline graphic and Inline graphic for anisotropic kernel. B1–B6: (Left) Representative isotropic and anisotropic bins, (middle) histograms Inline graphic (blue) and Inline graphic (red) with isotropic kernel and (right) histograms with anisotropic kernel, (top) Inline graphic, (bottom) Inline graphic. C1–C6: (Left) Isotropic kernel, (right) anisotropic kernel, (top) Heatmap of Inline graphic as singular functions of kernel, (middle) Heatmap of Inline graphic and (bottom) witness function. Inline graphic, and Inline graphic agree up to Inline graphic.

The analysis of the testing power of Inline graphic hinges on the singular value decomposition of Inline graphic, formally written as Inline graphic and will be defined in Section 3. The histogram Inline graphic thus becomes

graphic file with name M127.gif

and similarly for Inline graphic, which means that the ability of Inline graphic to distinguish Inline graphic and Inline graphic is determined by the amount that Inline graphic and Inline graphic differ when projected onto the singular functions Inline graphic. For the example above, the first few singular functions are visualized in Fig. 1(C1–C4), where the Inline graphics of the anisotropic kernel project along the directions where Inline graphic and Inline graphic deviate at a lower index of Inline graphic, thus contributing more significantly to the quantity Inline graphic. Figure 1(C5 and C6) also shows the ‘witness function’ ([13], cf. Section 2.3) of kernels, which indicates the regions of deviation between Inline graphic and Inline graphic.

Throughout the paper, we refer to the use of a local Mahalanobis distance with Inline graphic as an anisotropic kernel and Inline graphic as an isotropic kernel. Similarly, we refer to a kernel measuring affinity from all points to a reference set (i.e. Inline graphic) as an asymmetric kernel. The analysis in Section 3 is for the symmetrized version of the kernel Inline graphic, while in practice one never computes the Inline graphic-by-Inline graphic kernel, but only the Inline graphic-by-Inline graphic asymmetric kernel Inline graphic which is equivalent and way more efficient. We discuss more about computation in Section 4.4.

1.2 Related work

The question of two-sample testing is a central problem in statistics. In one dimension, one classical approach to two sample testing is the Kolmogorov–Smirnov distance, which compares the Inline graphic distance between the two empirical cumulative distribution functions [19, 29]. While there exist generalizations of these infinitely supported bins in high dimensions [4, 12, 15, 25], these require a large computational cost for either computing a minimum spanning tree or running a large number of randomized tests. This warranted binning functions that are defined more locally and in a data-adaptive fashion. Another high-dimensional extension of Kolmogorov–Smirnov is to randomly project the data into a low-dimensional space and compute the test in each dimension.

The one-dimensional Kolmogorov–Smirnov statistic can be seen as a special case of the MMD, the latter being generally defined as

graphic file with name M152.gif

where Inline graphic is certain family of integrable functions [13]. When Inline graphic equals the set of all indicator functions of intervals Inline graphic in Inline graphic, the MMD discrepancy gives the Kolmogorov–Smirnov distance. Gretton et al. [13] studied kernel-based MMD where the function class Inline graphic consists of all functions s.t. Inline graphic, Inline graphic indicating the norm of the Hilbert space associated with the reproducing kernel. Specifically, suppose the positive semi-definite (PSD) kernel is Inline graphic, the (squared) RKHS MMD can be written as

graphic file with name M161.gif (1.5)

and can be estimated by (here we refer to the biased estimator in [13] which includes the diagonal terms)

graphic file with name M162.gif

The methodology and theory apply to any dimensional data as long as the kernel can be evaluated.

We consider a kernel Inline graphic of the form Inline graphic and its variants, where Inline graphic is the certain measure of the reference points. This kernel can be seen as a special case of RKHS MMD [13], which considers a general PSD kernel Inline graphic. When Inline graphic and is isotropic, and Inline graphic is the Lebesgue measure, Inline graphic is reduced to a Gaussian kernel, but generally not. In practice, Inline graphic corresponds to the discrete measure on a set of reference points, and it reduces the computation to be linear in the number of sample size. A similar construction has been taken in the ME test [6, 18], where the kernel is isotropic Gaussian. The use of anisotropic kernel is key to our construction as the non-identity Inline graphic, and allows one to incorporate the local dimensionality reduction in (1.1). This may lead to a more powerful test for distributions that are concentrated near locally low-dimensional data structures, and turns out to be critical for the empirical success of the proposed method on real-world datasets. Theoretically, our analysis gives a comparison of the testing power of different kernels, which is determined by the spectral decomposition of the kernels. This result augments that in [13]: [24] pointed out that the test power may degenerate when applied to high dimensional data, and the proposed method addresses this issue when the data exhibit a locally low-dimensional structure, e.g. the manifold data. Our result also adds to that in [6, 18] by proving the convergence of the test power at a sharper rate, both in the asymptotic and the non-asymptotic result, with a more general construction of the MMD statistic (i.e. a general distribution of reference Inline graphic and anisotropic kernels).

While the empirical spectral expansion of translation-invariant kernels has been previously used for two-sample tests, e.g. in [11] and more recently in [32], and the idea dates back to earlier statistical works (see, e.g. [9]), our setting is different due to the use of anisotropic kernel. We also generalize the construction to a larger family of kernels by adding a ‘spectral filtering’, which truncates the spectral decomposition of the anisotropic kernels and modifies the eigenvalues, cf. Section 2.2. The modified kernel may lead to improved testing power for certain departures.

The problem of optimizing kernel parameters has been considered in several places, from the selection of kernel bandwidth in kernel MMD test [13] to the optimization of reference locations in the ME test [18]. We discuss the parameters of our algorithm in Section 2.4. The optimization over a search space of kernels has been considered by [14], where one constructs a convex combination of a finite number of kernels drawn from a given family, typically isotropic Gaussian kernels with various bandwidths. The kernels we consider are anisotropic, and thus lie outside the family of kernels considered in [14] and, in particular, have different eigenvalues and eigenfunctions. Also, building spectral filters directly on the eigenfunctions yields a different MMD statistic than by convex combination of kernels.

Our approach is also closely related to the previous study of the distribution distance based on kernel density estimation [1]. We generalize the results in [1] by considering non-translation-invariant kernels, which greatly increases the separation between the expectation of Inline graphic under the null hypothesis and the expectation of Inline graphic under an alternative hypothesis. Moreover, it is well known that kernel density estimation, which [1] is based on, converges poorly in high dimension. In the manifold setting, the problem was remedied by normalizing the (isometric) kernel in a modified way, and the estimation accuracy was shown to only depend on the intrinsic dimension [23]. Our proposed approach takes extra advantage of the locally low-dimensional structure, and obtains improved distinguishing power compared to the one using isotropic kernels when possible.

At last, the proposed approach can be viewed as related to two-sample testing via nearest neighbours [16]. In [16], one computes the nearest neighbours of a reference point Inline graphic to the data Inline graphic, and derives a statistical test based on the amount the empirical ratio Inline graphic, where Inline graphic is the number of neighbours from Inline graphic (similarly Inline graphic), deviates from the expected ratio under the null hypothesis, namely Inline graphic. Because the nearest neighbour algorithm is based on Euclidean distance, it is equivalent to a kernel-based MMD with a hard-thresholded isotropic kernel. The approach can be similarly combined with a local Mahalanobis distance as we do, which has not been explored.

2. MMD test statistics and witness functions

We are given two independent datasets Inline graphic and Inline graphic, where Inline graphic has Inline graphic i.i.d. samples drawn from distribution Inline graphic and Inline graphic has Inline graphic i.i.d. samples drawn from Inline graphic. We also use Inline graphic and Inline graphic to denote the densities and write integration w.r.t. Inline graphic as Inline graphic and similarly for Inline graphic.

In the method proposed in Section 1.1, we assume that the reference set Inline graphic and the covariance field Inline graphic are given. In this section, we consider a slightly general form of MMD test statistics, which involves a distribution Inline graphic of the reference points and a covariance field Inline graphic, both assumed to be pre-defined. They can be viewed as parameters of the algorithm (cf. Section 2.4), and the choice can be crucial for the test power. Section 3 explains such effects by theoretically analysing the test power, which is entirely characterized by the spectral decomposition of the anisotropic kernel—and the spectrum in turn depends on the choice of Inline graphic and Inline graphic. The effects are also empirically demonstrated in Section 3.6. The algorithmic choice of Inline graphic and Inline graphic in practice is explained in Section 4.

2.1 Kernel MMD statistics

Using the kernel Inline graphic defined in (1.1), we consider the following empirical statistic:

graphic file with name M204.gif (2.1)

where Inline graphic and Inline graphic are defined in (1.2). Note that (2.1) assumes the measure Inline graphic along with the covariance field Inline graphic as explained above. For now we leave Inline graphic to be general, and in the proposed algorithm Inline graphic for some reference set Inline graphic, cf. Section 4.

The population statistic corresponding to (2.1) is

graphic file with name M212.gif (2.2)

Inline graphic can be viewed as a special form of RKHS MMD: by (1.5), (2.1) is the (squared) RKHS MMD with the kernel

graphic file with name M214.gif (2.3)

The kernel (2.3) is clearly PSD, however, not necessarily ‘universal’, meaning that the population MMD as in (2.2) may not give a distance between Inline graphic and Inline graphic. The test is thus restricted to the departures within the Hilbert space (Assumption 2). Recall that with finite samples, what determines the power of a test is the relative magnitude of the bias of the test statistic under Inline graphic and Inline graphic, in comparison with its probabilistic deviation. Maximizing the bias while controlling the variance is practically more important than the population statistic being mathematically non-zero under every conceivable deviation of distribution. In this view, constructing a pseudo-metric instead of a metric trades the ability to detect certain types of departures, which may be hard to detect with finite samples, for improved discriminative ability against other ones which the proposed anisotropic kernel is more sensitive to detect.

We introduce a spectral decomposition of the kernel Inline graphic based upon that of the asymmetric kernel Inline graphic, which sheds light on the analysis: let Inline graphic be a distribution of data point Inline graphic (which is a mixture of Inline graphic and Inline graphic to be specified later). Since Inline graphic is bounded by 1, so is the integral Inline graphic, and thus the asymetric kernel is Hilbert–Schmidt and the integral operator are compact. The singular value decomposition of Inline graphic with respect to Inline graphic and Inline graphic can be written as

graphic file with name M230.gif (2.4)

where Inline graphic, Inline graphic and Inline graphic are ortho-normal sets w.r.t. Inline graphic and Inline graphic respectively. Then (2.3) can be written as

graphic file with name M236.gif (2.5)

This formula suggests that the ability of the kernel MMD to distinguish Inline graphic and Inline graphic is determined by (i) how discriminative the eigenfunctions Inline graphic are (viewed as ‘feature extractors’), and (ii) how the spectrum Inline graphic decay (viewed as ‘weights’ to combine the Inline graphic differences extracted per mode). It also naturally leads to generalizing the definition by modify the weights Inline graphic, which is next.

2.2 Generalized kernel and spectral filtering

We consider the situation where the distributions Inline graphic and Inline graphic lie around certain lower-dimensional manifolds in the ambient space, and both densities are smooth with respect to the manifold and decay off-manifold, which is typically encountered in the applications of interest (cf. Section 5). Meanwhile, since the reference set is sampled near the data, Inline graphic is also centered around the manifold. Thus, one would expect the population histograms Inline graphic and Inline graphic to be smooth on the manifold as well. This suggests building a ‘low-pass-filter’ for the empirical histograms before computing the Inline graphic distance between them, namely the MMD statistic.

We thus introduce a general form of kernel as

graphic file with name M249.gif (2.6)

where Inline graphic is a sequence of sufficiently decaying positive numbers, the requirement to be detailed in Section 3. Our analysis will be based on kernels in form of (2.6), which includes Inline graphic as a special case when Inline graphic. While the eigenfunctions Inline graphic are generally not analytically available, to compute the MMD statistics one only needs to evaluate Inline graphics on the data points in Inline graphic, which can be approximated by the empirical singular vectors of the Inline graphic-by-Inline graphic kernel matrix Inline graphic and computed efficiently for MMD tests (cf. Section 4). Note that the approximation by empirical spectral decomposition may degrade as Inline graphic increases; however, for the purpose of smoothing histograms, one typically assigns small values of Inline graphic for large Inline graphic so as to suppress the ‘high-frequency components’.

The construction (2.6) gives a large family of kernels and is versatile: first, setting Inline graphic for some positive integer Inline graphic is equivalent to using the kernel as Inline graphic where Inline graphic. This can be interpreted as redefining the affinity of points Inline graphic and Inline graphic by allowing Inline graphic-steps of ‘intermediate diffusion’ on the reference set, and thus ‘on the data’ [7]. When Inline graphic is uniform over the whole ambient space, raising Inline graphic is equivalent to enlarging the bandwidth of the Gaussian kernel. However, when Inline graphic is chosen to adapt to the densities Inline graphic and Inline graphic, then the kernel becomes a data-distribution-adapted object. As Inline graphic increases, Inline graphic decays rapidly, which ‘filters out’ high-frequency components in the histograms when computing the MMD statistic, because Inline graphic. Generally, setting Inline graphic to be a decaying sequence has the effect of spectral filtering. Secondly, in the case that prior knowledge about the magnitude of the projection Inline graphic is available, one may also choose Inline graphic accordingly to select the ‘important modes’. Furthermore, one may view the kernel MMD with (2.6) as a weighted squared-distance statistic after projecting to the spectral coordinates by Inline graphic, where the coordinates are uncorrelated thanks to the orthogonality of Inline graphic. We further discuss the possible generalizations in the last section. The paper is mainly concerned with the kernel Inline graphic, including Inline graphic, with anisotropic Inline graphic, while the above extension of MMD may be of interest even when Inline graphic is isotropic.

2.3 Witness functions

Following the convection of RKHS MMD [13], the ‘witness function’ Inline graphic is defined as

graphic file with name M287.gif

where Inline graphic and Inline graphic are the ME of Inline graphic and Inline graphic in Inline graphic, respectively. By Riez representation, Inline graphic equals Inline graphic multiplied by a constant. We will thus consider Inline graphic as the witness function. By definition, Inline graphic, and similarly for Inline graphic, thus Inline graphic. We then have that for the statistic Inline graphic,

graphic file with name M300.gif (2.7)

and generally for Inline graphic,

graphic file with name M302.gif (2.8)

The computation of empirical witness functions will be discussed in Section 4.2.

Although the witness function is not an estimator of the difference Inline graphic, it gives a measurement of how Inline graphic and Inline graphic deviate at local places. The witness function with isotropic Gaussian kernel has been previously used for model criticism of generative models of image data [22]. This can be useful for user interpretation of the two-sample test, as shown in Section 5. The witness function augments the MMD statistic, which is a global quantity.

2.4 Kernel parameters

If the reference distribution Inline graphic and the covariance field Inline graphic are given, there is no tuning parameter to compute Inline graphic-MMD (cf. Algorithm 2). To compute Inline graphic-MMD with general Inline graphic, which is truncated to be of finite rank Inline graphic, the number Inline graphic and the values Inline graphic are tunable parameters (cf. Algorithm 3).

If the covariance field Inline graphic is provided up to a global scaling constant by Inline graphic, e.g. when estimated from local PCA, which means that one uses Inline graphic for some Inline graphic in (1.1), then this Inline graphic is a parameter which needs to be determined in practice.

Generally speaking, one may view the reference set distribution Inline graphic and the covariance field Inline graphic as ‘parameters’ of the proposed MMD test. The optimization of these parameters surely have an impact on the power of the MMD statistics, e.g. the optimization of the reference set Inline graphic and a global bandwidth of the isotropic Gaussian kernel via a separated training set has been considered in [18]. A full analysis of this goes beyond the scope of the current paper. At the same time, there are important application scenarios where pre-defined Inline graphic and Inline graphic are available, e.g. the diffusion MRI imaging data (Section 5.3). We thus proceed with the simplified setting by assuming pre-defined Inline graphic and Inline graphic, and focusing on the effects of anisotropic kernel and re-weighted spectrum.

3. Analysis of testing power

We consider the population MMD statistic Inline graphic of the following form:

graphic file with name M327.gif (3.1)

where Inline graphic as in (2.6), and particularly Inline graphic as in (2.5). The empirical version is

graphic file with name M330.gif (3.2)

where Inline graphic and Inline graphic, and Inline graphic. We consider the limit where both Inline graphic and Inline graphic go to infinity and proportional to each other, in other words, Inline graphic and Inline graphic, and Inline graphic.

We will show that, under generic assumptions on the kernel and the departure Inline graphic, the test based on Inline graphic is asymptotically consistent, which means that the test power Inline graphic as Inline graphic (with controlled false-positive rate). The asymptotic consistency holds when Inline graphic is allowed to depend on Inline graphic as long as the magnitude of the density departure Inline graphic decays to 0 slower than Inline graphic. We also provide a lower bound of the power based upon Chebyshev which applies to the critical regime when the magnitude of Inline graphic is proportional to Inline graphic. The analysis also provides a quantitative comparison of the testing power of different kernels.

3.1 Assumptions on the kernel

Because Inline graphic is uniformly bounded and Hilbert–Schmidt, Inline graphic is positive semi-definite and compact, and thus can be expanded as in (2.5) where Inline graphic are a set of ortho-normal functions under Inline graphic. By that Inline graphic, we also have that Inline graphic satisfies that

graphic file with name M355.gif (3.3)

and Inline graphic for any Inline graphic. Meanwhile, Inline graphic is continuous (by the continuity and uniformly boundedness of Inline graphic), which means that the series in (2.5) converges uniformly and absolutely, and the eigenfunctions Inline graphic are continuous (Mercer’s theorem). Finally, (3.3) implies that the integral operator associated with the kernel Inline graphic is in the trace class, and specifically Inline graphic.

These properties imply that when replacing Inline graphic by Inline graphic as in (2.6), one can preserve the continuity and boundedness of the kernel. We analyse kernels of the form as (2.6) and assume the following properties.

Assumption 1

The Inline graphic in (2.6) satisfy that Inline graphic, Inline graphic, and that the kernel Inline graphic is PSD, continuous and Inline graphic for all Inline graphic.

As a result, Mercer’s theorem applies to guarantee that the spectral expansion (2.6) converges uniformly and absolutely, and the operator is in the trace class. Note that Assumption 1 holds for a large class of kernels, including all the important cases considered in this paper. The previous argument shows that Inline graphic is covered. Another important case is the finite-rank kernel: that is, Inline graphic when Inline graphic for some positive integer Inline graphic, and Inline graphic can be any positive numbers when Inline graphic such that Inline graphic and Inline graphic for all Inline graphic.

For the MMD test to distinguish a deviation Inline graphic from Inline graphic, it needs to have Inline graphic for such Inline graphic. We consider the family of alternatives Inline graphic which satisfies the following condition.

Assumption 2

When Inline graphic, there exists Inline graphic s.t. Inline graphic and Inline graphic. In particular, if Inline graphic are strictly positive for all Inline graphic, then Inline graphic satisfies that Inline graphic does not vanish w.r.t. Inline graphic, i.e. Inline graphic.

The following proposition, proved in the Appendix, shows that Inline graphic for such deviated Inline graphic.

Proposition 3.1

Notations as above, for a fixed Inline graphic, the following are equivalent:

  • (i) Inline graphic.

  • (ii) For some Inline graphic, Inline graphic and Inline graphic.

If Inline graphic for all Inline graphic, then (i) is also equivalent to

  • (iii) Inline graphic.

Note that Inline graphic satisfies Inline graphic for all Inline graphic; thus, (iii) applies. The proposition says that Inline graphic distinguishes an alternative Inline graphic when Inline graphic lies in the subspace spanned by Inline graphic, and for general Inline graphic, Inline graphic needs to lie in the subspace spanned by Inline graphic. These bases are usually not complete, e.g. when the reference set has Inline graphic points then Inline graphic is of rank at most Inline graphic. However, when the measure Inline graphic is continuous and smooth, the singular value decomposition (2.4) has a sufficiently decaying spectrum, and the low-frequency Inline graphics can be efficiently approximated with a drastic down sampling of Inline graphic [3]. This means that, under certain conditions of Inline graphic (sufficiently overlapping with Inline graphic and regularity), after replacing the continuous Inline graphic by a discrete sampling in constructing the kernel, an alternative Inline graphic violates Assumption 2 only when the departure Inline graphic lies entirely in the high-frequency modes w.r.t. the original Inline graphic. In the applications considered in this paper, these very high-frequency departures are rarely of interest to detect, not to mention that estimating Inline graphic for large Inline graphic lacks robustness with finite samples. Thus, Assumption 2 poses a mild constraint for all practical purposes considered in this paper, even with the spectral filtering kernel which utilizes a possibly truncated sequence of Inline graphic.

We note that, by viewing Inline graphic as general Fourier modes, Assumption 2 servers as the counterpart of the classical condition of the kernel ‘having non-vanishing Fourier transforms on any interval’, which is needed for the MMD distance with translation-invariant kernel to be a distance [1, Section 2.4].

3.2 The centered kernel

We introduce the centered kernel Inline graphic under Inline graphic, defined as

graphic file with name M433.gif (3.4)

where Inline graphic, and Inline graphic. The spectral decomposition of Inline graphic is the key quantity used in later analysis.

The following lemma shows that the MMD statistic, both the population and the empirical version, remains the same if Inline graphic is replaced by Inline graphic. The proof is by definition and details are omitted.

Lemma 3.1

Notations as above,

Lemma 3.1 (3.5)
Lemma 3.1 (3.6)

In particular, under Assumption 2 that Inline graphic, then so is the r.h.s. of (3.5).

The kernel Inline graphic also inherits the following properties from Inline graphic, proved in Appendix A.

Proposition 3.2

The kernel Inline graphic, as an integral operator on the space with the measure Inline graphic, is PSD. Under Assumption 1,

  • (1) Inline graphic for any Inline graphic, and Inline graphic for any Inline graphic,

  • (2) Inline graphic is continuous,

  • (3) Inline graphic is Hilbert–Schmidt as an operator on Inline graphic and is in the trace class. It has the spectral expansion
    graphic file with name M453.gif (3.7)
    where Inline graphic, Inline graphic. Inline graphic are continuous on Inline graphic, Inline graphic are both summable and square summable, and the series in (3.7) converges uniformly and absolutely.
  • (4) The eigenfunctions Inline graphic are square integrable, and thus integrable, w.r.t. Inline graphic. Furthermore, Inline graphic.

To proceed, we define

graphic file with name M462.gif (3.8)

then by (3.7), (3.5) and (3.6), shortening the notation Inline graphic as Inline graphic, Inline graphic as Inline graphic, we have that

graphic file with name M467.gif (3.9)
graphic file with name M468.gif (3.10)

3.3 Limiting distribution of Tn

Consider the alternative Inline graphic for some fixed Inline graphic, Inline graphic. Inline graphic remains a probability density for any Inline graphic. We define the constants

graphic file with name M474.gif (3.11)

where Inline graphic are finite by the integrability of Inline graphic under Inline graphic ((4) of Proposition 3.2), and Inline graphic.

The theorem below identifies the limiting distribution of Inline graphic under various order of Inline graphic, which may decrease to 0 as Inline graphic. The techniques very much follow Chapter 6 of [27] (see also [13, Theorems 12 and 13]), where the key step is to replace the two independent sets of summations in (3.10) by normal random variables via multi-variate CLT (Lemma A.1), using a truncation argument based on the decaying of Inline graphic. The proof is left to Appendix A.

Theorem 3.3

Let Inline graphic may depend on Inline graphic, Inline graphic, and notations Inline graphic, Inline graphic, Inline graphic and others like above. Under Assumption 1, as Inline graphic with Inline graphic, Inline graphic,

(1) If Inline graphic, Inline graphic (including the case when Inline graphic), then

Theorem 3.3

Due to the summability of Inline graphic the random variable on the right-hand side is well defined.

(2) If Inline graphic, where Inline graphic, then

Theorem 3.3

where Inline graphic.

(3) If Inline graphic, then

Theorem 3.3

where Inline graphic, with Inline graphic.

Remark 3.1

As a direct result of Theorem 3.3 (1), under Inline graphic,

Remark 3.1

When Inline graphic, the limiting density is Inline graphic where Inline graphic are i.i.d. Inline graphic random variables. Theorem 3.3 (3) shows the asymptotic normality of Inline graphic under Inline graphic. One can verify that when Inline graphic, Inline graphic, where Inline graphic. These limiting densities under Inline graphic recover the classical result of V-statistics [27, Theorems 6.4.1.B and 6.4.3].

The numerical experiments in Section 3.6 show that the theoretical limits identified in Theorem 3.3(1) approximate the empirical distributions of Inline graphic quite well when Inline graphic equals a few hundreds (Fig. 2). In the next two subsections, we address the asymptotic consistency of the MMD test and the finite-sample bound of the test power.

Fig. 2.

Fig. 2.

Empirical distribution of the statistic Inline graphic under Inline graphic (blue) and Inline graphic (red) shown in comparison with the large Inline graphic limiting distribution (broken lines) as in Theorem 3.3, for Gaussian kernel (left), Inline graphic (middle) and Inline graphic (right).mmd. The numbers are the fraction of Inline graphic lying to the right of the 0.95-quantile of Inline graphic, for empirical and theoretical curves, respectively. Data sets Inline graphic and Inline graphic drawn from Inline graphic and Inline graphic in the example in Section 3.6, a scatter plot is shown in Fig. 4, Inline graphic.

3.4 Asymptotic consistency of the test

A test based on the MMD statistic Inline graphic rejects Inline graphic whenever Inline graphic exceeds certain threshold Inline graphic. For a target ‘level’ Inline graphic (typically Inline graphic), a test (with Inline graphic samples) achieves level Inline graphic if Inline graphic, and the ‘power’ against an alternative Inline graphic is defined as Inline graphic. We define the test power at level Inline graphic as

graphic file with name M544.gif (3.12)

The threshold Inline graphic needs to be sufficiently large to guarantee that the false-positive rate is at most Inline graphic, and in practice it is a parameter to determine (cf. Section 4.1). Below, we omit the dependence on Inline graphic in the notation and write Inline graphic as Inline graphic. The test is called asymptotically consistent if Inline graphic as Inline graphic. For the one-parameter family of Inline graphic, the following theorem shows that, if Inline graphic satisfies Assumption 2, the MMD test has a non-trivial power when Inline graphic converges to a positive constant and is asymptotically consistent if Inline graphic.

Theorem 3.4

Under Assumption 1, and suppose that the departure Inline graphic makes Inline graphic satisfy Assumption 2 for Inline graphic, where Inline graphic may depend on Inline graphic. As Inline graphic with Inline graphic,

  • (1) If Inline graphic, Inline graphic, then Inline graphic, where Inline graphic is a monotonic function of Inline graphic.

  • (2) If Inline graphic, then Inline graphic.

This is qualitatively the same result as for RKHS MMD [13, Theorem 13]. The claim (1) directly follows from Theorem 3.3 (1), and the proof for (2) uses similar techniques. The proof is left to Appendix A.

3.5 Non-asymptotic bound of the testing power

In this section, we derive a non-asymptotic (lower) bound of the testing power Inline graphic for finite Inline graphic, which shows that the speed of the convergence Inline graphic is at least as fast as Inline graphic, as Inline graphic increases whenever Inline graphic is greater than certain threshold value.

Theorem 3.5

Notations Inline graphic, Inline graphic, Inline graphic as above. Define Inline graphic, Inline graphic, Inline graphic and Inline graphic, and let Inline graphic be the target level. Under Assumptions 1 and 2, define Inline graphic. If Inline graphic and

Theorem 3.5 (3.13)

then

Theorem 3.5 (3.14)

where

Theorem 3.5

In particular, assuming that Inline graphic is uniformly bounded to be between Inline graphic for some Inline graphic, then the constants Inline graphic, Inline graphic, Inline graphic, Inline graphic are all uniformly bounded with respect to Inline graphic by constants which only depend on Inline graphic and Inline graphic.

Remark 3.2

The above bound applies when Inline graphic is proportional to Inline graphic, which is the critical regime. The lower bound of Inline graphic in (3.14) increases with Inline graphic and approaches 1 when Inline graphic, showing the same behaviour as Theorem 3.4. It can be used as another way to prove Theorem 3.4 (2). In particular, as Inline graphic increases (or Inline graphic increases, since Inline graphic stays bounded) and assuming uniformly bounded Inline graphic, the r.h.s. of (3.14) is dominated by

Remark 3.2

which leads to an Inline graphic bound of Inline graphic for fixed Inline graphic. The theorem only uses Chebyshev, so the bound may be improved, e.g. by investigating the concentration of Inline graphic which is a quadratic function of independent sums, cf. (3.10).

Remark 3.3

We notice two other possible ways of deriving non-asymptotic bound of the power: (1) Berry–Esseen rate of the convergence to asymptotic normality has been proved for U-statistics in the case of Inline graphic and can be extended to the V-statistics (for the case of Inline graphic), cf. [27, Chapter 6]. This can lead to a control of Inline graphic, where the constant depends on the kernel and the departure magnitude Inline graphic. However, when Inline graphic is proportional to Inline graphic, the Berry–Essen rate loses sharpness and cannot give a non-trivial bound. (2) Large deviation bounds of the empirical MMD Inline graphic have been obtained by McDiarmid inequality and the boundedness of the Rademacher complexity of the unit ball in the RKHS, where Inline graphic is proved to converge to the population value Inline graphic at the rate Inline graphic with exponential tail [13, Theorem 7]. This gives a stronger concentration than Chebyshev when Inline graphic for some constant Inline graphic and leads to an exponential decay of Inline graphic with increasing Inline graphic, but it does not apply to give a control when Inline graphic is proportional to Inline graphic, i.e. the critical regime.

3.6 Comparison of kernels

From the above analysis (Theorems 3.33.5), one can see that the mean and variance of the statistic under Inline graphic and Inline graphic are very indicative of the power of the MMD test. These quantities are

graphic file with name M631.gif

Since the distribution of the test statistic is approximately (properly rescaled) chi-square or normal, the larger the bias Inline graphic, and the smaller the standard deviation Inline graphic and Inline graphic, the more powerful the test is to detect the deviation. By Theorem 3.3 (1), at the critical regime where Inline graphic is proportional to Inline graphic, these quantities can be approximated by the following (recall that Inline graphic)

graphic file with name M638.gif

where Inline graphic i.i.d. Notice that the gap Inline graphic, which is the population Inline graphic.

In this subsection, we compare isotropic and anisotropic kernels by numerically computing the above quantities under a specific choice of data distribution. We will consider three kernels: (1) the Gaussian kernel Inline graphic (2) the one induced by the anisotropic kernel Inline graphic where Inline graphic is set to be the Lebesgue measure, and Inline graphic is constructed so that the tangent direction is the first principle direction with variance Inline graphic, and the normal direction is the second principle direction with variance Inline graphic, and (3) the spectral filtered kernel as in Equation (2.6), where Inline graphic are designed to be 1 for Inline graphic and smoothly decays to zero at Inline graphic. All kernels are multiplied by a constant to make the largest eigenvalue Inline graphic so as to be comparable. We also adopt the ratio (similar to the object of kernel optimization considered in [14])

graphic file with name M652.gif

to illustrate the testing power, where the larger the Inline graphic the more powerful the test is likely to be.

For the distribution Inline graphic and Inline graphic, we use the two-dimensional example in the first section of the paper. Specifically, Inline graphic is the uniform distribution on the curve Inline graphic convolved with Inline graphic, Inline graphic, and Inline graphic is the distribution of the shifted curve Inline graphic convolved with Inline graphic, where Inline graphic. One realization of 200 points in Inline graphic and Inline graphic is shown in Fig. 4.1 Let Inline graphic, we consider the one-parameter family of alternative Inline graphic and will simulate in the critical regime where the test power is less than 1. We assume that Inline graphic and denote by Inline graphic in this subsection.

Fig. 4.

Fig. 4.

Example 1 in Section 5.1, two curves in Inline graphic. A1: Two samples Inline graphic and Inline graphic in Inline graphic in green and blue, respectively. A2: Testing power of MMD using (blue) Gaussian kernel, (red) anisotropic kernel Inline graphic, (pink) spectral-filtered anisotropic-kernel Inline graphic and (green) summed KS distance from 20 random projections to one dimension. Witness functions for two-sample test of (A3) Gaussian kernel, (A4) Inline graphic and (A5) Inline graphic, respectively.

In all experiments, the quantities Inline graphic, Inline graphic, Inline graphic and Inline graphic are computed by Monte Carlo simulation over 10000 runs. Their asymptotic version (with bar) are computed by 50000 draws of the limiting distribution, truncating the summation over Inline graphic to the first 500 terms; the values of Inline graphic and Inline graphic are approximated by the empirical eigenvalues and mean of eigenvectors over 10000 sample points drawn from Inline graphic and Inline graphic (Inline graphic is computed by Nystrom extension). The values of Inline graphic and Inline graphic are shown in Fig. 3, and those of Inline graphic etc. in Table 1.

Fig. 3.

Fig. 3.

Eigenvalues Inline graphic and the values of Inline graphic of the three kernels: Gaussian, Inline graphic and Inline graphic, where Inline graphic is the bias for mode Inline graphic; cf. Section 3.6. The Gaussian kernel has the discriminative modes lying in the high-index eigenfunctions (right, blue bars), to which the MMD statistic assigns relatively smaller weights via relatively smaller eigenvalues (left, blue curve). The Inline graphic kernel has more discriminative modes in the low-lying end of Inline graphic (right, red bars), and Inline graphic further enhances these modes by assigning higher weights to them (left, black curve and right, black bars).

Table 1.

The values of Inline graphic, Inline graphic, Inline graphic, Inline graphic and Inline graphic of three kernels: Gaussian, Inline graphic and Inline graphic, in the example in Section 3.6. In first row are average over Monte Carlo simulations, and in second row are the approximated values computed according to the limiting distribution identified by Theorem 3.3. Larger value of Inline graphic suggests that the test is more powerful in distinguishing Inline graphic from Inline graphic

Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
Inline graphic
Gaussian 0.4771 0.5444 0.0677 0.0704 0.4874
0.4754 0.5439 0.0676 0.0736 0.4848
Inline graphic 0.0489 0.0958 0.0214 0.0306 0.9000
0.0488 0.0939 0.0214 0.0312 0.8573
Inline graphic 0.0985 0.2046 0.0348 0.0587 1.1351
0.0983 0.2013 0.0374 0.0620 1.0354
Inline graphic
Gaussian 0.2381 0.2720 0.0334 0.0359 0.4885
0.2379 0.2722 0.0339 0.0368 0.4850
Inline graphic 0.0243 0.0477 0.0107 0.0153 0.8972
0.0244 0.0471 0.0106 0.0157 0.8616
Inline graphic 0.0490 0.1036 0.0177 0.0305 1.1343
0.0490 0.1003 0.0188 0.0310 1.0290

4. Practical considerations

Algorithm 1 gives the pseudo code for two-sample test based on the proposed MMD statistics, where two external subroutines, akMMD and akWitness, will be given in Algorithms 2 and 3 for Inline graphic and Inline graphic statistics, respectively. In Algorithm 3, an extra input parameter Inline graphic, which is a positive vector of length Inline graphic (Inline graphic), is needed. Inline graphic is the target spectrum of the kernel Inline graphic in Equation (2.6). Both algorithms compute the Inline graphic quantile of test statistic under Inline graphic by bootstrapping, which is used as the threshold Inline graphic in the test. It is also assumed that the reference set Inline graphic with Inline graphic are pre-defined.

In the rest of the section, we explain the bootstrapping approach, the empirical estimator of the witness function and the construction of reference set in detail. We close the section by commenting on the computation complexity.

  —

graphic file with name iaz018fx2.jpg

  —

graphic file with name iaz018fx3.jpg

  —

graphic file with name iaz018fx4.jpg

4.1 Permutation test and choice of the threshold tα

The MMD statistics Inline graphic introduced in Section 3 are non-parametric, which means that the threshold Inline graphic does not have a closed-form expression for finite Inline graphic. In practice, one may use a classical method known as the permutation test [17], which is a bootstrapping strategy to empirically estimate Inline graphic, and previously used in RKHS MMD [13]. The idea is to model the null hypothesis by pooling the two datasets, resampling from that pool and computing one instance of the MMD under that null hypothesis. Repetitive resampling thus corresponds to permuting the joint dataset multiple times, and it generates a sequence values of Inline graphic, the Inline graphic-quantile of which is used as Inline graphic (see both Algorithms 2 and 3). Strictly speaking, the power of the test is slightly degraded due to the empirical estimate of the null hypothesis. However, as we see in Section 5, the test still yields very strong results in a number of examples.

Meanwhile, the limiting distribution of the statistics Inline graphic appears to well approximate the empirical one under Inline graphic, as shown in Section 3.6. This suggests the possibility of determining Inline graphic based on the empirical spectral decomposition of the kernel. In the current work we focus on the bootstrapping approach (permutation test) for simplicity.

4.2 Empirical witness functions

The population witness functions have been introduced in Section 2.3 and, by definition, can be evaluated at any data point Inline graphic. For the kernel Inline graphic, the empirical version of Equation (2.7) is

graphic file with name M737.gif (4.1)

which can be computed straightforwardly from the empirical histograms Inline graphic, Inline graphic and the pre-defined kernel Inline graphic; see Algorithm 2.

For the kernel Inline graphic, Equation (2.8) is approximated by

graphic file with name M742.gif (4.2)

where Inline graphic, Inline graphic are computed by SVD of the assymetric kernel matrix, and the out-of-sample extension to Inline graphic by Nystrom method; see Algorithm 3.

4.3 Sampling of the reference set

In bio-informatics applications, e.g. flow cytometry, datasets are usually large in volume so that one can construct the reference set Inline graphic and covariance field Inline graphic from a pool of data points, which combines multiple samples, before the differential analysis in which MMD is involved. In some application scenario (like diffusion MRI), Inline graphic and Inline graphic are provided by external settings. Thus, we treat the procedure of constructing Inline graphic and Inline graphic separately from the two-sample analysis, which is also assumed in the theory.

In experiments in this work, we sample the reference points randomly in Lebesgue measure using the following heuristic: given a pool of data points, e.g. drawn from Inline graphic and Inline graphic or subsampled from larger datasets, the procedure loops over batches until Inline graphic points have been generated. In each loop, a batch of points are loaded from the pool, and candidate points are sampled from the batch according to Inline graphic which is the KDE of each point in the batch. The candidate points are giggled, and points which have too few neighbours in the pool dataset are excluded. The code can be found in the software github repository https://github.com/AClon42/two-sample-anisotropic.

The covariance field is computed by local PCA, when needed. This is related to the issue of ‘Inline graphic-selection’ in Gaussian MMD [13], where a ‘median heuristic’ was proposed to determine the Inline graphic in the Gaussian kernel Inline graphic. In our setting, the extension of ‘Inline graphic’ is the local covariance matrix Inline graphic, which allows different Inline graphic at different points, apart from different Inline graphic along different local directions. In estimating Inline graphic, a controlling parameter is then the size of the local neighbourhood, i.e. the Inline graphic-nearest neighbours from which local PCA is computed. In the manifold setting, there are strategies to choose Inline graphic so as to most efficiently estimate local covariance matrix, see, e.g. [21], and it is best done by using different Inline graphic at different point. For simplicity, in all the experiments we set Inline graphic to be a fraction of the total number of samples, which may be sub-optimal. We also introduce a parameter Inline graphic and set Inline graphic, where Inline graphic is computed via local PCA, so as to make the ‘size’ of Inline graphic tunable. The parameter Inline graphic is sampled on a binary grid (Inline graphic for Inline graphic).

4.4 Computational complexity

We will only discuss the cost of computing the MMD statistics, namely that of MMD-L2 (Algorithm 2) and MMD-Spec (Algorithm 3). The extra cost for computing the witness function is negligible, as can be seen from the code (some computation, e.g. that of Inline graphic and Inline graphic in Witness-L2, and the SVD of Inline graphic in Witness-spec are repetitive for illustrative purpose).

We first discuss MMD-L2: the cost for computing one empirical MMD statistics Inline graphic is of Inline graphic, where Inline graphic is the size of the two samples. The main cost is the one-time construction of the asymmetric kernel matrix Inline graphic, which also dominates the memory requirements. While the choice of the reference set, and hence the number of reference points Inline graphic, is related to the problem being addressed, any amount of structure in the samples leads to a choice of Inline graphic that is Inline graphic. This yields major computational and memory benefits as the test complexity is much smaller than the Inline graphic, which is the order for computing the U-statistics via a symmetric Gaussian kernel without extra techniques. In all applications we have considered and synthetic examples we have worked with, Inline graphic is in the tens or hundreds and can remain relatively constant as Inline graphic grows while still yielding a test with large power. This means the test with an anisotropic kernel becomes linear in the number of points being tested.

In MMD-spec, the extra computation is for the rank-Inline graphic SVD of the Inline graphic-by-Inline graphic matrix Inline graphic. The computation can be done in Inline graphic time, Inline graphic, via the classical pivoted QR decomposition, and may be accelerated by randomized algorithms, e.g. [31] which takes Inline graphic time to achieve an accuracy proportional to the magnitude of the Inline graphic)th singular value. The computational cost may be further reduced by making use of the possible sparse/low-rank structure of the kernel matrix Inline graphic. When the kernel Inline graphic almost vanishes outside a local neighbourhood of Inline graphic which has at most Inline graphic points, the rows of Inline graphic are Inline graphic-sparse, and then fast nearest neighbour search methods (by Kd-tree or randomized algorithm) can be applied under the local Mahalanobis distance and the storage is reduced to Inline graphic. When Inline graphic has a small numerical rank, e.g. only Inline graphic singular values are significantly non-zero, Inline graphic can be stored in its rank-Inline graphic factorized form Inline graphic which can be computed in linear time of Inline graphic. This will save computation in bootstrapping as both Inline graphic and Inline graphic can be computed from Inline graphic and (row-permuted) Inline graphic only.

5. Applications

5.1 Synthetic examples

5.1.1 Example 1: Curve in two dimension

The density Inline graphic is the uniform distribution on a quarter circle with radius 1 convolved with Inline graphic, Inline graphic on a quarter circle with radius Inline graphic convolved with Inline graphic, which is the same example in Section 3.6 (top left in Fig. 2), Inline graphic. The departure is parametrized by Inline graphic, taking values from 0 to 0.02. The rejection rate is computed from the average over Inline graphic runs according to Algorithm 1, as shown in the left of Fig. 4, where the error bars indicate the standard deviation. For the Gaussian kernel, the curve corresponds to the Inline graphic which has the best performance over a grid, and it is obtained by an intermediate value on the grid. The witness functions of the three kernels when Inline graphic are computed according to Section 4.2 and shown in Fig. 4.

5.1.2 Example 2: Gaussian mixture in 3 dimension

Two Gaussian mixtures in Inline graphic, Inline graphic, Inline graphic. The density Inline graphic consists of three equal-size components centering at Inline graphic, Inline graphic and Inline graphic respectively, with diagonal covariance matrix Inline graphic, Inline graphic and Inline graphic, Inline graphic; density Inline graphic differs from Inline graphic by centering the three means at Inline graphic, Inline graphic and Inline graphic, while letting the covariance matrices remain the same. The constant Inline graphic takes value from 0 to 0.02. Results are shown in Fig. 5, similar to Fig. 4.

Fig. 5.

Fig. 5.

Example 2 in Section 5.1: two Gaussian mixtures in Inline graphic. (Left) Two samples, (middle) testing power using Inline graphic determined by local covariance estimation and (right) testing power using Inline graphic by the true covariance matrix of the belonging cluster, where line markers are the same as in Fig. 4. The performance with estimated PCA is undermined by the loss of accuracy in estimating Inline graphic from local nearest neighbours.

5.2 Flow cytometry data analysis

Flow cytometry is a laser-based technology used for cell counting. It is routinely used in diagnosing blood diseases by measuring various physical and chemical characteristics of the cells in a sample. This leads to each sample being represented by a multi-dimensional point cloud of tens of thousands of cells that were measured.

Two natural questions arise in comparing various flow cytometry samples. The first is a supervised learning question: can a statistical test detect the systemic differences between a set of healthy people and a set of unhealthy people? The second question is an unsupervised learning question: can the distance measure be used to determine the pairwise distance between any two samples, and will that distance matrix yield meaningful clusters?

We address both of these questions for two different flow cytometry datasets. We compare isotropic Gaussian MMD to anisotropic MMD and demonstrate, in all cases, our anisotropic MMD test has more power and yields more accurate unsupervised clusters. We only show the pairwise distance embedding for the reference set asymmetric kernel Inline graphic due to the untenable Inline graphic cost of computing each of the Inline graphic pairs.

5.2.1 Acute myeloid leukemia dataset

Acute myeloid leukemia (AML) is a cancer of the blood that is characterized by a rapid growth of abnormal white blood cells. While being a relatively rare disease, it is incredibly deadly with a 5-year survival rate of Inline graphic [8]. AML is diagnosed through a flow cytometry analysis of the bone marrow cells, and while there are features that distinguish AML in certain cases, other cases can be more difficult to detect. State of the art supervised methods include SPADE, Citrus, etc. See [5] and the references therein.

The data are from the public competition Flow Cytometry: Critical Assessment of Population Identification Methods II (FlowCAP-II). This dataset consists of 316 healthy patients and 43 patients with AML, where each person has approximately Inline graphic cells measured across seven physical and chemical features.

We use an anisotropic kernel with only Inline graphic reference points. We create an unsupervised clustering by computing the pairwise distances Inline graphic between person Inline graphic and person Inline graphic. In Fig. 6, we display the network of these people constructed by weighting each edge as the exponential of the negative distance Inline graphic properly scaled. When supervising the process and running a two-sample test between the pool of healthy cells and the pool of unhealthy cells, we see that the anisotropic kernel yields significantly better separation and lower variance than the isotropic Gaussian kernel.

Fig. 6.

Fig. 6.

A1: Unsupervised histograms for AML patients. A2: Network clustering of pairwise distances between patients (green for AML, red for healthy). A3: Permutation test with isotropic kernel. A4: Permutation test wtih anisotropic kernel. B1 and B2: Different two-dimensional slices of point cloud coloured by witness function. Blue are cells more likely to be indicitive of AML.

We also examine the witness function in Fig. 6 that is generated by the anisotropic kernel, as introduced in Section 2.3. This yields a tool for visualizing the separation between the two samples in the original data space. This gives a way to communicate the decision boundary to the medical community, which uses visualization of the two-dimensional slices as the diagnostic tool for determining whether the patient has AML.

5.2.2 Myelodysplastic syndromes dataset

Myelodysplastic syndromes (MDS) are a group of cancers of the blood. There is more variability in how MDS presents itself in the cells and flow cytometry measurements. The data we work with came from anonymized patients that were treated at Yale New Haven Hospital. After choosing to examine surface markers CD16, CD13 and CD11B, along with several physical characteristics of the cells, we are left with 72 patients that were initially diagnosed with some form of MDS and 87 patients that were not. These patients are represented by about Inline graphic cells in eight dimensions.

While MDS is more difficult to detect than AML, the unsupervised pairwise graph, created the same way as in Fig. 6, still yields a fairly strong unsupervised clustering, as we see in Fig. 7. When supervising the process and running a two-sample test between the pool of healthy cells and the pool of unhealthy cells, we see that the anisotropic kernel yields strongly significant separation between the two classes, unlike the isotropic Gaussian kernel.

Fig. 7.

Fig. 7.

A1: Unsupervised histograms for MDS patients. A2: Network clustering of pairwise distances between patients (green for MDS, red for healthy). A3: Permutation test with isotropic kernel. A4: Permutation test with anisotropic kernel.

5.3 Diffusion MRI imaging analysis

Diffusion-weighted MRI is an imaging modality that creates contrast images via the diffusion of water molecules. Various regions of the brain diffuse in different ways, and the patters can reveal details about the tissue architecture. At a low level, each pixel in a three-dimensional brain image generates a three-dimensional diffusion tensor (i.e. covariance matrix) that describes the local flow of water molecules.

An important question in diffusion MRI analysis is to identify regions of the brain that systemically differ between groups of healthy and sick individuals. We attack this problem by comparing the distributions of the diffusion tensors in various regions of the brain, thus framing it as a multiple sample problem. Every brain is co-registered so that the pixels overlap.

Real images are around Inline graphic, so the amount of memory needed to build a square symmetric kernel, even a sparse one, is completely prohibitive. For this reason, we instead consider a set of reference pixels Inline graphic that subsamples the image and construct an anisotropic kernel Inline graphic, where Inline graphic is the location of pixel Inline graphic and Inline graphic is the Inline graphic diffusion tensor at pixel Inline graphic. This kernel must enforce both locality in the pixel space and the local behaviour of the diffusion tensors. The latter requires measuring covariance matrices over the space of positive semi-definite matrices.

Fortunately, the study of covariance matrices on the space of positive semi-definite matrices is a well-studied phenomenon [2]. The main takeaway is that there is an isomorphism from a Inline graphic diffusion tensor Inline graphic and its vectorized representation

graphic file with name M873.gif

Thus, we can define a Inline graphic covariance matrix Inline graphic on Inline graphic and define the anisotropic kernel

graphic file with name M877.gif (5.1)

We examine MMD using a synthetic data set generated from the common brain phantom image. We treat this as a two-dimensional slice of the three-dimensional image, and simply have all tensor variation in the z-direction constant and uncorrelated with the xy-direction of this slice. We also assume that the brains have been co-registered. The process of co-registration is an independent preprocessing issue which can be incorporated into our proposed methodology when working with real-world data sets, but is outside the scope of this paper.

In Fig. 8, we show both a ‘healthy’ brain and an ‘unhealthy’ brain in which a small region has been removed. The intensity of the image in this case will correspond to the eccentricity of the diffusion tensor at that point; the magnitude of the tensor will decrease from left to right in the same way for both images and the angle shift uniformly from left to right. The eccentricity, magnitude and angle all have i.i.d. Gaussian noise added to them with Inline graphic.

Fig. 8.

Fig. 8.

A1–A6: (Top) Synthetic brains with grayscale of diffusion tensor eccentricity, (middle) diffusion tensors with artificial colouring and (bottom) zoom in on area of difference. (Left) Group 1, (right) Group 2. B1–B9: (Left) Witness function, (middle) permutation test and (right) pairwise graph from MMD distance. (Top) Anisotropic kernel under alternative hypothesis, (middle) isotropic kernel under alternative hypothesis and (bottom) anisotropic kernel under null hypothesis.

We down-sample the brain by a factor of 5 for the reference points and consider the MEs Inline graphic for Inline graphic realizations of a healthy brain (null hypothesis Inline graphic), and for Inline graphic realizations of a healthy brain and Inline graphic realizations of an unhealthy brain (alternative hypothesis Inline graphic). Figure 8 shows the supervised witness function of regions of difference between the two groups, as well as a permutation test in which we permute group labels while maintaining the individual brain structure. We also show the leading eigenvectors of the pairwise network generated by measuring the MMD between any two brains.

By using reference points, each brain is represented by a kernel that is Inline graphic, and the data adaptive MMD computation can be run on 10 brains in about Inline graphic min on a standard laptop.

We also compare the anisotropic kernel to one with an isotropic kernel of constant bandwidth. We cannot compare it to kernel MMD with a square symmetric kernel due to computational limits, so instead we compare it to the modified asymmetric kernel as in (5.1), but without the covariance matrix. Instead, we replace it by a constant bandwidth, which is chosen to be Inline graphic.

6. Discussion and remarks

The paper studies kernel-based MMD statistics with kernels of the form Inline graphic, and more generally, Inline graphic where Inline graphic is a ‘filtering’ kernel over Inline graphic. The statistics can be computed with a pre-defined reference set in time Inline graphic, where Inline graphic is the number of samples and Inline graphic is the cardinal number of the reference set. The power of the test against alternative distributions is analysed using the spectral decomposition of the kernel with respect to the data distribution, and the consistency is proved under generic assumptions. The difference in the testing power of the kernels is determined by their spectral properties. We apply the proposed methodology to flow cytometry and diffusion MRI data analysis, where the goal of analysis is formulated as comparing the distribution of multiple samples.

We close the section by a few remarks about the proposed approach as well as possible extensions:

The spectral coordinates. The kernel-MMD distance being studied can be viewed as certain Inline graphic distance, weighted or unweighted, in the space of spectral embedding. This is reflected in the construction of the kernel Inline graphic, as well as in the power analysis. Theoretically, the space of spectral embedding is infinitely dimensional; however, in practice only finite-dimensional coordinates may contribute to the RKHS MMD statistic—in the sense of providing statistically significant departures. Thus, the leading Inline graphic coordinates of spectral embedding (depends on the kernel) gives a mapping from Inline graphic to Inline graphic, where Inline graphic may be proportional or larger than Inline graphic, and one may consider the two-sample test in the new coordinates. The general alternative then becomes a mean-shift alternative in the new coordinates. This suggests other possible tests for mean-shift alternatives than the weighted Inline graphic distance test being studied.

Weighting of bins. While spectral filtering introduces weighting in the generalized Fourier domain, another important variation is to introduce weighting in the ‘real domain’, namely weighting each bin centered at Inline graphic by a weight Inline graphic. Such weights may be computed from data, e.g. by certain local Inline graphic-value (see below), where the dependence among Inline graphic’s needs to be handled. Another interesting question is how to introduce multi-resolution systems in the context of the current paper: mapping points into spectral coordinates has certain advantage as analysed in the paper; however, spectral basis is global and may not be sensitive enough to local departure. On the other hand, histograms on local bins may have large variance and one needs to jointly analyse multiple bins. The shortcomings of both approaches may be overcome by considering a multi-scale basis on the reference-set graph.

Beyond Inline graphic distance. RKHS MMD considers the Inline graphic distance by construction, while other metrics have been studied in the literature, particularly the Wasserstein metric which has the interpretation of optimal flow given the underlying geometry. Such ‘geometric’ distances are certainly useful in various application scenarios. Using the reference set, a modification of Inline graphic will be

graphic file with name M910.gif

where Inline graphic, Inline graphic are the population histograms, and Inline graphic is certain metric on reference set. It may also be possible to construct a metric which is equivalent to the Wasserstein metric by measuring the difference at reference points across multiple scales of covariance matrices, as is done with with Haar wavelet [28] and with diffusion kernels [20]. Efficient estimation scheme of the Wasserstein metric needs to be developed as well as the consistency analysis with Inline graphic samples.

Local Inline graphic-value. The current approach gives a global test and computes the Inline graphic-value for the hypothesis of the distribution globally. In certain applications, especially differential analysis of flow cytometry data and other single-cell data, a more important problem is to find the local region where the two samples differ corresponding to different biological conditions, or in other words, to derive a ‘local’ Inline graphic-value of the test. While the witness function introduced in our current approach can provide indication where Inline graphic, a more systematical study of testing the hypothesis locally and controlling false discovery rate across bins is needed.

Funding

We would like to thank our joint funding NSF DMS-1818945. A.C. is also partially supported by the Russel Sage Foundation, X.C. is also partially supported by NSF DMS- 1820827, NIH, and the Alfred P. Sloan Foundation.

Footnotes

1

Strictly speaking Inline graphic and Inline graphic are longer compacted supported due to convolving with the two-dimensional Gaussian distribution; however, the normal density exponentially decays and Inline graphic is small, so that there is no any practical difference. The analysis can extend by a truncation argument.

Contributor Information

Xiuyuan Cheng, Email: xiuyuan.cheng@duke.edu.

Alexander Cloninger, Email: acloninger@ucsd.edu.

Ronald R Coifman, Email: ronald.coifman@yale.edu.

A. Proofs in Section 3

A.1 Proofs of Propositions 3.1 and 3.1

Proof of Proposition 3.1.

By (3.1), (2.6),

Proof of Proposition 3.1.

and thus (i) Inline graphic (ii).

Recall that Inline graphic has the singular value decomposition (2.4), and thus

Proof of Proposition 3.1.

with Inline graphic all strictly positive. This means that (iii) is equivalent to Inline graphic for some Inline graphic. When Inline graphic is all strictly positive, it is equivalent to (ii).

Proof of Proposition 3.2.

We first verify the positive semi-definiteness of Inline graphic: for any Inline graphic so that Inline graphic, by definition,

Proof of Proposition 3.2.

where Inline graphic. The quantity is non-negative by that Inline graphic is PSD.

To prove (1): under Assumption 1, Inline graphic and Inline graphic for any Inline graphic. This implies the boundedness of Inline graphic and Inline graphicin (3.4) and leads to (1).

(2) Follows from the continuity of Inline graphic.

The square integrability of Inline graphic then follows from boundedness of Inline graphic, which makes the operator Hilbert–Schmidt with

Proof of Proposition 3.2. (A.1)

by (1). Meanwhile, by (1) again,

Proof of Proposition 3.2.

which proves that the operator is in trace class. Mercer’s theorem applies to give the spectral expansion and the relevant properties, and Inline graphic is by the centering so that Inline graphic for all Inline graphic. This proves (3).

Finally, by the uniform convergence of Equation (3.7), and (1),

Proof of Proposition 3.2.

which proves (4).

A.2 Proof of Theorems 3.3 and 3.4

Lemma A.1 (Replacement lemma). —

Let Inline graphic be a sequence of positive number so that Inline graphic. Let Inline graphic be an array of random variables, Inline graphic, Inline graphic, s.t. for each Inline graphic,

Lemma A.1

where Inline graphic. Furthermore, as Inline graphic,

  • (i) Inline graphic elementwise, and for any finite Inline graphic, Inline graphic where Inline graphic.

  • (ii) There exists Inline graphic s.t.
    graphic file with name M964.gif
    and the convergence is uniform in Inline graphic.

Meanwhile, let Inline graphic, Inline graphic, Inline graphic be three double arrays satisfying that

  • (iii) Inline graphic, Inline graphic; as Inline graphic,
    graphic file with name M972.gif
  • (iv) There exist Inline graphic s.t.
    graphic file with name M974.gif
    for all Inline graphic, and the convergence is uniform in Inline graphic.

Then, as Inline graphic, the random variable

Lemma A.1 (A.2)

converges in distribution to Inline graphic defined as

Lemma A.1 (A.3)

and Inline graphic has finite mean and variance.

Proof.

First, we verify that Inline graphic is well defined and has finite variance; notice that (ii) implies that Inline graphic, (iii) implies that Inline graphic, Inline graphic, and (iv) implies that Inline graphic and Inline graphic. For finite Inline graphic, we define the truncated Inline graphic as taking the summation from Inline graphic to Inline graphic in Equation (A.3). Then

Proof.

and thus

Proof. (A.4)

due to the summability of Inline graphic, Inline graphic and Inline graphic (the second term is bounded by Cauchy–Schwarz). Thus, Inline graphic with probability one, and Inline graphic. Further,

Proof.

by that Inline graphic is finite. This verifies that Inline graphic has finite mean and variance. Actually, by martingale convergence theorem one can show that Inline graphic a.s.

Secondly, using a similar argument, defining Inline graphic to be the truncated Inline graphic in Equation (A.2), one can show that

Proof. (A.5)

by that Inline graphic, Inline graphic and Inline graphic all converges to zero uniformly in Inline graphic, which is assumed in conditions (ii) and (iv).

Now we come to prove Inline graphic. By Levy’s continuity theorem, it suffices to show the pointwise convergence of the characteristic function, namely

Proof.

Using the truncation of Inline graphic up to Inline graphic, we have that

Proof. (A.6)

By Equations (A.4) and (A.5), for any Inline graphic, and any Inline graphic, we can choose sufficiently large Inline graphic s.t. the first and the third term are both less than Inline graphic for any Inline graphic. To show that the second term can be made small, we introduce

Proof.

and by that Inline graphic (condition (i)), Inline graphic. Meanwhile,

Proof.

Since Inline graphic is finite, and Inline graphic is uniformly bounded as Inline graphic increases for each Inline graphic, we have that Inline graphic as Inline graphic, by the convergence of Inline graphic, Inline graphic and Inline graphic. Thus, the second term can be bounded by

Proof. (A.7)

where (I) can be made smaller than Inline graphic for large Inline graphic (Inline graphic is fixed and Inline graphic), and (II) can be made smaller than Inline graphic as a result of Inline graphic which implies convergence of characteristic function. Putting together, the l.h.s. of Equation (A.6) can be made smaller than Inline graphic for large Inline graphic, which proves the claim.

Proof of Theorem 3.3.

We introduce

Proof of Theorem 3.3.

where, by definition,

Proof of Theorem 3.3.

and

Proof of Theorem 3.3.

Using the above notations, we rewrite Equation (3.10) as

Proof of Theorem 3.3. (A.8)

where Inline graphic and Inline graphic. The random variables Inline graphic are independent from Inline graphic, and both are asymptotically normal for finite many Inline graphics. We will use the replacement Lemma A.1 to substitute Inline graphic and Inline graphic by their normal counterparts and discuss scenarios (1)–(3), respectively.

To apply Lemma A.1, we set Inline graphic, and the summability follows (3) of Proposition 3.2; we set

Proof of Theorem 3.3. (A.9)

and then (A.8) becomes

Proof of Theorem 3.3. (A.10)

We have that

Proof of Theorem 3.3. (A.11)
Proof of Theorem 3.3. (A.12)

where, recalling that Inline graphic, Inline graphic and Inline graphic,

Proof of Theorem 3.3. (A.13)

By that Inline graphic and Inline graphic, both Inline graphic and Inline graphic are uniformly bounded, so Inline graphic are bounded for each Inline graphic. We now verify conditions (i) and (ii) in Lemma A.1.

Condition (i): In cases (1) and (2), Inline graphic; thus, Inline graphic and then Inline graphic. In case (3) Inline graphic; thus, Inline graphic and then Inline graphic. As for the limiting distribution of Inline graphic for any finite Inline graphic, we know that Inline graphic, and Inline graphic by Lindeberg–Levy CLT ([27, Theorem 1.9.1 B], extended to the case where the covariance matrix converges to a non-degenerate limit by Slutsky theorem). By definition of Inline graphic and that Inline graphic and Inline graphic are independent, Inline graphic where Inline graphic (Inline graphic) is replaced by Inline graphic (Inline graphic) by Slutsky theorem. The argument applies to all the three cases.

Condition (ii): Inline graphic for some absolute positive constant Inline graphic and Inline graphic. Meanwhile, by Equation (A.13), Inline graphic; thus,

Proof of Theorem 3.3.

thanks to that Inline graphic and that Inline graphic ((4) of Proposition 3.2), and the convergence is uniform in Inline graphic.

We now consider the three scenarios respectively:

(1) Let Inline graphic, by Equation (A.10) we have

Proof of Theorem 3.3.

thus, (iii) holds. Condition (iv) can be verified by that Inline graphic (upper bounded by Inline graphic). As analysed above, Inline graphic, and conditions (i) and (ii) hold; thus, Lemma A.1 applies to give that

Proof of Theorem 3.3.

as claimed in the theorem.

(2) Let Inline graphic be the l.h.s. of the statement, then

Proof of Theorem 3.3.

and conditions (iii) and (iv) hold. Same as in (1), Inline graphic and (i) and (ii) hold; thus, Lemma A.1 gives that

Proof of Theorem 3.3.

By the summability of Inline graphic, Inline graphic is in same distribution as Inline graphic as defined in the theorem.

(3) Similar to (2), let Inline graphic be the l.h.s. of the statement, then Inline graphic, Inline graphic, Inline graphic are same as in (2) where Inline graphic, so they have the same limit, and (iii) and (iv) hold. As analysed above, Inline graphic, and (i) and (ii) hold. Thus, Lemma A.1 gives that Inline graphic where Inline graphic has covariance Inline graphic. By the summability of Inline graphic, Inline graphic is in same distribution as Inline graphic, where Inline graphic which equals the formula claimed in the theorem.

Proof of Theorem 3.4

We only prove (2), as (1) directly follows from Theorem 3.3 (1) and the form of the limiting density of Inline graphic in this case.

We first consider the Inline graphic case, i.e. Inline graphic: due to that Inline graphic satisfies Assumption 2 and the equivalent forms of Inline graphic as in (3.5) and (3.9), we have that

Proof of Theorem 3.4

Notations as in the proof of Theorem 3.3, by (A.10),

Proof of Theorem 3.4

We set Inline graphic to be the l.h.s. and apply Lemma A.1: Inline graphic and are summable as before. Conditions (i) and (ii) are satisfied by Inline graphic (defined in (A.9)), as has been verified in the proof of Theorem 3.3. Since

Proof of Theorem 3.4

and Inline graphic, condition (iii) holds with the limits as

Proof of Theorem 3.4

Condition (iv) is also satisfied due to the summability of Inline graphic, same as in the proof of Theorem 3.3. Thus, Lemma A.1 gives that

Proof of Theorem 3.4 (A.14)

which is a single-point distribution at the positive constant Inline graphic.

We then consider the Inline graphic case, i.e. Inline graphic: by Theorem 3.3 (1), Inline graphic which is a continuous non-negative random variable with finite mean and variance. Thus, for any chosen level Inline graphic, there exists Inline graphic s.t. Inline graphic, and then when Inline graphic is large enough,

Proof of Theorem 3.4

This means that Inline graphic is a valid threshold in (3.12), and as a result, (shortening ‘Inline graphic’ as Inline graphic)

Proof of Theorem 3.4

Putting together with (A.14), we then have that for sufficiently large Inline graphic,

Proof of Theorem 3.4

which converges to 0 since Inline graphic is an absolute constant and thus Inline graphic, and meanwhile Inline graphic. This proves that Inline graphic.

A.3 Proof of Theorem 3.5

Proof of Theorem 3.5.

We use Chebyshev to control the deviation of the random variable Inline graphic from its mean, under Inline graphic and Inline graphic, respectively.

Under Inline graphic, Inline graphic, by (A.10),

Proof of Theorem 3.5.

By (A.12), Inline graphic,

Proof of Theorem 3.5. (A.15)

We will prove later that

Proof of Theorem 3.5. (A.16)

and then by Chebyshev we have that for any Inline graphic,

Proof of Theorem 3.5.

Setting the r.h.s. to be Inline graphic gives Inline graphic, and thus

Proof of Theorem 3.5. (A.17)

Under Inline graphic, by (A.10) and the definition that Inline graphic,

Proof of Theorem 3.5. (A.18)
Proof of Theorem 3.5. (A.19)

Since Inline graphic (cf. (A.11)),

Proof of Theorem 3.5.

we have that

Proof of Theorem 3.5. (A.20)

We will prove later that

Proof of Theorem 3.5. (A.21)

and then, using Chebyshev again, for any Inline graphic,

Proof of Theorem 3.5.

Combined with (A.17) which shows that Inline graphic is a valid threshold to achieve level Inline graphic, by setting Inline graphic (which is strictly positive under (3.13)), this gives the bound (3.14).

It remains to prove (A.16) and (A.21) to finish the proof.

Proof of (A.16):

We will prove that

Proof of (A.16): (A.22)

where the first term Inline graphic since Inline graphic (cf. (A.1)), and the second term ¡0.1 under the condition that Inline graphic. This gives (A.16).

Recall that Inline graphic, and Inline graphic. By the definition of Inline graphic (A.9),

Proof of (A.16): (A.23)

Note that

Proof of (A.16):

and then

Proof of (A.16): (A.24)

Recall that Inline graphic for any Inline graphic; thus, Inline graphic does not vanish only when the indices Inline graphic all equal or fall into two pairs. Then

Proof of (A.16): (A.25)

Recall that

Proof of (A.16): (A.26)
Proof of (A.16): (A.27)

and Inline graphic (cf. Proposition 3.2 (1)), the above line continues to give that

Proof of (A.16): (A.28)

Similarly, the fourth term can be bounded by

Proof of (A.16): (A.29)

Back to (A.23), we have that

Proof of (A.16):

which is (A.22).

Proof of (A.21): Recall that Inline graphic, Inline graphic being constant; thus, Inline graphic. By (A.19),

Proof of Theorem 3.5.

We will prove that

Proof of Theorem 3.5. (A.30)
Proof of Theorem 3.5. (A.31)
Proof of Theorem 3.5. (A.32)

which, putting together, gives that

Proof of Theorem 3.5.

which proves (A.21).

Proof of (A.30): Recall that (cf. (A.12)), and Inline graphic,

Proof of Theorem 3.5.

We define

Proof of Theorem 3.5. (A.33)

and then

Proof of Theorem 3.5.

thus

Proof of Theorem 3.5. (A.34)

Meanwhile, by Proposition 3.2 (1), Inline graphic uniformly, this gives that Inline graphic for any Inline graphic. Thus,

Proof of Theorem 3.5.

As a result,

Proof of Theorem 3.5.

which proves (A.30).

Proof of (A.31): By (A.9), the independence of Inline graphic from Inline graphic, and that Inline graphic, Inline graphic for all Inline graphic,

Proof of Theorem 3.5.

We then have that

Proof of Theorem 3.5.

where Inline graphic is defined as

Proof of Theorem 3.5. (A.35)
Proof of Theorem 3.5. (A.36)
Proof of Theorem 3.5. (A.37)

We compute Inline graphic and Inline graphic, respectively: note that Inline graphic; thus,

Proof of Theorem 3.5.

due to that Inline graphic vanishes unless the three indices coincide. By that Inline graphic and Inline graphic, we have that

Proof of Theorem 3.5. (A.38)

As for Inline graphic, by that Inline graphic for any Inline graphic, a similar argument gives that

Proof of Theorem 3.5.

To proceed, by Cauchy–Schwarz,

Proof of Theorem 3.5.

where, same as in (A.34),

Proof of Theorem 3.5.

and, by the uniform bound that Inline graphic, we also have that

Proof of Theorem 3.5. (A.39)

Together, they give that

Proof of Theorem 3.5.

This means that

Proof of Theorem 3.5. (A.40)

Back to (A.35), we have that

Proof of Theorem 3.5.

namely (A.31).

Proof of (A.32): Similar to (A.23),

Proof of Theorem 3.5.

We have that

Proof of Theorem 3.5. (A.41)

and by (A.28),

Proof of Theorem 3.5. (A.42)

Using a similar argument to derive (A.25), one can verify that

Proof of Theorem 3.5.

Meanwhile,

Proof of Theorem 3.5.

and Inline graphic (cf. (A.39)). This gives that

Proof of Theorem 3.5. (A.43)

We also have that

Proof of Theorem 3.5.

together with (A.41)–(A.43), this gives that

Proof of Theorem 3.5.

Note that Inline graphic, and Inline graphic; thus,

Proof of Theorem 3.5.

this gives that

Proof of Theorem 3.5.

The first term equals Inline graphic, and the second term ¡0.1 under the condition of the theorem. This proves (A.32).

References

  • 1. Anderson N. H., Hall P. & Titterington D. M. (1994) Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. J. Multivariate Anal., 50, 41–54. [Google Scholar]
  • 2. Basser P. J. & Pajevic S. (2007) Spectral decomposition of a 4th-order covariance tensor: applications to diffusion tensor MRI. Signal Process., 87, 220–236. [Google Scholar]
  • 3. Bermanis A., Averbuch A. & Coifman R. R. (2013) Multiscale data sampling and function extension. Appl. Comput. Harmon. Anal., 34, 15–29. [Google Scholar]
  • 4. Bickel P. J. (1969) A distribution free version of the Smirnov two sample test in the p-variate case. Ann. Math. Stat., 40, 1–23. [Google Scholar]
  • 5. Bruggner R. V., Bodenmiller B., Dill D. L., Tibshirani R. J. & Nolan G. P. (2014) Automated identification of stratifying signatures in cellular subpopulations. Proc. Natl. Acad. Sci., 111, E2770–E2777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Chwialkowski K. P., Ramdas A., Sejdinovic D. & Gretton A. (2015) Fast two-sample testing with analytic representations of probability measures. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2, 1981–1989. [Google Scholar]
  • 7. Coifman R. R., Lafon S., Lee A. B., Maggioni M., Nadler B., Warner F. & Zucker S. W. (2005) Geometric diffusions as a tool for harmonic analysis and structure definition of data: multiscale methods. Proc. Natl. Acad. Sci., 102, 7432–7437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Deschler B. & Lübbert M. (2006) Acute myeloid leukemia: epidemiology and etiology. Cancer, 107, 2099–2107. [DOI] [PubMed] [Google Scholar]
  • 9. Epps T. & Singleton K. J. (1986) An omnibus test for the two-sample problem using the empirical characteristic function. J. Statist. Comput. Simulation, 26, 177–203. [Google Scholar]
  • 10. Eric M., Bach F. R. & Harchaoui Z. (2008) Testing for homogeneity with kernel fisher discriminant analysis. Adv. Neur. Inform. Proc. Syst., 609–616. [Google Scholar]
  • 11. Fernández V. A., Gamero M. J. & García J. M. (2008) A test for the two-sample problem based on empirical characteristic functions. Comput. Statist. Data Anal., 52, 3730–3748. [Google Scholar]
  • 12. Friedman J. H. & Rafsky L. C. (1979) Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Ann. Statist., 697–717. [Google Scholar]
  • 13. Gretton A., Borgwardt K. M., Rasch M. J., Schölkopf B. & Smola A. (2012) A kernel two-sample test. J. Mach. Learn. Res., 13, 723–773. [Google Scholar]
  • 14. Gretton A., Sejdinovic D., Strathmann H., Balakrishnan S., Pontil M., Fukumizu K. & Sriperumbudur B. K. (2012) Optimal kernel choice for large-scale two-sample tests. Proceedings of the 25th International Conference on Neural Information Processing Systems, 1, 1205–1213. [Google Scholar]
  • 15. Hall P. & Tajvidi N. (2002) Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89, 359–374. [Google Scholar]
  • 16. Henze N. (1988) A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann. Statist., 16, 772–783. [Google Scholar]
  • 17. Higgins J. J. (2003) Introduction to Modern Nonparametric Statistics, 1st edn (May 5, 2003) Duxbury Press. [Google Scholar]
  • 18. Jitkrittum W., Szabó Z., Chwialkowski K. P. & Gretton A. (2016) Interpretable distribution features with maximum testing power. Proceedings of the 30th International Conference on Neural Information Processing Systems, 1, 181–189. [Google Scholar]
  • 19. Kolmogorov A. (1933) Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari, 4, 83–91. [Google Scholar]
  • 20. Leeb W. & Coifman R. (2016) Hölder–Lipschitz norms and their duals on spaces with semigroups, with applications to earth mover’s distance. J. Fourier Anal. Appl., 22, 910–953. [Google Scholar]
  • 21. Little A. V., Maggioni M. & Rosasco L. (2016) Multiscale geometric methods for data sets I: multiscale SVD, noise and curvature. Appl. Comput. Harmon. Anal. http://hdl.handle.net/1721.1/72597. [Google Scholar]
  • 22. Lloyd J. R. & Ghahramani Z. (2015) Statistical model criticism using kernel two sample tests. Proceedings of the 28th International Conference on Neural Information Processing Systems, 1, 829–837. [Google Scholar]
  • 23. Ozakin A. & Gray A. G. (2009) Submanifold density estimation. Adv. Neur. Inform. Proc. Syst., 1, 1375–1382. [Google Scholar]
  • 24. Ramdas A., Reddi S. J., Póczos B., Singh A. & Wasserman L. (2015) On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence . 3571–3577
  • 25. Rosenbaum P. R. (2005) An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Series B Stat. Methodology, 67, 515–530. [Google Scholar]
  • 26. Sejdinovic D., Sriperumbudur B., Gretton A. & Fukumizu K. (2013) Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist., 41, 2263–2291. [Google Scholar]
  • 27. Serfling R. J. (1981) Approximation Theorems of Mathematical Statistics, 1st edn (December 21, 2001) Wiley-Interscience. [Google Scholar]
  • 28. Shirdhonkar S. & Jacobs D. W. (2008) Approximate earth mover’s distance in linear time. Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, pp. 1–8. [Google Scholar]
  • 29. Smirnov N. (1948) Table for estimating the goodness of fit of empirical distributions. Ann. Math. Stat., 19, 279–281. [Google Scholar]
  • 30. Székely G. J. & Rizzo M. L. (2004) Testing for equal distributions in high dimension. InterStat, 5, 1249–1272. [Google Scholar]
  • 31. Woolfe F., Liberty E., Rokhlin V. & Tygert M. (2008) A fast randomized algorithm for the approximation of matrices. Appl. Comput. Harmon. Anal., 25, 335–366. [Google Scholar]
  • 32. Zhao J. & Deyu M. (2015) Fastmmd: Ensemble of circular discrepancy for efficient two-sample test. Neural Comput., 27, 1345–1372. [DOI] [PubMed] [Google Scholar]

Articles from Information and Inference are provided here courtesy of Oxford University Press

RESOURCES