Abstract
The paper introduces a new kernel-based Maximum Mean Discrepancy (MMD) statistic for measuring the distance between two distributions given finitely many multivariate samples. When the distributions are locally low-dimensional, the proposed test can be made more powerful to distinguish certain alternatives by incorporating local covariance matrices and constructing an anisotropic kernel. The kernel matrix is asymmetric; it computes the affinity between data points and a set of reference points, where can be drastically smaller than . While the proposed statistic can be viewed as a special class of Reproducing Kernel Hilbert Space MMD, the consistency of the test is proved, under mild assumptions of the kernel, as long as , and a finite-sample lower bound of the testing power is obtained. Applications to flow cytometry and diffusion MRI datasets are demonstrated, which motivate the proposed approach to compare distributions.
Keywords: two-sample statistics, maximum mean discrepancy, anisotropic kernel
1. Introduction
We address the problem of comparing two probability distributions from finite samples in , where both distributions are assumed to be continuous (with respect to Lebesgue measure), compactly supported, and the two densities are and . We consider the case where each distribution is observed from i.i.d. samples, called () and (), respectively, and the two datasets and are independent. The methodology has applications in a variety of fields, particularly in bio-informatics. It can be used, for example, to test genetic similarities between subtypes of cancers, to compare patient groups to determine potential treatment propensity and to detect small anomalies in medical images that are symptomatic of a certain disease. We will cover applications to flow cytometry and diffusion MRI datasets in this paper.
We are interested in the general alternative hypothesis test, namely against . We also focus on the medium dimensional setting, in which , where () is the number of samples in (). In the large sample limit we consider the case where , the ratio of converges to a positive constant between 0 and 1, and the dimension is assumed to be fixed. (For the scenario where , the convergence of kernel matrices to the limiting integral operator needs to be treated quite differently, and a new analysis is needed.) We are particularly interested in the situation where data are sampled from distributions which are locally low-rank, which means that the local covariance matrices are generally of rank much smaller than . As will be clear in the analysis, the approach of constructing anisotropic kernels is most useful when data exhibit such characteristics in a high-dimensional ambient space.
We are similarly concerned with an -sample problem, in which the question is to determine the differentiation between the distributions, each of which has a finite set of i.i.d. samples. This can be done by measuring a pairwise distance between any two distributions and combining these pairwise measurements to build a weighted graph between the distributions. Thus, we focus on the two-sample test as the primary problem.
In the two-sample problem, the two data sets do not have any point-to-point correspondence, which motivates the comparison of their unknown probability densities. One way to do this is to construct ‘bins’ at many places in the domain, compute the histograms of the two datasets at these bins and then compare the histograms. This turns out to be a good description of our basic strategy, and the techniques beyond this include using ‘anisotropic Gaussian’ bins at every local point and ‘smoothing’ the histograms when needed. Notice that there is a trade-off in constructing these bins: if a bin is too small, which may leave no points in it most of the time, the histogram will have a large variance compared to its mean. When a bin is too large, it will lose the power to distinguish and when they only differ slightly within the bin. In more than one dimension, one may hope to construct anisotropic bins which are large in some directions and small in others, so as to maximize the power to differentiate and while maintaining small variance. It turns out to be possible when the deviation has certain preferred (local) directions. We illustrate this idea in a toy example below, and the analysis of testing consistency, including the comparison of different ‘binning’ strategies, is carried out by analysing the spectrum of the associated kernels in Section 3.
The idea of using bins and comparing histograms dates back to measuring distribution distances based on kernel density estimation, and is also closely related to two-sample statistics by Reproducing Kernel Hilbert Space (RKHS) Maximum-mean discrepancy (MMD) [13]. Other non-parametric tests have been proposed for the two-sample problem, including the energy distance test [26, 30], kernel Fisher discriminant analysis [10], the mean embedding (ME) test and the smooth characteristic function test [6, 18]. We discuss these connections in more detail in Section 1.2. While anisotropic kernels have been previously studied in manifold learning and image processing, kernel-based two-sample statistics with anisotropic kernels have not been examined in the situation where the data are locally low-rank, which is the theme of the current paper. At the same time, we are not just interested in whether and deviate, but how and where they deviate. The difference of the histograms of and measured at multiple bins, introduced as above, can surely be used as an indication of where and differ. The formal concept is known as the witness function in literature [13], which we describe in Section 2.3.
This paper yields several contributions to the two-sample problem. Methodologically, we introduce a kernel-based MMD statistic that increases the testing power against certain deviated distributions by adopting an anisotropic kernel, and both computation and memory requirements are reduced by using a reference set of locations on which the difference of the two distributions are computed. The proposed method can be combined with spectral smoothing of the histograms in order to reduce variability and possibly optimize the importance of certain regions of data, so that the power of the test may be further improved. Theoretically, asymptotic consistency is proved whenever the system is beyond the critical regime, i.e. when the magnitude of the density deviation is proportional to , under generic assumptions. The subscript of in denotes the possible dependence of the deviated density on the number of samples . Both asymptotic and non-asymptotic results are given. Experimentally, we provide two novel applications of two-sample and -sample problems for biological datasets.
The rest of the paper is organized as follows: we begin with a sketch of the main idea and motivating example in the remainder of this section, together with a review of previous studies. Section 2 formalizes the definition of the MMD statistics being proposed, and the theoretical analysis of test power is given in Section 3. The algorithm and other implementation matters, including computation complexity, are discussed in Section 4. Section 5 covers numerical experiments on synthetic and real-world datasets.
1.1 Main idea
Let and be two distributions supported on a compact set . Suppose a reference set is given which consists of points, and for each point there is a (non-degenerate) covariance matrix . We define the asymmetric affinity kernel to be
(1.1) |
In practice, is either given or can be computed from data, e.g. by local PCA on the combined dataset. Consider the two independent datasets and , where has i.i.d. samples and has i.i.d. samples. The empirical histograms of and at the reference point are defined as
(1.2) |
for which the population quantities are
(1.3) |
The empirical histograms are nothing else but the Gaussian binning of and at point with the anisotropic bins corresponding to the covariance matrix . We then compute the quantity
(1.4) |
as a measurement of the (squared) distance between the two datasets.
We use the following example to illustrate the difference between (1) using anisotropic kernel where is aligned with the tangent space of the manifold data, and (2) using the isotropic ones where is a multiple of the identity matrix.
The data are alike in Fig. 1(A1–A3), where and are supported on two arcs in separated by a gap of size at various regions. We begin by specifying a reference set and the covariance field . For simplicity, we do this by uniformly sampling reference points from (see Fig. 1). At each reference point , we take neighbours to estimate the local covariance matrix by . The empirical histograms and are computed as in Equation (1.2) at every point (see Fig. 1), as well as the quantity as in Equation (1.4). We also compute under a permutation of the data points in and so as to mimic the null hypothesis , and we call the two values and , respectively. The experiment is repeated 100 times, where , and the distribution of and are shown as red and blue bars in Fig. 1. The simulation is done across three datasets where takes value as , and we compare isotropic and anisotropic kernels. When , the distributions of and overlay each other, as expected. When , greater separation between distributions of and implies greater power for the test. The advantage of the anisotropic kernel is clearly demonstrated, particularly when (the middle row).
The analysis of the testing power of hinges on the singular value decomposition of , formally written as and will be defined in Section 3. The histogram thus becomes
and similarly for , which means that the ability of to distinguish and is determined by the amount that and differ when projected onto the singular functions . For the example above, the first few singular functions are visualized in Fig. 1(C1–C4), where the s of the anisotropic kernel project along the directions where and deviate at a lower index of , thus contributing more significantly to the quantity . Figure 1(C5 and C6) also shows the ‘witness function’ ([13], cf. Section 2.3) of kernels, which indicates the regions of deviation between and .
Throughout the paper, we refer to the use of a local Mahalanobis distance with as an anisotropic kernel and as an isotropic kernel. Similarly, we refer to a kernel measuring affinity from all points to a reference set (i.e. ) as an asymmetric kernel. The analysis in Section 3 is for the symmetrized version of the kernel , while in practice one never computes the -by- kernel, but only the -by- asymmetric kernel which is equivalent and way more efficient. We discuss more about computation in Section 4.4.
1.2 Related work
The question of two-sample testing is a central problem in statistics. In one dimension, one classical approach to two sample testing is the Kolmogorov–Smirnov distance, which compares the distance between the two empirical cumulative distribution functions [19, 29]. While there exist generalizations of these infinitely supported bins in high dimensions [4, 12, 15, 25], these require a large computational cost for either computing a minimum spanning tree or running a large number of randomized tests. This warranted binning functions that are defined more locally and in a data-adaptive fashion. Another high-dimensional extension of Kolmogorov–Smirnov is to randomly project the data into a low-dimensional space and compute the test in each dimension.
The one-dimensional Kolmogorov–Smirnov statistic can be seen as a special case of the MMD, the latter being generally defined as
where is certain family of integrable functions [13]. When equals the set of all indicator functions of intervals in , the MMD discrepancy gives the Kolmogorov–Smirnov distance. Gretton et al. [13] studied kernel-based MMD where the function class consists of all functions s.t. , indicating the norm of the Hilbert space associated with the reproducing kernel. Specifically, suppose the positive semi-definite (PSD) kernel is , the (squared) RKHS MMD can be written as
(1.5) |
and can be estimated by (here we refer to the biased estimator in [13] which includes the diagonal terms)
The methodology and theory apply to any dimensional data as long as the kernel can be evaluated.
We consider a kernel of the form and its variants, where is the certain measure of the reference points. This kernel can be seen as a special case of RKHS MMD [13], which considers a general PSD kernel . When and is isotropic, and is the Lebesgue measure, is reduced to a Gaussian kernel, but generally not. In practice, corresponds to the discrete measure on a set of reference points, and it reduces the computation to be linear in the number of sample size. A similar construction has been taken in the ME test [6, 18], where the kernel is isotropic Gaussian. The use of anisotropic kernel is key to our construction as the non-identity , and allows one to incorporate the local dimensionality reduction in (1.1). This may lead to a more powerful test for distributions that are concentrated near locally low-dimensional data structures, and turns out to be critical for the empirical success of the proposed method on real-world datasets. Theoretically, our analysis gives a comparison of the testing power of different kernels, which is determined by the spectral decomposition of the kernels. This result augments that in [13]: [24] pointed out that the test power may degenerate when applied to high dimensional data, and the proposed method addresses this issue when the data exhibit a locally low-dimensional structure, e.g. the manifold data. Our result also adds to that in [6, 18] by proving the convergence of the test power at a sharper rate, both in the asymptotic and the non-asymptotic result, with a more general construction of the MMD statistic (i.e. a general distribution of reference and anisotropic kernels).
While the empirical spectral expansion of translation-invariant kernels has been previously used for two-sample tests, e.g. in [11] and more recently in [32], and the idea dates back to earlier statistical works (see, e.g. [9]), our setting is different due to the use of anisotropic kernel. We also generalize the construction to a larger family of kernels by adding a ‘spectral filtering’, which truncates the spectral decomposition of the anisotropic kernels and modifies the eigenvalues, cf. Section 2.2. The modified kernel may lead to improved testing power for certain departures.
The problem of optimizing kernel parameters has been considered in several places, from the selection of kernel bandwidth in kernel MMD test [13] to the optimization of reference locations in the ME test [18]. We discuss the parameters of our algorithm in Section 2.4. The optimization over a search space of kernels has been considered by [14], where one constructs a convex combination of a finite number of kernels drawn from a given family, typically isotropic Gaussian kernels with various bandwidths. The kernels we consider are anisotropic, and thus lie outside the family of kernels considered in [14] and, in particular, have different eigenvalues and eigenfunctions. Also, building spectral filters directly on the eigenfunctions yields a different MMD statistic than by convex combination of kernels.
Our approach is also closely related to the previous study of the distribution distance based on kernel density estimation [1]. We generalize the results in [1] by considering non-translation-invariant kernels, which greatly increases the separation between the expectation of under the null hypothesis and the expectation of under an alternative hypothesis. Moreover, it is well known that kernel density estimation, which [1] is based on, converges poorly in high dimension. In the manifold setting, the problem was remedied by normalizing the (isometric) kernel in a modified way, and the estimation accuracy was shown to only depend on the intrinsic dimension [23]. Our proposed approach takes extra advantage of the locally low-dimensional structure, and obtains improved distinguishing power compared to the one using isotropic kernels when possible.
At last, the proposed approach can be viewed as related to two-sample testing via nearest neighbours [16]. In [16], one computes the nearest neighbours of a reference point to the data , and derives a statistical test based on the amount the empirical ratio , where is the number of neighbours from (similarly ), deviates from the expected ratio under the null hypothesis, namely . Because the nearest neighbour algorithm is based on Euclidean distance, it is equivalent to a kernel-based MMD with a hard-thresholded isotropic kernel. The approach can be similarly combined with a local Mahalanobis distance as we do, which has not been explored.
2. MMD test statistics and witness functions
We are given two independent datasets and , where has i.i.d. samples drawn from distribution and has i.i.d. samples drawn from . We also use and to denote the densities and write integration w.r.t. as and similarly for .
In the method proposed in Section 1.1, we assume that the reference set and the covariance field are given. In this section, we consider a slightly general form of MMD test statistics, which involves a distribution of the reference points and a covariance field , both assumed to be pre-defined. They can be viewed as parameters of the algorithm (cf. Section 2.4), and the choice can be crucial for the test power. Section 3 explains such effects by theoretically analysing the test power, which is entirely characterized by the spectral decomposition of the anisotropic kernel—and the spectrum in turn depends on the choice of and . The effects are also empirically demonstrated in Section 3.6. The algorithmic choice of and in practice is explained in Section 4.
2.1 Kernel MMD statistics
Using the kernel defined in (1.1), we consider the following empirical statistic:
(2.1) |
where and are defined in (1.2). Note that (2.1) assumes the measure along with the covariance field as explained above. For now we leave to be general, and in the proposed algorithm for some reference set , cf. Section 4.
The population statistic corresponding to (2.1) is
(2.2) |
can be viewed as a special form of RKHS MMD: by (1.5), (2.1) is the (squared) RKHS MMD with the kernel
(2.3) |
The kernel (2.3) is clearly PSD, however, not necessarily ‘universal’, meaning that the population MMD as in (2.2) may not give a distance between and . The test is thus restricted to the departures within the Hilbert space (Assumption 2). Recall that with finite samples, what determines the power of a test is the relative magnitude of the bias of the test statistic under and , in comparison with its probabilistic deviation. Maximizing the bias while controlling the variance is practically more important than the population statistic being mathematically non-zero under every conceivable deviation of distribution. In this view, constructing a pseudo-metric instead of a metric trades the ability to detect certain types of departures, which may be hard to detect with finite samples, for improved discriminative ability against other ones which the proposed anisotropic kernel is more sensitive to detect.
We introduce a spectral decomposition of the kernel based upon that of the asymmetric kernel , which sheds light on the analysis: let be a distribution of data point (which is a mixture of and to be specified later). Since is bounded by 1, so is the integral , and thus the asymetric kernel is Hilbert–Schmidt and the integral operator are compact. The singular value decomposition of with respect to and can be written as
(2.4) |
where , and are ortho-normal sets w.r.t. and respectively. Then (2.3) can be written as
(2.5) |
This formula suggests that the ability of the kernel MMD to distinguish and is determined by (i) how discriminative the eigenfunctions are (viewed as ‘feature extractors’), and (ii) how the spectrum decay (viewed as ‘weights’ to combine the differences extracted per mode). It also naturally leads to generalizing the definition by modify the weights , which is next.
2.2 Generalized kernel and spectral filtering
We consider the situation where the distributions and lie around certain lower-dimensional manifolds in the ambient space, and both densities are smooth with respect to the manifold and decay off-manifold, which is typically encountered in the applications of interest (cf. Section 5). Meanwhile, since the reference set is sampled near the data, is also centered around the manifold. Thus, one would expect the population histograms and to be smooth on the manifold as well. This suggests building a ‘low-pass-filter’ for the empirical histograms before computing the distance between them, namely the MMD statistic.
We thus introduce a general form of kernel as
(2.6) |
where is a sequence of sufficiently decaying positive numbers, the requirement to be detailed in Section 3. Our analysis will be based on kernels in form of (2.6), which includes as a special case when . While the eigenfunctions are generally not analytically available, to compute the MMD statistics one only needs to evaluate s on the data points in , which can be approximated by the empirical singular vectors of the -by- kernel matrix and computed efficiently for MMD tests (cf. Section 4). Note that the approximation by empirical spectral decomposition may degrade as increases; however, for the purpose of smoothing histograms, one typically assigns small values of for large so as to suppress the ‘high-frequency components’.
The construction (2.6) gives a large family of kernels and is versatile: first, setting for some positive integer is equivalent to using the kernel as where . This can be interpreted as redefining the affinity of points and by allowing -steps of ‘intermediate diffusion’ on the reference set, and thus ‘on the data’ [7]. When is uniform over the whole ambient space, raising is equivalent to enlarging the bandwidth of the Gaussian kernel. However, when is chosen to adapt to the densities and , then the kernel becomes a data-distribution-adapted object. As increases, decays rapidly, which ‘filters out’ high-frequency components in the histograms when computing the MMD statistic, because . Generally, setting to be a decaying sequence has the effect of spectral filtering. Secondly, in the case that prior knowledge about the magnitude of the projection is available, one may also choose accordingly to select the ‘important modes’. Furthermore, one may view the kernel MMD with (2.6) as a weighted squared-distance statistic after projecting to the spectral coordinates by , where the coordinates are uncorrelated thanks to the orthogonality of . We further discuss the possible generalizations in the last section. The paper is mainly concerned with the kernel , including , with anisotropic , while the above extension of MMD may be of interest even when is isotropic.
2.3 Witness functions
Following the convection of RKHS MMD [13], the ‘witness function’ is defined as
where and are the ME of and in , respectively. By Riez representation, equals multiplied by a constant. We will thus consider as the witness function. By definition, , and similarly for , thus . We then have that for the statistic ,
(2.7) |
and generally for ,
(2.8) |
The computation of empirical witness functions will be discussed in Section 4.2.
Although the witness function is not an estimator of the difference , it gives a measurement of how and deviate at local places. The witness function with isotropic Gaussian kernel has been previously used for model criticism of generative models of image data [22]. This can be useful for user interpretation of the two-sample test, as shown in Section 5. The witness function augments the MMD statistic, which is a global quantity.
2.4 Kernel parameters
If the reference distribution and the covariance field are given, there is no tuning parameter to compute -MMD (cf. Algorithm 2). To compute -MMD with general , which is truncated to be of finite rank , the number and the values are tunable parameters (cf. Algorithm 3).
If the covariance field is provided up to a global scaling constant by , e.g. when estimated from local PCA, which means that one uses for some in (1.1), then this is a parameter which needs to be determined in practice.
Generally speaking, one may view the reference set distribution and the covariance field as ‘parameters’ of the proposed MMD test. The optimization of these parameters surely have an impact on the power of the MMD statistics, e.g. the optimization of the reference set and a global bandwidth of the isotropic Gaussian kernel via a separated training set has been considered in [18]. A full analysis of this goes beyond the scope of the current paper. At the same time, there are important application scenarios where pre-defined and are available, e.g. the diffusion MRI imaging data (Section 5.3). We thus proceed with the simplified setting by assuming pre-defined and , and focusing on the effects of anisotropic kernel and re-weighted spectrum.
3. Analysis of testing power
We consider the population MMD statistic of the following form:
(3.1) |
where as in (2.6), and particularly as in (2.5). The empirical version is
(3.2) |
where and , and . We consider the limit where both and go to infinity and proportional to each other, in other words, and , and .
We will show that, under generic assumptions on the kernel and the departure , the test based on is asymptotically consistent, which means that the test power as (with controlled false-positive rate). The asymptotic consistency holds when is allowed to depend on as long as the magnitude of the density departure decays to 0 slower than . We also provide a lower bound of the power based upon Chebyshev which applies to the critical regime when the magnitude of is proportional to . The analysis also provides a quantitative comparison of the testing power of different kernels.
3.1 Assumptions on the kernel
Because is uniformly bounded and Hilbert–Schmidt, is positive semi-definite and compact, and thus can be expanded as in (2.5) where are a set of ortho-normal functions under . By that , we also have that satisfies that
(3.3) |
and for any . Meanwhile, is continuous (by the continuity and uniformly boundedness of ), which means that the series in (2.5) converges uniformly and absolutely, and the eigenfunctions are continuous (Mercer’s theorem). Finally, (3.3) implies that the integral operator associated with the kernel is in the trace class, and specifically .
These properties imply that when replacing by as in (2.6), one can preserve the continuity and boundedness of the kernel. We analyse kernels of the form as (2.6) and assume the following properties.
Assumption 1
The in (2.6) satisfy that , , and that the kernel is PSD, continuous and for all .
As a result, Mercer’s theorem applies to guarantee that the spectral expansion (2.6) converges uniformly and absolutely, and the operator is in the trace class. Note that Assumption 1 holds for a large class of kernels, including all the important cases considered in this paper. The previous argument shows that is covered. Another important case is the finite-rank kernel: that is, when for some positive integer , and can be any positive numbers when such that and for all .
For the MMD test to distinguish a deviation from , it needs to have for such . We consider the family of alternatives which satisfies the following condition.
Assumption 2
When , there exists s.t. and . In particular, if are strictly positive for all , then satisfies that does not vanish w.r.t. , i.e. .
The following proposition, proved in the Appendix, shows that for such deviated .
Proposition 3.1
Notations as above, for a fixed , the following are equivalent:
(i) .
(ii) For some , and .
If for all , then (i) is also equivalent to
(iii) .
Note that satisfies for all ; thus, (iii) applies. The proposition says that distinguishes an alternative when lies in the subspace spanned by , and for general , needs to lie in the subspace spanned by . These bases are usually not complete, e.g. when the reference set has points then is of rank at most . However, when the measure is continuous and smooth, the singular value decomposition (2.4) has a sufficiently decaying spectrum, and the low-frequency s can be efficiently approximated with a drastic down sampling of [3]. This means that, under certain conditions of (sufficiently overlapping with and regularity), after replacing the continuous by a discrete sampling in constructing the kernel, an alternative violates Assumption 2 only when the departure lies entirely in the high-frequency modes w.r.t. the original . In the applications considered in this paper, these very high-frequency departures are rarely of interest to detect, not to mention that estimating for large lacks robustness with finite samples. Thus, Assumption 2 poses a mild constraint for all practical purposes considered in this paper, even with the spectral filtering kernel which utilizes a possibly truncated sequence of .
We note that, by viewing as general Fourier modes, Assumption 2 servers as the counterpart of the classical condition of the kernel ‘having non-vanishing Fourier transforms on any interval’, which is needed for the MMD distance with translation-invariant kernel to be a distance [1, Section 2.4].
3.2 The centered kernel
We introduce the centered kernel under , defined as
(3.4) |
where , and . The spectral decomposition of is the key quantity used in later analysis.
The following lemma shows that the MMD statistic, both the population and the empirical version, remains the same if is replaced by . The proof is by definition and details are omitted.
Lemma 3.1
Notations as above,
(3.5)
(3.6) In particular, under Assumption 2 that , then so is the r.h.s. of (3.5).
The kernel also inherits the following properties from , proved in Appendix A.
Proposition 3.2
The kernel , as an integral operator on the space with the measure , is PSD. Under Assumption 1,
(1) for any , and for any ,
(2) is continuous,
(3) is Hilbert–Schmidt as an operator on and is in the trace class. It has the spectral expansionwhere , . are continuous on , are both summable and square summable, and the series in (3.7) converges uniformly and absolutely.
(3.7) (4) The eigenfunctions are square integrable, and thus integrable, w.r.t. . Furthermore, .
To proceed, we define
(3.8) |
then by (3.7), (3.5) and (3.6), shortening the notation as , as , we have that
(3.9) |
(3.10) |
3.3 Limiting distribution of Tn
Consider the alternative for some fixed , . remains a probability density for any . We define the constants
(3.11) |
where are finite by the integrability of under ((4) of Proposition 3.2), and .
The theorem below identifies the limiting distribution of under various order of , which may decrease to 0 as . The techniques very much follow Chapter 6 of [27] (see also [13, Theorems 12 and 13]), where the key step is to replace the two independent sets of summations in (3.10) by normal random variables via multi-variate CLT (Lemma A.1), using a truncation argument based on the decaying of . The proof is left to Appendix A.
Theorem 3.3
Let may depend on , , and notations , , and others like above. Under Assumption 1, as with , ,
(1) If , (including the case when ), then
Due to the summability of the random variable on the right-hand side is well defined.
(2) If , where , then
where .
(3) If , then
where , with .
Remark 3.1
As a direct result of Theorem 3.3 (1), under ,
When , the limiting density is where are i.i.d. random variables. Theorem 3.3 (3) shows the asymptotic normality of under . One can verify that when , , where . These limiting densities under recover the classical result of V-statistics [27, Theorems 6.4.1.B and 6.4.3].
The numerical experiments in Section 3.6 show that the theoretical limits identified in Theorem 3.3(1) approximate the empirical distributions of quite well when equals a few hundreds (Fig. 2). In the next two subsections, we address the asymptotic consistency of the MMD test and the finite-sample bound of the test power.
3.4 Asymptotic consistency of the test
A test based on the MMD statistic rejects whenever exceeds certain threshold . For a target ‘level’ (typically ), a test (with samples) achieves level if , and the ‘power’ against an alternative is defined as . We define the test power at level as
(3.12) |
The threshold needs to be sufficiently large to guarantee that the false-positive rate is at most , and in practice it is a parameter to determine (cf. Section 4.1). Below, we omit the dependence on in the notation and write as . The test is called asymptotically consistent if as . For the one-parameter family of , the following theorem shows that, if satisfies Assumption 2, the MMD test has a non-trivial power when converges to a positive constant and is asymptotically consistent if .
Theorem 3.4
Under Assumption 1, and suppose that the departure makes satisfy Assumption 2 for , where may depend on . As with ,
(1) If , , then , where is a monotonic function of .
(2) If , then .
This is qualitatively the same result as for RKHS MMD [13, Theorem 13]. The claim (1) directly follows from Theorem 3.3 (1), and the proof for (2) uses similar techniques. The proof is left to Appendix A.
3.5 Non-asymptotic bound of the testing power
In this section, we derive a non-asymptotic (lower) bound of the testing power for finite , which shows that the speed of the convergence is at least as fast as , as increases whenever is greater than certain threshold value.
Theorem 3.5
Notations , , as above. Define , , and , and let be the target level. Under Assumptions 1 and 2, define . If and
(3.13) then
(3.14) where
In particular, assuming that is uniformly bounded to be between for some , then the constants , , , are all uniformly bounded with respect to by constants which only depend on and .
Remark 3.2
The above bound applies when is proportional to , which is the critical regime. The lower bound of in (3.14) increases with and approaches 1 when , showing the same behaviour as Theorem 3.4. It can be used as another way to prove Theorem 3.4 (2). In particular, as increases (or increases, since stays bounded) and assuming uniformly bounded , the r.h.s. of (3.14) is dominated by
which leads to an bound of for fixed . The theorem only uses Chebyshev, so the bound may be improved, e.g. by investigating the concentration of which is a quadratic function of independent sums, cf. (3.10).
Remark 3.3
We notice two other possible ways of deriving non-asymptotic bound of the power: (1) Berry–Esseen rate of the convergence to asymptotic normality has been proved for U-statistics in the case of and can be extended to the V-statistics (for the case of ), cf. [27, Chapter 6]. This can lead to a control of , where the constant depends on the kernel and the departure magnitude . However, when is proportional to , the Berry–Essen rate loses sharpness and cannot give a non-trivial bound. (2) Large deviation bounds of the empirical MMD have been obtained by McDiarmid inequality and the boundedness of the Rademacher complexity of the unit ball in the RKHS, where is proved to converge to the population value at the rate with exponential tail [13, Theorem 7]. This gives a stronger concentration than Chebyshev when for some constant and leads to an exponential decay of with increasing , but it does not apply to give a control when is proportional to , i.e. the critical regime.
3.6 Comparison of kernels
From the above analysis (Theorems 3.3– 3.5), one can see that the mean and variance of the statistic under and are very indicative of the power of the MMD test. These quantities are
Since the distribution of the test statistic is approximately (properly rescaled) chi-square or normal, the larger the bias , and the smaller the standard deviation and , the more powerful the test is to detect the deviation. By Theorem 3.3 (1), at the critical regime where is proportional to , these quantities can be approximated by the following (recall that )
where i.i.d. Notice that the gap , which is the population .
In this subsection, we compare isotropic and anisotropic kernels by numerically computing the above quantities under a specific choice of data distribution. We will consider three kernels: (1) the Gaussian kernel (2) the one induced by the anisotropic kernel where is set to be the Lebesgue measure, and is constructed so that the tangent direction is the first principle direction with variance , and the normal direction is the second principle direction with variance , and (3) the spectral filtered kernel as in Equation (2.6), where are designed to be 1 for and smoothly decays to zero at . All kernels are multiplied by a constant to make the largest eigenvalue so as to be comparable. We also adopt the ratio (similar to the object of kernel optimization considered in [14])
to illustrate the testing power, where the larger the the more powerful the test is likely to be.
For the distribution and , we use the two-dimensional example in the first section of the paper. Specifically, is the uniform distribution on the curve convolved with , , and is the distribution of the shifted curve convolved with , where . One realization of 200 points in and is shown in Fig. 4.1 Let , we consider the one-parameter family of alternative and will simulate in the critical regime where the test power is less than 1. We assume that and denote by in this subsection.
In all experiments, the quantities , , and are computed by Monte Carlo simulation over 10000 runs. Their asymptotic version (with bar) are computed by 50000 draws of the limiting distribution, truncating the summation over to the first 500 terms; the values of and are approximated by the empirical eigenvalues and mean of eigenvectors over 10000 sample points drawn from and ( is computed by Nystrom extension). The values of and are shown in Fig. 3, and those of etc. in Table 1.
Table 1.
Gaussian | 0.4771 | 0.5444 | 0.0677 | 0.0704 | 0.4874 |
0.4754 | 0.5439 | 0.0676 | 0.0736 | 0.4848 | |
0.0489 | 0.0958 | 0.0214 | 0.0306 | 0.9000 | |
0.0488 | 0.0939 | 0.0214 | 0.0312 | 0.8573 | |
0.0985 | 0.2046 | 0.0348 | 0.0587 | 1.1351 | |
0.0983 | 0.2013 | 0.0374 | 0.0620 | 1.0354 | |
Gaussian | 0.2381 | 0.2720 | 0.0334 | 0.0359 | 0.4885 |
0.2379 | 0.2722 | 0.0339 | 0.0368 | 0.4850 | |
0.0243 | 0.0477 | 0.0107 | 0.0153 | 0.8972 | |
0.0244 | 0.0471 | 0.0106 | 0.0157 | 0.8616 | |
0.0490 | 0.1036 | 0.0177 | 0.0305 | 1.1343 | |
0.0490 | 0.1003 | 0.0188 | 0.0310 | 1.0290 |
4. Practical considerations
Algorithm 1 gives the pseudo code for two-sample test based on the proposed MMD statistics, where two external subroutines, akMMD and akWitness, will be given in Algorithms 2 and 3 for and statistics, respectively. In Algorithm 3, an extra input parameter , which is a positive vector of length (), is needed. is the target spectrum of the kernel in Equation (2.6). Both algorithms compute the quantile of test statistic under by bootstrapping, which is used as the threshold in the test. It is also assumed that the reference set with are pre-defined.
In the rest of the section, we explain the bootstrapping approach, the empirical estimator of the witness function and the construction of reference set in detail. We close the section by commenting on the computation complexity.
—
—
—
4.1 Permutation test and choice of the threshold tα
The MMD statistics introduced in Section 3 are non-parametric, which means that the threshold does not have a closed-form expression for finite . In practice, one may use a classical method known as the permutation test [17], which is a bootstrapping strategy to empirically estimate , and previously used in RKHS MMD [13]. The idea is to model the null hypothesis by pooling the two datasets, resampling from that pool and computing one instance of the MMD under that null hypothesis. Repetitive resampling thus corresponds to permuting the joint dataset multiple times, and it generates a sequence values of , the -quantile of which is used as (see both Algorithms 2 and 3). Strictly speaking, the power of the test is slightly degraded due to the empirical estimate of the null hypothesis. However, as we see in Section 5, the test still yields very strong results in a number of examples.
Meanwhile, the limiting distribution of the statistics appears to well approximate the empirical one under , as shown in Section 3.6. This suggests the possibility of determining based on the empirical spectral decomposition of the kernel. In the current work we focus on the bootstrapping approach (permutation test) for simplicity.
4.2 Empirical witness functions
The population witness functions have been introduced in Section 2.3 and, by definition, can be evaluated at any data point . For the kernel , the empirical version of Equation (2.7) is
(4.1) |
which can be computed straightforwardly from the empirical histograms , and the pre-defined kernel ; see Algorithm 2.
For the kernel , Equation (2.8) is approximated by
(4.2) |
where , are computed by SVD of the assymetric kernel matrix, and the out-of-sample extension to by Nystrom method; see Algorithm 3.
4.3 Sampling of the reference set
In bio-informatics applications, e.g. flow cytometry, datasets are usually large in volume so that one can construct the reference set and covariance field from a pool of data points, which combines multiple samples, before the differential analysis in which MMD is involved. In some application scenario (like diffusion MRI), and are provided by external settings. Thus, we treat the procedure of constructing and separately from the two-sample analysis, which is also assumed in the theory.
In experiments in this work, we sample the reference points randomly in Lebesgue measure using the following heuristic: given a pool of data points, e.g. drawn from and or subsampled from larger datasets, the procedure loops over batches until points have been generated. In each loop, a batch of points are loaded from the pool, and candidate points are sampled from the batch according to which is the KDE of each point in the batch. The candidate points are giggled, and points which have too few neighbours in the pool dataset are excluded. The code can be found in the software github repository https://github.com/AClon42/two-sample-anisotropic.
The covariance field is computed by local PCA, when needed. This is related to the issue of ‘-selection’ in Gaussian MMD [13], where a ‘median heuristic’ was proposed to determine the in the Gaussian kernel . In our setting, the extension of ‘’ is the local covariance matrix , which allows different at different points, apart from different along different local directions. In estimating , a controlling parameter is then the size of the local neighbourhood, i.e. the -nearest neighbours from which local PCA is computed. In the manifold setting, there are strategies to choose so as to most efficiently estimate local covariance matrix, see, e.g. [21], and it is best done by using different at different point. For simplicity, in all the experiments we set to be a fraction of the total number of samples, which may be sub-optimal. We also introduce a parameter and set , where is computed via local PCA, so as to make the ‘size’ of tunable. The parameter is sampled on a binary grid ( for ).
4.4 Computational complexity
We will only discuss the cost of computing the MMD statistics, namely that of MMD-L2 (Algorithm 2) and MMD-Spec (Algorithm 3). The extra cost for computing the witness function is negligible, as can be seen from the code (some computation, e.g. that of and in Witness-L2, and the SVD of in Witness-spec are repetitive for illustrative purpose).
We first discuss MMD-L2: the cost for computing one empirical MMD statistics is of , where is the size of the two samples. The main cost is the one-time construction of the asymmetric kernel matrix , which also dominates the memory requirements. While the choice of the reference set, and hence the number of reference points , is related to the problem being addressed, any amount of structure in the samples leads to a choice of that is . This yields major computational and memory benefits as the test complexity is much smaller than the , which is the order for computing the U-statistics via a symmetric Gaussian kernel without extra techniques. In all applications we have considered and synthetic examples we have worked with, is in the tens or hundreds and can remain relatively constant as grows while still yielding a test with large power. This means the test with an anisotropic kernel becomes linear in the number of points being tested.
In MMD-spec, the extra computation is for the rank- SVD of the -by- matrix . The computation can be done in time, , via the classical pivoted QR decomposition, and may be accelerated by randomized algorithms, e.g. [31] which takes time to achieve an accuracy proportional to the magnitude of the )th singular value. The computational cost may be further reduced by making use of the possible sparse/low-rank structure of the kernel matrix . When the kernel almost vanishes outside a local neighbourhood of which has at most points, the rows of are -sparse, and then fast nearest neighbour search methods (by Kd-tree or randomized algorithm) can be applied under the local Mahalanobis distance and the storage is reduced to . When has a small numerical rank, e.g. only singular values are significantly non-zero, can be stored in its rank- factorized form which can be computed in linear time of . This will save computation in bootstrapping as both and can be computed from and (row-permuted) only.
5. Applications
5.1 Synthetic examples
5.1.1 Example 1: Curve in two dimension
The density is the uniform distribution on a quarter circle with radius 1 convolved with , on a quarter circle with radius convolved with , which is the same example in Section 3.6 (top left in Fig. 2), . The departure is parametrized by , taking values from 0 to 0.02. The rejection rate is computed from the average over runs according to Algorithm 1, as shown in the left of Fig. 4, where the error bars indicate the standard deviation. For the Gaussian kernel, the curve corresponds to the which has the best performance over a grid, and it is obtained by an intermediate value on the grid. The witness functions of the three kernels when are computed according to Section 4.2 and shown in Fig. 4.
5.1.2 Example 2: Gaussian mixture in 3 dimension
Two Gaussian mixtures in , , . The density consists of three equal-size components centering at , and respectively, with diagonal covariance matrix , and , ; density differs from by centering the three means at , and , while letting the covariance matrices remain the same. The constant takes value from 0 to 0.02. Results are shown in Fig. 5, similar to Fig. 4.
5.2 Flow cytometry data analysis
Flow cytometry is a laser-based technology used for cell counting. It is routinely used in diagnosing blood diseases by measuring various physical and chemical characteristics of the cells in a sample. This leads to each sample being represented by a multi-dimensional point cloud of tens of thousands of cells that were measured.
Two natural questions arise in comparing various flow cytometry samples. The first is a supervised learning question: can a statistical test detect the systemic differences between a set of healthy people and a set of unhealthy people? The second question is an unsupervised learning question: can the distance measure be used to determine the pairwise distance between any two samples, and will that distance matrix yield meaningful clusters?
We address both of these questions for two different flow cytometry datasets. We compare isotropic Gaussian MMD to anisotropic MMD and demonstrate, in all cases, our anisotropic MMD test has more power and yields more accurate unsupervised clusters. We only show the pairwise distance embedding for the reference set asymmetric kernel due to the untenable cost of computing each of the pairs.
5.2.1 Acute myeloid leukemia dataset
Acute myeloid leukemia (AML) is a cancer of the blood that is characterized by a rapid growth of abnormal white blood cells. While being a relatively rare disease, it is incredibly deadly with a 5-year survival rate of [8]. AML is diagnosed through a flow cytometry analysis of the bone marrow cells, and while there are features that distinguish AML in certain cases, other cases can be more difficult to detect. State of the art supervised methods include SPADE, Citrus, etc. See [5] and the references therein.
The data are from the public competition Flow Cytometry: Critical Assessment of Population Identification Methods II (FlowCAP-II). This dataset consists of 316 healthy patients and 43 patients with AML, where each person has approximately cells measured across seven physical and chemical features.
We use an anisotropic kernel with only reference points. We create an unsupervised clustering by computing the pairwise distances between person and person . In Fig. 6, we display the network of these people constructed by weighting each edge as the exponential of the negative distance properly scaled. When supervising the process and running a two-sample test between the pool of healthy cells and the pool of unhealthy cells, we see that the anisotropic kernel yields significantly better separation and lower variance than the isotropic Gaussian kernel.
We also examine the witness function in Fig. 6 that is generated by the anisotropic kernel, as introduced in Section 2.3. This yields a tool for visualizing the separation between the two samples in the original data space. This gives a way to communicate the decision boundary to the medical community, which uses visualization of the two-dimensional slices as the diagnostic tool for determining whether the patient has AML.
5.2.2 Myelodysplastic syndromes dataset
Myelodysplastic syndromes (MDS) are a group of cancers of the blood. There is more variability in how MDS presents itself in the cells and flow cytometry measurements. The data we work with came from anonymized patients that were treated at Yale New Haven Hospital. After choosing to examine surface markers CD16, CD13 and CD11B, along with several physical characteristics of the cells, we are left with 72 patients that were initially diagnosed with some form of MDS and 87 patients that were not. These patients are represented by about cells in eight dimensions.
While MDS is more difficult to detect than AML, the unsupervised pairwise graph, created the same way as in Fig. 6, still yields a fairly strong unsupervised clustering, as we see in Fig. 7. When supervising the process and running a two-sample test between the pool of healthy cells and the pool of unhealthy cells, we see that the anisotropic kernel yields strongly significant separation between the two classes, unlike the isotropic Gaussian kernel.
5.3 Diffusion MRI imaging analysis
Diffusion-weighted MRI is an imaging modality that creates contrast images via the diffusion of water molecules. Various regions of the brain diffuse in different ways, and the patters can reveal details about the tissue architecture. At a low level, each pixel in a three-dimensional brain image generates a three-dimensional diffusion tensor (i.e. covariance matrix) that describes the local flow of water molecules.
An important question in diffusion MRI analysis is to identify regions of the brain that systemically differ between groups of healthy and sick individuals. We attack this problem by comparing the distributions of the diffusion tensors in various regions of the brain, thus framing it as a multiple sample problem. Every brain is co-registered so that the pixels overlap.
Real images are around , so the amount of memory needed to build a square symmetric kernel, even a sparse one, is completely prohibitive. For this reason, we instead consider a set of reference pixels that subsamples the image and construct an anisotropic kernel , where is the location of pixel and is the diffusion tensor at pixel . This kernel must enforce both locality in the pixel space and the local behaviour of the diffusion tensors. The latter requires measuring covariance matrices over the space of positive semi-definite matrices.
Fortunately, the study of covariance matrices on the space of positive semi-definite matrices is a well-studied phenomenon [2]. The main takeaway is that there is an isomorphism from a diffusion tensor and its vectorized representation
Thus, we can define a covariance matrix on and define the anisotropic kernel
(5.1) |
We examine MMD using a synthetic data set generated from the common brain phantom image. We treat this as a two-dimensional slice of the three-dimensional image, and simply have all tensor variation in the z-direction constant and uncorrelated with the xy-direction of this slice. We also assume that the brains have been co-registered. The process of co-registration is an independent preprocessing issue which can be incorporated into our proposed methodology when working with real-world data sets, but is outside the scope of this paper.
In Fig. 8, we show both a ‘healthy’ brain and an ‘unhealthy’ brain in which a small region has been removed. The intensity of the image in this case will correspond to the eccentricity of the diffusion tensor at that point; the magnitude of the tensor will decrease from left to right in the same way for both images and the angle shift uniformly from left to right. The eccentricity, magnitude and angle all have i.i.d. Gaussian noise added to them with .
We down-sample the brain by a factor of 5 for the reference points and consider the MEs for realizations of a healthy brain (null hypothesis ), and for realizations of a healthy brain and realizations of an unhealthy brain (alternative hypothesis ). Figure 8 shows the supervised witness function of regions of difference between the two groups, as well as a permutation test in which we permute group labels while maintaining the individual brain structure. We also show the leading eigenvectors of the pairwise network generated by measuring the MMD between any two brains.
By using reference points, each brain is represented by a kernel that is , and the data adaptive MMD computation can be run on 10 brains in about min on a standard laptop.
We also compare the anisotropic kernel to one with an isotropic kernel of constant bandwidth. We cannot compare it to kernel MMD with a square symmetric kernel due to computational limits, so instead we compare it to the modified asymmetric kernel as in (5.1), but without the covariance matrix. Instead, we replace it by a constant bandwidth, which is chosen to be .
6. Discussion and remarks
The paper studies kernel-based MMD statistics with kernels of the form , and more generally, where is a ‘filtering’ kernel over . The statistics can be computed with a pre-defined reference set in time , where is the number of samples and is the cardinal number of the reference set. The power of the test against alternative distributions is analysed using the spectral decomposition of the kernel with respect to the data distribution, and the consistency is proved under generic assumptions. The difference in the testing power of the kernels is determined by their spectral properties. We apply the proposed methodology to flow cytometry and diffusion MRI data analysis, where the goal of analysis is formulated as comparing the distribution of multiple samples.
We close the section by a few remarks about the proposed approach as well as possible extensions:
The spectral coordinates. The kernel-MMD distance being studied can be viewed as certain distance, weighted or unweighted, in the space of spectral embedding. This is reflected in the construction of the kernel , as well as in the power analysis. Theoretically, the space of spectral embedding is infinitely dimensional; however, in practice only finite-dimensional coordinates may contribute to the RKHS MMD statistic—in the sense of providing statistically significant departures. Thus, the leading coordinates of spectral embedding (depends on the kernel) gives a mapping from to , where may be proportional or larger than , and one may consider the two-sample test in the new coordinates. The general alternative then becomes a mean-shift alternative in the new coordinates. This suggests other possible tests for mean-shift alternatives than the weighted distance test being studied.
Weighting of bins. While spectral filtering introduces weighting in the generalized Fourier domain, another important variation is to introduce weighting in the ‘real domain’, namely weighting each bin centered at by a weight . Such weights may be computed from data, e.g. by certain local -value (see below), where the dependence among ’s needs to be handled. Another interesting question is how to introduce multi-resolution systems in the context of the current paper: mapping points into spectral coordinates has certain advantage as analysed in the paper; however, spectral basis is global and may not be sensitive enough to local departure. On the other hand, histograms on local bins may have large variance and one needs to jointly analyse multiple bins. The shortcomings of both approaches may be overcome by considering a multi-scale basis on the reference-set graph.
Beyond distance. RKHS MMD considers the distance by construction, while other metrics have been studied in the literature, particularly the Wasserstein metric which has the interpretation of optimal flow given the underlying geometry. Such ‘geometric’ distances are certainly useful in various application scenarios. Using the reference set, a modification of will be
where , are the population histograms, and is certain metric on reference set. It may also be possible to construct a metric which is equivalent to the Wasserstein metric by measuring the difference at reference points across multiple scales of covariance matrices, as is done with with Haar wavelet [28] and with diffusion kernels [20]. Efficient estimation scheme of the Wasserstein metric needs to be developed as well as the consistency analysis with samples.
Local -value. The current approach gives a global test and computes the -value for the hypothesis of the distribution globally. In certain applications, especially differential analysis of flow cytometry data and other single-cell data, a more important problem is to find the local region where the two samples differ corresponding to different biological conditions, or in other words, to derive a ‘local’ -value of the test. While the witness function introduced in our current approach can provide indication where , a more systematical study of testing the hypothesis locally and controlling false discovery rate across bins is needed.
Funding
We would like to thank our joint funding NSF DMS-1818945. A.C. is also partially supported by the Russel Sage Foundation, X.C. is also partially supported by NSF DMS- 1820827, NIH, and the Alfred P. Sloan Foundation.
Footnotes
Strictly speaking and are longer compacted supported due to convolving with the two-dimensional Gaussian distribution; however, the normal density exponentially decays and is small, so that there is no any practical difference. The analysis can extend by a truncation argument.
Contributor Information
Xiuyuan Cheng, Email: xiuyuan.cheng@duke.edu.
Alexander Cloninger, Email: acloninger@ucsd.edu.
Ronald R Coifman, Email: ronald.coifman@yale.edu.
A. Proofs in Section 3
A.1 Proofs of Propositions 3.1 and 3.1
Proof of Proposition 3.1.
and thus (i) (ii).
Recall that has the singular value decomposition (2.4), and thus
with all strictly positive. This means that (iii) is equivalent to for some . When is all strictly positive, it is equivalent to (ii).
Proof of Proposition 3.2.
We first verify the positive semi-definiteness of : for any so that , by definition,
where . The quantity is non-negative by that is PSD.
To prove (1): under Assumption 1, and for any . This implies the boundedness of and in (3.4) and leads to (1).
(2) Follows from the continuity of .
The square integrability of then follows from boundedness of , which makes the operator Hilbert–Schmidt with
(A.1) by (1). Meanwhile, by (1) again,
which proves that the operator is in trace class. Mercer’s theorem applies to give the spectral expansion and the relevant properties, and is by the centering so that for all . This proves (3).
Finally, by the uniform convergence of Equation (3.7), and (1),
which proves (4).
A.2 Proof of Theorems 3.3 and 3.4
Lemma A.1 (Replacement lemma). —
Let be a sequence of positive number so that . Let be an array of random variables, , , s.t. for each ,
where . Furthermore, as ,
(i) elementwise, and for any finite , where .
(ii) There exists s.t.and the convergence is uniform in .
Meanwhile, let , , be three double arrays satisfying that
(iii) , ; as ,
(iv) There exist s.t.for all , and the convergence is uniform in .
Then, as , the random variable
(A.2) converges in distribution to defined as
(A.3) and has finite mean and variance.
Proof.
First, we verify that is well defined and has finite variance; notice that (ii) implies that , (iii) implies that , , and (iv) implies that and . For finite , we define the truncated as taking the summation from to in Equation (A.3). Then
and thus
(A.4) due to the summability of , and (the second term is bounded by Cauchy–Schwarz). Thus, with probability one, and . Further,
by that is finite. This verifies that has finite mean and variance. Actually, by martingale convergence theorem one can show that a.s.
Secondly, using a similar argument, defining to be the truncated in Equation (A.2), one can show that
(A.5) by that , and all converges to zero uniformly in , which is assumed in conditions (ii) and (iv).
Now we come to prove . By Levy’s continuity theorem, it suffices to show the pointwise convergence of the characteristic function, namely
Using the truncation of up to , we have that
(A.6) By Equations (A.4) and (A.5), for any , and any , we can choose sufficiently large s.t. the first and the third term are both less than for any . To show that the second term can be made small, we introduce
and by that (condition (i)), . Meanwhile,
Since is finite, and is uniformly bounded as increases for each , we have that as , by the convergence of , and . Thus, the second term can be bounded by
(A.7) where (I) can be made smaller than for large ( is fixed and ), and (II) can be made smaller than as a result of which implies convergence of characteristic function. Putting together, the l.h.s. of Equation (A.6) can be made smaller than for large , which proves the claim.
Proof of Theorem 3.3.
We introduce
where, by definition,
and
Using the above notations, we rewrite Equation (3.10) as
(A.8) where and . The random variables are independent from , and both are asymptotically normal for finite many s. We will use the replacement Lemma A.1 to substitute and by their normal counterparts and discuss scenarios (1)–(3), respectively.
To apply Lemma A.1, we set , and the summability follows (3) of Proposition 3.2; we set
(A.9) and then (A.8) becomes
(A.10) We have that
(A.11)
(A.12) where, recalling that , and ,
(A.13) By that and , both and are uniformly bounded, so are bounded for each . We now verify conditions (i) and (ii) in Lemma A.1.
Condition (i): In cases (1) and (2), ; thus, and then . In case (3) ; thus, and then . As for the limiting distribution of for any finite , we know that , and by Lindeberg–Levy CLT ([27, Theorem 1.9.1 B], extended to the case where the covariance matrix converges to a non-degenerate limit by Slutsky theorem). By definition of and that and are independent, where () is replaced by () by Slutsky theorem. The argument applies to all the three cases.
Condition (ii): for some absolute positive constant and . Meanwhile, by Equation (A.13), ; thus,
thanks to that and that ((4) of Proposition 3.2), and the convergence is uniform in .
We now consider the three scenarios respectively:
(1) Let , by Equation (A.10) we have
thus, (iii) holds. Condition (iv) can be verified by that (upper bounded by ). As analysed above, , and conditions (i) and (ii) hold; thus, Lemma A.1 applies to give that
as claimed in the theorem.
(2) Let be the l.h.s. of the statement, then
and conditions (iii) and (iv) hold. Same as in (1), and (i) and (ii) hold; thus, Lemma A.1 gives that
By the summability of , is in same distribution as as defined in the theorem.
(3) Similar to (2), let be the l.h.s. of the statement, then , , are same as in (2) where , so they have the same limit, and (iii) and (iv) hold. As analysed above, , and (i) and (ii) hold. Thus, Lemma A.1 gives that where has covariance . By the summability of , is in same distribution as , where which equals the formula claimed in the theorem.
Proof of Theorem 3.4
We only prove (2), as (1) directly follows from Theorem 3.3 (1) and the form of the limiting density of in this case.
We first consider the case, i.e. : due to that satisfies Assumption 2 and the equivalent forms of as in (3.5) and (3.9), we have that
Notations as in the proof of Theorem 3.3, by (A.10),
We set to be the l.h.s. and apply Lemma A.1: and are summable as before. Conditions (i) and (ii) are satisfied by (defined in (A.9)), as has been verified in the proof of Theorem 3.3. Since
and , condition (iii) holds with the limits as
Condition (iv) is also satisfied due to the summability of , same as in the proof of Theorem 3.3. Thus, Lemma A.1 gives that
(A.14) which is a single-point distribution at the positive constant .
We then consider the case, i.e. : by Theorem 3.3 (1), which is a continuous non-negative random variable with finite mean and variance. Thus, for any chosen level , there exists s.t. , and then when is large enough,
This means that is a valid threshold in (3.12), and as a result, (shortening ‘’ as )
Putting together with (A.14), we then have that for sufficiently large ,
which converges to 0 since is an absolute constant and thus , and meanwhile . This proves that .
A.3 Proof of Theorem 3.5
Proof of Theorem 3.5.
We use Chebyshev to control the deviation of the random variable from its mean, under and , respectively.
Under , , by (A.10),
By (A.12), ,
(A.15) We will prove later that
(A.16) and then by Chebyshev we have that for any ,
Setting the r.h.s. to be gives , and thus
(A.17) Under , by (A.10) and the definition that ,
(A.18)
(A.19) Since (cf. (A.11)),
we have that
(A.20) We will prove later that
(A.21) and then, using Chebyshev again, for any ,
Combined with (A.17) which shows that is a valid threshold to achieve level , by setting (which is strictly positive under (3.13)), this gives the bound (3.14).
It remains to prove (A.16) and (A.21) to finish the proof.
Proof of (A.16):
We will prove that
(A.22) where the first term since (cf. (A.1)), and the second term ¡0.1 under the condition that . This gives (A.16).
Recall that , and . By the definition of (A.9),
(A.23) Note that
and then
(A.24) Recall that for any ; thus, does not vanish only when the indices all equal or fall into two pairs. Then
(A.25) Recall that
(A.26)
(A.27) and (cf. Proposition 3.2 (1)), the above line continues to give that
(A.28) Similarly, the fourth term can be bounded by
(A.29) Back to (A.23), we have that
which is (A.22).
Proof of (A.21): Recall that , being constant; thus, . By (A.19),
We will prove that
(A.30)
(A.31)
(A.32) which, putting together, gives that
which proves (A.21).
Proof of (A.30): Recall that (cf. (A.12)), and ,
We define
(A.33) and then
thus
(A.34) Meanwhile, by Proposition 3.2 (1), uniformly, this gives that for any . Thus,
As a result,
which proves (A.30).
Proof of (A.31): By (A.9), the independence of from , and that , for all ,
We then have that
where is defined as
(A.35)
(A.36)
(A.37) We compute and , respectively: note that ; thus,
due to that vanishes unless the three indices coincide. By that and , we have that
(A.38) As for , by that for any , a similar argument gives that
To proceed, by Cauchy–Schwarz,
where, same as in (A.34),
and, by the uniform bound that , we also have that
(A.39) Together, they give that
This means that
(A.40) Back to (A.35), we have that
namely (A.31).
Proof of (A.32): Similar to (A.23),
We have that
(A.41) and by (A.28),
(A.42) Using a similar argument to derive (A.25), one can verify that
Meanwhile,
and (cf. (A.39)). This gives that
(A.43) We also have that
together with (A.41)–(A.43), this gives that
Note that , and ; thus,
this gives that
The first term equals , and the second term ¡0.1 under the condition of the theorem. This proves (A.32).
References
- 1. Anderson N. H., Hall P. & Titterington D. M. (1994) Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. J. Multivariate Anal., 50, 41–54. [Google Scholar]
- 2. Basser P. J. & Pajevic S. (2007) Spectral decomposition of a 4th-order covariance tensor: applications to diffusion tensor MRI. Signal Process., 87, 220–236. [Google Scholar]
- 3. Bermanis A., Averbuch A. & Coifman R. R. (2013) Multiscale data sampling and function extension. Appl. Comput. Harmon. Anal., 34, 15–29. [Google Scholar]
- 4. Bickel P. J. (1969) A distribution free version of the Smirnov two sample test in the p-variate case. Ann. Math. Stat., 40, 1–23. [Google Scholar]
- 5. Bruggner R. V., Bodenmiller B., Dill D. L., Tibshirani R. J. & Nolan G. P. (2014) Automated identification of stratifying signatures in cellular subpopulations. Proc. Natl. Acad. Sci., 111, E2770–E2777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Chwialkowski K. P., Ramdas A., Sejdinovic D. & Gretton A. (2015) Fast two-sample testing with analytic representations of probability measures. Proceedings of the 28th International Conference on Neural Information Processing Systems, 2, 1981–1989. [Google Scholar]
- 7. Coifman R. R., Lafon S., Lee A. B., Maggioni M., Nadler B., Warner F. & Zucker S. W. (2005) Geometric diffusions as a tool for harmonic analysis and structure definition of data: multiscale methods. Proc. Natl. Acad. Sci., 102, 7432–7437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Deschler B. & Lübbert M. (2006) Acute myeloid leukemia: epidemiology and etiology. Cancer, 107, 2099–2107. [DOI] [PubMed] [Google Scholar]
- 9. Epps T. & Singleton K. J. (1986) An omnibus test for the two-sample problem using the empirical characteristic function. J. Statist. Comput. Simulation, 26, 177–203. [Google Scholar]
- 10. Eric M., Bach F. R. & Harchaoui Z. (2008) Testing for homogeneity with kernel fisher discriminant analysis. Adv. Neur. Inform. Proc. Syst., 609–616. [Google Scholar]
- 11. Fernández V. A., Gamero M. J. & García J. M. (2008) A test for the two-sample problem based on empirical characteristic functions. Comput. Statist. Data Anal., 52, 3730–3748. [Google Scholar]
- 12. Friedman J. H. & Rafsky L. C. (1979) Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. Ann. Statist., 697–717. [Google Scholar]
- 13. Gretton A., Borgwardt K. M., Rasch M. J., Schölkopf B. & Smola A. (2012) A kernel two-sample test. J. Mach. Learn. Res., 13, 723–773. [Google Scholar]
- 14. Gretton A., Sejdinovic D., Strathmann H., Balakrishnan S., Pontil M., Fukumizu K. & Sriperumbudur B. K. (2012) Optimal kernel choice for large-scale two-sample tests. Proceedings of the 25th International Conference on Neural Information Processing Systems, 1, 1205–1213. [Google Scholar]
- 15. Hall P. & Tajvidi N. (2002) Permutation tests for equality of distributions in high-dimensional settings. Biometrika, 89, 359–374. [Google Scholar]
- 16. Henze N. (1988) A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann. Statist., 16, 772–783. [Google Scholar]
- 17. Higgins J. J. (2003) Introduction to Modern Nonparametric Statistics, 1st edn (May 5, 2003) Duxbury Press. [Google Scholar]
- 18. Jitkrittum W., Szabó Z., Chwialkowski K. P. & Gretton A. (2016) Interpretable distribution features with maximum testing power. Proceedings of the 30th International Conference on Neural Information Processing Systems, 1, 181–189. [Google Scholar]
- 19. Kolmogorov A. (1933) Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari, 4, 83–91. [Google Scholar]
- 20. Leeb W. & Coifman R. (2016) Hölder–Lipschitz norms and their duals on spaces with semigroups, with applications to earth mover’s distance. J. Fourier Anal. Appl., 22, 910–953. [Google Scholar]
- 21. Little A. V., Maggioni M. & Rosasco L. (2016) Multiscale geometric methods for data sets I: multiscale SVD, noise and curvature. Appl. Comput. Harmon. Anal. http://hdl.handle.net/1721.1/72597. [Google Scholar]
- 22. Lloyd J. R. & Ghahramani Z. (2015) Statistical model criticism using kernel two sample tests. Proceedings of the 28th International Conference on Neural Information Processing Systems, 1, 829–837. [Google Scholar]
- 23. Ozakin A. & Gray A. G. (2009) Submanifold density estimation. Adv. Neur. Inform. Proc. Syst., 1, 1375–1382. [Google Scholar]
- 24. Ramdas A., Reddi S. J., Póczos B., Singh A. & Wasserman L. (2015) On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence . 3571–3577
- 25. Rosenbaum P. R. (2005) An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Series B Stat. Methodology, 67, 515–530. [Google Scholar]
- 26. Sejdinovic D., Sriperumbudur B., Gretton A. & Fukumizu K. (2013) Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Statist., 41, 2263–2291. [Google Scholar]
- 27. Serfling R. J. (1981) Approximation Theorems of Mathematical Statistics, 1st edn (December 21, 2001) Wiley-Interscience. [Google Scholar]
- 28. Shirdhonkar S. & Jacobs D. W. (2008) Approximate earth mover’s distance in linear time. Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, pp. 1–8. [Google Scholar]
- 29. Smirnov N. (1948) Table for estimating the goodness of fit of empirical distributions. Ann. Math. Stat., 19, 279–281. [Google Scholar]
- 30. Székely G. J. & Rizzo M. L. (2004) Testing for equal distributions in high dimension. InterStat, 5, 1249–1272. [Google Scholar]
- 31. Woolfe F., Liberty E., Rokhlin V. & Tygert M. (2008) A fast randomized algorithm for the approximation of matrices. Appl. Comput. Harmon. Anal., 25, 335–366. [Google Scholar]
- 32. Zhao J. & Deyu M. (2015) Fastmmd: Ensemble of circular discrepancy for efficient two-sample test. Neural Comput., 27, 1345–1372. [DOI] [PubMed] [Google Scholar]