Abstract

We describe the use of the Fréchet mean and variance in the Billera–Holmes–Vogtmann (BHV) treespace to summarize and explore the diversity of a set of phylogenetic trees. We show that the Fréchet mean is comparable to other summary methods, and, despite its stickiness property, is more likely to be binary than the majority-rule consensus tree. We show that the Fréchet variance is faster and more precise than commonly used variance measures. The Fréchet mean and variance are more theoretically justified, and more robust, than previous estimates of this type and can be estimated reasonably efficiently, providing a foundation for building more advanced statistical methods and leading to applications such as mean hypothesis testing and outlier detection.

Sets of related phylogenetic trees are commonly encountered in evolutionary biology. For example, one might encounter such a set as the output of an inference program like MrBayes (Ronquist et al. 2012), or as a set of gene trees on some given set of species. For this article, we think of any such set as being a sample from an underlying distribution on the set of all phylogenetic trees with a fixed leaf set, where a tree consists of both a topology and branch lengths. Here, we consider a mathematically-founded basis for describing the mean of such a distribution, using the representation of trees as elements of a continuous, geometric space and looking for the Fréchet mean: the tree that minimizes the sum of the squared distance between the mean and the elements of a sample from the distribution. This formulation also allows the identification of the Fréchet variance, which is the actual sum of the squared distances between the sample elements and the Fréchet mean.

This mean tree has been used in previous work, primarily as a step in the proposal of a more sophisticated statistical approach to analyzing phylogenetic trees (Nye 2014; Zairis et al. 2016; Nye et al. 2017; Willis 2019). The variance, or a heuristic computation of it, has been used to compare levels of incongruence under different evolutionary models (Williams et al. 2012) for hypothesis testing and to validate the application of data cloning to Bayesian phylogenetic inference (Ponciano et al. 2012). However, despite this usage in the literature, there has been no large-scale analysis of the performance and properties of the Fréchet mean and variance.

Throughout the article, we use an iterative algorithm to compute approximations of the Fréchet mean and variance, as there is no known polynomial algorithm for computing the mean. This algorithm is fairly efficient, and we show that the mean is close to other established summary measures like the maximum likelihood (ML) tree, maximum a posteriori (MAP) tree, and majority-rule consensus tree. The mean tree is known to be “sticky”, a phenomenon in which perturbing a sample may not perturb its mean, and which is caused by negative curvature in the treespace. In a negatively curved space, such as hyperbolic space, triangles are skinnier than they would be in Euclidean space. This negative curvature can also lead to the mean tree being unresolved, although we show the mean is more resolved than the majority-rule consensus tree, and to the mean having a shorter overall length than the mean input tree length. We will refer to the overall property of the mean being pulled towards the star tree as stickiness and show that even in a best case biological scenario, its effect is still detectable, if minimal, on the mean.

The Fréchet variance of a set of trees quantifies how spread out that set is from their mean. We will show that in our experiments, as sequence length increases and there is more information about the tree to be reconstructed, the variance of samples of trees from the bootstrap and posterior distributions decreases. We also show that the Fréchet variance is a significantly more precise measure of statistical uncertainty than simpler measures (like the number of topologies found in a set of trees) and is faster to compute than the sum of all pairwise distances between trees, for a large set of trees. Finally, we show that the variance can depend on the length of trees in the sample, so this needs to be accounted for in any comparisons of variance.

Our results show that the treespace-based measure of phylogenetic distance that originated with Billera et al. (2001) can in fact be used in practical applications. The findings that the Fréchet mean and variance make sense biologically, as well as statistically, justify its usage in more sophisticated statistical schemes, such as computing confidence sets (Willis 2019), and encourage future applications, such as outlier detection.

Background

We begin by describing the treespace in which we are working, and some properties of that space and its distance measure. We also consider other ways of summarizing a collection of trees besides the Fréchet mean and variance, and some of their properties.

Treespace

The Billera–Holmes–Vogtmann (BHV) treespace, |$\mathcal{T}_n$|⁠, (Billera et al. 2001) contains all unrooted phylogenetic trees with edge lengths and a given set of |$n+1$| labeled leaves. For this article, we fix the set of leaf labels to be |$\{0, 1, ..., n\}$|⁠. Any of these trees can be thought of as rooted by fixing leaf 0 as the root. In this article, we will define the BHV treespace as a subspace of |$\mathbb{R}^N$|⁠, where |$N = 2^n - 1$| is the number of possible splits on |$n+1$| leaves, or equivalently, the number of possible partitions of the set of leaves into two sets, each of size at least 1. Each coordinate of |$\mathbb{R}^N$| corresponds to a different split, where the order of the splits does not matter but is fixed. Note that by allowing partitions of size 1, we include splits corresponding to edges ending in leaves. The original definition (Billera et al. 2001) ignores these edges, but notes that they can be included, as we have done here.

Given a tree |$T$| with edge lengths and |$n + 1$| leaves, it corresponds to the following vector in |$\mathbb{R}^N$|⁠: for every edge |$e$| in |$T$| with length |$|e|_T$|⁠, let the coordinate corresponding to the split induced by |$e$| be |$|e|_T$|⁠. Let the coordinates corresponding to splits not induced by edges in |$T$| be 0. Let |$\mathcal{T}_n$| be the set of vectors in |$\mathbb{R}^N$| that correspond to trees, as just described. Not all nonnegative vectors in |$\mathbb{R}^N$| correspond to trees due to split incompatibility. Two splits are incompatible if they cannot be induced by edges existing in the same tree. For example, a cherry is a pair of adjacent leaves in a tree, and the corresponding split separates the two adjacent leaves from all others. No tree with |$n \ge 4$| can have both |$\{1,2\}$| and |$\{1,3\}$| as cherries, so the corresponding splits |$\{1,2\}\{0,3, 4, ..., n\}$| and |$\{1,3\}\{0,2, 4, 5 ..., n\}$| are incompatible, and no vectors in |$\mathcal{T}_n$| have positive values in both these coordinates. The topology of a tree is the set of all splits induced by the edges of that tree. A binary tree, in which all interior nodes have degree 3, contains |$2n-1$| splits, while unresolved (or degenerate or non-binary) trees will contain fewer than |$2n-1$| splits.

To visualize BHV treespace, consider all trees in |$\mathcal{T}_n$| with the same topology. Because these trees all correspond to the same set of splits, their vectors have exactly the same set of nonzero coordinates, which can take on any positive values. Thus, if the number of nonzero coordinates is |$d$|⁠, this set of trees corresponds to a |$d$|-dimensional Euclidean orthant, which is the nonnegative part of |$\mathbb{R}^d$|⁠. When the tree topology is binary, |$d = 2n-1$|⁠. There are |$(2n-3)!! = (2n-3)\times (2n-5)\times (2n-7) \times \cdot \cdot \cdot \times 1$| binary tree topologies on |$n+1$| leaves (Schröder 1870), and thus |$(2n-3)!!$| top-level orthants. Maximal orthants have dimension |$2n-1$| and two such orthants share a boundary of dimension |$2n-2$| if and only if their corresponding topologies differ by a single nearest neighbor interchange (NNI) move (see Fig. 1). For more details on the combinatorics and geometry of BHV treespace, see Billera et al. (2001).

(a) Five and (b) three of the 15 quadrants in the BHV treespace $\mathcal{T}_4$, with each subfigure representing a different non-Euclidean feature of the BHV treespace. For ease of visualization, the five dimensions corresponding to the leaf edges are not included. Thus, each binary tree topology is represented by a quadrant (two-dimensional orthant), with the two quadrant axes corresponding to the lengths of the interior edge splits. The axis between two quadrants corresponds to an unresolved tree topology. The geodesic (shortest path) between two trees is shown by a dashed line and may pass through different orthants depending on the branch lengths of the endpoint trees, as shown by the geodesics between $T_1$ and $T_2$ and between $T_1'$ and $T_2'$.
Figure 1.

(a) Five and (b) three of the 15 quadrants in the BHV treespace |$\mathcal{T}_4$|⁠, with each subfigure representing a different non-Euclidean feature of the BHV treespace. For ease of visualization, the five dimensions corresponding to the leaf edges are not included. Thus, each binary tree topology is represented by a quadrant (two-dimensional orthant), with the two quadrant axes corresponding to the lengths of the interior edge splits. The axis between two quadrants corresponds to an unresolved tree topology. The geodesic (shortest path) between two trees is shown by a dashed line and may pass through different orthants depending on the branch lengths of the endpoint trees, as shown by the geodesics between |$T_1$| and |$T_2$| and between |$T_1'$| and |$T_2'$|⁠.

There is a distance metric associated with the BHV treespace called the BHV distance or the geodesic distance, which was also defined in Billera et al. (2001). For any two trees with the same topology, the BHV distance is the Euclidean distance between their corresponding vectors in their shared orthant. For two trees with different topologies, the BHV distance is the length of the shortest path between them that remains in the treespace. The length of any path can be computed by calculating the Euclidean distance of the path restricted to each orthant that it passes though, and summing these lengths. The shortest path is called a geodesic and will pass from one orthant to the next orthant through lower-dimensional boundaries corresponding to trees with fewer splits. See Figure 1 for an example of two geodesics in |$\mathcal{T}_4$|⁠.

The BHV treespace is connected, because there is a path between any two trees that consists of the line segment from the first tree to the origin, followed by the line segment from the origin to the second tree. This path through the origin may or may not be the geodesic. Billera et al. (2001) showed that the BHV treespace is globally nonpositively curved (Bridson and Haefliger 1999), which implies that geodesics are unique. Owen and Provan (2011) gave a polynomial time algorithm for computing the geodesic distance between two trees that runs in |$O(nm^3)$| time, where |$m$| is the number of leaves in the largest subtree formed by decomposing the trees along common splits.

Mean and Variance

In Euclidean space, the Fréchet mean, or center of mass, is the point minimizing the sum of the squared distances to the sample points and is equivalent to the coordinate-wise average of the sample points. The Fréchet mean was similarly defined for treespace by Miller et al. (2015) and Bačák (2014), independently. For a set of sample trees |$\{T_1, T_2, ..., T_r\}$|⁠, the Fréchet mean, or simply, mean, is the tree |$t$| which minimizes |$\sum_{i=1}^r d(t,T_i)^2,$| where |$d(\cdot,\cdot)$| is the BHV distance between the two trees. The Fréchet variance, or simply variance, is that minimum sum of squared distances, and the standard deviation is the square root of the variance. This mean is unique because treespace is nonpositively curved. Both Miller et al. (2015) and Bačák (2014) gave an iterative algorithm for approximating the mean and variance based on the Law of Large Numbers for nonpositively curved space derived by Sturm (2003).

We briefly mention some interesting properties of the mean. See Miller et al. (2015, Section 5) for details and proofs. Tree |$T$| is a refinement of tree |$T'$| if tree |$T$| contains all of the splits of tree |$T'$|⁠, as well as at least one other split. The mean tree is not necessarily a refinement of the majority-rule consensus tree (which contains all splits found in a majority of the trees |$T_i$|⁠). Any split that appears in the mean tree appears in at least one of the sample trees, and any split that appears in all the sample trees appears in the mean tree. Finally, the mean tree is “sticky” (Hotz et al. 2013), in that perturbing one of the sample trees does not always change the mean tree. This “stickiness” only happens when the mean is on a lower-dimensional orthant in treespace, which corresponds to an unresolved tree topology. Thus, the mean will be unresolved more often than one might expect or wish, similar to the majority-rule consensus tree. Stickiness is caused by the non-Euclidean nature of BHV treespace, and also causes binary mean trees to be pulled towards the lower-dimensional orthants. For example, suppose we wish to find the mean of a set of trees sampled uniformly from a ball of radius 1 around a center tree, as in Figure 2a. By folding the orthants |$B$| and |$C$| on top of each other, we see that in the limit as the sample size increases, computing this mean is equivalent to computing the weighted Euclidean mean, or center of gravity, of a disk in which a section corresponding to the part of the ball in orthants |$B$| and |$C$| has twice the density of the rest of the disk (see Fig. 2b). The mean will be closer to the shared axis than the original center tree. This behavior is also a component of the stickiness of the mean.

(a) Three neighboring quadrants in BHV treespace, labeled $A$, $B$, and $C$. We uniformly sample from a ball of radius 1 around the center tree (marked by a dot) in quadrant $A$, as shown in (a). Flattening quadrants $B$ and $C$ gives (b), which shows that computing the mean of this distribution (marked by a star) is equivalent to computing the weighted Euclidean mean of a disk with twice the density in the region where the radius 1 ball intersects orthants $B$ and $C$.
Figure 2.

(a) Three neighboring quadrants in BHV treespace, labeled |$A$|⁠, |$B$|⁠, and |$C$|⁠. We uniformly sample from a ball of radius 1 around the center tree (marked by a dot) in quadrant |$A$|⁠, as shown in (a). Flattening quadrants |$B$| and |$C$| gives (b), which shows that computing the mean of this distribution (marked by a star) is equivalent to computing the weighted Euclidean mean of a disk with twice the density in the region where the radius 1 ball intersects orthants |$B$| and |$C$|⁠.

Sturm (2003) showed that a Law of Large Numbers holds, meaning that as the sample size increases, the sample means of a distribution over |$\mathcal{T}_n$| converge to the true mean. Barden and Le (2018) proved a Central Limit Theorem on the BHV treespace, showing that the distribution of the sample means converges to a certain Gaussian distribution. This Central Limit Theorem was used by Willis (2019) to construct confidence sets in BHV treespace.

Other Measures of Centrality and Variance

We compare the Fréchet mean to three other commonly used measures of center or consensus in phylogenetics, of which some are only the topology and others are both the topology and edge lengths. The first is the majority-rule topology, which is the topology containing exactly those splits appearing in a majority of the input trees (Margush and McMorris 1981). The majority-rule consensus topology and the Fréchet mean are not, in general, refinements of each other (Miller et al. 2015).

The other two measures of center come from tree search algorithms, namely ML and Bayesian inference. Both tree search procedures produce a distribution of trees along with a most likely or most probable tree, which acts like a center. Specifically, maximum likelihood search produces a distribution of bootstrap trees (Felsenstein 1985) and a maximum likelihood (ML) tree, while Bayesian inference produces a posterior distribution and a MAP topology (Rannala and Yang 1996).

The variance or amount of variability in a set of trees is less established as a measure than the summary tree. Often, a visual representation, such as SplitsTree (Huson and Bryant 2006) or DensiTree (Bouckaert 2010), is used to represent the diversity of a set of trees. Unfortunately, these methods cannot be assessed or compared quantitatively. Instead, we will compare the Fréchet variance with several quantitative measures of variance, namely the number of different tree topologies in the set, the number of different splits in the set, and the sum of the squared geodesic distances between each pair of trees in the set. The latter measure was proposed in Chakerian and Holmes (2012). A recent paper by Lewis et al. (2016) introduced a Bayesian approach for measuring phylogenetic information content using entropy that could also be used as a measure of variability. However, this method begins to run into limitations at 12 taxa, due to the combinatorial explosion in tree topologies, and thus we do not consider it here.

Relationship between the BHV and Robinson–Foulds Distances

The BHV distance and weighted version of the Robinson–Foulds (RF) distance (Robinson and Foulds 1981) are closely related, leading to relationships between measures of center involving these distances. The RF distance between trees |$T_1$| and |$T_2$| is defined as the number of splits in |$T_1$| but not in |$T_2$|⁠, plus the number of splits in |$T_2$| but not in |$T_1$|⁠. That is, it is the size of the symmetric difference between the split sets of |$T_1$| and |$T_2$|⁠. Note that this definition only uses the tree topologies and not the edge lengths of the trees. The weighted Robinson–Foulds (WAF) distance (Robinson and Foulds 1979) is the sum of the lengths of the splits in only one of the trees, plus the difference in the lengths of the common splits. If the |$L_1$| metric, which is also known as the Manhattan or taxicab metric, is used in each orthant of BHV treespace instead of the |$L_2$| or Euclidean metric, then the length of a (no longer unique) geodesic between two trees is the same as the weighted RF distance between them (see St. John (2016) for a good summary). Because the WAF and BHV distances only differ by whether the |$L_1$| or |$L_2$| metric is used in each orthant, pairs of trees that are close under one are close under the other. Thus, results validating the WAF distance for biological application (Kuhner and Yamato 2014) are likely to carry over to the BHV distance. While the BHV distance is more expensive to compute, BHV treespace has mathematical properties, like the uniqueness of geodesics and means, that make it more appropriate for developing further statistical tools for analyzing trees.

Recall that the median tree is the tree minimizing the sum of distances, instead of the sum of squared distances as in the mean tree, to the sample trees. The majority-rule consensus tree is the median tree under the Robinson–Foulds distance (Barthélemy and McMorris 1986), and Pattengale (2005) further showed that the weighted majority-rule consensus tree is the median under the weighted Robinson–Foulds distance.

Related Work

The most closely related work to ours is that of Benner et al. (2014). They investigated the behavior of both the mean and the median (the tree minimizing the sum of distances, instead of the sum of squared distances, to the sample trees) in summarizing posterior distributions returned by Bayesian tree reconstruction methods. The authors simulated sequences of lengths 50, 100, 250, and 500 using a 14-taxa tree of plants and the F81 evolutionary model (Felsenstein 1981). They show that the mean and median estimates are comparable to the majority-rule consensus estimate, and in some instances perform better, and also investigate how the Fréchet variance changes with sequence length.

Our work was conceived independently, and while the overall aim of the work is the same, the scope of our experiments are much broader, considering tree distributions generated from both maximum likelihood and Bayesian methods; the general GTR evolutionary model; trees with more taxa; stickiness of the mean in practice; and a more comprehensive look at the Fréchet variance, including its relationship with tree length.

This article focuses on the Fréchet mean and variance of a set of trees, with the mean being a point summary of the data. Other work has looked at one-dimensional summaries, or best fit lines, in BHV treespace (Nye 2011, 2014; Feragen et al. 2013), and extensions to multidimensional summaries or a generalization of principal components analysis (Nye et al. 2017).

The Fréchet mean is not the only mathematically justified summary statistic for a set of trees. A Bayes estimator is a estimator that minimizes the expected value of a loss function, taken over the posterior distribution. Holder et al. (2008) showed that the majority-rule consensus tree is the optimal tree to report as a summary of a Bayesian posterior when using a loss function that is linear in the number of incorrect clades estimated and number of true clades missing from the estimate, and that treats an incorrect split as a more serious error than an omitted split. Huggins et al. (2011) continued this line of investigation by showing that using the path difference distance (Steel and Penny 1993) between trees as the loss function in the Bayes Estimator applied to a posterior distribution sample improved the accuracy of the reconstruction tree. When the loss function in a Bayes Estimator is a distance, computing the Bayes Estimator becomes equivalent to computing a median.

Methods

Simulated Data

For the first data set, MAMMAL, we simulated 10 sets of DNA sequences for a variety of lengths (500, 1000, 1500, 2000, 2500, 3000, 3500, 4000 base pairs), using the tree of 44 mammals from Murphy et al. (2001) as the reference tree. The sequences were simulated with Seq-Gen version 1.3.3 (Rambaut and Grass 1997) using the GTR + I model with the following parameter settings: the proportion of invariable sites is 0.18; the equilibrium frequencies for A, C, G, T are 0.21, 0.31, 0.3, and 0.18, respectively; and the GTR rate matrix is |$(1.5,4.91,1.34,0.83,5.8,1)$|⁠, where the entries are the transition rate from A to C, A to G, A to T, C to G, C to T, and G to T. These parameter values were estimated in Hillis et al. (2005) for this reference tree.

The second data set, SATE, was the sets of 100-taxon sequences simulated for, and used in Liu et al. (2009). These sequences were produced by generating reference trees using r8s (Sanderson 2003), which were then adjusted to not be ultrametric and each edge length was scaled by a “tree height" value (15 for models 100L and 100M1; 10 for model 100S1; 7 for model 100M2; 4 for model 100M3; and 2 for models 100L2 and 100S2). Sequences of 1000 base pairs were simulated via ROSE (Stoye et al. 1998) using the GTR + Gamma model and either short (models 100S1 and 100S2), medium (models 100M1, 100M2, and 100M3), or long (models 100L1 and 100L2) gaps. There were 20 replicates, each with their own reference tree, for each of the seven different model conditions. See the Supplementary material of Liu et al. (2009) for complete model details.

For each set of sequences in each data set, we ran RAxML Version 8 (Stamatakis 2014) to compute the ML tree and 1000 bootstrap sample trees. For the RAxML settings, we used the GTR + I + Gamma evolutionary model. We conducted a full analysis (option -f a) using rapid bootstrapping (option -x), which was recommended in the RAxML user manual when computing a large number of bootstrap replicates, since every 5th bootstrap tree is used as a starting point for the ML tree search. For each set of sequences, we also ran MrBayes 3.2 (Ronquist et al. 2012) to compute the MAP tree and sample 1000 trees from the posterior distribution. For the MrBayes settings, we used the GTR+I+Gamma evolutionary model, and ran for 5,000,000 iterations, sampling every 1000 generations. We used the last 1000 sampled trees as the posterior distribution. We verified convergence of each run by comparing the log-likelihood values between the two chains.

For each sample distribution of 1000 trees, we estimated the Fréchet mean and variance, using SturmMean (Miller et al. 2015) with the settings explained in the following subsection. We also computed the majority-rule consensus tree using Dendropy (Sukumaran and Holder 2010), and three other measures of variation for each sample: the number of different tree topologies in the sample, the number of different splits in the sample, and the sum of the squared BHV distance between each pair of trees in the sample. We compared the mean tree to the corresponding reference, majority-rule consensus, ML, and MAP trees using the RF distance and the BHV distance if both trees had meaningful branch lengths. The mean trees, ML trees, and reference trees all had meaningful branch lengths, while the majority-rule consensus tree and MAP trees did not. To visualize what the trees from one replicate of the MAMMAL data set look like, for the first replicate of the 4000 base pair sequence length experiment, we computed the BHV distance between 100 trees from each of the bootstrap and posterior samples, the reference tree, the ML tree and the two mean trees. We use classic multidimensional scaling (MDS) (Kruskal 1964) to reduce this to two dimensions.

For MAMMAL, we compared the number of unresolved mean and majority-rule trees for each sequence length, as well as the lengths of the sampled trees and their means and ML trees. Here, we define the length of a tree |$T$| with edge set |$E$| to be |$\sqrt{\sum_{e \in E} |e|_T^2}$| which is also the BHV distance of tree |$T$| to the BHV treespace origin. To investigate stickiness of the mean, we compare the length of the mean tree of a sample with the average length of trees in that sample, to see if the mean is much shorter than its component trees. We also compare the lengths of the mean and the average lengths of trees between the posterior and bootstrap distributions generated from the same sequence set. In all cases, we use the Wilcoxon ranked signed test (Wilcoxon 1945) to test the hypothesis that the two distributions of lengths are the same.

Finally, we conducted a mean hypothesis test on the bootstrap and posterior samples for the first replicate of 4000 base pairs in MAMMAL. Recall that a mean hypothesis test is a type of two-sample test that tests if the means of two samples are the same. Rejecting this hypothesis implies that the samples are from different distributions. We computed the BHV distance between the means of these bootstrap and posterior samples, and compared it to the BHV distance between the means of a random partition of the two samples. This random partition was made by combining the bootstrap and posterior trees into one set, and randomly partitioning this new set into two equal parts. Since a full permutation test is not feasible (each sample contains 1000 trees), we estimated the distribution of the difference between means using 500 randomly chosen permutations. Note we are following the method for performing two-sample hypothesis testing on trees that was suggested in Feragen et al. (2013) for lung-airway tree data.

Ratsnake Data

Chen et al. (2017b) studied the biogeographic origins and regional diversity of ratsnakes using a phylogenomic scale data set of 304 loci. This data set (Chen et al. 2017a) included 1000 trees sampled from the posterior distribution for each of the 304 loci, which we refer to as SNAKE. For each sample of 1000 trees in SNAKE, we compute the Fréchet mean and variance, as well as the average length of the trees in the sample. For each loci, we subsampled 100 trees from the 1000 tree posterior distribution sample, and for each subsample, computed the BHV distance between all trees and embedded the trees in two dimensions using MDS.

Computation of the Mean and Variance

We compute an approximation of the mean tree using the iterative implementation described in Miller et al. (2015). In this implementation, a new approximation of the mean tree is returned each iteration, and these approximations converge to the true mean tree as the number of iterations grows. To decide when to stop the iterative algorithm, we use a program option to check for convergence using a Cauchy sequence of length 10 with an epsilon of |$10^{-6}$| for MAMMAL and |$10^{-4}$| for SATE and SNAKE. In other words, we stop the iterative algorithm when the pairwise BHV distances between the last 10 mean approximations were all less than or equal to this epsilon. We also use the random permutation heuristic, which selects the tree used at each iteration randomly without replacement instead of with replacement. This heuristic improves the convergence time of the algorithm in practice. In all of our experiments, the means converged within 285,000 iterations, and computing a single mean and variance took 10–35 min for a MAMMAL replicate, 5–35 min for a SATE replicate, and 1–10 min for a SNAKE sample on a 3.5 GHz 6-Core Intel Xeon E5 Processor. The epsilon values were chosen to yield manageable runtimes. However, to understand the effect of epsilon on the mean, we chose one replicate for each sequence length in MAMMAL, and computed the (approximate) mean of the corresponding posterior distribution sample 10 times using the chosen parameters. For an epsilon of |$10^{-6}$|⁠, the average BHV distance between pairs of approximate means of the same sample was on the order of |$10^{-5}$|⁠, which we considered acceptable. The nature of the iterative approximation algorithm causes the estimated mean trees to be binary. We declared an estimated mean tree to be unresolved if one or more of its internal edges were less than epsilon in length. The program and source code are available at http://comet.lehman.cuny.edu/owen/code/SturmMean.tar.gz and on Dryad (doi:10.5061/dryad.h5f4117).

Results

The MDS visualization of the first replicate of the 4000 base pair sequence length is shown in Figure 3. The two means and the ML tree are in the middle of their respective samples. The two samples are separated in space and the reference tree is closer to the posterior sample but not near its center. The reference tree, two means, ML tree, and 62 of the trees from the two samples have the same topology. The other common topologies appear in both the bootstrap and posterior samples, suggesting that the difference between the two clusters in Figure 3 is primarily due to branch lengths, which the BHV distance takes into account, instead of topology.

A sample of trees from the first replicate of the 4000 base pair sequence length from the MAMMAL data set embedded in two dimensions using classic multidimensional scaling. The trees from the bootstrap and posterior samples from two clusters, most likely due to differences in branch lengths instead of topology.
Figure 3.

A sample of trees from the first replicate of the 4000 base pair sequence length from the MAMMAL data set embedded in two dimensions using classic multidimensional scaling. The trees from the bootstrap and posterior samples from two clusters, most likely due to differences in branch lengths instead of topology.

Comparison of the Mean Tree with Other Measures of Center

First we compare the mean trees to the reference trees and the other summary trees, showing that they are close. The four plots in Figures 4 and 5 use the RF and BHV distances to compare the Fréchet mean of each bootstrap and posterior sample to its corresponding majority-rule consensus tree, the reference tree, and either ML or MAP tree, as appropriate. In both MAMMAL and SATE the mean tree is closer to the ML or MAP and majority-rule consensus tree than to the reference tree. In the MAMMAL data set, the distance between the mean and the other summary trees decreases as the sequence length increases, with those three trees almost always having the same topology for sequences of length 3000 base pairs and longer. In the SATE data set, the difference in the scale of the edge lengths of the reference trees results in large differences in the BHV distance between trees.

For each sample generated from the MAMMAL data set, the RF and BHV distances were calculated from its mean tree to its majority-rule tree, ML or MAP tree, and the reference tree. The distribution of these distances is shown using the box plots. The mean tree approaches the ML or MAP, and consensus trees as the sequence length increases, with all three trees usually sharing the same topology by sequence lengths of 3000 base pairs.
Figure 4.

For each sample generated from the MAMMAL data set, the RF and BHV distances were calculated from its mean tree to its majority-rule tree, ML or MAP tree, and the reference tree. The distribution of these distances is shown using the box plots. The mean tree approaches the ML or MAP, and consensus trees as the sequence length increases, with all three trees usually sharing the same topology by sequence lengths of 3000 base pairs.

For each sample generated from the SATE data set, the RF and BHV distances were calculated from its mean tree to its majority-rule tree, ML or MAP tree, and reference tree. The distribution of these distances is shown using the box plots. The mean tree is much closer to the ML or MAP tree and consensus tree than to the reference tree. The variation in BHV distance between models is due to the difference in the scale of the edge lengths of the reference trees, which translates into longer edge lengths in the higher tree models.
Figure 5.

For each sample generated from the SATE data set, the RF and BHV distances were calculated from its mean tree to its majority-rule tree, ML or MAP tree, and reference tree. The distribution of these distances is shown using the box plots. The mean tree is much closer to the ML or MAP tree and consensus tree than to the reference tree. The variation in BHV distance between models is due to the difference in the scale of the edge lengths of the reference trees, which translates into longer edge lengths in the higher tree models.

Next we consider the relation of the reference trees to the reconstructed trees. The plots in Figures 6 and 7 use the RF and BHV distance to compare the reference tree to the mean tree, majority-rule consensus tree, and ML or MAP tree of each sample. These plots combined with the previous Figures 4 and 5 show that the reconstructed trees are closer to each other than to the reference tree. In MAMMAL, as expected, all trees become closer to the reference tree as the sequence length increases, since this gives more information about the reference tree, improving its reconstruction. Interestingly, under the BHV distance and for the bootstrap distribution, the mean tree is closer to the reference tree than the ML tree (P-value |$< 2.2 \times 10^{-16}$| under the Wilcoxon signed rank test) for MAMMAL, but the reverse was true (P-value |$< 2.2 \times 10^{-16}$| under the Wilcoxon signed rank test) for SATE. Such strong P-values showing opposite results may indicate this behavior is data-dependent.

For each sample generated from the MAMMAL data set, the RF and BHV distances were calculated from the reference tree to the sample’s mean tree, majority-rule tree, and ML or MAP tree. The distribution of these distances is shown using the box plots. All three reconstructed trees approach the reference tree as the sequence length increases, with the mean tree being slightly closer on average than the ML tree under the BHV distance measure. The BHV distance between the reference and mean trees for the posterior distribution is shown in the lower-right graph in Figure 4.
Figure 6.

For each sample generated from the MAMMAL data set, the RF and BHV distances were calculated from the reference tree to the sample’s mean tree, majority-rule tree, and ML or MAP tree. The distribution of these distances is shown using the box plots. All three reconstructed trees approach the reference tree as the sequence length increases, with the mean tree being slightly closer on average than the ML tree under the BHV distance measure. The BHV distance between the reference and mean trees for the posterior distribution is shown in the lower-right graph in Figure 4.

For each sample generated from the SATE data set, the RF and BHV distances were calculated from its reference tree to its mean tree, majority-rule tree, and ML or MAP tree. The distribution of these distances is shown using the box plots. All three reconstructed trees are similar distances to the reference trees under both distances, with the variation in BHV distances due to the model variation in tree height. The BHV distance between the reference and mean trees for the posterior distribution is shown in the lower-right graph in Figure 5.
Figure 7.

For each sample generated from the SATE data set, the RF and BHV distances were calculated from its reference tree to its mean tree, majority-rule tree, and ML or MAP tree. The distribution of these distances is shown using the box plots. All three reconstructed trees are similar distances to the reference trees under both distances, with the variation in BHV distances due to the model variation in tree height. The BHV distance between the reference and mean trees for the posterior distribution is shown in the lower-right graph in Figure 5.

Variance

We compare the different measures of variance in a set a trees, namely the number of different topologies, the number of different splits, the Fréchet variance, and the sum of the squared BHV distance between all pairs of input trees, in Figures 8 and 9. In the MAMMAL data set, as the sequence length increases, there is more information about the underlying tree, and so we expect RAxML and MrBayes to both do a better job at inferring this tree and be more certain about it. This increased certainty should be reflected by a decrease in variance in the bootstrap and posterior distributions, and thus samples, as the sequence length increases, which is what we see in Figure 8. In the SATE data set, the variance decreased as the amount the edge lengths are scaled by decreased. Only model pairs 100L1 and 100M1, and 100L2 and 100S2 have different gap lengths, but had edge lengths scaled by the same amount. However, this change in gap length does not appear to have affected the variance. For both data sets, for all measures, except possibly the number of different topologies for SATE, the posterior samples have a lower variance than the bootstrap samples. This lower variance matches previous observations that the posterior probabilities are higher than bootstrap frequencies for well-supported clades (Douady et al. 2003; Erixon et al. 2003; Huelsenbeck and Rannala 2004), since trees in a lower variance sample are not as spread out, and thus have fewer different splits, than trees in a higher variance sample. In general, we find that there is more variation among the replicates in the number of different topologies and splits than in the Fréchet variance.

The variance of the MAMMAL data set samples of the bootstrap and posterior distributions are measured using the number of different topologies, the number of different splits, the sum of the squared BHV distance between all pairs, and the Fréchet variance. For all measures, the posterior samples have lower variance on average than the bootstrap samples. The sum of the squared BHV distance between all pairs and the Fréchet variance are very similar, up to scaling. Note that the y axis for the number of different splits, sum of squares, and Fréchet variance are in log scale.
Figure 8.

The variance of the MAMMAL data set samples of the bootstrap and posterior distributions are measured using the number of different topologies, the number of different splits, the sum of the squared BHV distance between all pairs, and the Fréchet variance. For all measures, the posterior samples have lower variance on average than the bootstrap samples. The sum of the squared BHV distance between all pairs and the Fréchet variance are very similar, up to scaling. Note that the y axis for the number of different splits, sum of squares, and Fréchet variance are in log scale.

The variance of the SATE data set samples of the bootstrap and posterior distributions are measured using the number of different topologies, the number of different splits, the sum of the squared BHV distance between all pairs, and the Fréchet variance. For all measures, except possibly the number of different topologies, the posterior samples have lower variance on average than the bootstrap samples. The sum of the squared BHV distance between all pairs and the Fréchet variance are very similar, up to scaling. Note that the y axis for the number of different splits, sum of squares, and Fréchet variance are in log scale.
Figure 9.

The variance of the SATE data set samples of the bootstrap and posterior distributions are measured using the number of different topologies, the number of different splits, the sum of the squared BHV distance between all pairs, and the Fréchet variance. For all measures, except possibly the number of different topologies, the posterior samples have lower variance on average than the bootstrap samples. The sum of the squared BHV distance between all pairs and the Fréchet variance are very similar, up to scaling. Note that the y axis for the number of different splits, sum of squares, and Fréchet variance are in log scale.

The measurements of the sum of the squared BHV distance between all pairs of sample trees and the Fréchet variance are almost identical, up to scaling. This similar behavior is expected as these two measures are equivalent in Euclidean space, which is why the sum of squared BHV distance between all pairs was first suggested as a measure of variance. For large input sets, it is faster to compute the Fréchet variance than the BHV distance between all pairs. For more details, see the Discussion section below.

Unresolved Trees and Tree Length Comparison

We also looked at the average tree lengths and how many mean and majority-rule trees were unresolved to determine the effect of mean stickiness. Figure 10 shows the number of unresolved mean and majority-rule trees for the bootstrap and posterior samples for all sequence lengths in MAMMAL. The number of unresolved mean and majority-rule trees decreases as the sequence lengths increase, with all mean trees for the 4000 base pair sequence length being binary. There are always the same or fewer unresolved mean trees than unresolved majority-rule trees for each sequence length. However, even when mean trees are fully resolved, the stickiness property could still cause them to be closer to the origin of the sample than expected. In terms of measurable effects, being closer to the origin translates into the mean trees being shorter (in terms of total edge length) on average than the trees from which they were computed. For each of the 80 (10 replicates per sequence length) bootstrap and posterior samples, the length of the mean tree was strictly less than the average length of the corresponding sample trees. While the mean trees had a shorter length than might be expected, this does not noticeably affect the quality of the summary, as averaged over all sequence lengths, the bootstrap and posterior mean trees were only 99.48% and 99.50%, respectively, of the average lengths of their respective samples.

In the MAMMAL data set, there were the same number or fewer unresolved mean tree than majority-rule trees for the bootstrap and posterior samples for all sequence lengths. The number of unresolved mean trees decreases as the sequence length increases, reaching 0 for 4000 base pairs.
Figure 10.

In the MAMMAL data set, there were the same number or fewer unresolved mean tree than majority-rule trees for the bootstrap and posterior samples for all sequence lengths. The number of unresolved mean trees decreases as the sequence length increases, reaching 0 for 4000 base pairs.

We also compared the lengths of the ML trees of the bootstrap samples with their corresponding mean trees for the MAMMAL data set. In all 80 samples the length of the ML tree was greater than the length of the corresponding mean tree. With |$P$|-value 0.005861, the length of the ML tree is greater than the average length of the trees in the corresponding bootstrap sample.

Interestingly, in the MAMMAL data set, the average length of the trees in each posterior sample was always less than the average length of the trees in the corresponding bootstrap sample generated from the same set of sequences. This difference suggests the bootstrap samples are more spread out, which could also reflect posterior probabilities being higher for well-support clades than bootstrap probabilities. Unsurprisingly, the mean tree of the posterior distribution sample always had length less than the mean tree of the corresponding bootstrap sample.

Mean Hypothesis Test

For the first replicate of the 4000 base pair sequences, we performed a two-sample hypothesis test to compare the samples of the bootstrap and posterior distributions using the means. The null hypothesis stated that the two means of bootstrap and posterior distributions were the same. We tested this hypothesis using an approximate permutation test. In all cases, the distance between the means of a random partition of the two samples was strictly less than the distance between the means of the two samples themselves. Thus we reject the null hypothesis with an estimated P-value of 0.002, with a 95% confidence interval of |$[0,0.006]$|⁠. Therefore, we can assume the two distributions do not have the same mean, and thus are not the same. This result was expected due to previous work showing the bootstrap and posterior probabilities are different (Douady et al. 2003; Erixon et al. 2003; Huelsenbeck and Rannala 2004).

Biological Applications to Ratsnake Data

Figure 11a shows that for SNAKE there is a linear relationship between the standard deviation and the average length of trees in the sample. The variation in the standard deviation increases as the average length increases, particularly after an average length of 1.5. However, there are two outliers with average lengths of less than 1.5 (marked with a plus sign in Figure 11a), that correspond to loci 136 and 267. The first 500 trees in both of these samples have significantly smaller lengths (0.059–0.145) than the lengths of the last 500 trees in the samples (1.45–3.14).

(left) The standard deviation depends linearly on the average length of the trees in SNAKE, with the variation in the standard deviation increasing as the length increases. Two outliers are marked with a plus sign. (right) The MDS plot of a subsample of 100 trees from one of these outliers (loci 136). The small cluster corresponds to trees with much smaller lengths than the other trees in the subsample.
Figure 11.

(left) The standard deviation depends linearly on the average length of the trees in SNAKE, with the variation in the standard deviation increasing as the length increases. Two outliers are marked with a plus sign. (right) The MDS plot of a subsample of 100 trees from one of these outliers (loci 136). The small cluster corresponds to trees with much smaller lengths than the other trees in the subsample.

Discussion

Our experiments demonstrate that both the Fréchet mean and variance behave similarly to other summary and variance methods on biological data. We have shown that the mean of samples of bootstrap and posterior distributions is comparable in accuracy to the ML and MAP trees, respectively, as well as the majority-rule consensus topology. Furthermore, the mean tree is more likely to be binary than the majority-rule consensus topologies. In the MAMMAL data set, the original reference tree is closer to the mean than to the ML tree, however, the reverse is true in the SATE data set. This leads to the open question of whether we can characterize when the Fréchet mean performs better than the ML tree as a summary for bootstrap samples. While this possible gain may not be enough to warrant the cost of computing the mean, these results conclusively demonstrate that the mean is a valid summary method for a sample of trees. We believe the value of the mean comes from its sound mathematical backing, which enables more sophisticated statistical tests, like mean hypothesis testing.

From the results of the variance experiments, it is clear that the Fréchet variance is a faster and more precise measure of variance than existing alternatives. The variability in the number of topologies in the tree samples with the same sequence length is very high, in comparison to the other variance measures. Although the variability decreases when measuring the number of splits in the tree sample, it is still higher than that of the sum of squared pairwise distances and the Fréchet variance. From the SATE data set variances, we see that tree length can be a key, and possibly overwhelming, factor in the value of the variance. This observation is a direct consequence of the definition of the Fréchet variance as a sum of squared BHV distances, since if the branch lengths of two trees are all scaled by |$c$|⁠, then the BHV distance between the two trees scales by |$2c$|⁠. If the branch lengths of all trees in a set are scaled by |$c$|⁠, then the Fréchet variance of the set scales by |$4c^2$|⁠. However, the MAMMAL data set experiments show that when tree samples are comparable in length, the variance provides a means of differentiating between sample variability. Thus, the variance will be most useful in comparing samples of similar average tree length. For example, when comparing the bootstrap or posterior samples for a number of genes on the same taxa, high variance relative to average tree length of a sample could indicate an outlier. In the SNAKE data set, two such outliers had posterior samples from MrBayes with an abrupt change in the sample tree lengths at the midpoint of the samples, which may indicate some problem with the MrBayes run. We do not have access to the original MrBayes output, but Chen et al. (2017b) had assessed convergence, and confirmed the effective sample size was |$>$|200 for all parameters. None of the other analyses performed by Chen et al. (2017b) identified these loci as outliers. Thus, the Fréchet variance can be used to detect anomalies that are missed by existing methods.

The sum of squared pairwise distances and the Fréchet variance have very similar profiles, up to scaling, justifying the use of the sum of squared pairwise distances as a measure of variance in the literature. However for large sample sizes, the Fréchet variance is faster to compute. To compute the sum of squared geodesic distances for |$r$| trees, one must compute |$r(r-1)/2$| geodesic distances. In contrast, computing the Fréchet variance involves calculating the geodesic distance once per iteration of the algorithm to compute the mean, and |$r$| times to compute the variance from the mean. The number of iterations required depends on the desired precision of the mean. However, we have often obtained good results with 10,000–15,000 iterations, suggesting that when we are computing variances of more than approximately 200 trees, the Fréchet variance calculation will be faster than the sum-of-squared-distance calculation.

Our experiments on the tree lengths show that while the mean trees are shorter than the average length of the sample trees, it is by less than 1% on average. This demonstrates that stickiness leads to shorter mean trees, but that the difference in lengths is not likely to be significant and can be ignored. More importantly, we showed that the mean tree is less sticky than the majority-rule topology, because it has the same or fewer unresolved edges. The Euclidean median is sticky, while the Euclidean mean is not, so since the majority-rule topology is the median under the RF distance, it is likely more sticky than the mean tree. Thus, using the mean, instead of majority-rule topology, to summarize a set of trees gives more resolution, albeit at higher computation cost. Note that the iterative algorithm for approximating the mean works in such a way that it almost always produces binary trees, so any edges of length less than the approximation amount should be trimmed.

One might think that we could use the mean to estimate the species tree for a set of gene trees (Miller et al., 2015, Example 5.5). Even if we just restrict ourselves to the coalescent model for explaining gene tree diversity, this estimation is still problematic in two regards. Under the coalescent model, the pendant edges of the gene trees must be at least as long as than the pendant edges of the species tree but will usually be longer. Since each pendant edge of the mean tree is computed by averaging the length of that edge in all input trees, the pendant edges of the mean of the gene trees will be at least as long, and usually longer, than the pendant edges of the true species tree. While it is possible that the topology of the mean tree might still match that of the species tree, as conjectured in Miller et al. (2015, Example 5.5), our preliminary experiments suggest that stickiness greatly limits the amount of information in the mean tree, compared with other methods. In preliminary experiments, the mean tree of a set of simulated gene trees became essentially unresolved, while the gene trees were still far away from the anomalous zone (Degnan and Rosenberg 2006). That is, the most common gene tree topology was still the same as the species tree topology. By essentially unresolved, we mean that the approximated mean tree was within a small distance of some unresolved tree. It is possible that an exact algorithm for computing the mean would give a binary mean tree topology even in these cases, but with an iterative algorithm, we can not be certain the mean is not unresolved, nor even that it would be informative.

Conclusion

We have shown that the Fréchet mean and variance behave in an expected way on biological data. We have further shown that the mean is more likely to be resolved than the majority-rule tree, and that the variance is a stable and reliable measure of the amount of variability in a sample of trees. This validation opens the door to new applications of these quantities.

One possible application of the variance is to determine when MrBayes or other Markov Chain Monte Carlo (MCMC) algorithms converge. Preliminary experiments on the MrBayes runs for the data in this article show that comparing the variance of sliding windows of the sampled trees can identify the burn-in, but that the variance of the trees in the sliding window remains roughly the same after this period. However, there may be other data sets where the variance will continue to decrease after the burn-in, indicating a lack of convergence. The iterative algorithm for computing the mean is easily adapted into an online algorithm: at each iteration, instead of choosing a tree at random from the input set of trees, use the next tree generated by the MCMC chain. Unfortunately, it is not clear how to take advantage of this online algorithm in computing the variance, which requires computing the BHV distance from the current mean approximation to all sample trees seen so far. However, algorithms for computing geodesics dynamically (Skwerer and Provan 2015) might help.

Funding

M.O. acknowledges the support of the Fields Institute. This work was partially supported by a grant from the Simons Foundation (#355824 to M.O.). The work of D.G.B. is supported by a Natural Sciences and Engineering Research Council (NSERC) Discovery Grant.

Acknowledgments

We would like to thank the anonymous reviewers and Jeremy Brown for their comments and suggestions that greatly improved the quality of the article. We also thank Frank Burbrink, Aasa Feragen, Marcelo Gehara, Tom Nye, Lior Pachter, Katherine St. John, Tandy Warnow, and Derrick Zwickl for helpful discussions.

Supplementary Material

Data available from the Dryad Digital Repository: http://dx.doi.org/10.5061/dryad.h5f4117.

References

Bačák
M.
2014
.
Computing medians and means in Hadamard spaces
.
SIAM J. Optim.
24
:
1542
1566
.

Barden
D.
,
Le
H.
2018
.
The logarithm map, its limits and Fréchet means in orthant spaces
.
Proceedings of the London Mathematical Society,
117
:
751
789
.

Barthélemy
J.-P.
,
McMorris
F.
1986
.
The median procedure for n-trees
.
J. Classif.
3
:
329
334
.

Benner
P.
,
Bačák
M.
,
Bourguignon
P.-Y.
2014
.
Point estimates in phylogenetic reconstructions
.
Bioinformatics.
30
:
i534
i540
.

Billera
L.
,
Holmes
S.
,
Vogtmann
K.
2001
.
Geometry of the space of phylogenetic trees
.
Adv. Appl. Math.
27
:
733
767
.

Bouckaert
R.R.
2010
.
DensiTree: making sense of sets of phylogenetic trees
.
Bioinformatics.
26
:
1372
1373
.

Bridson
M.
,
Haefliger
A.
1999
.
Metric spaces of non-positive curvature
.
Berlin Heidelberg
:
Springer-Verlag
.

Chakerian
J.
,
Holmes
S.
2012
.
Computational tools for evaluating phylogenetic and hierarchical clustering trees
.
J. Comput. Graph. Stat.
21
:
581
599
.

Chen
X.
,
Lemmon
A.
,
Lemmon
E.
,
Pyron
R.
,
Burbrink
F.
2017a
.
Data from: using phylogenomics to understand the link between biogeographic origins and regional diversification in ratsnakes. Dryad Digital Repository
. https://doi.org/10.5061/dryad.r43s0.

Chen
X.
,
Lemmon
A.R.
,
Lemmon
E.M.
,
Pyron
R.A.
,
Burbrink
F.T.
2017b
.
Using phylogenomics to understand the link between biogeographic origins and regional diversification in ratsnakes
.
Mol. Phylogenet. Evol.
111
:
206
218
.

Degnan
J.H.
,
Rosenberg
N.A.
2006
.
Discordance of species trees with their most likely gene trees
.
PLoS Genet.
2
:
e68
.

Douady
C.J.
,
Delsuc
F.
,
Boucher
Y.
,
Doolittle
W.F.
,
Douzery
E.J.
2003
.
Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability
.
Mol. Biol. Evol.
20
:
248
254
.

Erixon
P.
,
Svennblad
B.
,
Britton
T.
,
Oxelman
B.
2003
.
Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics
.
Syst. Biol.
52
:
665
673
.

Felsenstein
J.
1981
.
Evolutionary trees from DNA sequences: a maximum likelihood approach
.
J. Mol. Evol.
17
:
368
376
.

Felsenstein
J.
1985
.
Confidence limits on phylogenies: an approach using the bootstrap
.
Evolution.
39
:
783
791
.

Feragen
A.
,
Owen
M.
,
Petersen
J.
,
Wille
M.M.
,
Thomsen
L.H.
,
Dirksen
A.
,
de Bruijne
M.
2013
.
Tree-space statistics and approximations for large-scale analysis of anatomical trees
. In:
Gee,
James C.
Joshi,
Sarang
Pohl,
Kilian M.
Wells,
William M.
Zöllei,
Lilla
editors.
International Conference on Information Processing in Medical Imaging
.
Berlin, Heidelberg
:
Springer
, Vol.
2013
. p.
74
85
.

Hillis
D.
,
Heath
T.
,
St. John
K.
2005
.
Analysis and visualization of tree space
.
Syst. Biol.
54
:
471
482
.

Holder
M.T.
,
Sukumaran
J.
,
Lewis
P.O.
2008
.
A justification for reporting the majority-rule consensus tree in Bayesian phylogenetics
.
Syst. Biol.
57
:
814
821
.

Hotz
T.
,
Huckemann
S.
,
Le
H.
,
Marron
J.S.
,
Mattingly
J.C.
,
Miller
E.
,
Nolen
J.
,
Owen
M.
,
Patrangenaru
V.
,
Skwerer
S.
2013
.
Sticky central limit theorems on open books
.
Ann. Appl. Probab.
23
:
2238
2258
.

Huelsenbeck
J.P.
,
Rannala
B.
2004
.
Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models
.
Syst. Biol.
53
:
904
913
.

Huggins
P.M.
,
Li
W.
,
Haws
D.
,
Friedrich
T.
,
Liu
J.
,
Yoshida
Y.
2011
.
Bayes estimators for phylogenetic reconstruction
.
Syst. Biol.
60
:
528
540
.

Huson
D.H.
,
Bryant
D.
2006
.
Application of phylogenetic networks in evolutionary studies
.
Mol. Biol. Evol.
23
:
254
267
.

Kruskal
J.B.
1964
.
Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis
.
Psychometrika.
29
:
1
27
.

Kuhner
M.K.
,
Yamato
J.
2014
.
Practical performance of tree comparison metrics
.
Syst. Biol.
64
:
205
214
.

Lewis
P.O.
,
Chen
M.-H.
,
Kuo
L.
,
Lewis
L.A.
,
Fučíková
K.
,
Neupane
S.
,
Wang
Y.-B.
,
Shi
D.
2016
.
Estimating Bayesian phylogenetic information content
.
Syst. Biol.
65
:
1009
1023
.

Liu
K.
,
Raghavan
S.
,
Nelesen
S.
,
Linder
C.R.
,
Warnow
T.
2009
.
Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees
.
Science.
324
:
1561
1564
.

Margush
T.
,
McMorris
F.R.
1981
.
Consensus n-trees
.
Bull. Math. Biol.
43
:
239
244
.

Miller
E.
,
Owen
M.
,
Provan
J.S.
2015
.
Polyhedral computational geometry for averaging metric phylogenetic trees
.
Adv. Appl. Math.
68
:
51
91
.

Murphy
W.J.
,
Eizirik
E.
,
O’Brien
S.J.
,
Madsen
O.
,
Scally
M.
,
Douady
C.J.
,
Teeling
E.
,
Ryder
O.A.
,
Stanhope
M.J.
,
de Jong
W.W.
,
Springer,
M.S.
2001
.
Resolution of the early placental mammal radiation using Bayesian phylogenetics
.
Science.
294
:
2348
2351
.

Nye
T.M.
2011
.
Principal components analysis in the space of phylogenetic trees
.
Ann. Stat.
39
:
2716
2739
.

Nye
T.M.
2014
.
An algorithm for constructing principal geodesics in phylogenetic treespace
.
IEEE/ACM Trans. Comput. Biol. Bioinform.
11
:
304
315
.

Nye
T.M.
,
Tang
X.
,
Weyenberg
G.
,
Yoshida
R.
2017
.
Principal component analysis and the locus of the fréchet mean in the space of phylogenetic trees
.
Biometrika.
104
:
901
922
.

Owen
M.
,
Provan
J.S.
2011
.
A fast algorithm for computing geodesic distances in tree space
.
IEEE/ACM Trans. Comput. Biol. Bioinform.
8
:
2
13
.

Pattengale
N.D.
2005
.
Tools for phylogenetic postprocessing [PhD thesis]
.
The University of New Mexico
.

Ponciano
J.M.
,
Burleigh
J.G.
,
Braun
E.L.
,
Taper
M.L.
2012
.
Assessing parameter identifiability in phylogenetic models using data cloning
.
Syst. Biol.
61
:
955
972
.

Rambaut
A.
,
Grass
N.C.
1997
.
Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees
.
Comp. Appl. Biosci. CABIOS
13
:
235
238
.

Rannala
B.
,
Yang
Z.
1996
.
Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference
.
J. Mol. Evol.
43
:
304
311
.

Robinson
D.F.
,
Foulds
L.R.
1979
.
Comparison of weighted labelled trees
. In:
Horadam
A. F.
,
Wallis,
W. D.
, editors.
Combinatorial mathematics VI.
Berlin, HeidelbergIn
:
Springer
, p.
119
126
.

Robinson
D.F.
,
Foulds
L.R.
1981
.
Comparison of phylogenetic trees
.
Math. Biosci.
53
:
131
147
.

Ronquist
F.
,
Teslenko
M.
,
van der Mark
P.
,
Ayres
D.L.
,
Darling
A.
,
Höhna
S.
,
Larget
B.
,
Liu
L.
,
Suchard
M.A.
,
Huelsenbeck
J.P.
2012
.
MrBayes 3.2: efficient Bayesian phylogenetic inference and model choice across a large model space
.
Syst. Biol.
61
:
539
542
.

Sanderson
M.J.
2003
.
r8s: inferring absolute rates of molecular evolution and divergence times in the absence of a molecular clock
.
Bioinformatics.
19
:
301
302
.

Schröder
E.
1870
.
Vier combinatorische probleme
.
Z. Math. Phys.
15
:
361
376
.

Skwerer
S.
,
Provan
S.
2015
.
Dynamic geodesics in treespace via parametric maximum flow
.
arXiv preprint arXiv:1512.03115
.

St. John
K.
2016
.
Review paper: The shape of phylogenetic treespace
.
Syst. Biol.
66
:
e83
e94
.

Stamatakis
A.
2014
.
RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies
.
Bioinformatics.
30
:
1312
1313
.

Steel
M.
,
Penny
D.
1993
.
Distributions of tree comparison metrics- some new results
.
Syst. Biol.
42
:
126
141
.

Stoye
J.
,
Evers
D.
,
Meyer
F.
1998
.
Rose: generating sequence families
.
Bioinformatics (Oxford, England).
14
:
157
163
.

Sturm
K.-T.
2003
.
Probability measures on metric spaces of nonpositive
. In:
Auscher
p.
,
Coulhon
T.
,
Grigor’yan
A.
, editors.
Heat Kernels and Analysis on Manifolds, Graphs, and Metric Spaces: Lecture Notes from a Quarter Program on Heat Kernels, Random Walks, and Analysis on Manifolds and Graphs: April 16-July 13, 2002, Emile Borel Centre of the Henri Poincaré Institute, Paris, France, Providence
,
Rhode Island, USA
:
The American Mathematical Society
, Vol.
338
. p.
357
.

Sukumaran
J.
,
Holder
M.T.
2010
.
Dendropy: a python library for phylogenetic computing
.
Bioinformatics.
26
:
1569
1571
.

Wilcoxon
F.
1945
.
Individual comparisons by ranking methods
.
Biometr. Bull.
1
:
80
83
.

Williams
T.A.
,
Foster
P.G.
,
Nye
T.M.
,
Cox
C.J.
,
Embley
T.M.
2012
.
A congruent phylogenomic signal places eukaryotes within the archaea
.
Proc. Biol. Sci.
279
:
4870
4879
.

Willis
A.
2019
.
Confidence sets for phylogenetic trees
.
J. Am. Stat. Assoc.
114
:
235
244
.

Zairis
S.
,
Khiabanian
H.
,
Blumberg
A.J.
,
Rabadan
R.
2016
.
Genomic data analysis in tree spaces
.
arXiv preprint arXiv:1607.07503
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Jeremy Brown
Jeremy Brown
Associate Editor
Search for other works by this author on: