# The Probabilities of Trees and Cladograms under Ford's [alpha]-Model.

1. IntroductionThe study of random growth models of rooted phylogenetic trees and the statistical properties of the shapes of the phylogenetic trees they produce was initiated almost one century ago by Yule [1] and it has gained momentum in the last 20 years: see, for instance, [2-8]. The final goal of this line of research is to understand the relationship between the forces that drive evolution and the topological properties of "real-life" phylogenetic trees [3, 9]; see also [10, Chapter 33]. One of the most popular such models is Ford's [alpha]-model for rooted bifurcating phylogenetic trees or cladograms [4]. It consists of a parametric model that generalizes both the uniform model (where new leaves are added equiprobably to any arc, giving rise to the uniform probability distribution on the sets of cladograms with a fixed set of taxa) and Yule's model (where new leaves are added equiprobably only to pendant arcs, i.e., to arcs ending in leaves) by allocating a possibly different probability (that depends on a parameter a and hence its name, "[alpha]-model") to the addition of the new leaves to pendant arcs or to internal arcs.

When models like Ford's model are used to contrast topological properties of phylogenetic trees contained in databases like TreeBase (https://treebase.org), only their general properties (moments, asymptotic behavior) are employed. But, in the course of a research where we have needed to compute the probabilities of several specific cladograms under this model [11], we have noticed that the explicit formulas that Ford gives in [4, [section]3.5] for the probabilities of cladograms and of tree shapes (unlabeled rooted bifurcating trees) are wrong, failing for some trees with n^ 8 leaves; see Propositions 29 and 32 in [4], with the definition of q given in page 30 therein, for Ford's formulas.

So, to help the future user of Ford's model, in this paper we give the correct explicit formulas for these probabilities. This paper is accompanied by the GitHub page https://github.com/biocom-uib/prob-alpha where the interested reader can find a SageMath [12] module to compute these probabilities and their explicit values on the sets [T.sub.n] of cladograms with n leaves labeled 1, ..., n, for every n from 2 to 8.

2. Preliminaries

2.1. Definitions, Notations, and Conventions. Throughout this paper, by a tree T, we mean a rooted bifurcating tree. As it is customary, we understand T as a directed graph, with its arcs pointing away from the root, which we shall denote by [r.sub.T]. Then, all nodes in T have out-degree either 0 (its leaves, which form the set L(T)) or 2 (its internal nodes, which form the set [V.sub.jnt](T)). The children of an internal node v are those nodes u such that (v, u) is an arc in T, and they form the set child(v). A node x is a descendant of a node v when there exists a directed path from v to x in T. For every node v, the subtree [T.sub.v] of T rooted at v is the subgraph of T induced on the set of descendants of v.

A tree T is ordered when it is endowed with an ordering [[??].sub.v] on every set child(v). A cladogram (resp., an ordered cladogram) on a set of taxa [SIGMA] is a tree (resp., an ordered tree) with its leaves bijectively labeled in [SIGMA]. Whenever we want to stress the fact that a tree is not a cladogram, that is, it is an unlabeled tree, we shall use the term tree shape.

It is important to point out that although ordered trees have no practical interest from the phylogenetic point of view, because the orderings on the sets of children of internal nodes do not carry any phylogenetic information, they are useful from the mathematical point of view, because the existence of the orderings allows one to easily prove certain extra properties that can later be translated to the unordered setting (cf. Proposition 1).

An isomorphism of ordered trees is an isomorphism of rooted trees that moreover preserves these orderings. An isomorphism of cladograms (resp., of ordered cladograms) is an isomorphism of trees (resp., of ordered trees) that preserves the leaves' labels. We shall always identify a tree shape, an ordered tree shape, a cladogram, or an ordered cladogram, with its isomorphism class, and in particular we shall make henceforth the abuse of language of saying that two of these objects, T, T', are the same, in symbols T = T', when they are (only) isomorphic. We shall denote by [T.sup.*.sub.n] and O[T.sup.*.sub.n], respectively, the sets of tree shapes and of ordered tree shapes with n leaves. Given any finite set of taxa [SIGMA], we shall denote by [T.sub.[SIGMA]] and O[T.sub.[SIGMA]], respectively, the sets of cladograms and of ordered cladograms on [SIGMA]. When the specific set [SIGMA] is unrelevant and only its cardinal matters, we shall write [T.sub.n] and O[T.sub.n] (with n = [absolute value of (X)]) instead of [T.sub.[SIGMA]] and O[T.sub.[SIGMA]], and then we shall understand that [SIGMA] is [n] = {1, 2, ..., n}.

There exist natural isomorphism-preserving forgetful mappings

[mathematical expression not reproducible]

that "forget" the orderings or the labels of the trees. In particular, we shall call the image of a cladogram under [pi] its shape. Figure 1 depicts an example of images under these forgetful mappings.

Let us introduce some more notations. For every node v in a tree T, [[kappa].sub.T](v) is its number of descendant leaves. For every internal node v in an ordered tree T, with children [v.sub.1] [[??].sub.v] [v.sub.2], its numerical split is the ordered pair N[S.sub.T](v) = ([[kappa].sub.T]([v.sub.1]), [[kappa].sub.T]([v.sub.2])). If, instead, T is unordered and if child(v) = {[v.sub.1], [v.sub.2]} with [[kappa].sub.T]([v.sub.1]) [less than or equal to] [[kappa].sub.T]([v.sub.2]), then N[S.sub.T](v) = ([[kappa].sub.T]([v.sub.1]), [[kappa].sub.T]([v.sub.2])). In both cases, the multiset of numerical splits of T is NS(T) = {N[S.sub.T](v) | v [member of] [V.sub.int](T)}. For instance, if T is the cladogram depicted in Figure 2, then

NS (T) = {(1, 1), (1, 1), (1, 1), (2, 2), (1,4), (2, 5)}. (1)

A symmetric branch point in a tree T is an internal node v such that if [v.sub.1] and [v.sub.2] are its children, then the subtrees [mathematical expression not reproducible] and [mathematical expression not reproducible] of T rooted at them have the same shape. For instance, the symmetric branch points in the cladogram depicted in Figure 2 are those filled in black.

Given two cladograms T and T' on [SIGMA] and [SIGMA]', respectively, with [SIGMA] [intersection] [SIGMA]' = 0, their root join is the cladogram T * T' on [SIGMA] [union] [SIGMA]' obtained by connecting the roots of T and T' to a (new) common root r; see Figure 3. If T, T' are ordered cladograms, T * T' is ordered by inheriting the orderings on T and T' and ordering the children of the new root r as [r.sub.T] [[??].sub.r] [r.sub.T']. If T and T' are tree shapes, a similar construction yields a tree shape T * T'; if they are moreover ordered, then T * T' becomes an ordered tree shape as explained above.

2.2. The [alpha]-Model. Ford's [alpha]-model [4] defines, for every n [greater than or equal to] 1, a family of probability density functions [P.sup.(*).sub.[alpha],n] on [T.sup.*.sub.n] that depends on one parameter [alpha] [member of] [0, 1], and then it translates this family into three other families of probability density functions [P.sub.[alpha],n] on [T.sub.n], [P.sup.(o,*).sub.[alpha],n] on O[T.sup.*.sub.n], and [P.sup.(o).sub.[alpha],n] on O[T.sub.n], by imposing that the probability of a tree shape is equally distributed among its preimages under [pi], [[pi].sub.o], and [pi] [??] [[pi].sub.o] = [[pi].sub.o,*] [??] [[pi].sub.*], respectively.

It is well known [13] that every T [member of] [T.sub.n] can be obtained in a unique way by adding recurrently to a single node labeled 1 new leaves labeled 2, ..., n to arcs (i.e., splitting an arc (u, v) into two arcs (u, w) and (w, v) and then adding a new arc from the inserted node w to a new leaf) or to a new root (i.e., adding a new root w and new arcs from w to the old root and to a new leaf). The value of [P.sup.(*).sub.[alpha],n] ([T.sup.*]) for [T.sup.*] [member of] [T.sup.*.sub.n] is determined through all possible ways of constructing cladograms with shape [T.sup.*] in this way. More specifically,

(1) if [T.sub.1] and [T.sub.2] denote, respectively, the only cladograms in [T.sub.1] and [T.sub.2], let [P'.sub.[alpha],1] ([T.sub.1]) = [P'.sub.[alpha],2] ([T.sub.2]) = 1;

(2) for every m = 3, ..., n, let [T.sub.m] [member of] [T.sub.m] be obtained by adding a new leaf labeled m to [T.sub.m-1]. Then

[mathematical expression not reproducible] (2)

(3) When the desired number n of leaves is reached, the probability of every tree shape [T.sup.*.sub.n] [member of] [T.sup.*.sub.n] is defined as

[mathematical expression not reproducible] (3)

For instance, Figure 4 shows the construction of two cladograms in [T.sub.5] with the same shape and how their probability [P'.sub.[alpha],5] is built using the recursion in Step (2). If we generate all cladograms in [T.sub.5] with this shape, we compute their probabilities [P'.sub.[alpha],5], and then we add up all these probabilities, we obtain the probability [P.sup.(*).sub.[alpha],5] of this shape, which turns out to be 2(1 - [alpha])/(4 - [alpha]); cf. [4, Figure 23].

Once [P.sup.(*).sub.[alpha],n] is defined on [T.sup.*.sub.n],it is transported to [T.sub.n], O[T.sup.*.sub.n], and O[T.sub.n] by defining the probability of an object in one of these sets as the probability of its image in T* divided by the number of preimages of this image:

(i) For every T [member of] [T.sub.n], if [pi](T) = [T.sup.*] [member of] [T.sup.*.sub.n] and it has k symmetric branch points, then

[P.sub.lapha],n] (T) = [2.sup.k]/n! x [P.sup.(*).sub.[alpha],n]([T.sup.*]), (4)

because [absolute value of ([[pi].sup.-1] ([T.sup.*]))] = n!/[2.sup.k] (see, e.g., [4, Lemma 31]).

(ii) For every [T.sub.o] [member of] O[T.sub.n], if [[pi].sub.o]([T.sub.o]) = T [member of] [T.sub.n], then

[P.sup.(o).sub.[alpha],n]([T.sub.o]) = 1/[2.sup.n-1] x [P.sub.[alpha],n] (T), (5)

because [absolute value of [[pi].sup.-1.sub.o](T))] = [2.sup.n-1] (T has [2.up.n-1] different preimages under [[pi].sub.o], obtained by taking all possible different combinations of orderings on the n-1 sets child(v), v [member of] [V.sub.int]([T.sup.*])).

(iii) For every [T.sup.*.sub.o] [member of] O[T.sup.*.sub.n], if [[pi].sub.o,*]([T.sup.*.sub.o]) = [T.sup.*] [member of] [T.sup.*.sub.n] and it has k symmetric branch points, then

[P.sup.(o,*).sub.[alpha],n] ([T.sup.*.sub.o]) = 1/[2.sup.n-k-1] x [P.sup.(*).sub.[alpha],n] ([T.sup.*]), (6)

because [absolute value of ([[pi].sup.-1.sub.o,*]([T.sup.*]))] = [2.sup.n-1-k] (from the [2.sup.n-1] possible preimages of [T.sup.*] under [[pi].sub.o,*], defined by all possible different combinations of orderings on the n - 1 sets child(v), v [member of] [V.sub.int]([T.sup.*]), those differing only on the orderings on the children of the k symmetric branch points are actually the same ordered tree shape).

The family [([P.sup.(o,*).sub.[alpha],n]).sub.n] of probabilities of ordered tree shapes satisfies the useful Markov branching recurrence (in the sense of [2, [section]4]) given by the following proposition. In it and in the sequel, let, for every a, b [member of] [Z.sup.+],

[q.sub.[alpha]] (a, b) = [[GAMMA].sub.[alpha]](a) [[GAMMA].sub.[alpha]](b)/[[GAMMA].sub.[alpha]] (a + b) x [[phi].sub.[alpha]] (a, b), (7)

where

[mathematical expression not reproducible] (8)

and [[GAMMA].sub.[alpha]] : [Z.sup.+] [right arrow] R is the mapping defined by [[GAMMA].sub.[alpha]] (1) = 1 and, for every n [greater than or equal to] 2, [[GAMMA].sub.[alpha]](n) = (n - 1 - [alpha]) x [[GAMMA].sub.[alpha]] (n - 1).

Proposition 1. For every 0 < m < n and for every [T.sup.*.sub.m] [member of] O[T.sup.*.sub.m] and [T.sup.*.sub.n-m] [member of] O[T.sup.*.sub.n-m],

[mathematical expression not reproducible] (9)

This recurrence, together with the fact that [P.sup.(o,*).sub.[alpha],1] of a single node is 1, implies that, for every [T.sup.*.sub.o] [member of] O[T.sup.*.sub.n],

[mathematical expression not reproducible] (10)

For proofs of Proposition 1 and (10), see Lemma 27 and Proposition 28 in [4], respectively.

3. Main Results

Our first result is an explicit formula for [P.sub.[alpha],n](T), for every n [greater than or equal to] 1 and T [member of] [T.sub.n].

Proposition 2. For every T [member of] [T.sub.n], its probability under the [alpha]-model is

[mathematical expression not reproducible] (11)

Proof. Given T [member of] [T.sub.n], let [T.sub.o] be any ordered cladogram such that [[pi].sub.o]([T.sub.o]) = T, and let [T.sup.*.sub.o] = [[pi].sub.*]([T.sub.o] [member of] O[T.sup.*.sub.n] and [T.sup.*] = [pi](T) = [[pi].sub.o,*]([T.sup.*.sub.o]). If [T.sup.*] has k symmetric branch points, then, by (4), (6), and (10),

[mathematical expression not reproducible] (12)

Now, on the one hand, it is easy to check that

NS (T) = {(min {a, b}, max {a, b}) | (a, b) [member of] NS ([T.sup.*.sub.o])}, (13)

and therefore, since [q.sub.[alpha]] is symmetric,

[mathematical expression not reproducible] (14)

It remains to simplify this product. If, for every v [member of] [V.sub.int](T), we denote its children by [v.sub.1] and [v.sub.2], then

[mathematical expression not reproducible] (15)

For every v [member of] [V.sub.int](T) \ {[r.sub.T]}, the term [[GAMMA].sub.[alpha]]([[kappa].sub.T](v)) appears twice in this product: in the denominator of the factor corresponding to v itself and in the numerator of the factor corresponding to its parent. Therefore, all terms [[GAMMA].sub.[alpha]]([[kappa].sub.T](v)) in this product vanish except [[GAMMA].sub.[alpha]]([[kappa].sub.T]([r.sub.T])) = [[GAMMA].sub.[alpha]](n) (that appears in the denominator of its factor) and every [[GAMMA].sub.[alpha]]([[kappa].sub.T](v)) = [[GAMMA].sub.[alpha]](1) = 1 with v, a leaf. Thus,

[mathematical expression not reproducible] (16)

as we claimed.

Remark 3. Ford states (see [4, Proposition 32 and page 30]) that if T [member of] [T.sub.n], then

[mathematical expression not reproducible] (17)

where k is the number of symmetric branching points in T and

[mathematical expression not reproducible] (18)

If we simplify [[PI].sub.(a,b)[member of]NS(T)] [[??].sub.[alpha]](a, b) as in the proof of Proposition 2, this formula for [P.sub.[alpha],n] (T) becomes

[mathematical expression not reproducible] (19)

where m is the number of internal nodes whose children have different numbers of descendant leaves. This formula does not agree with the one given in Proposition 2 above, because

[mathematical expression not reproducible] (20)

and, hence, it may happen that k + m < n - 1. The first example of a cladogram with this property (and the only one, up to relabeling, with at most 8 leaves) is the cladogram [??] [member of] [T.sub.8] depicted in Figure 5. For it, our formula gives (see (8.22) in the document ProblsAlpha.pdf in https://github.com/biocom-uib/prob-alpha)

[P.sub.[alpha],8] ([??]) = [(1 - [alpha]).sup.2] (2 - [alpha])/126 (7 - [alpha]) (6 - [alpha]) (5 - [alpha]) (3 - [alpha]) (21)

while expression (19) assigns to [??] a probability of half this value:

[(1 - [alpha]).sup.2] (2 - [alpha])/252 (7 - [alpha]) (6 - [alpha]) (5 - [alpha]) (3 - [alpha]) (22)

This last value cannot be right, for several reasons. Firstly, by [4, [section]3.12], when [alpha] = 1/2, Ford's model is equivalent to the uniform model, where every cladogram in [T.sub.n] has the same probability

1/[absolute value of (B[T.sub.n])] = 1(2n - 3)!! (23)

and when [alpha] = 0, Ford's model gives rise to the Yule model [1, 14], where the probability of every T [member of] [T.sub.n] is

[mathematical expression not reproducible] (24)

In particular, [P.sub.1/2,8]([??]) should be equal to 1/135135 and [P.sub.0,8]([??]) should be equal to 1/19845. Both values are consistent with our formula, while expression (22) yields half these values.

As a second reason, which can be checked using a symbolic computation program, let us mention that if we take expression (22) as the probability of [??] and hence of all other cladograms with its shape, and we assign to all other cladograms in [T.sub.8] the probabilities computed with Proposition 2, which agree on them with the values given by (19) (they are also provided in the aforementioned document ProblsAlpha.pdf), these probabilities do not add up 1.

Combining Proposition 2 and (4) we obtain the following result.

Corollary 4. For every [T.sup.*] [member of] [T.sup.*.sub.n] with k symmetric branch points,

[mathematical expression not reproducible] (25)

This formula does not agree, either, with the one given in [4, Proposition 29]: the difference lies again in the same factor of 2 to the power of the number of internal nodes that are not symmetric branch points but whose children have the same number of descendant leaves.

The family of density mappings [([P.sub.[alpha],n]).sub.n] satisfies the following Markov branching recurrence.

Corollary 5. For every 0 < m < n and for every [T.sub.m] [member of] [T.sub.m] and [T.sub.n-m] [member of] [T.sub.n-m],

[mathematical expression not reproducible] (26)

Proof. If [T.sub.m] [member of] [T.sub.m] and [T.sub.n-m] [member of] [T.sub.n-m], then

[mathematical expression not reproducible] (27)

as we claimed.

Remark 6. Against what is stated in [4], [([P.sup.(*).sub.[alpha],n]).sub.n] does not satisfy any Markov branching recurrence; that is, there does not exist any symmetric mapping Q : [Z.sup.+] x [Z.sup.+] [right arrow] R such that, for every k, l [greater than or equal to] 1 and for every [T.sub.k] [member of] [T.sup.*.sub.k] and [T.sub.l] [mmeber of] [T.sup.*.sub.l],

[P.sup.(*).sub.a,k+l] ([T.sub.k] * [T.sub.k]) = Q(k, l) x [P.sup.(*).sub.[alpha],k] ([T.sub.k]) x [P.sup.(*).sub.[alpha],l] ([T.sub.l]). (28)

Indeed, let [T.sup.*.sub.m], [[??].sup.*.sub.m] [member of] [T.sup.*.sub.m] be any two different tree shapes, both with m leaves and k symmetric branch points, for instance, the tree shapes in [T.sup.*.sub.6] depicted in Figure 6. Then,

[mathematical expression not reproducible] (29)

In this case, [T.sup.*.sub.m] * [T.sup.*.sub.m] [member of] [T.sup.*.sub.2m] has 2k + 1 symmetric branch points and therefore

[mathematical expression not reproducible] (30)

while [T.sup.*.sub.m] * [[??].sup.*.sub.m] [member of] [T.sup.*.sub.2m] has 2k symmetric branch points and therefore

[mathematical expression not reproducible] (31)

and [q.sub.[alpha]](m, m) [not equal to] [2q.sub.[alpha]](m, m). This shows that there does not exist any well-defined, single real number Q(m, m) such that

[mathematical expression not reproducible] (32)

for every [T.sup.*.sub.1,m], [T.sup.*.sub.2,m] [member of] [T.sup.*.sub.m].

Data Availability

The data used to support the findings of this study are available at the Github page that accompanies this paper.

https://doi.org/10.1155/2018/1916094

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by Spanish Ministry of Economy and Competitiveness and European Regional Development Fund Project DPI2015-67082-P (MINECO/FEDER). The authors thank G. Cardona and G. Riera for several comments on the SageMath module that accompanies this paper.

References

[1] G. U. Yule, "A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S.," Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 213, no. 402-410, pp. 21-87, 1925.

[2] D. Aldous, "Probability distributions on cladograms," in Random discrete structures, D. Aldous and R. Pemantle, Eds., vol. 76, pp. 1-18, Springer, New York, NY, USA, 1996.

[3] M. G. B. Blum and O. Francois, "Which random processes describe the tree of life? A large-scale study of phylogenetic tree imbalance," Systematic Biology, vol. 55, no. 4, pp. 685-691, 2006.

[4] D. J. Ford, "Probabilities on cladograms: Introduction to the alpha model," PhD Thesis (Stanford University), https://arxiv .org/abs/math/0511246.

[5] S. Keller-Schmidt, M. Tugrul, V. M. Eguiluz, E. Hernandez-Garcia, and K. Klemm, "Anomalous scaling in an age-dependent branching model," Physical Review E: Statistical, Nonlinear, and Soft Matter Physics, vol. 91, no. 2, Article ID 022803, 2015.

[6] M. Kirkpatrick and M. Slatkin, "Searching for Evolutionary Patterns in the Shape of a Phylogenetic Tree," Evolution, vol. 47, no. 4, pp. 1171-1181, 1993.

[7] L. Popovic and M. Rivas, "Topology and inference for Yule trees with multiple states," Journal of Mathematical Biology, vol. 73, no. 5, pp. 1251-1291, 2016.

[8] R. Sainudiin and A. Veber, "A Beta-splitting model for evolutionary trees," Royal Society Open Science, vol. 3, no. 5, Article ID 160016, 2016.

[9] A. O. Mooers and S. B. Heard, "Inferring evolutionary process from phylogenetic tree shape," The Quarterly Review of Biology, vol. 72, no. 1, pp. 31-54, 1997.

[10] J. Felsenstein, Inferring Phylogenies, Sinauer Associates Inc., 2004.

[11] T. M. Coronado, A. Mir, F. Rossello, and G. Valiente, "A balance index for phylogenetic trees based on quartets," https://arxiv .org/abs/1801.05411.

[12] SageMath, the Sage Mathematics Software System (Version 7.6), The Sage Developers (2017) http://www.sagemath.org.

[13] L. L. Cavalli-Sforza and A. W. Edwards, "Phylogenetic Analysis: Models and Estimation Procedures," Evolution, vol. 21, no. 3, pp. 550-570, 1967.

[14] E. F. Harding, "The probabilities of rooted tree-shapes generated by random bifurcation," Advances in Applied Probability, vol. 3, pp. 44-77, 1971.

Tomas M. Coronado (iD), Arnau Mir, and Francesc Rossello (iD)

Balearic Islands Health Research Institute (IdISBa) and Department of Mathematics and Computer Science, University of the Balearic Islands, 07122 Palma, Spain

Correspondence should be addressed to Francesc Rossello; cesc.rossello@uib.es

Received 1 February 2018; Accepted 8 March 2018; Published 18 April 2018

Academic Editor: Bola Tothmerosz

Caption: Figure 1: An example of images under the forgetful mappings between (ordered and unordered) cladograms and tree shapes. In the ordered objects, the ordering is represented by the nodes' colors: gray [??] white.

Caption: Figure 2: A cladogram in [T.sub.7]. The black nodes are its symmetric branch points.

Caption: Figure 3: The root join T * T'.

Caption: Figure 4: Two examples of computations of the probability [P'.sub.[alpha],n] of a cladogram through its construction in Step (2) of the definition of the [alpha]-model.

Caption: Figure 5: The cladogram [??] [member of] [T.sub.8] used in Remark 3.

Caption: Figure 6: The tree shapes in [T.sup.*.sub.6] mentioned in Remark 6.

Printer friendly Cite/link Email Feedback | |

Title Annotation: | Research Article |
---|---|

Author: | Coronado, Tomas M.; Mir, Arnau; Rossello, Francesc |

Publication: | The Scientific World Journal |

Date: | Jan 1, 2018 |

Words: | 4060 |

Previous Article: | Relationship between Risk Behavior for Eating Disorders and Dental Caries and Dental Erosion. |

Next Article: | Detection of Vertical Root Fractures Using Cone-Beam Computed Tomography in the Presence and Absence of Gutta-Percha. |

Topics: |