-
Inconsistency of parsimony under the multispecies coalescent
Authors:
Daniel Rickert,
Wai-Tong Louis Fan,
Matthew Hahn
Abstract:
While it is known that parsimony can be statistically inconsistent under certain models of evolution due to high levels of homoplasy, the consistency of parsimony under the multispecies coalescent (MSC) is less well studied. Previous studies have shown the consistency of concatenated parsimony (parsimony applied to concatenated alignments) under the MSC for the rooted 4-taxa case under an infinite…
▽ More
While it is known that parsimony can be statistically inconsistent under certain models of evolution due to high levels of homoplasy, the consistency of parsimony under the multispecies coalescent (MSC) is less well studied. Previous studies have shown the consistency of concatenated parsimony (parsimony applied to concatenated alignments) under the MSC for the rooted 4-taxa case under an infinite-sites model of mutation; on the other hand, other work has also established the inconsistency of concatenated parsimony for the unrooted 6-taxa case. These seemingly contradictory results suggest that concatenated parsimony may fail to be consistent for trees with more than 5 taxa, for all unrooted trees, or for some combination of the two. Here, we present a technique for computing the expected internal branch lengths of gene trees under the MSC. This technique allows us to determine the regions of the parameter space of the species tree under which concatenated parsimony fails for different numbers of taxa, for rooted or unrooted trees. We use our new approach to demonstrate that there are always regions of statistical inconsistency for concatenated parsimony for the 5- and 6-taxa cases, regardless of rooting. Our results therefore suggest that parsimony is not generally dependable under the MSC.
△ Less
Submitted 4 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
Correlation of coalescence times in a diploid Wright-Fisher model with recombination and selfing
Authors:
David Kogan,
Dimitrios Diamantidis,
John Wakeley,
Wai-Tong Louis Fan
Abstract:
The correlation among the gene genealogies at different loci is crucial in biology, yet challenging to understand because such correlation depends on many factors including genetic linkage, recombination, natural selection and population structure. Based on a diploid Wright-Fisher model with a single mating type and partial selfing for a constant large population with size $N$, we quantify the com…
▽ More
The correlation among the gene genealogies at different loci is crucial in biology, yet challenging to understand because such correlation depends on many factors including genetic linkage, recombination, natural selection and population structure. Based on a diploid Wright-Fisher model with a single mating type and partial selfing for a constant large population with size $N$, we quantify the combined effect of genetic drift and two competing factors, recombination and selfing, on the correlation of coalescence times at two linked loci for samples of size two. Recombination decouples the genealogies at different loci and decreases the correlation while selfing increases the correlation. We obtain explicit asymptotic formulas for the correlation for four scaling scenarios that depend on whether the selfing probability and the recombination probability are of order $O(1/N)$ or $O(1)$ as $N$ tends to infinity. Our analytical results confirm that the asymptotic lower bound in [King, Wakeley, Carmi (Theor. Popul. Biol. 2018)] is sharp when the loci are unlinked and when there is no selfing, and provide a number of new formulas for other scaling scenarios that have not been considered before. We present asymptotic results for the variance of Tajima's estimator of the population mutation rate for infinitely many loci as $N$ tends to infinity. When the selfing probability is of order $O(1)$ and is equal to a positive constant $s$ for all $N$ and if the samples at both loci are in the same individual, then the variance of the Tajima's estimator tends to $s/2$ (hence remains positive) even when the recombination rate, the number of loci and the population size all tend to infinity.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Quasi-stationary behavior of the stochastic FKPP equation on the circle
Authors:
Wai-Tong Louis Fan,
Oliver Tough
Abstract:
We consider the stochastic Fisher-Kolmogorov-Petrovsky-Piscunov (FKPP) equation on the circle $\mathbb{S}$, \begin{equation*}
\partial_t u(t,x) \,= \fracα{2}Δu +β\,u(1-u) + \sqrt{γ\,u(1-u)}\,\dot{W}, \qquad (t,x)\in(0,\infty)\times \mathbb{S}, \end{equation*} where $\dot{W}$ is space-time white noise. While any solution will eventually be absorbed at one of two states, the constant 1 and the con…
▽ More
We consider the stochastic Fisher-Kolmogorov-Petrovsky-Piscunov (FKPP) equation on the circle $\mathbb{S}$, \begin{equation*}
\partial_t u(t,x) \,= \fracα{2}Δu +β\,u(1-u) + \sqrt{γ\,u(1-u)}\,\dot{W}, \qquad (t,x)\in(0,\infty)\times \mathbb{S}, \end{equation*} where $\dot{W}$ is space-time white noise. While any solution will eventually be absorbed at one of two states, the constant 1 and the constant 0 on the circle, essentially nothing had been established about the absorption time (also called the fixation time in population genetics), or about the long-time behavior prior to absorption. We establish the existence and uniqueness of the quasi-stationary distribution (QSD) for the solution of the stochastic FKPP. Moreover, we show that the solution conditioned on not being absorbed at time $t$ converges to this unique QSD as $t\to\infty$, for any initial distribution, and characterize the leading-order asymptotics for the tail distribution of the fixation time. We obtain explicit calculations in the neutral case ($β=0$), quantifying the effect of spatial diffusion on fixation time. We explicitly express the fixation rate in terms of the migration rate $α$ for all $α\in (0,\infty)$, finding in particular that the fixation rate is given by $γ[1-\fracγ{12α}+\mathcal{O}(\frac{γ^2}{α^2})]$ for fast migration and $π^2α[1-\frac{8α}γ+\mathcal{O}(\frac{α^2}{γ^2})]$ for slow migration. Our proof relies on the observation that the absorbed (or killed) stochastic FKPP is dual to a system of $2$-type branching-coalescing Brownian motions killed when one type dies off, and on leveraging the relationship between these two killed processes.
△ Less
Submitted 9 January, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
Latent mutations in the ancestries of alleles under selection
Authors:
Wai-Tong Louis Fan,
John Wakeley
Abstract:
We consider a single genetic locus with two alleles $A_1$ and $A_2$ in a large haploid population. The locus is subject to selection and two-way, or recurrent, mutation. Assuming the allele frequencies follow a Wright-Fisher diffusion and have reached stationarity, we describe the asymptotic behaviors of the conditional gene genealogy and the latent mutations of a sample with known allele counts,…
▽ More
We consider a single genetic locus with two alleles $A_1$ and $A_2$ in a large haploid population. The locus is subject to selection and two-way, or recurrent, mutation. Assuming the allele frequencies follow a Wright-Fisher diffusion and have reached stationarity, we describe the asymptotic behaviors of the conditional gene genealogy and the latent mutations of a sample with known allele counts, when the count $n_1$ of allele $A_1$ is fixed, and when either or both the sample size $n$ and the selection strength $\lvertα\rvert$ tend to infinity. Our study extends previous work under neutrality to the case of non-neutral rare alleles, asserting that when selection is not too strong relative to the sample size, even if it is strongly positive or strongly negative in the usual sense ($α\to -\infty$ or $α\to +\infty$), the number of latent mutations of the $n_1$ copies of allele $A_1$ follows the same distribution as the number of alleles in the Ewens sampling formula. On the other hand, very strong positive selection relative to the sample size leads to neutral gene genealogies with a single ancient latent mutation. We also demonstrate robustness of our asymptotic results against changing population sizes, when one of $\lvertα\rvert$ or $n$ is large.
△ Less
Submitted 26 April, 2024; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Patch formation driven by stochastic effects of interaction between viruses and defective interfering particles
Authors:
Qiantong Liang,
Johnny Yang,
Wai-Tong Louis Fan,
Wing-Cheong Lo
Abstract:
Defective interfering particles (DIPs) are virus-like particles that occur naturally during virus infections. These particles are defective, lacking essential genetic materials for replication, but they can interact with the wild-type virus and potentially be used as therapeutic agents. However, the effect of DIPs on infection spread is still unclear due to complicated stochastic effects and nonli…
▽ More
Defective interfering particles (DIPs) are virus-like particles that occur naturally during virus infections. These particles are defective, lacking essential genetic materials for replication, but they can interact with the wild-type virus and potentially be used as therapeutic agents. However, the effect of DIPs on infection spread is still unclear due to complicated stochastic effects and nonlinear spatial dynamics. In this work, we develop a model with a new hybrid method to study the spatial-temporal dynamics of viruses and DIPs co-infections within hosts. We present two different scenarios of virus production and compare the results from deterministic and stochastic models to demonstrate how the stochastic effect is involved in the spatial dynamics of virus transmission. We quantitatively study the spread features of the virus, including the formation and the speed of virus spread and the emergence of stochastic patchy patterns of virus distribution. Our simulations simultaneously capture observed spatial spread features in the experimental data, including the spread rate of the virus and its patchiness. The results demonstrate that DIPs can slow down the growth of virus particles and make the spread of the virus more patchy.
△ Less
Submitted 24 March, 2023;
originally announced March 2023.
-
Impossibility of phylogeny reconstruction from $k$-mer counts
Authors:
Wai-Tong Louis Fan,
Brandon Legried,
Sebastien Roch
Abstract:
We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed $k$ no statistically consistent phylogeny estimation is possible from $k$-mer counts over the full leaf sequences alone. Formally, we establish that the joint distribution of $k$-mer counts ov…
▽ More
We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed $k$ no statistically consistent phylogeny estimation is possible from $k$-mer counts over the full leaf sequences alone. Formally, we establish that the joint distribution of $k$-mer counts over the entire leaf sequences on two distinct trees have total variation distance bounded away from $1$ as the sequence length tends to infinity. Our impossibility result implies that statistical consistency requires more sophisticated use of $k$-mer count information, such as block techniques developed in previous theoretical work.
△ Less
Submitted 1 March, 2022; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Impossibility of consistent distance estimation from sequence lengths under the TKF91 model
Authors:
Wai-Tong Louis Fan,
Brandon Legried,
Sebastien Roch
Abstract:
We consider the problem of distance estimation under the TKF91 model of sequence evolution by insertions, deletions and substitutions on a phylogeny. In an asymptotic regime where the expected sequence lengths tend to infinity, we show that no consistent distance estimation is possible from sequence lengths alone. More formally, we establish that the distributions of pairs of sequence lengths at d…
▽ More
We consider the problem of distance estimation under the TKF91 model of sequence evolution by insertions, deletions and substitutions on a phylogeny. In an asymptotic regime where the expected sequence lengths tend to infinity, we show that no consistent distance estimation is possible from sequence lengths alone. More formally, we establish that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to one.
△ Less
Submitted 23 May, 2020;
originally announced May 2020.
-
Stochastic PDEs on graphs as scaling limits of discrete interacting systems
Authors:
Wai-Tong Louis Fan
Abstract:
Stochastic partial differential equations (SPDE) on graphs were introduced by Cerrai and Freidlin [Ann. Inst. Henri Poincaré Probab. Stat. 53 (2017) 865-899]. This class of stochastic equations in infinite dimensions provides a minimal framework for the study of the effective dynamics of much more complex systems. However, how they emerge from microscopic individual-based models is still poorly un…
▽ More
Stochastic partial differential equations (SPDE) on graphs were introduced by Cerrai and Freidlin [Ann. Inst. Henri Poincaré Probab. Stat. 53 (2017) 865-899]. This class of stochastic equations in infinite dimensions provides a minimal framework for the study of the effective dynamics of much more complex systems. However, how they emerge from microscopic individual-based models is still poorly understood, partly due to complications near vertex singularities. In this work, motivated by the study of the dynamics and the genealogies of expanding populations in spatially structured environments, we obtain a new class of SPDE on graphs of Wright-Fisher type which have nontrivial boundary conditions on the vertex set. We show that these SPDE arise as scaling limits of suitably defined biased voter models (BVM), which extends the scaling limits of Durrett and Fan [Ann. Appl. Probab. 26 (2016) 456-490]. We further obtain a convergent simulation scheme for each of these SPDE in terms of a system of Itô SDEs, which is useful when the size of the BVM is too large for stochastic simulations. These give the first rigorous connection between SPDE on graphs and more discrete models, specifically, interacting particle systems and interacting SDEs. Uniform heat kernel estimates for symmetric random walks approximating diffusions on graphs are the keys to our proofs.
△ Less
Submitted 17 November, 2020; v1 submitted 5 August, 2017;
originally announced August 2017.
-
Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling
Authors:
Wai-Tong Louis Fan,
Sebastien Roch
Abstract:
In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and branchings indicate past speciation events. Phylogenies are commonly estimated from molecular sequences, such as DNA sequences, collected from the species of interest. At a high level, the idea behind this inference is…
▽ More
In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and branchings indicate past speciation events. Phylogenies are commonly estimated from molecular sequences, such as DNA sequences, collected from the species of interest. At a high level, the idea behind this inference is simple: the further apart in the Tree of Life are two species, the greater is the number of mutations to have accumulated in their genomes since their most recent common ancestor. In order to obtain accurate estimates in phylogenetic analyses, it is standard practice to employ statistical approaches based on stochastic models of sequence evolution on a tree. For tractability, such models necessarily make simplifying assumptions about the evolutionary mechanisms involved. In particular, commonly omitted are insertions and deletions of nucleotides -- also known as indels.
Properly accounting for indels in statistical phylogenetic analyses remains a major challenge in computational evolutionary biology. Here we consider the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process. We focus on the case of dense phylogenies of bounded height, which we refer to as the taxon-rich setting, where statistical consistency is achievable. We give the first polynomial-time ancestral reconstruction algorithm with provable guarantees under constant rates of mutation. Our algorithm succeeds when the phylogeny satisfies the "big bang" condition, a necessary and sufficient condition for statistical consistency in this context.
△ Less
Submitted 31 July, 2019; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees
Authors:
Wai-Tong Louis Fan,
Sebastien Roch
Abstract:
We establish necessary and sufficient conditions for consistent root reconstruction in continuous-time Markov models with countable state space on bounded-height trees. Here a root state estimator is said to be consistent if the probability that it returns to the true root state converges to 1 as the number of leaves tends to infinity. We also derive quantitative bounds on the error of reconstruct…
▽ More
We establish necessary and sufficient conditions for consistent root reconstruction in continuous-time Markov models with countable state space on bounded-height trees. Here a root state estimator is said to be consistent if the probability that it returns to the true root state converges to 1 as the number of leaves tends to infinity. We also derive quantitative bounds on the error of reconstruction. Our results answer a question of Gascuel and Steel and have implications for ancestral sequence reconstruction in a classical evolutionary model of nucleotide insertion and deletion.
△ Less
Submitted 1 August, 2019; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Genealogies in Expanding Populations
Authors:
Rick Durrett,
Wai-Tong Louis Fan
Abstract:
The goal of this paper is to prove rigorous results for the behavior of genealogies in a one-dimensional long range biased voter model introduced by Hallatschek and Nelson [25]. The first step, which is easily accomplished using results of Mueller and Tribe [38], is to show that when space and time are rescaled correctly, our biased voter model converges to a Wright-Fisher SPDE. A simple extension…
▽ More
The goal of this paper is to prove rigorous results for the behavior of genealogies in a one-dimensional long range biased voter model introduced by Hallatschek and Nelson [25]. The first step, which is easily accomplished using results of Mueller and Tribe [38], is to show that when space and time are rescaled correctly, our biased voter model converges to a Wright-Fisher SPDE. A simple extension of a result of Durrett and Restrepo [18] then shows that the dual branching coalescing random walk converges to a branching Brownian motion in which particles coalesce after an exponentially distributed amount of intersection local time. Brunet et al. [8] have conjectured that genealogies in models of this type are described by the Bolthausen-Sznitman coalescent, see [39]. However, in the model we study there are no simultaneous coalescences. Our third and most significant result concerns "tracer dynamics" in which some of the initial particles in the biased voter model are labeled. We show that the joint distribution of the labeled and unlabeled particles converges to the solution of a system of stochastic partial differential equations. A new duality equation that generalizes the one Shiga [44] developed for the Wright-Fisher SPDE is the key to the proof of that result.
△ Less
Submitted 13 January, 2016; v1 submitted 3 July, 2015;
originally announced July 2015.