-
Inconsistency of parsimony under the multispecies coalescent
Authors:
Daniel Rickert,
Wai-Tong Louis Fan,
Matthew Hahn
Abstract:
While it is known that parsimony can be statistically inconsistent under certain models of evolution due to high levels of homoplasy, the consistency of parsimony under the multispecies coalescent (MSC) is less well studied. Previous studies have shown the consistency of concatenated parsimony (parsimony applied to concatenated alignments) under the MSC for the rooted 4-taxa case under an infinite…
▽ More
While it is known that parsimony can be statistically inconsistent under certain models of evolution due to high levels of homoplasy, the consistency of parsimony under the multispecies coalescent (MSC) is less well studied. Previous studies have shown the consistency of concatenated parsimony (parsimony applied to concatenated alignments) under the MSC for the rooted 4-taxa case under an infinite-sites model of mutation; on the other hand, other work has also established the inconsistency of concatenated parsimony for the unrooted 6-taxa case. These seemingly contradictory results suggest that concatenated parsimony may fail to be consistent for trees with more than 5 taxa, for all unrooted trees, or for some combination of the two. Here, we present a technique for computing the expected internal branch lengths of gene trees under the MSC. This technique allows us to determine the regions of the parameter space of the species tree under which concatenated parsimony fails for different numbers of taxa, for rooted or unrooted trees. We use our new approach to demonstrate that there are always regions of statistical inconsistency for concatenated parsimony for the 5- and 6-taxa cases, regardless of rooting. Our results therefore suggest that parsimony is not generally dependable under the MSC.
△ Less
Submitted 4 July, 2024; v1 submitted 2 July, 2024;
originally announced July 2024.
-
MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction
Authors:
Yuyan Liu,
Sirui Ding,
Sheng Zhou,
Wenqi Fan,
Qiaoyu Tan
Abstract:
Molecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective o…
▽ More
Molecular property prediction (MPP) is a fundamental and crucial task in drug discovery. However, prior methods are limited by the requirement for a large number of labeled molecules and their restricted ability to generalize for unseen and new tasks, both of which are essential for real-world applications. To address these challenges, we present MolecularGPT for few-shot MPP. From a perspective on instruction tuning, we fine-tune large language models (LLMs) based on curated molecular instructions spanning over 1000 property prediction tasks. This enables building a versatile and specialized LLM that can be adapted to novel MPP tasks without any fine-tuning through zero- and few-shot in-context learning (ICL). MolecularGPT exhibits competitive in-context reasoning capabilities across 10 downstream evaluation datasets, setting new benchmarks for few-shot molecular prediction tasks. More importantly, with just two-shot examples, MolecularGPT can outperform standard supervised graph neural network methods on 4 out of 7 datasets. It also excels state-of-the-art LLM baselines by up to 16.6% increase on classification accuracy and decrease of 199.17 on regression metrics (e.g., RMSE) under zero-shot. This study demonstrates the potential of LLMs as effective few-shot molecular property predictors. The code is available at https://github.com/NYUSHCS/MolecularGPT.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Sequential Model for Predicting Patient Adherence in Subcutaneous Immunotherapy for Allergic Rhinitis
Authors:
Yin Li,
Yu Xiong,
Wenxin Fan,
Kai Wang,
Qingqing Yu,
Liping Si,
Patrick van der Smagt,
Jun Tang,
Nutan Chen
Abstract:
Objective: Subcutaneous Immunotherapy (SCIT) is the long-lasting causal treatment of allergic rhinitis (AR). How to enhance the adherence of patients to maximize the benefit of allergen immunotherapy (AIT) plays a crucial role in the management of AIT. This study aims to leverage novel machine learning models to precisely predict the risk of non-adherence of AR patients and related local symptom s…
▽ More
Objective: Subcutaneous Immunotherapy (SCIT) is the long-lasting causal treatment of allergic rhinitis (AR). How to enhance the adherence of patients to maximize the benefit of allergen immunotherapy (AIT) plays a crucial role in the management of AIT. This study aims to leverage novel machine learning models to precisely predict the risk of non-adherence of AR patients and related local symptom scores in three years SCIT.
Methods: The research develops and analyzes two models, sequential latent-variable model (SLVM) of Stochastic Latent Actor-Critic (SLAC) and Long Short-Term Memory (LSTM) evaluating them based on scoring and adherence prediction capabilities.
Results: Excluding the biased samples at the first time step, the predictive adherence accuracy of the SLAC models is from 60\% to 72\%, and for LSTM models, it is 66\% to 84\%, varying according to the time steps. The range of Root Mean Square Error (RMSE) for SLAC models is between 0.93 and 2.22, while for LSTM models it is between 1.09 and 1.77. Notably, these RMSEs are significantly lower than the random prediction error of 4.55.
Conclusion: We creatively apply sequential models in the long-term management of SCIT with promising accuracy in the prediction of SCIT nonadherence in AR patients. While LSTM outperforms SLAC in adherence prediction, SLAC excels in score prediction for patients undergoing SCIT for AR. The state-action-based SLAC adds flexibility, presenting a novel and effective approach for managing long-term AIT.
△ Less
Submitted 19 July, 2024; v1 submitted 21 January, 2024;
originally announced January 2024.
-
Correlation of coalescence times in a diploid Wright-Fisher model with recombination and selfing
Authors:
David Kogan,
Dimitrios Diamantidis,
John Wakeley,
Wai-Tong Louis Fan
Abstract:
The correlation among the gene genealogies at different loci is crucial in biology, yet challenging to understand because such correlation depends on many factors including genetic linkage, recombination, natural selection and population structure. Based on a diploid Wright-Fisher model with a single mating type and partial selfing for a constant large population with size $N$, we quantify the com…
▽ More
The correlation among the gene genealogies at different loci is crucial in biology, yet challenging to understand because such correlation depends on many factors including genetic linkage, recombination, natural selection and population structure. Based on a diploid Wright-Fisher model with a single mating type and partial selfing for a constant large population with size $N$, we quantify the combined effect of genetic drift and two competing factors, recombination and selfing, on the correlation of coalescence times at two linked loci for samples of size two. Recombination decouples the genealogies at different loci and decreases the correlation while selfing increases the correlation. We obtain explicit asymptotic formulas for the correlation for four scaling scenarios that depend on whether the selfing probability and the recombination probability are of order $O(1/N)$ or $O(1)$ as $N$ tends to infinity. Our analytical results confirm that the asymptotic lower bound in [King, Wakeley, Carmi (Theor. Popul. Biol. 2018)] is sharp when the loci are unlinked and when there is no selfing, and provide a number of new formulas for other scaling scenarios that have not been considered before. We present asymptotic results for the variance of Tajima's estimator of the population mutation rate for infinitely many loci as $N$ tends to infinity. When the selfing probability is of order $O(1)$ and is equal to a positive constant $s$ for all $N$ and if the samples at both loci are in the same individual, then the variance of the Tajima's estimator tends to $s/2$ (hence remains positive) even when the recombination rate, the number of loci and the population size all tend to infinity.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Quasi-stationary behavior of the stochastic FKPP equation on the circle
Authors:
Wai-Tong Louis Fan,
Oliver Tough
Abstract:
We consider the stochastic Fisher-Kolmogorov-Petrovsky-Piscunov (FKPP) equation on the circle $\mathbb{S}$, \begin{equation*}
\partial_t u(t,x) \,= \fracα{2}Δu +β\,u(1-u) + \sqrt{γ\,u(1-u)}\,\dot{W}, \qquad (t,x)\in(0,\infty)\times \mathbb{S}, \end{equation*} where $\dot{W}$ is space-time white noise. While any solution will eventually be absorbed at one of two states, the constant 1 and the con…
▽ More
We consider the stochastic Fisher-Kolmogorov-Petrovsky-Piscunov (FKPP) equation on the circle $\mathbb{S}$, \begin{equation*}
\partial_t u(t,x) \,= \fracα{2}Δu +β\,u(1-u) + \sqrt{γ\,u(1-u)}\,\dot{W}, \qquad (t,x)\in(0,\infty)\times \mathbb{S}, \end{equation*} where $\dot{W}$ is space-time white noise. While any solution will eventually be absorbed at one of two states, the constant 1 and the constant 0 on the circle, essentially nothing had been established about the absorption time (also called the fixation time in population genetics), or about the long-time behavior prior to absorption. We establish the existence and uniqueness of the quasi-stationary distribution (QSD) for the solution of the stochastic FKPP. Moreover, we show that the solution conditioned on not being absorbed at time $t$ converges to this unique QSD as $t\to\infty$, for any initial distribution, and characterize the leading-order asymptotics for the tail distribution of the fixation time. We obtain explicit calculations in the neutral case ($β=0$), quantifying the effect of spatial diffusion on fixation time. We explicitly express the fixation rate in terms of the migration rate $α$ for all $α\in (0,\infty)$, finding in particular that the fixation rate is given by $γ[1-\fracγ{12α}+\mathcal{O}(\frac{γ^2}{α^2})]$ for fast migration and $π^2α[1-\frac{8α}γ+\mathcal{O}(\frac{α^2}{γ^2})]$ for slow migration. Our proof relies on the observation that the absorbed (or killed) stochastic FKPP is dual to a system of $2$-type branching-coalescing Brownian motions killed when one type dies off, and on leveraging the relationship between these two killed processes.
△ Less
Submitted 9 January, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
Latent mutations in the ancestries of alleles under selection
Authors:
Wai-Tong Louis Fan,
John Wakeley
Abstract:
We consider a single genetic locus with two alleles $A_1$ and $A_2$ in a large haploid population. The locus is subject to selection and two-way, or recurrent, mutation. Assuming the allele frequencies follow a Wright-Fisher diffusion and have reached stationarity, we describe the asymptotic behaviors of the conditional gene genealogy and the latent mutations of a sample with known allele counts,…
▽ More
We consider a single genetic locus with two alleles $A_1$ and $A_2$ in a large haploid population. The locus is subject to selection and two-way, or recurrent, mutation. Assuming the allele frequencies follow a Wright-Fisher diffusion and have reached stationarity, we describe the asymptotic behaviors of the conditional gene genealogy and the latent mutations of a sample with known allele counts, when the count $n_1$ of allele $A_1$ is fixed, and when either or both the sample size $n$ and the selection strength $\lvertα\rvert$ tend to infinity. Our study extends previous work under neutrality to the case of non-neutral rare alleles, asserting that when selection is not too strong relative to the sample size, even if it is strongly positive or strongly negative in the usual sense ($α\to -\infty$ or $α\to +\infty$), the number of latent mutations of the $n_1$ copies of allele $A_1$ follows the same distribution as the number of alleles in the Ewens sampling formula. On the other hand, very strong positive selection relative to the sample size leads to neutral gene genealogies with a single ancient latent mutation. We also demonstrate robustness of our asymptotic results against changing population sizes, when one of $\lvertα\rvert$ or $n$ is large.
△ Less
Submitted 26 April, 2024; v1 submitted 13 June, 2023;
originally announced June 2023.
-
Patch formation driven by stochastic effects of interaction between viruses and defective interfering particles
Authors:
Qiantong Liang,
Johnny Yang,
Wai-Tong Louis Fan,
Wing-Cheong Lo
Abstract:
Defective interfering particles (DIPs) are virus-like particles that occur naturally during virus infections. These particles are defective, lacking essential genetic materials for replication, but they can interact with the wild-type virus and potentially be used as therapeutic agents. However, the effect of DIPs on infection spread is still unclear due to complicated stochastic effects and nonli…
▽ More
Defective interfering particles (DIPs) are virus-like particles that occur naturally during virus infections. These particles are defective, lacking essential genetic materials for replication, but they can interact with the wild-type virus and potentially be used as therapeutic agents. However, the effect of DIPs on infection spread is still unclear due to complicated stochastic effects and nonlinear spatial dynamics. In this work, we develop a model with a new hybrid method to study the spatial-temporal dynamics of viruses and DIPs co-infections within hosts. We present two different scenarios of virus production and compare the results from deterministic and stochastic models to demonstrate how the stochastic effect is involved in the spatial dynamics of virus transmission. We quantitatively study the spread features of the virus, including the formation and the speed of virus spread and the emergence of stochastic patchy patterns of virus distribution. Our simulations simultaneously capture observed spatial spread features in the experimental data, including the spread rate of the virus and its patchiness. The results demonstrate that DIPs can slow down the growth of virus particles and make the spread of the virus more patchy.
△ Less
Submitted 24 March, 2023;
originally announced March 2023.
-
EndHiC: assemble large contigs into chromosomal-level scaffolds using the Hi-C links from contig ends
Authors:
Sen Wang,
Hengchao Wang,
Fan Jiang,
Anqi Wang,
Hangwei Liu,
Hanbo Zhao,
Boyuan Yang,
Dong Xu,
Yan Zhang,
Wei Fan
Abstract:
Motivation: The application of PacBio HiFi and ultra-long ONT reads have achieved huge progress in the contig-level assembly, but it is still challenging to assemble large contigs into chromosomes with available Hi-C scaffolding software, which all compute the contact value between contigs using the Hi-C links from the whole contig regions. As the Hi-C links of two adjacent contigs concentrate onl…
▽ More
Motivation: The application of PacBio HiFi and ultra-long ONT reads have achieved huge progress in the contig-level assembly, but it is still challenging to assemble large contigs into chromosomes with available Hi-C scaffolding software, which all compute the contact value between contigs using the Hi-C links from the whole contig regions. As the Hi-C links of two adjacent contigs concentrate only at the neighbor ends of the contigs, larger contig size will reduce the power to differentiate adjacent (signal) and non-adjacent (noise) contig linkages, leading to a higher rate of mis-assembly.
Results: We present a software package EndHiC, which is suitable to assemble large contigs (> 1-Mb) into chromosomal-level scaffolds, using Hi-C links from only the contig end regions instead of the whole contig regions. Benefiting from the increased signal to noise ratio, EndHiC achieves much higher scaffolding accuracy compared to existing software LACHESIS, ALLHiC, and 3D-DNA. Moreover, EndHiC has few parameters, runs 10-1000 times faster than existing software, needs trivial memory, provides robustness evaluation, and allows graphic viewing of the scaffold results. The high scaffolding accuracy and user-friendly interface of EndHiC, liberate the users from labor-intensive manual checks and revision works.
Availability and implementation: EndHiC is written in Perl, and is freely available at https://github.com/fanagislab/EndHiC. Contact: fanwei@caas.cn and milrazhang@163.com Supplementary information: Supplementary data are available at Bioinformatics online.
△ Less
Submitted 30 November, 2021;
originally announced November 2021.
-
DGL-LifeSci: An Open-Source Toolkit for Deep Learning on Graphs in Life Science
Authors:
Mufei Li,
Jinjing Zhou,
Jiajing Hu,
Wenxuan Fan,
Yangkang Zhang,
Yaxin Gu,
George Karypis
Abstract:
Graph neural networks (GNNs) constitute a class of deep learning methods for graph data. They have wide applications in chemistry and biology, such as molecular property prediction, reaction prediction and drug-target interaction prediction. Despite the interest, GNN-based modeling is challenging as it requires graph data pre-processing and modeling in addition to programming and deep learning. He…
▽ More
Graph neural networks (GNNs) constitute a class of deep learning methods for graph data. They have wide applications in chemistry and biology, such as molecular property prediction, reaction prediction and drug-target interaction prediction. Despite the interest, GNN-based modeling is challenging as it requires graph data pre-processing and modeling in addition to programming and deep learning. Here we present DGL-LifeSci, an open-source package for deep learning on graphs in life science. DGL-LifeSci is a python toolkit based on RDKit, PyTorch and Deep Graph Library (DGL). DGL-LifeSci allows GNN-based modeling on custom datasets for molecular property prediction, reaction prediction and molecule generation. With its command-line interfaces, users can perform modeling without any background in programming and deep learning. We test the command-line interfaces using standard benchmarks MoleculeNet, USPTO, and ZINC. Compared with previous implementations, DGL-LifeSci achieves a speed up by up to 6x. For modeling flexibility, DGL-LifeSci provides well-optimized modules for various stages of the modeling pipeline. In addition, DGL-LifeSci provides pre-trained models for reproducing the test experiment results and applying models without training. The code is distributed under an Apache-2.0 License and is freely accessible at https://github.com/awslabs/dgl-lifesci.
△ Less
Submitted 27 June, 2021;
originally announced June 2021.
-
Modelling brain based on canonical ensemble with functional MRI: A thermodynamic exploration on neural system
Authors:
Chenxi Zhou,
Bin Yang,
Wenliang Fan,
Wei Li
Abstract:
Objective. Modelling is an important way to study the working mechanism of brain. While the characterization and understanding of brain are still inadequate. This study tried to build a model of brain from the perspective of thermodynamics at system level, which brought a new thinking to brain modelling.
Approach. Regarding brain regions as systems, voxels as particles, and intensity of signals…
▽ More
Objective. Modelling is an important way to study the working mechanism of brain. While the characterization and understanding of brain are still inadequate. This study tried to build a model of brain from the perspective of thermodynamics at system level, which brought a new thinking to brain modelling.
Approach. Regarding brain regions as systems, voxels as particles, and intensity of signals as energy of particles, the thermodynamic model of brain was built based on canonical ensemble theory. Two pairs of activated regions and two pairs of inactivated brain regions were selected for comparison in this study, and the analysis on thermodynamic properties based on the model proposed were performed. In addition, the thermodynamic properties were also extracted as input features for the detection of Alzheimer's disease.
Main results. The experiment results verified the assumption that the brain also follows the thermodynamic laws. It demonstrated the feasibility and rationality of brain thermodynamic modelling method proposed, indicating that thermodynamic parameters could be applied to describe the state of neural system. Meanwhile, the brain thermodynamic model achieved much better accuracy in detection of Alzheimer's disease, suggesting the potential application of thermodynamic model in auxiliary diagnosis.
Significance. (1) Instead of applying some thermodynamic parameters to analyze neural system, a brain model at system level was proposed from perspective of thermodynamics for the first time in this study. (2) The study discovered that the neural system also follows the laws of thermodynamics, which leads to increased internal energy, increased free energy and decreased entropy when system is activated. (3) The detection of neural disease was demonstrated to be benefit from thermodynamic model, implying the immense potential of thermodynamics in auxiliary diagnosis.
△ Less
Submitted 27 March, 2021; v1 submitted 26 February, 2021;
originally announced March 2021.
-
Impossibility of phylogeny reconstruction from $k$-mer counts
Authors:
Wai-Tong Louis Fan,
Brandon Legried,
Sebastien Roch
Abstract:
We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed $k$ no statistically consistent phylogeny estimation is possible from $k$-mer counts over the full leaf sequences alone. Formally, we establish that the joint distribution of $k$-mer counts ov…
▽ More
We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed $k$ no statistically consistent phylogeny estimation is possible from $k$-mer counts over the full leaf sequences alone. Formally, we establish that the joint distribution of $k$-mer counts over the entire leaf sequences on two distinct trees have total variation distance bounded away from $1$ as the sequence length tends to infinity. Our impossibility result implies that statistical consistency requires more sophisticated use of $k$-mer count information, such as block techniques developed in previous theoretical work.
△ Less
Submitted 1 March, 2022; v1 submitted 27 October, 2020;
originally announced October 2020.
-
Impossibility of consistent distance estimation from sequence lengths under the TKF91 model
Authors:
Wai-Tong Louis Fan,
Brandon Legried,
Sebastien Roch
Abstract:
We consider the problem of distance estimation under the TKF91 model of sequence evolution by insertions, deletions and substitutions on a phylogeny. In an asymptotic regime where the expected sequence lengths tend to infinity, we show that no consistent distance estimation is possible from sequence lengths alone. More formally, we establish that the distributions of pairs of sequence lengths at d…
▽ More
We consider the problem of distance estimation under the TKF91 model of sequence evolution by insertions, deletions and substitutions on a phylogeny. In an asymptotic regime where the expected sequence lengths tend to infinity, we show that no consistent distance estimation is possible from sequence lengths alone. More formally, we establish that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to one.
△ Less
Submitted 23 May, 2020;
originally announced May 2020.
-
Estimation of genome size using k-mer frequencies from corrected long reads
Authors:
Hengchao Wang,
Bo Liu,
Yan Zhang,
Fan Jiang,
Yuwei Ren,
Lijuan Yin,
Hangwei Liu,
Sen Wang,
Wei Fan
Abstract:
The third-generation long reads sequencing technologies, such as PacBio and Nanopore, have great advantages over second-generation Illumina sequencing in de novo assembly studies. However, due to the inherent low base accuracy, third-generation sequencing data cannot be used for k-mer counting and estimating genomic profile based on k-mer frequencies. Thus, in current genome projects, second-gener…
▽ More
The third-generation long reads sequencing technologies, such as PacBio and Nanopore, have great advantages over second-generation Illumina sequencing in de novo assembly studies. However, due to the inherent low base accuracy, third-generation sequencing data cannot be used for k-mer counting and estimating genomic profile based on k-mer frequencies. Thus, in current genome projects, second-generation data is also necessary for accurately determining genome size and other genomic characteristics. We show that corrected third-generation data can be used to count k-mer frequencies and estimate genome size reliably, in replacement of using second-generation data. Therefore, future genome projects can depend on only one sequencing technology to finish both assembly and k-mer analysis, which will largely decrease sequencing cost in both time and money. Moreover, we present a fast light-weight tool kmerfreq and use it to perform all the k-mer counting tasks in this work. We have demonstrated that corrected third-generation sequencing data can be used to estimate genome size and developed a new open-source C/C++ k-mer counting tool, kmerfreq, which is freely available at https://github.com/fanagislab/kmerfreq.
△ Less
Submitted 26 March, 2020;
originally announced March 2020.
-
Stochastic PDEs on graphs as scaling limits of discrete interacting systems
Authors:
Wai-Tong Louis Fan
Abstract:
Stochastic partial differential equations (SPDE) on graphs were introduced by Cerrai and Freidlin [Ann. Inst. Henri Poincaré Probab. Stat. 53 (2017) 865-899]. This class of stochastic equations in infinite dimensions provides a minimal framework for the study of the effective dynamics of much more complex systems. However, how they emerge from microscopic individual-based models is still poorly un…
▽ More
Stochastic partial differential equations (SPDE) on graphs were introduced by Cerrai and Freidlin [Ann. Inst. Henri Poincaré Probab. Stat. 53 (2017) 865-899]. This class of stochastic equations in infinite dimensions provides a minimal framework for the study of the effective dynamics of much more complex systems. However, how they emerge from microscopic individual-based models is still poorly understood, partly due to complications near vertex singularities. In this work, motivated by the study of the dynamics and the genealogies of expanding populations in spatially structured environments, we obtain a new class of SPDE on graphs of Wright-Fisher type which have nontrivial boundary conditions on the vertex set. We show that these SPDE arise as scaling limits of suitably defined biased voter models (BVM), which extends the scaling limits of Durrett and Fan [Ann. Appl. Probab. 26 (2016) 456-490]. We further obtain a convergent simulation scheme for each of these SPDE in terms of a system of Itô SDEs, which is useful when the size of the BVM is too large for stochastic simulations. These give the first rigorous connection between SPDE on graphs and more discrete models, specifically, interacting particle systems and interacting SDEs. Uniform heat kernel estimates for symmetric random walks approximating diffusions on graphs are the keys to our proofs.
△ Less
Submitted 17 November, 2020; v1 submitted 5 August, 2017;
originally announced August 2017.
-
Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling
Authors:
Wai-Tong Louis Fan,
Sebastien Roch
Abstract:
In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and branchings indicate past speciation events. Phylogenies are commonly estimated from molecular sequences, such as DNA sequences, collected from the species of interest. At a high level, the idea behind this inference is…
▽ More
In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and branchings indicate past speciation events. Phylogenies are commonly estimated from molecular sequences, such as DNA sequences, collected from the species of interest. At a high level, the idea behind this inference is simple: the further apart in the Tree of Life are two species, the greater is the number of mutations to have accumulated in their genomes since their most recent common ancestor. In order to obtain accurate estimates in phylogenetic analyses, it is standard practice to employ statistical approaches based on stochastic models of sequence evolution on a tree. For tractability, such models necessarily make simplifying assumptions about the evolutionary mechanisms involved. In particular, commonly omitted are insertions and deletions of nucleotides -- also known as indels.
Properly accounting for indels in statistical phylogenetic analyses remains a major challenge in computational evolutionary biology. Here we consider the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process. We focus on the case of dense phylogenies of bounded height, which we refer to as the taxon-rich setting, where statistical consistency is achievable. We give the first polynomial-time ancestral reconstruction algorithm with provable guarantees under constant rates of mutation. Our algorithm succeeds when the phylogeny satisfies the "big bang" condition, a necessary and sufficient condition for statistical consistency in this context.
△ Less
Submitted 31 July, 2019; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees
Authors:
Wai-Tong Louis Fan,
Sebastien Roch
Abstract:
We establish necessary and sufficient conditions for consistent root reconstruction in continuous-time Markov models with countable state space on bounded-height trees. Here a root state estimator is said to be consistent if the probability that it returns to the true root state converges to 1 as the number of leaves tends to infinity. We also derive quantitative bounds on the error of reconstruct…
▽ More
We establish necessary and sufficient conditions for consistent root reconstruction in continuous-time Markov models with countable state space on bounded-height trees. Here a root state estimator is said to be consistent if the probability that it returns to the true root state converges to 1 as the number of leaves tends to infinity. We also derive quantitative bounds on the error of reconstruction. Our results answer a question of Gascuel and Steel and have implications for ancestral sequence reconstruction in a classical evolutionary model of nucleotide insertion and deletion.
△ Less
Submitted 1 August, 2019; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Genealogies in Expanding Populations
Authors:
Rick Durrett,
Wai-Tong Louis Fan
Abstract:
The goal of this paper is to prove rigorous results for the behavior of genealogies in a one-dimensional long range biased voter model introduced by Hallatschek and Nelson [25]. The first step, which is easily accomplished using results of Mueller and Tribe [38], is to show that when space and time are rescaled correctly, our biased voter model converges to a Wright-Fisher SPDE. A simple extension…
▽ More
The goal of this paper is to prove rigorous results for the behavior of genealogies in a one-dimensional long range biased voter model introduced by Hallatschek and Nelson [25]. The first step, which is easily accomplished using results of Mueller and Tribe [38], is to show that when space and time are rescaled correctly, our biased voter model converges to a Wright-Fisher SPDE. A simple extension of a result of Durrett and Restrepo [18] then shows that the dual branching coalescing random walk converges to a branching Brownian motion in which particles coalesce after an exponentially distributed amount of intersection local time. Brunet et al. [8] have conjectured that genealogies in models of this type are described by the Bolthausen-Sznitman coalescent, see [39]. However, in the model we study there are no simultaneous coalescences. Our third and most significant result concerns "tracer dynamics" in which some of the initial particles in the biased voter model are labeled. We show that the joint distribution of the labeled and unlabeled particles converges to the solution of a system of stochastic partial differential equations. A new duality equation that generalizes the one Shiga [44] developed for the Wright-Fisher SPDE is the key to the proof of that result.
△ Less
Submitted 13 January, 2016; v1 submitted 3 July, 2015;
originally announced July 2015.
-
Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects
Authors:
Binghang Liu,
Yujian Shi,
Jianying Yuan,
Xuesong Hu,
Hao Zhang,
Nan Li,
Zhenyu Li,
Yanxiang Chen,
Desheng Mu,
Wei Fan
Abstract:
Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemble…
▽ More
Background: With the fast development of next generation sequencing technologies, increasing numbers of genomes are being de novo sequenced and assembled. However, most are in fragmental and incomplete draft status, and thus it is often difficult to know the accurate genome size and repeat content. Furthermore, many genomes are highly repetitive or heterozygous, posing problems to current assemblers utilizing short reads. Therefore, it is necessary to develop efficient assembly-independent methods for accurate estimation of these genomic characteristics. Results: Here we present a framework for modeling the distribution of k-mer frequency from sequencing data and estimating the genomic characteristics such as genome size, repeat structure and heterozygous rate. By introducing novel techniques of k-mer individuals, float precision estimation, and proper treatment of sequencing error and coverage bias, the estimation accuracy of our method is significantly improved over existing methods. We also studied how the various genomic and sequencing characteristics affect the estimation accuracy using simulated sequencing data, and discussed the limitations on applying our method to real sequencing data. Conclusion: Based on this research, we show that the k-mer frequency analysis can be used as a general and assembly-independent method for estimating genomic characteristics, which can improve our understanding of a species genome, help design the sequencing strategy of genome projects, and guide the development of assembly algorithms. The programs developed in this research are written using C/C++, and freely accessible at Github URL (https://github.com/fanagislab/GCE) or BGI ftp ( ftp://ftp.genomics.org.cn/pub/gce).
△ Less
Submitted 26 February, 2020; v1 submitted 8 August, 2013;
originally announced August 2013.
-
Genomes: at the edge of chaos with maximum information capacity
Authors:
Sing-Guan Kong,
Hong-Da Chen,
Wen-Lang Fan,
Jan Wigger,
Andrew Torda,
HC Lee
Abstract:
We propose an order index, phi, which quantifies the notion of ``life at the edge of chaos'' when applied to genome sequences. It maps genomes to a number from 0 (random and of infinite length) to 1 (fully ordered) and applies regardless of sequence length. The 786 complete genomic sequences in GenBank were found to have phi values in a very narrow range, 0.037+/-0.027. We show this implies that…
▽ More
We propose an order index, phi, which quantifies the notion of ``life at the edge of chaos'' when applied to genome sequences. It maps genomes to a number from 0 (random and of infinite length) to 1 (fully ordered) and applies regardless of sequence length. The 786 complete genomic sequences in GenBank were found to have phi values in a very narrow range, 0.037+/-0.027. We show this implies that genomes are halfway towards being completely random, namely, at the edge of chaos. We argue that this narrow range represents the neighborhood of a fixed-point in the space of sequences, and genomes are driven there by the dynamics of a robust, predominantly neutral evolution process.
△ Less
Submitted 12 August, 2007;
originally announced August 2007.