[1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences
Abstract The machine-learning technique of decision graphs, a generalization of decision trees, i... more Abstract The machine-learning technique of decision graphs, a generalization of decision trees, is applied to the prediction of protein secondary structure to infer a theory for this problem. The resulting decision graph provides both a prediction method and an ...
<b>Copyright information:</b>Taken from "Comparative analysis of long DNA sequen... more <b>Copyright information:</b>Taken from "Comparative analysis of long DNA sequences by per element information content using different contexts"http://www.biomedcentral.com/1471-2105/8/S2/S10BMC Bioinformatics 2007;8(Suppl 2):S10-S10.Published online 3 May 2007PMCID:PMC1892068.
This paper considers the importance of end-fragment constraints in the construction of contig res... more This paper considers the importance of end-fragment constraints in the construction of contig restriction maps. A representation for such maps is used in conjunction with an objective function based on minimum message length (MML) principles and two stochastic optimization methods. Results from the optimization of real and simulated data sets with and without end fragment constraints are given, and it is shown that better scores can be obtained if end fragment constraints are violated. The eeectiveness of the MML objective function is illustrated by its ability to balance a number of connicting constraints.
Given the exponential growth in the amount of genetic data being produced, it is more important t... more Given the exponential growth in the amount of genetic data being produced, it is more important than ever for researchers to have effective tools to help them manage this data. This paper describes a system that enables users, generally biologists, to construct components to answer specific questions in their field. The system allows the creation of modules and submodules via top-down decomposition. Concepts and terms can be defined through conversation. These are then used when composing base-level functions to produce code for modules and for interfacing modules.
Proceedings. International Conference on Intelligent Systems for Molecular Biology, 1994
Restriction mapping generally requires the application of information from various digestions by ... more Restriction mapping generally requires the application of information from various digestions by restriction enzymes to find solution sets. We use both the predicate calculus and constraint solving capabilities of CLP(R) to develop an engine for restriction mapping. Many of the techniques employed by biologists to manually find solutions are supported by the engine in a consistent manner. We provide generalized pipeline and cross-multiply operators for combining sub-maps. Our approach encourages the building of maps iteratively. We show how other techniques can be readily incorporated.
The exponential growth in the quantity of publicly available genetic data and the proliferation o... more The exponential growth in the quantity of publicly available genetic data and the proliferation of bioinformatic databases mean that scientists need computerized tools more than ever. Existing ap- proaches to the problem all suffer from one or more basic problems. This paper describes Polyome, the core of a system for the integration and querying of data sources, designed to overcome
Restriction site mapping programs construct maps by generating permutations of fragments and chec... more Restriction site mapping programs construct maps by generating permutations of fragments and checking for consistency. Unfortunately many consistent maps often are obtained within the experimental error bounds, even though there is only one actual map. A particularly efficient algorithm is presented that aims to minimize error bounds between restriction sites. The method is generalized for linear and circular maps. The time complexity is derived and execution times are given for multiple enzymes and a range of error bounds.
Methods for measuring genetic distances in phylogenetics are known to be sensitive to the evoluti... more Methods for measuring genetic distances in phylogenetics are known to be sensitive to the evolutionary model assumed. However, there is a lack of established methodology to accommodate the trade-off between incorporating sufficient biological reality and avoiding model overfitting. In addition, as traditional methods measure distances based on the observed number of substitutions, their tend to underestimate distances between diverged sequences due to backward and parallel substitutions. Various techniques were proposed to correct this, but they lack the robustness against sequences that are distantly related and of unequal base frequencies. In this article, we present a novel genetic distance estimate based on information theory that overcomes the above two hurdles. Instead of examining the observed number of substitutions, this method estimates genetic distances using Shannon&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#39;s mutual information. This naturally provides an effective framework for balancing model complexity and goodness of fit. Our distance estimate is shown to be approximately linear to elapsed time and hence is less sensitive to the divergence of sequence data and compositional biased sequences. Using extensive simulation data, we show that our method 1) consistently reconstructs more accurate phylogeny topologies than existing methods, 2) is robust in extreme conditions such as diverged phylogenies, unequal base frequencies data, and heterogeneous mutation patterns, and 3) scales well with large phylogenies.
[1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences
Abstract The machine-learning technique of decision graphs, a generalization of decision trees, i... more Abstract The machine-learning technique of decision graphs, a generalization of decision trees, is applied to the prediction of protein secondary structure to infer a theory for this problem. The resulting decision graph provides both a prediction method and an ...
<b>Copyright information:</b>Taken from "Comparative analysis of long DNA sequen... more <b>Copyright information:</b>Taken from "Comparative analysis of long DNA sequences by per element information content using different contexts"http://www.biomedcentral.com/1471-2105/8/S2/S10BMC Bioinformatics 2007;8(Suppl 2):S10-S10.Published online 3 May 2007PMCID:PMC1892068.
This paper considers the importance of end-fragment constraints in the construction of contig res... more This paper considers the importance of end-fragment constraints in the construction of contig restriction maps. A representation for such maps is used in conjunction with an objective function based on minimum message length (MML) principles and two stochastic optimization methods. Results from the optimization of real and simulated data sets with and without end fragment constraints are given, and it is shown that better scores can be obtained if end fragment constraints are violated. The eeectiveness of the MML objective function is illustrated by its ability to balance a number of connicting constraints.
Given the exponential growth in the amount of genetic data being produced, it is more important t... more Given the exponential growth in the amount of genetic data being produced, it is more important than ever for researchers to have effective tools to help them manage this data. This paper describes a system that enables users, generally biologists, to construct components to answer specific questions in their field. The system allows the creation of modules and submodules via top-down decomposition. Concepts and terms can be defined through conversation. These are then used when composing base-level functions to produce code for modules and for interfacing modules.
Proceedings. International Conference on Intelligent Systems for Molecular Biology, 1994
Restriction mapping generally requires the application of information from various digestions by ... more Restriction mapping generally requires the application of information from various digestions by restriction enzymes to find solution sets. We use both the predicate calculus and constraint solving capabilities of CLP(R) to develop an engine for restriction mapping. Many of the techniques employed by biologists to manually find solutions are supported by the engine in a consistent manner. We provide generalized pipeline and cross-multiply operators for combining sub-maps. Our approach encourages the building of maps iteratively. We show how other techniques can be readily incorporated.
The exponential growth in the quantity of publicly available genetic data and the proliferation o... more The exponential growth in the quantity of publicly available genetic data and the proliferation of bioinformatic databases mean that scientists need computerized tools more than ever. Existing ap- proaches to the problem all suffer from one or more basic problems. This paper describes Polyome, the core of a system for the integration and querying of data sources, designed to overcome
Restriction site mapping programs construct maps by generating permutations of fragments and chec... more Restriction site mapping programs construct maps by generating permutations of fragments and checking for consistency. Unfortunately many consistent maps often are obtained within the experimental error bounds, even though there is only one actual map. A particularly efficient algorithm is presented that aims to minimize error bounds between restriction sites. The method is generalized for linear and circular maps. The time complexity is derived and execution times are given for multiple enzymes and a range of error bounds.
Methods for measuring genetic distances in phylogenetics are known to be sensitive to the evoluti... more Methods for measuring genetic distances in phylogenetics are known to be sensitive to the evolutionary model assumed. However, there is a lack of established methodology to accommodate the trade-off between incorporating sufficient biological reality and avoiding model overfitting. In addition, as traditional methods measure distances based on the observed number of substitutions, their tend to underestimate distances between diverged sequences due to backward and parallel substitutions. Various techniques were proposed to correct this, but they lack the robustness against sequences that are distantly related and of unequal base frequencies. In this article, we present a novel genetic distance estimate based on information theory that overcomes the above two hurdles. Instead of examining the observed number of substitutions, this method estimates genetic distances using Shannon&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#39;s mutual information. This naturally provides an effective framework for balancing model complexity and goodness of fit. Our distance estimate is shown to be approximately linear to elapsed time and hence is less sensitive to the divergence of sequence data and compositional biased sequences. Using extensive simulation data, we show that our method 1) consistently reconstructs more accurate phylogeny topologies than existing methods, 2) is robust in extreme conditions such as diverged phylogenies, unequal base frequencies data, and heterogeneous mutation patterns, and 3) scales well with large phylogenies.
Uploads
Papers by Trevor I Dix