Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

    Bruce Parrello

    In this study, we built machine learning classifiers for predicting the presence or absence of the variable genes occurring in 10-90% of all publicly available high-qualityEscherichia coligenomes. The BV-BRC genus-specific protein... more
    In this study, we built machine learning classifiers for predicting the presence or absence of the variable genes occurring in 10-90% of all publicly available high-qualityEscherichia coligenomes. The BV-BRC genus-specific protein families were used to define orthologs across the set of genomes, and a single binary classifier was built for predicting the presence or absence of each family in each genome. Each model was built using the nucleotide k-mers from a set of 100 conserved genes as features. The resulting set of 3,259 XGBoost classifiers had a per-genome average macro F1 score of 0.944 [0.943-0.945, 95% CI]. We show that the F1 scores are stable across MLSTs, and that the trend can be recapitulated through sampling with a smaller number of core genes or diverse input genomes. Surprisingly, the presence or absence of poorly annotated proteins, including “hypothetical proteins”, were easily predicted (F1 = 0.902 [0.898-0.906, 95% CI]). Models for proteins with horizontal gene t...
    genes linked to subsystems (our mechanism of manual curation) is shown to provide an initial assessment of the performance of RAST.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using... more
    genes linked to subsystems (our mechanism of manual curation) is shown to provide an initial assessment of the performance of RAST.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using Subsystems Technology"http://www.biomedcentral.com/1471-2164/9/75BMC Genomics 2008;9():75-75.Published online 8 Feb 2008PMCID:PMC2265698.
    Y – not started, blue – queued for computation, yellow – in progress, red – requires user input, brown – failed with an error, green – successfully completed.<b>Copyright information:</b>Taken from "The RAST Server: Rapid... more
    Y – not started, blue – queued for computation, yellow – in progress, red – requires user input, brown – failed with an error, green – successfully completed.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using Subsystems Technology"http://www.biomedcentral.com/1471-2164/9/75BMC Genomics 2008;9():75-75.Published online 8 Feb 2008PMCID:PMC2265698.
    By scenario basis. The scenarios are organized on the left by subsystems, which are themselves organized by categories of metabolic function. If a path through a scenario was found in a given subsystem, the subsystem name is highlighted... more
    By scenario basis. The scenarios are organized on the left by subsystems, which are themselves organized by categories of metabolic function. If a path through a scenario was found in a given subsystem, the subsystem name is highlighted in blue. In this case, one path was found through the Uroporphyrinogen III generation scenario in the Porphyrin, Heme and Siroheme Biosynthesis subsystem. The table to the right shows the input and output compounds for the scenario, including their stoichiometry, and the reactions that make up the path through the scenario.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using Subsystems Technology"http://www.biomedcentral.com/1471-2164/9/75BMC Genomics 2008;9():75-75.Published online 8 Feb 2008PMCID:PMC2265698.
    Ed genome of Manfredo was compared to the metabolic reconstruction for MGAS315, which is part of the comparative environment of the SEED. All three columns of subsystem categories are expandable. In cases where RAST was conservative in... more
    Ed genome of Manfredo was compared to the metabolic reconstruction for MGAS315, which is part of the comparative environment of the SEED. All three columns of subsystem categories are expandable. In cases where RAST was conservative in the assertion of a subsystem a manual attempt to retrieve the missing function/s can be made by clicking the find button.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using Subsystems Technology"http://www.biomedcentral.com/1471-2164/9/75BMC Genomics 2008;9():75-75.Published online 8 Feb 2008PMCID:PMC2265698.
    H includes comparative genomics views and the connections to a subsystem if one was asserted.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using Subsystems... more
    H includes comparative genomics views and the connections to a subsystem if one was asserted.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using Subsystems Technology"http://www.biomedcentral.com/1471-2164/9/75BMC Genomics 2008;9():75-75.Published online 8 Feb 2008PMCID:PMC2265698.
    genes linked to subsystems (our mechanism of manual curation) is shown to provide an initial assessment of the performance of RAST.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using... more
    genes linked to subsystems (our mechanism of manual curation) is shown to provide an initial assessment of the performance of RAST.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using Subsystems Technology"http://www.biomedcentral.com/1471-2164/9/75BMC Genomics 2008;9():75-75.Published online 8 Feb 2008PMCID:PMC2265698.
    Additional file 7 Pasolli et al. quality report. This is a report of EvalCon and EvalG scores for high- and medium-quality genomes assembled in the Pasolli et al. study. Columns, in order, are Pasolli et al. genome name, CheckM... more
    Additional file 7 Pasolli et al. quality report. This is a report of EvalCon and EvalG scores for high- and medium-quality genomes assembled in the Pasolli et al. study. Columns, in order, are Pasolli et al. genome name, CheckM completeness score, CheckM contamination score, PATRIC genome ID, Scientific name of organism, EvalG completeness score, EvalG contamination score, EvalCon coarse consistency score, EvalCon fine consistency score, and "good?" Genomes marked "good" in this table will meet the following criteria: (1) contamination ≤10%, (2) fine consistency ≥87%, and (3) completeness ≥80%, and (4) a single copy of pheS of appropriate length.
    Additional file 6 PATRIC quality report. This is a report of EvalCon and EvalG scores for all public genomes in PATRIC. Columns, in order, are: PATRIC genome ID, genome name, EvalCon fine consistency score, EvalG completeness score, EvalG... more
    Additional file 6 PATRIC quality report. This is a report of EvalCon and EvalG scores for all public genomes in PATRIC. Columns, in order, are: PATRIC genome ID, genome name, EvalCon fine consistency score, EvalG completeness score, EvalG contamination score, "Good," and "Good Seed." Genomes marked "good" meet the following criteria: (1) contamination ≤10%, (2) fine consistency ≥87%, and (3) completeness ≥80%. Genomes marked "good seed" have a single copy of the phenylalanine tRNA synthetase, alpha subunit (pheS) gene of appropriate length (209–405 amino acid residues for bacteria, 293-652 for archaea).
    Additional file 5 EvalCon converged multiplicity matrix. This is the multiplicity matrix with its columns pared down to the set of reliably predictable roles. Each line of this file represents a single genome, with tab-separated... more
    Additional file 5 EvalCon converged multiplicity matrix. This is the multiplicity matrix with its columns pared down to the set of reliably predictable roles. Each line of this file represents a single genome, with tab-separated multiplicities for each role. Its rows are labeled in genome_names.txt and its columns are labeled in reliable_roles.txt.
    Additional file 4 EvalCon reliable role names. This file contains (column index, role name) tuples for the converged matrix. All indices start at zero. This is the subset of roles in all_roles.txt found to be reliably predictable under... more
    Additional file 4 EvalCon reliable role names. This file contains (column index, role name) tuples for the converged matrix. All indices start at zero. This is the subset of roles in all_roles.txt found to be reliably predictable under the random forest predictor.
    Additional file 3 EvalCon training multiplicity matrix. This is the multiplicity matrix used to train the machine learning predictors for EvalCon. Each line of this file represents a single genome, with tab-separated multiplicities for... more
    Additional file 3 EvalCon training multiplicity matrix. This is the multiplicity matrix used to train the machine learning predictors for EvalCon. Each line of this file represents a single genome, with tab-separated multiplicities for each role. Its rows are labeled in genome_names.txt and its columns are labeled in all_roles.txt.
    Additional file 2 EvalCon training role names. This file contains (column index, role name) tuples for the training data matrix.
    Rmat. For each peg the location on the contig, the functional role assignment, its EC number (if present) and GO category, the connection to a subsystem and a KEGG reaction (if appropriate) are given.<b>Copyright... more
    Rmat. For each peg the location on the contig, the functional role assignment, its EC number (if present) and GO category, the connection to a subsystem and a KEGG reaction (if appropriate) are given.<b>Copyright information:</b>Taken from "The RAST Server: Rapid Annotations using Subsystems Technology"http://www.biomedcentral.com/1471-2164/9/75BMC Genomics 2008;9():75-75.Published online 8 Feb 2008PMCID:PMC2265698.
    The remarkable advance in sequencing technology and the rising interest in medical and environmental microbiology, biotechnology, and synthetic biology resulted in a deluge of published microbial genomes. Yet, genome annotation,... more
    The remarkable advance in sequencing technology and the rising interest in medical and environmental microbiology, biotechnology, and synthetic biology resulted in a deluge of published microbial genomes. Yet, genome annotation, comparison, and modeling remain a major bottleneck to the translation of sequence information into biological knowledge, hence computational analysis tools are continuously being developed for rapid genome annotation and interpretation. Among the earliest, most comprehensive resources for prokaryotic genome analysis, the SEED project, initiated in 2003 as an integration of genomic data and analysis tools, now contains.5,000 complete genomes, a constantly updated set of curated annotations embodied in a large and growing collection of encoded subsystems, a derived set of protein families, and hundreds of genome-scale metabolic models. Until recently, however, maintaining current copies of the SEED code and data at remote locations has been a pressing issue. T...
    The National Microbial Pathogen Data Resource (NMPDR)
    Resource (NMPDR): a genomics platform based
    Large amounts of metagenomically-derived data are submitted to PATRIC for analysis. In the future, we expect even more jobs submitted to PATRIC will use metagenomic data. One in-demand use case is the extraction of near-complete draft... more
    Large amounts of metagenomically-derived data are submitted to PATRIC for analysis. In the future, we expect even more jobs submitted to PATRIC will use metagenomic data. One in-demand use case is the extraction of near-complete draft genomes from assembled contigs of metagenomic origin. The PATRIC metagenome binning service utilizes the PATRIC database to furnish a large, diverse set of reference genomes. We provide a new service for supervised extraction and annotation of high-quality, near-complete genomes from metagenomically-derived contigs. Reference genomes are assigned to putative draft genome bins based on the presence of single-copy universal marker roles in the sample, and contigs are sorted into these bins by their similarity to reference genomes in PATRIC. Each set of binned contigs represents a draft genome that will be annotated by RASTtk in PATRIC. A structured-language binning report is provided containing quality measurements and taxonomic information about the con...
    The PathoSystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center funded by the National Institute of Allergy and Infectious Diseases (https://www.patricbrc.org). PATRIC supports bioinformatic analyses... more
    The PathoSystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center funded by the National Institute of Allergy and Infectious Diseases (https://www.patricbrc.org). PATRIC supports bioinformatic analyses of all bacteria with a special emphasis on pathogens, offering a rich comparative analysis environment that provides users with access to over 250 000 uniformly annotated and publicly available genomes with curated metadata. PATRIC offers web-based visualization and comparative analysis tools, a private workspace in which users can analyze their own data in the context of the public collections, services that streamline complex bioinformatic workflows and command-line tools for bulk data analysis. Over the past several years, as genomic and other omics-related experiments have become more cost-effective and widespread, we have observed considerable growth in the usage of and demand for easy-to-use, publicly available bioinformatic tools and services...
    BackgroundLarge volumes of metagenomic samples are being processed and submitted to PATRIC for analysis as reads or assembled contigs. Effective analysis of these samples requires solutions to a number of problems, including the binning... more
    BackgroundLarge volumes of metagenomic samples are being processed and submitted to PATRIC for analysis as reads or assembled contigs. Effective analysis of these samples requires solutions to a number of problems, including the binning of assembled, mixed, metagenomically-derived contigs into taxonomic units.DescriptionThe PATRIC metagenome binning service utilizes the PATRIC database to furnish a large, diverse set of reference genomes. Reference genomes are assigned based on the presence of single-copy universal marker proteins in the sample, and contigs are assigned to the bin corresponding to the most similar reference genome. Each set of binned contigs represents a draft genome that will be annotated by RASTtk in PATRIC. A structured-language binning report is provided containing quality measurements and taxonomic information about the contig bins.ConclusionWe provide a new service for rapid and interpretable metagenomic contig binning and annotation in PATRIC.
    Background Recent advances in high-volume sequencing technology and mining of genomes from metagenomic samples call for rapid and reliable genome quality evaluation. The current release of the PATRIC database contains over 220,000... more
    Background Recent advances in high-volume sequencing technology and mining of genomes from metagenomic samples call for rapid and reliable genome quality evaluation. The current release of the PATRIC database contains over 220,000 genomes, and current metagenomic technology supports assemblies of many draft-quality genomes from a single sample, most of which will be novel. Description We have added two quality assessment tools to the PATRIC annotation pipeline. EvalCon uses supervised machine learning to calculate an annotation consistency score. EvalG implements a variant of the CheckM algorithm to estimate contamination and completeness of an annotated genome.We report on the performance of these tools and the potential utility of the consistency score. Additionally, we provide contamination, completeness, and consistency measures for all genomes in PATRIC and in a recent set of metagenomic assemblies. Conclusion EvalG and EvalCon facilitate the rapid quality control and explorati...
    The Pathosystems Resource Integration Center (PATRIC, www.patricbrc.org) is designed to provide researchers with the tools and services that they need to perform genomic and other ‘omic’ data analyses. In response to mounting concern over... more
    The Pathosystems Resource Integration Center (PATRIC, www.patricbrc.org) is designed to provide researchers with the tools and services that they need to perform genomic and other ‘omic’ data analyses. In response to mounting concern over antimicrobial resistance (AMR), the PATRIC team has been developing new tools that help researchers understand AMR and its genetic determinants. To support comparative analyses, we have added AMR phenotype data to over 15 000 genomes in the PATRIC database, often assembling genomes from reads in public archives and collecting their associated AMR panel data from the literature to augment the collection. We have also been using this collection of AMR metadata to build machine learning-based classifiers that can predict the AMR phenotypes and the genomic regions associated with resistance for genomes being submitted to the annotation service. Likewise, we have undertaken a large AMR protein annotation effort by manually curating data from the literat...
    Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. An important step toward meeting the challenge of understanding gene function... more
    Understanding gene function and regulation is essential for the interpretation, prediction, and ultimate design of cell responses to changes in the environment. An important step toward meeting the challenge of understanding gene function and regulation is the identification of sets of genes that are always co-expressed. These gene sets, Atomic Regulons (ARs), represent fundamental units of function within a cell and could be used to associate genes of unknown function with cellular processes and to enable rational genetic engineering of cellular systems. Here, we describe an approach for inferring ARs that leverages large-scale expression data sets, gene context, and functional relationships among genes. We computed ARs for Escherichia coli based on 907 gene expression experiments and compared our results with gene clusters produced by two prevalent data-driven methods: Hierarchical clustering and k-means clustering. We compared ARs and purely data-driven gene clusters to the curat...
    The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users... more
    The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by 'virtual integratio...
    ABSTRACT The purpose of this paper is to present a systematic approach to the conceptual schema design. The entity-relationship model is used as the conceptual schema model. The entity-relationship schema, a formal description of the... more
    ABSTRACT The purpose of this paper is to present a systematic approach to the conceptual schema design. The entity-relationship model is used as the conceptual schema model. The entity-relationship schema, a formal description of the model, is defined to explicitly ...
    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e.,... more
    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.
    ABSTRACT General ledger systems, though not often discussed in the database literature, are fundamental to the operation of any business and illustrate many of the subtleties of database design. We present here an application of the... more
    ABSTRACT General ledger systems, though not often discussed in the database literature, are fundamental to the operation of any business and illustrate many of the subtleties of database design. We present here an application of the entity-relationship model design methodology to general ledger systems. Introductions to the terminology of both entity-relationship models and general ledger accounting systems are given, in order to make the paper relatively self-contained.