Lymphoid tissues are an important HIV reservoir site that persists in the face of antiretroviral ... more Lymphoid tissues are an important HIV reservoir site that persists in the face of antiretroviral therapy and natural immunity. Targeting these reservoirs by harnessing the antiviral activity of local tissue resident memory ( TRM) CD8+ T-cells is of great interest, but limited data exist on TRMs within lymph nodes of people living with HIV (PLWH). Here, we studied tonsil CD8+ T-cells obtained from PLWH and uninfected controls from South Africa. We show that these cells are preferentially located outside the germinal centers (GCs), the main reservoir site for HIV, and display a low cytolytic and transcriptionally TRM-like profile that is distinct from blood. In PLWH, CD8+ TRM-like cells are highly expanded and adopt a more cytolytic, activated and exhausted phenotype characterized by increased expression of CD69, PD-1 and perforin, but reduced CD127. This phenotype was enhanced in HIV-specific CD8+ T-cells from tonsils compared to matched blood. Single-cell profiling of these cells re...
SummaryChromosome conformation capture technologies such as Hi-C have revealed a rich hierarchica... more SummaryChromosome conformation capture technologies such as Hi-C have revealed a rich hierarchical structure of chromatin, with topologically associating domains (TADs) as a key organizational unit, but experimentally reported TAD architectures, currently determined separately for each cell type, are lacking for many cell/tissue types. A solution to address this issue is to integrate existing epigenetic data across cells and tissue types to develop a species-level consensus map relating genes to TADs. Here, we introduce the TAD Map, a bag-of-genes representation that we use to infer, or “impute,” TAD architectures for those cells/tissues with limited Hi-C experimental data. The TAD Map enables a systematic analysis of gene coexpression induced by chromatin structure. By overlaying transcriptional data from hundreds of bulk and single-cell assays onto the TAD Map, we assess gene coexpression in TADs and find that expressed genes cluster into fewer TADs than would be expected by chanc...
Machine learning that generates biological hypotheses has transformative potential, but most lear... more Machine learning that generates biological hypotheses has transformative potential, but most learning algorithms are susceptible to pathological failure when exploring regimes beyond the training data distribution. A solution is to quantify predictionuncertaintyso that algorithms can gracefully handle novel phenomena that confound standard methods. Here, we demonstrate the broad utility of robust uncertainty prediction in biological discovery. By leveraging Gaussian process-based uncertainty prediction on modern pretrained features, we train a model on just 72 compounds to make predictions over a 10,833-compound library, identifying and experimentally validating compounds with nanomolar affinity for diverse kinases and whole-cell growth inhibition ofMycobacterium tuberculosis. We show how uncertainty facilitates a tight iterative loop between computation and experimentation, improves the generative design of novel biochemical structures, and generalizes across disparate biological d...
Viral mutation that escapes from human immunity remains a major obstacle to antiviral and vaccine... more Viral mutation that escapes from human immunity remains a major obstacle to antiviral and vaccine development. While anticipating escape could aid rational therapeutic design, the complex rules governing viral escape are challenging to model. Here, we demonstrate an unprecedented ability to predict viral escape by using machine learning algorithms originally developed to model the complexity of human natural language. Our key conceptual advance is that predicting escape requires identifying mutations that preserve viral fitness, or “grammaticality,” and also induce high antigenic change, or “semantic change.” We develop viral language models for influenza hemagglutinin, HIV Env, and SARS-CoV-2 Spike that we use to construct antigenically meaningful semantic landscapes, perform completely unsupervised prediction of escape mutants, and learn structural escape patterns from sequence alone. More profoundly, we lay a promising conceptual bridge between natural language and viral evolutio...
Nonlinear data-visualization methods, such as t-SNE and UMAP, have become staple tools for summar... more Nonlinear data-visualization methods, such as t-SNE and UMAP, have become staple tools for summarizing the complex transcriptomic landscape of single cells in 2D or 3D. However, existing approaches neglect the local density of data points in the original space, often resulting in misleading visualizations where densely populated subpopulations of cells are given more visual space even if they account for only a small fraction of transcriptional diversity within the dataset. We present den-SNE and densMAP, our density-preserving visualization tools based on t-SNE and UMAP, respectively, and demonstrate their ability to facilitate more accurate visual interpretation of single-cell RNA-seq data. On recently published datasets, our methods newly reveal significant changes in transcriptomic variability within a range of biological processes, including cancer, immune cell specialization in human, and the developmental trajectory ofC. elegans. Our methods are readily applicable to visualiz...
Cryo-electron microscopy (cryoEM) is becoming the preferred method for resolving protein structur... more Cryo-electron microscopy (cryoEM) is becoming the preferred method for resolving protein structures. Low signal-to-noise (SNR) in cryoEM images reduces the confidence and throughput of structure determination during several steps of data processing, resulting in impediments such as missing particle orientations. Denoising cryoEM images can not only improve downstream analysis but also accelerate the time-consuming data collection process by allowing lower electron dose micrographs to be used for analysis. Here, we present Topaz-Denoise, a deep learning method for reliably and rapidly increasing the SNR of cryoEM images and cryoET tomograms. By training on a dataset composed of thousands of micrographs collected across a wide range of imaging conditions, we are able to learn models capturing the complexity of the cryoEM image formation process. The general model we present is able to denoise new datasets without additional training. Denoising with this model improves micrograph inter...
Motivation Unbiased clustering methods are needed to analyze growing numbers of complex datasets.... more Motivation Unbiased clustering methods are needed to analyze growing numbers of complex datasets. Currently available clustering methods often depend on parameters that are set by the user, they lack stability, and are not applicable to small datasets. To overcome these shortcomings we used topological data analysis, an emerging field of mathematics that discerns additional feature and discovers hidden insights on datasets and has a wide application range. Results We have developed a topology-based clustering method called Two-Tier Mapper (TTMap) for enhanced analysis of global gene expression datasets. First, TTMap discerns divergent features in the control group, adjusts for them, and identifies outliers. Second, the deviation of each test sample from the control group in a high-dimensional space is computed, and the test samples are clustered using a new Mapper-based topological algorithm at two levels: a global tier and local tiers. All parameters are either carefully chosen or ...
The International Society for Computational Biology (ISCB) each year recognizes the achievements ... more The International Society for Computational Biology (ISCB) each year recognizes the achievements of an early to mid-career scientist with the Overton Prize. This prize honors the untimely death of Dr. G. Christian Overton, an admired computational biologist and founding ISCB Board member. Winners of the Overton Prize are independent investigators who are in the early to middle phases of their careers and are selected because of their significant contributions to computational biology through research, teaching, and service. ISCB is pleased to recognize Dr. Christoph Bock, Principal Investigator at the CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences in Vienna, Austria, as the 2017 winner of the Overton Prize. Bock will be presenting a keynote presentation at the 2017 International Conference on Intelligent Systems for Molecular Biology/European Conference on Computational Biology (ISMB/ECCB) in Prague, Czech Republic being held during July 21-25, 2017.
Researchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological ... more Researchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological systems1–4 and every cell type in the human body.5 Leveraging this data to gain unprecedented insight into biology and disease will require assembling heterogeneous cell populations across multiple experiments, laboratories, and technologies. Although methods for scRNA-seq data integration exist6,7, they often naively merge data sets together even when the data sets have no cell types in common, leading to results that do not correspond to real biological patterns. Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration. Our strategy identifies and merges the shared cell types among all pairs of data sets and is orders of magnitude faster than existing techniques. We use Scanorama to combine 105,476 cells from 26 diverse scRNA-seq experiments across 9 diff...
Sequencing technologies are capturing longer-range genomic information at lower error rates, enab... more Sequencing technologies are capturing longer-range genomic information at lower error rates, enabling alignment to genomic regions that are inaccessible with short reads. However, many methods are unable to align reads to much of the genome, recognized as important in disease, and thus report erroneous results in downstream analyses. We introduce EMA, a novel two-tiered statistical binning model for barcoded read alignment, that first probabilistically maps reads to potentially multiple "read clouds" and then within clouds by newly exploiting the non-uniform read densities characteristic of barcoded read sequencing. EMA substantially improves downstream accuracy over existing methods, including phasing and genotyping on 10x data, with fewer false variant calls in nearly half the time. EMA effectively resolves particularly challenging alignments in genomic regions that contain nearby homologous elements, uncovering variants in the pharmacogenomically important CYP2D region,...
Mathematical models of cellular processes can systematically predict the phenotypes of novel comb... more Mathematical models of cellular processes can systematically predict the phenotypes of novel combinations of multi-gene mutations. Searching for informative predictions and prioritizing them for experimental validation is challenging since the number of possible combinations grows exponentially in the number of mutations. Moreover, keeping track of the crosses needed to make new mutants and planning sequences of experiments is unmanageable when the experimenter is deluged by hundreds of potentially informative predictions to test. We present CrossPlan, a novel methodology for systematically planning genetic crosses to make a set of target mutants from a set of source mutants. We base our approach on a generic experimental workflow used in performing genetic crosses in budding yeast.We prove that the CrossPlan problem is NPcomplete. We develop an integer-linear-program (ILP) to maximize the number of target mutants that we can make under certain experimental constraints. We apply our...
Bacterial microbiomes of incredible complexity are found throughout the world, from exotic marine... more Bacterial microbiomes of incredible complexity are found throughout the world, from exotic marine locations to the soil in our yards to within our very guts. With recent advances in Next-Generation Sequencing (NGS) technologies, we have vastly greater quantities of microbial genome data, but the nature of environmental samples is such that DNA from different species are mixed together. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. Our Opal method introduces low-density, even-coverage hashing to bioinformatics applications, enabling quick and accurate metagenomic binning. Our tool is up to two orders of magnitude faster than leading alignment-based methods at similar or improved accuracy, allowing computational tractability on large metagenomic datasets. Moreover, on public benchmarks, Opal is substantially more accurate than both alignment-based and alignment-free methods (e.g. on SimHC20.500, Opal achieves 95% F1-...
We report a newly-identified bias in CLIP data that results from cleaving enzyme specificity. Thi... more We report a newly-identified bias in CLIP data that results from cleaving enzyme specificity. This bias is inadvertently incorporated into standard peak calling methods [1], which identify the most likely locations where proteins bind RNA. We further show how, in downstream analysis, this bias is incorporated into models inferred by the state-of-the-art GraphProt method to predict protein RNA-binding. We call for both experimental controls to measure enzyme specificities and algorithms to identify unbiased CLIP binding sites.
Interactome mapping and functional genomics in Drosophila reveal common and specific components o... more Interactome mapping and functional genomics in Drosophila reveal common and specific components of a mitogen-activated protein kinase pathway.
Lymphoid tissues are an important HIV reservoir site that persists in the face of antiretroviral ... more Lymphoid tissues are an important HIV reservoir site that persists in the face of antiretroviral therapy and natural immunity. Targeting these reservoirs by harnessing the antiviral activity of local tissue resident memory ( TRM) CD8+ T-cells is of great interest, but limited data exist on TRMs within lymph nodes of people living with HIV (PLWH). Here, we studied tonsil CD8+ T-cells obtained from PLWH and uninfected controls from South Africa. We show that these cells are preferentially located outside the germinal centers (GCs), the main reservoir site for HIV, and display a low cytolytic and transcriptionally TRM-like profile that is distinct from blood. In PLWH, CD8+ TRM-like cells are highly expanded and adopt a more cytolytic, activated and exhausted phenotype characterized by increased expression of CD69, PD-1 and perforin, but reduced CD127. This phenotype was enhanced in HIV-specific CD8+ T-cells from tonsils compared to matched blood. Single-cell profiling of these cells re...
SummaryChromosome conformation capture technologies such as Hi-C have revealed a rich hierarchica... more SummaryChromosome conformation capture technologies such as Hi-C have revealed a rich hierarchical structure of chromatin, with topologically associating domains (TADs) as a key organizational unit, but experimentally reported TAD architectures, currently determined separately for each cell type, are lacking for many cell/tissue types. A solution to address this issue is to integrate existing epigenetic data across cells and tissue types to develop a species-level consensus map relating genes to TADs. Here, we introduce the TAD Map, a bag-of-genes representation that we use to infer, or “impute,” TAD architectures for those cells/tissues with limited Hi-C experimental data. The TAD Map enables a systematic analysis of gene coexpression induced by chromatin structure. By overlaying transcriptional data from hundreds of bulk and single-cell assays onto the TAD Map, we assess gene coexpression in TADs and find that expressed genes cluster into fewer TADs than would be expected by chanc...
Machine learning that generates biological hypotheses has transformative potential, but most lear... more Machine learning that generates biological hypotheses has transformative potential, but most learning algorithms are susceptible to pathological failure when exploring regimes beyond the training data distribution. A solution is to quantify predictionuncertaintyso that algorithms can gracefully handle novel phenomena that confound standard methods. Here, we demonstrate the broad utility of robust uncertainty prediction in biological discovery. By leveraging Gaussian process-based uncertainty prediction on modern pretrained features, we train a model on just 72 compounds to make predictions over a 10,833-compound library, identifying and experimentally validating compounds with nanomolar affinity for diverse kinases and whole-cell growth inhibition ofMycobacterium tuberculosis. We show how uncertainty facilitates a tight iterative loop between computation and experimentation, improves the generative design of novel biochemical structures, and generalizes across disparate biological d...
Viral mutation that escapes from human immunity remains a major obstacle to antiviral and vaccine... more Viral mutation that escapes from human immunity remains a major obstacle to antiviral and vaccine development. While anticipating escape could aid rational therapeutic design, the complex rules governing viral escape are challenging to model. Here, we demonstrate an unprecedented ability to predict viral escape by using machine learning algorithms originally developed to model the complexity of human natural language. Our key conceptual advance is that predicting escape requires identifying mutations that preserve viral fitness, or “grammaticality,” and also induce high antigenic change, or “semantic change.” We develop viral language models for influenza hemagglutinin, HIV Env, and SARS-CoV-2 Spike that we use to construct antigenically meaningful semantic landscapes, perform completely unsupervised prediction of escape mutants, and learn structural escape patterns from sequence alone. More profoundly, we lay a promising conceptual bridge between natural language and viral evolutio...
Nonlinear data-visualization methods, such as t-SNE and UMAP, have become staple tools for summar... more Nonlinear data-visualization methods, such as t-SNE and UMAP, have become staple tools for summarizing the complex transcriptomic landscape of single cells in 2D or 3D. However, existing approaches neglect the local density of data points in the original space, often resulting in misleading visualizations where densely populated subpopulations of cells are given more visual space even if they account for only a small fraction of transcriptional diversity within the dataset. We present den-SNE and densMAP, our density-preserving visualization tools based on t-SNE and UMAP, respectively, and demonstrate their ability to facilitate more accurate visual interpretation of single-cell RNA-seq data. On recently published datasets, our methods newly reveal significant changes in transcriptomic variability within a range of biological processes, including cancer, immune cell specialization in human, and the developmental trajectory ofC. elegans. Our methods are readily applicable to visualiz...
Cryo-electron microscopy (cryoEM) is becoming the preferred method for resolving protein structur... more Cryo-electron microscopy (cryoEM) is becoming the preferred method for resolving protein structures. Low signal-to-noise (SNR) in cryoEM images reduces the confidence and throughput of structure determination during several steps of data processing, resulting in impediments such as missing particle orientations. Denoising cryoEM images can not only improve downstream analysis but also accelerate the time-consuming data collection process by allowing lower electron dose micrographs to be used for analysis. Here, we present Topaz-Denoise, a deep learning method for reliably and rapidly increasing the SNR of cryoEM images and cryoET tomograms. By training on a dataset composed of thousands of micrographs collected across a wide range of imaging conditions, we are able to learn models capturing the complexity of the cryoEM image formation process. The general model we present is able to denoise new datasets without additional training. Denoising with this model improves micrograph inter...
Motivation Unbiased clustering methods are needed to analyze growing numbers of complex datasets.... more Motivation Unbiased clustering methods are needed to analyze growing numbers of complex datasets. Currently available clustering methods often depend on parameters that are set by the user, they lack stability, and are not applicable to small datasets. To overcome these shortcomings we used topological data analysis, an emerging field of mathematics that discerns additional feature and discovers hidden insights on datasets and has a wide application range. Results We have developed a topology-based clustering method called Two-Tier Mapper (TTMap) for enhanced analysis of global gene expression datasets. First, TTMap discerns divergent features in the control group, adjusts for them, and identifies outliers. Second, the deviation of each test sample from the control group in a high-dimensional space is computed, and the test samples are clustered using a new Mapper-based topological algorithm at two levels: a global tier and local tiers. All parameters are either carefully chosen or ...
The International Society for Computational Biology (ISCB) each year recognizes the achievements ... more The International Society for Computational Biology (ISCB) each year recognizes the achievements of an early to mid-career scientist with the Overton Prize. This prize honors the untimely death of Dr. G. Christian Overton, an admired computational biologist and founding ISCB Board member. Winners of the Overton Prize are independent investigators who are in the early to middle phases of their careers and are selected because of their significant contributions to computational biology through research, teaching, and service. ISCB is pleased to recognize Dr. Christoph Bock, Principal Investigator at the CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences in Vienna, Austria, as the 2017 winner of the Overton Prize. Bock will be presenting a keynote presentation at the 2017 International Conference on Intelligent Systems for Molecular Biology/European Conference on Computational Biology (ISMB/ECCB) in Prague, Czech Republic being held during July 21-25, 2017.
Researchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological ... more Researchers are generating single-cell RNA sequencing (scRNA-seq) profiles of diverse biological systems1–4 and every cell type in the human body.5 Leveraging this data to gain unprecedented insight into biology and disease will require assembling heterogeneous cell populations across multiple experiments, laboratories, and technologies. Although methods for scRNA-seq data integration exist6,7, they often naively merge data sets together even when the data sets have no cell types in common, leading to results that do not correspond to real biological patterns. Here we present Scanorama, inspired by algorithms for panorama stitching, that overcomes the limitations of existing methods to enable accurate, heterogeneous scRNA-seq data set integration. Our strategy identifies and merges the shared cell types among all pairs of data sets and is orders of magnitude faster than existing techniques. We use Scanorama to combine 105,476 cells from 26 diverse scRNA-seq experiments across 9 diff...
Sequencing technologies are capturing longer-range genomic information at lower error rates, enab... more Sequencing technologies are capturing longer-range genomic information at lower error rates, enabling alignment to genomic regions that are inaccessible with short reads. However, many methods are unable to align reads to much of the genome, recognized as important in disease, and thus report erroneous results in downstream analyses. We introduce EMA, a novel two-tiered statistical binning model for barcoded read alignment, that first probabilistically maps reads to potentially multiple "read clouds" and then within clouds by newly exploiting the non-uniform read densities characteristic of barcoded read sequencing. EMA substantially improves downstream accuracy over existing methods, including phasing and genotyping on 10x data, with fewer false variant calls in nearly half the time. EMA effectively resolves particularly challenging alignments in genomic regions that contain nearby homologous elements, uncovering variants in the pharmacogenomically important CYP2D region,...
Mathematical models of cellular processes can systematically predict the phenotypes of novel comb... more Mathematical models of cellular processes can systematically predict the phenotypes of novel combinations of multi-gene mutations. Searching for informative predictions and prioritizing them for experimental validation is challenging since the number of possible combinations grows exponentially in the number of mutations. Moreover, keeping track of the crosses needed to make new mutants and planning sequences of experiments is unmanageable when the experimenter is deluged by hundreds of potentially informative predictions to test. We present CrossPlan, a novel methodology for systematically planning genetic crosses to make a set of target mutants from a set of source mutants. We base our approach on a generic experimental workflow used in performing genetic crosses in budding yeast.We prove that the CrossPlan problem is NPcomplete. We develop an integer-linear-program (ILP) to maximize the number of target mutants that we can make under certain experimental constraints. We apply our...
Bacterial microbiomes of incredible complexity are found throughout the world, from exotic marine... more Bacterial microbiomes of incredible complexity are found throughout the world, from exotic marine locations to the soil in our yards to within our very guts. With recent advances in Next-Generation Sequencing (NGS) technologies, we have vastly greater quantities of microbial genome data, but the nature of environmental samples is such that DNA from different species are mixed together. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. Our Opal method introduces low-density, even-coverage hashing to bioinformatics applications, enabling quick and accurate metagenomic binning. Our tool is up to two orders of magnitude faster than leading alignment-based methods at similar or improved accuracy, allowing computational tractability on large metagenomic datasets. Moreover, on public benchmarks, Opal is substantially more accurate than both alignment-based and alignment-free methods (e.g. on SimHC20.500, Opal achieves 95% F1-...
We report a newly-identified bias in CLIP data that results from cleaving enzyme specificity. Thi... more We report a newly-identified bias in CLIP data that results from cleaving enzyme specificity. This bias is inadvertently incorporated into standard peak calling methods [1], which identify the most likely locations where proteins bind RNA. We further show how, in downstream analysis, this bias is incorporated into models inferred by the state-of-the-art GraphProt method to predict protein RNA-binding. We call for both experimental controls to measure enzyme specificities and algorithms to identify unbiased CLIP binding sites.
Interactome mapping and functional genomics in Drosophila reveal common and specific components o... more Interactome mapping and functional genomics in Drosophila reveal common and specific components of a mitogen-activated protein kinase pathway.
Uploads
Papers by Bonnie Berger