Microarrays are one of the latest breakthroughs in experimental molecular biology, which allow mo... more Microarrays are one of the latest breakthroughs in experimental molecular biology, which allow monitoring of gene expression for tens of thousands of genes in parallel and are already producing huge amounts of valuable data. Analysis and handling of such data is becoming one of the major bottlenecks in the utilization of the technology. The raw microarray data are images, which have to be transformed into gene expression matrices--tables where rows represent genes, columns represent various samples such as tissues or experimental conditions, and numbers in each cell characterize the expression level of the particular gene in the particular sample. These matrices have to be analyzed further, if any knowledge about the underlying biological processes is to be extracted. In this paper we concentrate on discussing bioinformatics methods used for such analysis. We briefly discuss supervised and unsupervised data analysis and its applications, such as predicting gene function classes and cancer classification. Then we discuss how the gene expression matrix can be used to predict putative regulatory signals in the genome sequences. In conclusion we discuss some possible future directions.
Expression Profiler (EP, http://ep.ebi.ac.uk/) is a set of tools for the analysis and interpretat... more Expression Profiler (EP, http://ep.ebi.ac.uk/) is a set of tools for the analysis and interpretation of gene expression and other functional genomics data. These tools perform expression data clustering, visualization, and analysis, integration of expression data with protein interaction data and functional annotations, such as GeneOntology, and the analysis of promoter sequences for predicting transcription factor binding sites. Several clustering analysis method implementations and tools for sequence pattern discovery provide a rich data mining environment for various types of biological data. All the tools are Web-based, with minimal browser requirements. Analysis results are cross-linked to other databases and tools are available on the Internet. This enables further integration of the tools and databases; for instance, such public microarray gene expression databases as Array Express.
We have examined methods and developed a general software tool for finding and analyzing combinat... more We have examined methods and developed a general software tool for finding and analyzing combinations of transcription factor binding sites that occur relatively often in gene upstream regions (putative promoter regions) in the yeast genome. Such frequently occurring combinations may be essential parts of possible promoter classes. The regions upstream to all genes were first isolated from the yeast genome database MIPS using the information in the annotation files of the database. The ones that do not overlap with coding regions were chosen for further studies. Next, all occurrences of the yeast transcription factor binding sites, as given in the IMD database, were located in the genome and in the selected regions in particular. Finally, by using a general purpose data mining software in combination with our own software, which parametrizes the search, we can find the combinations of binding sites that occur in the upstream regions more frequently than would be expected on the basis of the frequency of individual sites. The procedure also finds so-called association rules present in such combinations. The developed tool is available for use through the WWW.
We have developed a set of methods and tools for automatic discovery of putative regulatory signa... more We have developed a set of methods and tools for automatic discovery of putative regulatory signals in genome sequences. The analysis pipeline consists of gene expression data clustering, sequence pattern discovery from upstream sequences of genes, a control experiment for pattern significance threshold limit detection, selection of interesting patterns, grouping of these patterns, representing the pattern groups in a concise form and evaluating the discovered putative signals against existing databases of regulatory signals. The pattern discovery is computationally the most expensive and crucial step. Our tool performs a rapid exhaustive search for a priori unknown statistically significant sequence patterns of unrestricted length. The statistical significance is determined for a set of sequences in each cluster with respect to a set of background sequences allowing the detection of subtle regulatory signals specific for each cluster. The potentially large number of significant patterns is reduced to a small number of groups by clustering them by mutual similarity. Automatically derived consensus patterns of these groups represent the results in a comprehensive way for a human investigator. We have performed a systematic analysis for the yeast Saccharomyces cerevisiae. We created a large number of independent clusterings of expression data simultaneously assessing the "goodness" of each cluster. For each of the over 52,000 clusters acquired in this way we discovered significant patterns in the upstream sequences of respective genes. We selected nearly 1,500 significant patterns by formal criteria and matched them against the experimentally mapped transcription factor binding sites in the SCPD database. We clustered the 1,500 patterns to 62 groups for which we derived automatically alignments and consensus patterns. Of these 62 groups 48 had patterns that have matching sites in SCPD database.
One of the underlying principles of scientific publication in peer-reviewed journals has been the... more One of the underlying principles of scientific publication in peer-reviewed journals has been the requirement that the authors make available the data and materials necessary for a reader to reproduce the experiment or analysis and to determine whether the data support ...
Management and analysis of the huge amounts of data produced by microarray experiments is becomin... more Management and analysis of the huge amounts of data produced by microarray experiments is becoming one of the major bottlenecks in the utilization of this high-throughput technology. We describe the basic design of a microarray gene expression database to help microarray users and their informatics teams to set up their information services. We describe two data models — a simpler one called ArrayExpressB and the complete model ArrayExpressC, and discuss some implementation issues. For latest developments see http: www.ebi.ac.uk/arrayexpress.
Microarrays are one of the latest breakthroughs in experimental molecular biology, which allow mo... more Microarrays are one of the latest breakthroughs in experimental molecular biology, which allow monitoring of gene expression for tens of thousands of genes in parallel and are already producing huge amounts of valuable data. Analysis and handling of such data is becoming one of the major bottlenecks in the utilization of the technology. The raw microarray data are images, which have to be transformed into gene expression matrices--tables where rows represent genes, columns represent various samples such as tissues or experimental conditions, and numbers in each cell characterize the expression level of the particular gene in the particular sample. These matrices have to be analyzed further, if any knowledge about the underlying biological processes is to be extracted. In this paper we concentrate on discussing bioinformatics methods used for such analysis. We briefly discuss supervised and unsupervised data analysis and its applications, such as predicting gene function classes and cancer classification. Then we discuss how the gene expression matrix can be used to predict putative regulatory signals in the genome sequences. In conclusion we discuss some possible future directions.
Expression Profiler (EP, http://ep.ebi.ac.uk/) is a set of tools for the analysis and interpretat... more Expression Profiler (EP, http://ep.ebi.ac.uk/) is a set of tools for the analysis and interpretation of gene expression and other functional genomics data. These tools perform expression data clustering, visualization, and analysis, integration of expression data with protein interaction data and functional annotations, such as GeneOntology, and the analysis of promoter sequences for predicting transcription factor binding sites. Several clustering analysis method implementations and tools for sequence pattern discovery provide a rich data mining environment for various types of biological data. All the tools are Web-based, with minimal browser requirements. Analysis results are cross-linked to other databases and tools are available on the Internet. This enables further integration of the tools and databases; for instance, such public microarray gene expression databases as Array Express.
We have examined methods and developed a general software tool for finding and analyzing combinat... more We have examined methods and developed a general software tool for finding and analyzing combinations of transcription factor binding sites that occur relatively often in gene upstream regions (putative promoter regions) in the yeast genome. Such frequently occurring combinations may be essential parts of possible promoter classes. The regions upstream to all genes were first isolated from the yeast genome database MIPS using the information in the annotation files of the database. The ones that do not overlap with coding regions were chosen for further studies. Next, all occurrences of the yeast transcription factor binding sites, as given in the IMD database, were located in the genome and in the selected regions in particular. Finally, by using a general purpose data mining software in combination with our own software, which parametrizes the search, we can find the combinations of binding sites that occur in the upstream regions more frequently than would be expected on the basis of the frequency of individual sites. The procedure also finds so-called association rules present in such combinations. The developed tool is available for use through the WWW.
We have developed a set of methods and tools for automatic discovery of putative regulatory signa... more We have developed a set of methods and tools for automatic discovery of putative regulatory signals in genome sequences. The analysis pipeline consists of gene expression data clustering, sequence pattern discovery from upstream sequences of genes, a control experiment for pattern significance threshold limit detection, selection of interesting patterns, grouping of these patterns, representing the pattern groups in a concise form and evaluating the discovered putative signals against existing databases of regulatory signals. The pattern discovery is computationally the most expensive and crucial step. Our tool performs a rapid exhaustive search for a priori unknown statistically significant sequence patterns of unrestricted length. The statistical significance is determined for a set of sequences in each cluster with respect to a set of background sequences allowing the detection of subtle regulatory signals specific for each cluster. The potentially large number of significant patterns is reduced to a small number of groups by clustering them by mutual similarity. Automatically derived consensus patterns of these groups represent the results in a comprehensive way for a human investigator. We have performed a systematic analysis for the yeast Saccharomyces cerevisiae. We created a large number of independent clusterings of expression data simultaneously assessing the "goodness" of each cluster. For each of the over 52,000 clusters acquired in this way we discovered significant patterns in the upstream sequences of respective genes. We selected nearly 1,500 significant patterns by formal criteria and matched them against the experimentally mapped transcription factor binding sites in the SCPD database. We clustered the 1,500 patterns to 62 groups for which we derived automatically alignments and consensus patterns. Of these 62 groups 48 had patterns that have matching sites in SCPD database.
One of the underlying principles of scientific publication in peer-reviewed journals has been the... more One of the underlying principles of scientific publication in peer-reviewed journals has been the requirement that the authors make available the data and materials necessary for a reader to reproduce the experiment or analysis and to determine whether the data support ...
Management and analysis of the huge amounts of data produced by microarray experiments is becomin... more Management and analysis of the huge amounts of data produced by microarray experiments is becoming one of the major bottlenecks in the utilization of this high-throughput technology. We describe the basic design of a microarray gene expression database to help microarray users and their informatics teams to set up their information services. We describe two data models — a simpler one called ArrayExpressB and the complete model ArrayExpressC, and discuss some implementation issues. For latest developments see http: www.ebi.ac.uk/arrayexpress.
Uploads
Papers by Alvis Brazma