Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
106 Genome Informatics 11: 106–117 (2000) Expression Profiles and Biological Function Juan Carlos Oliveros1,3 Christian Blaschke1,3 oliveros@cnb.uam.es blaschke@cnb.uam.es Javier Herrero1,2 Joaquı́n Dopazo2 jherrero@cnio.es jdopazo@cnio.es Alfonso Valencia1,4 valencia@cnb.uam.es 1 2 3 4 Protein Design Group, Centro Nacional de Biotecnologı́a (CNB-CSIC), Campus de Cantoblanco, 28049 Madrid, Spain Bioinformatics Unit, Centro Nacional de Investigaciones Oncológicas (CNIO), Ctra Majadahonda-Pozuelo Km 2, 28220 Majadahonda, Madrid, Spain These authors have contributed equally to this work To whom correspondence should be adressed Abstract Expression arrays facilitate the monitoring of changes in expression patterns of large collections of genes. It is generally expected that genes with similar expression patterns would correspond to proteins of common biological function. We assess this common assumption by comparing levels of similarity of expression patterns and statistical significance of biological terms that describe the corresponding protein functions. Terms are automatically obtained by mining large collections of Medline abstracts. We propose that the combined use of the tools for expression profiles clustering and automatic function retrieval, can be useful tools for the detection of biologically relevant associations between genes in complex gene expression experiments. The results obtained using publicly available experimental data show how, in general, an increase in the similarity of the expression patterns is accompanied by an enhancement of the amount of specific functional information or, in other words, how the selected terms became more specific following an increase in the specificity of the expression patterns. Particularly interesting are the discrepancies from this general trend, i.e. groups of genes with similar expression patterns but very little in common at the functional level. In these cases the similarity of their expression profiles becomes the first link between previously unrelated genes. Keywords: text analysis, protein function, expression arrays, yeast, medline 1 Introduction In the past few years the development of the expression array technology has introduced an important technical novelty facilitating the analysing the expression level of genes of entire systems, e.g. the genomes of E. coli [14], and Saccharomyces cerevisiae [6, 15] or different human tissues [1, 12]; for a review see [4]. Among the many possibilities opened by the new technology, one of the more interesting ones is the possibility of stabilising links between genes with similar expression patterns that are likely to have a similar mechanism of gene expression control. Interestingly, the relation between genes by similarity of their expression profile is different of other possible connections created by similarities at the level of biochemical or cellular functions. Even if it is commonly assumed that both type of connections would go together in many cases, that is, genes with similar expression patterns will have similar functions, to our knowledge there have been no Expression Profiles and Biological Function 107 detailed studies of the correspondence between expression patterns and protein function. Indeed, as interesting as the possible general relation between expression patterns and functions are the possible exceptional cases in which genes of similar expression patterns have different functions. We address here the general scientific question of the detection of the level of relation between expression patterns and functions. Can the biological relevance of the gene expression clusters be detected by monitoring the significance of the associated functional information? The gathering of data for analysing this question is carried out by combining two technologies, one for expression profiles clustering and a second one for deriving automatically functional information. These technologies allow us first, to cluster the expression patterns by their similarity in a steadily manner, including different levels of relation, and second, to extract automatically functional information common to groups of genes. The first task is carried out with a recently developed clustering method and the second one with the application of our tools for information extraction from the literature. The quantitative evaluation is carried out by following the “goodness of fit” of the groups of genes with similar expression patterns and the significance of the biological terms specific to the different groups of genes. 1.1 Classifying Groups of Genes with Similar Expression Profiles Different approaches have been applied to the comparison of large numbers of expression patterns, including hierarchical clustering, multivariate analysis and neural networks [8, 16, 17]. We have recently proposed a method capable of producing hierarchical tree structures, that facilitate the representation of higher order relationships between groups of profiles, without loosing the advantageous properties of the direct classifications produced by unsupervised learning methods (SOTA1 ; [10]). The underlying algorithm [7] is able to function with expression array data that include considerable amounts of noise, and can be used to estimate the reliability of the different branches of the final tree structure. The SOTA approach has additional advantages: the binary tree representation is adequate for the visualisation of the data, the profile values associated with the nodes are equivalent to a weighted average of the corresponding profiles [13] that can be directly used as representative profiles of the associated genes, and the similarity of the expression patterns associated to each node can be directly used to estimate the quality of the different gene clusters. 1.2 Extracting Information about Biological Function The DNA arrays usually contain genes for which very different levels of biological information are available, including many genes of unknown function. We have previously developed a system (GEISHA2 ) for recovering automatically functional information specific to the different clusters by direct extraction from textual sources [5], including abstracts stored in Medline [18], functional information associated to sequence databases such as SwissProt [3], or specialised information in repositories, such us YPD [11]. The GEISHA system recovers significant information in the form of terms associated with the different clusters. It provides, together with the specific functional information in the form of terms, a quantification of their statistical significance and a selection of the best sentences and abstracts in which the terms were identified. The analysis of the publicly available experiments for yeast [8] reveals how the information automatically extracted contains terms that clearly identify the function of the different gene expression clusters in good agreement with the original annotations derived by human experts. 1 2 Self-Organising Tree Algorithm Gene Expression Information System for Human Analysis 108 2 2.1 Oliveros et al. Methods Tree-Clustering of Expression Profiles (SOTA) The clustering of the gene expression patterns was carried out with SOTA [7], an algorithm based both on self organising maps [13] and growing cell structures [9]. It maps the complex input space to a simpler binary tree topology. This structure grows from the root of the tree, where all expression profiles are mapped to one node, toward the leaves which contain only one profile. The final structure can be asymmetrical, including branches with different number of nodes and can be stopped at the desired level to adjust the homogeneity of the resulting clusters to the particular needs. Expression array data are typically arranged in tables where rows represent genes and columns expression values in the form of intensities (see for example, [8]). In many cases, they are given as ratios between the expression values and a reference condition. Since raw experimental data often display highly asymmetrical distributions, a posterior logarithmic transformation compresses the scale and produces symmetrical values around zero. Distances are obtained from the pair-wise comparison of gene expression patterns as a common Euclidean distance, Pearson correlation coefficients or correlation coefficients with an offset of 0, a choice measurement when the data are serial measurements with respect to an initial state of reference with value zero, i.e. time series [8]. In the SOTA implementation each node is a profile vector equivalent to the vectors of gene expression profiles. In the beginning, one root-node is constructed as a mean of all the expression profiles and divided during the training in two nodes that contain more similar profiles. This process is continued to generate groups of genes with highly similar profiles linked to other groups by the generated tree structure, giving information about the relative distances between the groups. The growth of the network is directed by the dispersion value [7, 9, 13], defined as the mean value of the distances between a node and the expression profiles associated with it. The criterion used for monitoring the convergence of the network is the total error, defined as the summation of the dispersion values of all the terminal nodes. The algorithm proceeds by expanding the network from the node having the most heterogeneous population of associated input gene expression profiles. The growth of the network ends when the maximum dispersion value among all the terminal nodes reaches a certain threshold. The maximum distance between pairs of gene expression profiles of a node (variability) can also be used as a threshold. Depending on the value chosen, the resulting hierarchical tree structure can be built to the desired level. In the current examples we used data from [15] with a logarithmic transformation at base 2 and a correlation coefficient with an offset of 0. 2.2 Extraction of Functional Information (GEISHA) The application of GEISHA requires, first, the selection of the chosen body of text, for example, all the Medline abstracts that contain the word “yeast” in the text or “Saccharomyces cerevisiae” in the MESH terms. An abstract is associated to a gene if it contains the gene name, or any of the known synonyms. Based on this selection, the abstracts are associated with the corresponding clusters. If genes from different clusters appear in one abstract, the abstract is associated with all the different clusters. By comparing the frequency of a term (number of abstracts in which the term occur / total number of abstracts in that cluster) in one cluster to the frequencies in the other clusters, the significance of a term for a cluster can be computed (simply said, a term that appears in one or a few clusters with a frequency significantly higher than in the rest). This is estimated by the Z-Score [2], Z − Score = fai − f a SDa (1) (with fai the frequency of term a in cluster i, f a the mean frequency of term a and SDa the standard deviation of the distribution of this term) which has to be minimum value of 2.0 for a term to be 109 Expression Profiles and Biological Function Level 1 Level 2 Level 3 gene1 gene2 gene3 gene4 gene1 gene2 . . gene6 gene7 Associated literature Gene cluster DNA array Representative expression profile for the genes of the cluster gene5 gene6 gene7 gene1 gene2 gene3 . . . . . . gene13 gene14 gene15 gene8 gene9 gene10 gene11 gene12 gene8 gene9 . . gene14 gene15 Significant terms gene13 gene14 gene15 RESULT: Significant terms for each node Expression profiles and functional information gets more specific Figure 1: Schema of the process of clustering of the gene expression profiles and analysis of the biological terms contained in the associated text. The general process starts with the application of SOTA, which organises the profiles in a binary tree-like structure. Groups of genes (nodes) can be compared at different levels, three of which are represented in the figure. For each one of the nodes, a representative expression profile is obtained, representing the general behaviour of the associated genes. In a second step, functional information is added to each node. Medline abstracts are associated to those nodes where gene names appear in the corresponding entries. The statistical significance of the association between gene clusters and functional terms is obtained by comparing the abundance of Medline entries containing significant terms with other nodes of the same tree level. selected. The minimum amount of textual information considered necessary to calculate the frequency fai is 25 abstracts. Groups with fewer abstracts are not taken into account in the analysis. Before doing this, the words are rooted (for singular and plural forms like “kinase”, “kinases” and different verb forms like “phosphorilate”, “phosphorilates”, “phosphorilated”) by simple rules without taking into account spelling differences and irregular verb forms. Then the text is searched for compound words (e.g. “DNA analysis”, “cell cycle”) by comparing the frequency of a word pair to the expected value based on the frequencies of the individual words and selecting the ones with significantly higher co-occurrence. term is used here to refer to single words and word pairs. 2.3 The Analysis The experimental data analysed here were obtained from [15], where the yeast cell cycle was explored in 76 time points corresponding to 6 different experiments. A body of text composed of 5472 MEDLINE abstracts was collected from the NLM data server [18]; 792 genes that presented substantial variations in their expression patterns where analysed by SOTA; 442 of these genes appeared in Medline abstracts and were analysed by GEISHA. 110 Oliveros et al. 1.1 variability dispersion 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 Level Figure 2: Progress of the clustering process. The mean values of variability, defined as the maximum distance among gene expression profiles in a cluster, and dispersion, calculated as the mean distance between node vector and each gene expression profile, are represented. Note that both values decrease with the increase in the resolution of the tree. In this analysis the tree has eight levels, from the root at level 1, to the smaller nodes at level 8. Error bars correspond to the standard deviation of the variability and dispersion distributions at each level of the tree. The results of the analysis are provided in a tree-like form, where each node contains information about the expression patterns and the biochemical function description of the associated genes (Fig. 1). Two values reflect the quality of a node. First, the dispersion value corresponding to the similarity between the associated expression patterns, and second, the Z-Score of the extracted terms estimating their significance in relation to the other nodes. The complete set of results discussed here and additional information about the different methods is accessible in the form of a web page [19]. 3 Results The clustering process starts at the root of the tree and proceeds by splitting those nodes that correspond to more variable profiles. As this process proceeds, the nodes became better defined and the profiles assigned to them closer to the average node profile. We used two parameters to measure the “goodness of fit” of the profiles in the corresponding node: variability and dispersion representing, respectively, the maximum distance the profiles and the mean value of the profile distances associated to a node see methods. Both of these decrease progressively along the tree (Fig. 2). When the values of the individual branches of the tree are followed, it is possible to distinguish cases in which the clustering proceeds at different speeds (Fig. 3). Rapid decreases in the cluster variability correspond to the creation of very homogeneous clusters, whereas relatively slow decreases indicate that the resulting nodes still contain genes of very diverse expression patterns. When the nodes have reached a variability lower than the threshold, they are not further divided (long horizontal lines in the figure). 3.1 Functional Information at Different Levels of the Tree The specificity of the functional information associated with the different nodes is quantified by the average Z-Score of the corresponding terms (for calculation see methods). The Z-Scores for the indi- Expression Profiles and Biological Function 111 Figure 3: Variability at different branches of the tree. The variability (maximum distance among gene expression profiles in a cluster, see Methods) for each node is represented. Nodes of the same branch of the tree are connected with a line. In some cases the variability is 0, corresponding to clusters of a single gene. Lines parallel to the x-axis correspond to nodes that have reached a variability under the threshold and are not further divided in smaller clusters. vidual terms are averaged for each and show a clear increase from the general to the specific clusters (root to leaves, see Fig. 4) indicating the accumulation of specific functional information in the different clusters. This increase is clearly significant, as can be seen by the comparison with a random assignment of Medline abstracts to genes (Fig. 4). The analysis of the behaviour of different tree branches provides additional information about the properties of the functional information. The average Z-Scores of the different terms, represented in Fig. 5, points to the existence of interesting trends: a) The most frequent observations are those cases in which the information associated with the clusters becomes more relevant along the tree, with a corresponding increase in Z-Scores. Often regions of abrupt increase of the Z-Score are observed, corresponding to gene clusters which at that point achieve their full functional meaning. b) In some cases, the continuous increase of the Z-Score is interrupted by episodic decreases, corresponding to points in which a node splits, producing groups that contain genes of more heterogeneous functions than the parental node and, consequently, smaller average Z-Scores. c) An interesting group of branches maintain relatively low values of Z-Score, indicating that, in general, the associated functions are quite heterogeneous and/or there is little functional information available in the literature analysed. An example of information contained in a parental node and in its two split nodes can help to understand the general tendencies described above. Cluster 10.10.10.53 groups 52 genes, with 312 associated abstracts (Fig. 6). The expression patterns are quite similar, with a variability of 0,7 and a dispersion value of 0,2. At the functional level, the cluster includes five well defined biological functions as revealed by the detailed manual analysis of the available information in different databases (YPD, SwissProt and Medline). The corresponding terms include relevant information for the function of the cluster, for example: “segregation”, for chromatin segregation; “microtubule”, related to the functions of chromatid segregation and transport; “polymerase”, corresponds to the main actor in DNA replication and repair, and “spindle” and “chromosome” are the main structures involved in the segregation of the chromatids. Two other terms have a less general meaning: “temperature” appears 112 Oliveros et al. Figure 4: Mean values of the term Z-Scores for the nodes at the different levels of the tree. The tree based on the experimental results is compared with a randomisation of the data where the abstracts for each gene were substituted by other abstracts randomly chosen from the same set, keeping the same number of abstracts for each node. The randomly generated data show a small increase in the Z-Scores together with an increase in the dispersion of the values. This increase corresponds to the increasing probability of random observations to became significant with the decrease of the node sizes. Figure 5: Progress of the mean values of the Z-scores for the different branches of the tree. The average value of the Z-Scores for the terms of each cluster are represented. Nodes of the same branch of the tree are connected with a line, using a representation equivalent to the one in Fig. 3. 113 Expression Profiles and Biological Function Cluster: 10.10.10.53.81 Abstracts: 264 Genes: 34 Function: Number of genes (%) DNA replication/repair: 10 (29%) Chromatid segregation: 10 (29%) Heteropolysaccharides rel.: 2 (6%) Budding and morphogenesis: 1 (3%) Cluster: 10.10.10.53 Abstracts: 312 Genes: 52 Terms: segregation, polymerase, chromosome, pol1... Function: Number of genes (%) DNA replication/repair: 12 (23%) Chromatid segregation: 14 (27%) Heteropolysaccharides rel.: 3 (6%) Budding and morphogenesis: 2 (4%) Transport across vesicule: 2 (4%) Mean Z−Score of terms: Variability: Dispersion: 3,61 0,566927 0,170563 Cluster: 10.10.10.53.82 Terms: segregation, temperatures, microtubule, , polymerase, spindles, chromosome, pol1... Mean Z−Score of terms: Variability: Dispersion: 3,05 0,716607 0,190920 Abstracts: 51 Genes: 18 Function : Number of genes (%) DNA replication/repair: 2 (4%) Chromatid segregation: 4 (8%) Heteropolysaccharides rel.: 1 (2%) Budding and morphogenesis: 1 (2%) Transport across vesicule: 2 (4%) Terms: cytoplasmic, detect, centromere, suppressors... Mean Z−Score of terms: Variability: Dispersion: 3,02 0,645772 0,172224 Figure 6: Example of biological function associated with a gene node. The clusters are labelled by a serial number that indicates their position in the tree (accessible at [19]). The number of genes corresponds to those genes with expression patterns similar to the node vector. The number of abstracts corresponds to the set of Medline entries that contain, at least, the name of one of the genes of the node. The indicated Z-Score values correspond to the average value for all of terms. Some of these terms are included in the figure. Genes without available bibliographic information were used for the clustering procedure but not for the functional analysis. because many of the experimental studies in this field have been carried out in temperature sensitive strains, and “pol1” is a specific term for a polymerase, very often quoted in the literature. The two derived nodes contain 34 and 18 genes that have substantially similar expression patterns. (variability values of 0,57 and 0,65). In one of them, cluster 10.10.10.53.81, the genes show a clear specificity in the associated biological function, with more than 65% of them corresponding to four well defined functions and more than half of them being related with the processes of DNA replication and chromatid segregation. Interestingly the sister node also contains genes with very similar profiles but it is less defined regarding their biological functions. For example, only 20% of the genes were clearly associated with the five main functional groups. At the level of the automatically extracted terms, the first of these two nodes contains well defined terms, such as “segregation”, “polymerase”, “chromosome” that were already important for the parental node, while in the second node many different terms appear, such as “cytoplasmic”, “centromer” or “suppressor”, that were not present in the parental cluster. This adds further evidence to the functional heterogeneity of genes that are associated by the similarity of their expression patterns. 114 Oliveros et al. 1.2 level 1 level 2 level 3 level 4 level 5 level 6 level 7 level 8 center of mass 1 Variability 0.8 0.6 0.4 0.2 0 2 2.5 3 3.5 4 Z-Score 4.5 5 5.5 6 Figure 7: Relation between the average Z-Score and the variability for all the nodes of the tree. Only nodes with at least one significant term and more than 25 associated Medline abstracts are represented. The representation uses different symbols to indicate the nodes of different tree levels, from 1 to 8. The centre of mass is calculated for the nodes of each level. 3.2 Relation Between Expression Patterns and Functional Information Two parallel processes are seen here; while clustering proceeds toward higher similarities of the expression profiles, the associated functional information becomes more specific (Fig. 7). The upper levels of the clustering correspond to less similar expression profiles and less defined common functions, while the lower levels are composed by nodes with more similar expression profiles and genes that have a greater similarity in their known functions. It is interesting to note that the value of the Z-Score does not increase in average in the lower levels (levels 7 and 8). This tendency may indicate that once a given information level has been reached, further subdivision of clusters, based on minor differences of expression patterns, does not lead to an increase in the amount of functional information. It is also interesting to observe the difference between the dispersion of the values for expression profiles and extracted keywords, probably reflecting the quality of the underlying information, obviously better for the experimentally determined expression profiles. Unfortunately, the small size of the clusters at this level does not enable a detailed analysis, that would require a larger collection of expression profiles. 3.3 Behaviour of Significant T erms The above analysis indicates that terms of clear biological significance, even if they are not very abundant in a parental node, can have a considerable significance indicated by their Z-Scores. Once the terms become specifically associated to the nodes, their frequency and Z-Score increase. Interestingly, there are cases in which the terms became less represented, since a function present in a parental node does not continue to be associated in some of the derived nodes. In an effort to substantiate these general observations, we have analysed a number of examples in detail, and present two of them in Fig. 8. In the first example (Fig. 8a), a clear correspondence between the quality of the expression clusters and the corresponding biological information is presented. Cluster 10.10.10.46.54.54 includes 23 genes, 9 of them corresponding to histones and the others to di- 115 Expression Profiles and Biological Function a chromosome segregation 0,31/6,27 polymerase dna polymerase dna glycosylase chromosome segregation dna damage pol1 0,17/2,82 * 0,13/3,37 0,06/6,50 0,10/2,94 0,08/2,18 0,14/6,27 polymerase dna polymerase dna glycosylase dna damage pol1 0.34/4,19 0,29/4,50 0,15/7,75 0,14/2,94 0,34/7,70 * frequency/Z−Score polymerase dna polymerase pol1 0,48/5,49 0,45/5,98 0,56/8,19 polymerase dna glycosylase dna damage 0,26/2,32 0,44/8,29 0,22/3,99 b Cluster: 10.10.10.46.54.54.211 Abstracts: 51 Genes: 10 Function: Histones: Cluster: 10.10.10.46.54.54 Abstracts: 173 Genes: 23 Function: Number of genes (%) Histones: 9 (39%) Heteropolysaccharides rel.: 6 (26%) Cell wall maintenance: 3 (13%) Terms: histone, glucan, glycoproteins,anchoring, ,wall... Mean Z−Score of terms: Variability: Dispersion: 2,85 0,520030 0,150731 Number of genes (%) 9 (90%) Terms: histone, nucleosome, h2a, h2b... Mean Z−Score of terms: Variability: Dispersion: 4,12 0,309913 0,061938 Cluster: 10.10.10.46.54.54.227 Abstracts: 119 Genes: 8 Function : Number of genes (%) Heteropolysaccharides rel.: 2 (4%) Cell wall maintenance: 4 (8%) Terms: glucan, glycoproteins, wall, membranes... Mean Z−Score of terms: Variability: Dispersion: 3,02 0,645772 0,172224 Figure 8: Examples of the clustering of expression profiles and biological terms. a: Cluster of DNA associated proteins and, b: Cluster of expression patterns containing histone genes. The figure shows two snap shots of the clustering process, to illustrate the general tendency toward a reduction of the number of genes and Medline abstracts, accompanied by an increase in the frequency and specificity of the biological terms. It also includes a case (“polymerase” in the last subdivision of the figure 8a) in which the reliability of the terms (as measured by the Z-Score) decreases, indicating the simultaneous occurrence of different functions in a group of genes with similar expression patterns. 116 Oliveros et al. verse functions such as cell wall maintenance and transport of polysaccharides. The associated terms, in most cases clearly represent the functions (“histone”, “glucan”, “wall”...). The split of the cluster creates a new cluster that contains all the histone genes, with a clear enhancement of the similarity of the corresponding expression patterns and the quality of the corresponding terms (the average score goes from 2,85 to 4,12, with new terms appearing such as “chromatin” and “nucleosome”), as expected by the presence of the very homogeneous set of genes. The sister cluster (10.10.10.46.54.227) contains genes related with cell wall maintenance, synthesis and transport of polysaccharides, that do not correspond to a unique biological function. Consequently, the terms Z-Score only increases slightly from 2,85 to 3,02. The third cluster at the same level of resolution does not contain enough genes and abstracts to be evaluated. The second example illustrates a more complex reality (Fig. 8b). The cluster of proteins associated with different functions can be better described by the term “chromosome segregation” that is significant for the description of some of the functions associated to the parental node (Z-score of 2,94). However, its real specificity appears when the parental node is further divided and one of the derived nodes with 17 genes contains this term with a Z-score of 6,27, while it does not appear in the sister node. A similar observation is made for other terms like “DNA glycosylase”, “DNA polymerase” or “DNA damage”. An interesting case is the term “polymerase” whose Z-Scores rise from 2,82 to 4,19 until the last two nodes are divided, and then in one of nodes its significance increases to 5,49 while in the sister cluster its value decreases to 2,32. This Z-Score is smaller than in the parental group, indicating that the information related with “polymerase” is still present in both groups, but that in only one of them does it is dominant, while the second group leans toward functions related with DNA glycosylation. 4 Conclusions The approach presented here is based on the simultaneous comparison of patterns of gene expression and their corresponding functional annotations. For the task of comparing expression profiles, we have used a new clustering algorithm based on self organising maps and for the analysis of functional annotations a process that directly extracts key terms from the scientific literature. We believe that the combination of the these two approaches could become a powerful tool during the process of analysing expression array data and will open new insights in the interpretation of the experimental results. With these tools, we have directly assessed the relation between gene regulation (similarity in expression patterns) and biochemical function (significant terms). We propose a quantitative study of the similarity of the gene clusters and the amount of information contained in the associated literature. The analysis of the expression array data about yeast cell division shows a clear tendency for groups of genes with similar expression patterns to have a common function described by terms statistically associated to them. Especially interesting are those cases that differ from the general trend, such as gene clusters that are further subdivided into clusters of increasingly similar expression patterns that surprisingly do not correspond to more specific functional information. revealing new relations between genes that are regulated by a common mechanism for which, so far, there are no functional relations described. These cases are likely candidates for new biological discovers. 5 Acknowledgements C. Blaschke, developed the language analysis software, J. C. Oliveros, the software for the analysis of expression arrays. C. B. and J. C. O. prepared the examples discussed in the text. J. Herrero and J. Dopazo developed the clustering software. J. H. contributed to the combined analysis of expression Expression Profiles and Biological Function 117 arrays and significant terms. A. Valencia originated the initial idea, organised the work and the manuscript. We are indebted to Keith Harsman for critical reading of the manuscript. References [1] Alizadeh, A.A., Eisen, M.B., et al., Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling, Nature, 403:503–511, 2000. [2] Andrade, M.A. and Valencia, A., Automatic extraction of keywords from scientific text: Application to the knowledge domain of protein families, Bioinformatics, 14:600–607, 1998. [3] Bairoch, A. and Apweiler, R., The SWISS-PROT protein sequence data bank and its supplement TREMBL, Nucl. Acids Res., 28:46–48, 2000. [4] Berns, A., Gene expression in diagnosis, Nature, 403:491–492, 2000. [5] Blaschke, C., Oliveros, J.C., and Valencia, A., Mining functional information associated to expression arrays, Functional and Integrative Genomics, (in press), 2000. [6] DeRisi, J.L., Iyer, V.R., and Brown, P.O., Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278:680–686, 1997. [7] Dopazo, J. and Carazo, J.M., Phylogenetic reconstruction using a growing neural network that adopts the topology of a phylogenetic tree, J. Mol. Evol., 44:226–233, 1997. [8] Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. USA, 95:14863–14868, 1998. [9] Fritzke, B., Growing cell structures. A self-organising network for unsupervised and supervised learning, Neural Networks, 7:1141–1160, 1994. [10] Herrero, J., Valencia, A., and Dopazo, J., A hierarchical unsupervised growing neural network for clustering gene expression patterns, Bioinformatics, (in press), 2000. [11] Hodges, P.E., McKee, A.H.Z., Davis, B.P., Payne, W.E., and Garrels, J.I., Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data, Nucl. Acids Res., 27:69–73, 1999. [12] Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J.C.F., Trent, J.M., Staudt, L.M., Hudson, J., Boguski, M.S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P.O., The transcriptional program in response of human fibroblasts to serum, Science, 283:83–87, 1999. [13] Kohonen, T., The self-organising map, Proc. IEEE, 78:1464–1480, 1990. [14] Richmond, C.S., Glasner, J.D., Mau, R., Jin, H., and Blattner, F.R., Genome-wide expression profiling in Escherichia coli K-12, Nucl. Acids Res., 27:3821–3835, 1999. [15] Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., and Futcher, B., Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Cell, 9:3273–3297, 1998. [16] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., and Golub, T.R., Interpreting patterns of gene expression with self-organising maps: methods and application to hematopoietic differentiation, Proc. Natl. Acad. Sci. USA, 96:2907–2912, 1999. [17] Törönen, P., Kolehmainen, M., Wong, G., and Castrén, E., Analysis of gene expression data using self-organising maps, FEBS letters, 451:142–146, 1999. [18] http://www.ncbi.nlm.nih.gov/pubmed/ [19] http://montblanc.cnb.uam.es/SOTAandGEISHA/index.html