Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
&CHAPTER 4 Robust Methods for Microarray Analysis GEORGE S. DAVIDSON, SHAWN MARTIN, KEVIN W. BOYACK, BRIAN N. WYLIE, JUANITA MARTINEZ, ANTHONY ARAGON, MARGARET WERNER-WASHBURNE, MÓNICA MOSQUERA-CARO, and CHERYL WILLMAN 4.1. INTRODUCTION The analysis of a complex system within an environment that is only subject to incomplete control is nearly impossible without some way to measure a large fraction of the system’s internal state information. As a result, it is only with the recent advent of high-throughput measurement technologies (able to simultaneously measure tens of thousands of molecular concentrations) that systems biology is really a possibility. As an example of the scope of this problem, consider that eukaryotic cells typically have on the order of tens of thousands of genes, each of which is likely to have several alternative splicing variants coding for the protein building blocks of the cell. These proteins undergo posttranslational modifications and have multiple phosphorylations such that there are likely to be hundreds of thousands, possibly millions, of variants. Hence the future of systems biology relies critically on high-throughput instruments, such as microarrays and dual mass spectrometers. The research reported here addresses the need for improved informatics to deal with the large volume of information from these techniques. At present, microarray data are notoriously noisy. Hopefully this technology will improve in the future, but for the immediate present, it is important that the analysis tools and informatics systems developed for microarray analysis admit and adjust to significant levels of noise. In particular, such methods should be stable in the presence of noise and should assign measures of confidence to their own results. In this chapter we describe our efforts to implement and assess reliable methods for microarray analysis. We begin with the structure of the typical microarray experiment, normalization of the resulting data, and ways to find relationships. We then Genomics and Proteomics Engineering in Medicine and Biology. Edited by Metin Akay Copyright # 2007 the Institute of Electrical and Electronics Engineers, Inc. 99 100 ROBUST METHODS FOR MICROARRAY ANALYSIS proceed to discuss the tools themselves as well as methods to assess the output of such tools. Much of the work in this chapter is based on the VxInsight visualization software [1]. However, we have tried to minimize overlap with existing published results and have emphasized new work. Throughout each section we will present results and examples from our research to motivate the specific approaches, algorithms, and analysis methods we have developed. The data sets we have analyzed will not be emphasized, although they were of course critical to the work. Most of the data sets we have used have been published, and these will be cited in the course of the chapter. The data sets so far unpublished have been provided by the University of New Mexico Cancer Research and Treatment Center. These data sets include an infant leukemia data set, a precursor-B childhood leukemia data set, and an adult leukemia data set, all of which have been presented at the American Society of Hematology conferences in 2002 and/or 2003. Publications concerning the biological implications of these data sets are forthcoming. Here we discuss methods only. Finally, we emphasize that this chapter is by no means a survey of techniques for microarray analysis. Indeed, the literature on microarray analysis is vast, and new papers appear frequently. We will mention, however, some of the seminal papers, including [2, 3], which describe the original technology, as well as [4], which gives an overview of the earlier work on microarray technology. Of course, microarrays would not be terribly useful without computational analysis, and some of the original work on microarray analysis includes hierarchical clustering of the yeast cell cycle [5, 6], an analysis of leukemia microarray data using an original method of gene selection [7], and an application of the singular-value decomposition to microarray data [8]. In addition to innumerable additional papers, there are a variety of books available which emphasize different aspects of the microarrays and microarray analysis. A few of the more recent books include [9– 12]. 4.2. MICROARRAY EXPERIMENTS AND ANALYSIS METHODS 4.2.1. Microarray Experiments Microarray experiments and their analyses seek to detect effects in gene expression under different treatments or natural conditions with the goal of clarifying the cellular mechanisms involved in the cells’ differential responses. Uncertainty is the rule rather than the exception in these analyses. First, the underlying systems (cells and/or tissues) are incredibly complex whether viewed from the dynamic process perspective or their physical realizations in space and time. Second, there is abundant variability between cells experiencing exactly the same conditions as a result of genetic polymorphisms as well as the stochastic nature of these chemical systems. Third, the collection and initial preservation of these cells are not a precisely controlled process. For example, when a surgeon removes a cancer, the tissue may not be frozen or otherwise processed for minutes to hours, all the 4.2. MICROARRAY EXPERIMENTS AND ANALYSIS METHODS 101 while the cells continue to respond to these unnatural conditions. Further, the tissues, or partially processed extracts, are often sent to another laboratory several hours or even days away from the original collection site, all of which offers opportunities for chemical changes. Fourth, these measurements are not easy to make; they involve many processing steps with a wide variety of chemicals, and at every step variability arises. The processing is often (necessarily) divided across several days and among several technicians, with inherently different skills and training. Further, the chemicals are never exactly the same; they are created in different batches and age at different rates. All of these affect the laboratory yields and the quality of the processing. Finally, the arrays themselves are technological objects subject to all sorts of variability in their creation, storage, and final use. In effect, the simple measurement of mRNA concentrations that we would like to make is confounded by huge uncertainties. To be able to make good measurements, it is essential that all of the mentioned steps be subject to careful statistical process control, monitoring, and systematic improvement. Further, the actual experiments should be designed to avoid, randomize, or otherwise balance the confounding effects for the most important experimental measurements [13, 14]. These are best practices. Unfortunately, they are not often followed in the laboratory. By the time the data are ready for analysis, they are typically presented in a numeric table recording a measurement for each gene across several microarrays. For example, one might be analyzing 400 arrays each with 20,000 gene measurements, which would be presented in a table with 20,000 rows and 400 columns. Often the table will have missing values resulting from scratched arrays, poor hybridizations, or scanner misalignments, to name just a few (from among a host) of the possible problems. The analysis methods should be able to gracefully deal with these incomplete data sets, while the analyst should approach these data with great skepticism and humility considering the complexity of the cellular processes and the error-prone nature of our microarray technology. Despite all of these issues, statisticians and informaticians, unlike mathematicians, are expected to say something about the structure and meaning of these data. Because, as Thompson has said, “[statisticians] should be concerned with a reality beyond data-free formalism. That is why statistics is not simply a subset of mathematics” [15, p. xv]. Here, we attempt to follow Thompson in discussing implications of, as well as our approaches to, analyzing these experiments, including considerable detail about the algorithms and the way the data are handled. 4.2.2. Preprocessing and Normalization As discussed in the previous section, microarray data typically have a large number of missing data, or values otherwise deemed to be nonpresent. We typically drop genes with too many missing values, assigning a threshold for this purpose that is under the control of the analyst. The raw values are then scaled to help with the processing. The distributions of microarray measurement values typically have extremely long tails. There are a few genes with very large expressions, while most of the 102 ROBUST METHODS FOR MICROARRAY ANALYSIS others are quite small. Tukey and Mosteller suggested a number of transforms to make data from such distributions less extreme and more like the normal Gaussian distribution [16]. In particular, taking logarithms of the raw data is a common practice to make microarray data more symmetric and to shorten the extreme tail (see Fig. 4.1a). However, we frequently use another transform to compress the extreme values, which is due to [17]. This rank-order-based score is an increasing function of expression level, for which the smaller values are compressed to be very nearly the same. This is particularly useful with array data, as many of these smaller values are due purely to noise. The Savage score is computed as follows: If the absolute expression levels are rank ordered from smallest to largest, X(1)  X(2)      X(n) , then the score for X(k) is given by SS(X(k) ) ¼ k X 1 n þ 1j j¼1 Figure 4.1b shows how this score affects the resulting distribution. In particular, Savage scoring compresses the extreme tail and emphasizes the intermediate and large counts. In this case, about 60% of the savage scores are below 1. The use of this scoring has two advantages over correlations using raw counts. First, because it is based on rank ordering, data from arrays processed with very different scales can still be compared. Second, because the noisiest fraction of the measurements is aggressively forced toward zero, the effect of the noise is suppressed without completely ignoring the information (it has been taken into account during the sorting phase). Finally, large differences in rank order will still be strong enough to be detected. The normalization of array data is controversial, although some form of centering and variance normalization is often applied [18, 19]. However, it has been argued FIGURE 4.1. (a) Distribution of log-transformed data from typical Affymetrix U94A microarray. (b) Distribution of the same data after Savage scores. 4.3. UNSUPERVISED METHODS 103 that for many experiments there is no intrinsic reason to expect the underlying mRNA concentrations to have the same mean (or median) values, and hence variance adjustment is suspect. Nevertheless, the analyst has the option to do such normalizations, if desired. In general, we avoid this issue by working with order statistics and Savage scores. 4.2.3. Overview of Basic Analysis Method After preprocessing the measurements with thresholding, rescaling, and various other transforms, we often perform our analysis as follows. First, we compare the genes or arrays under investigation by computing pairwise similarities with several techniques. These similarities are used to cluster (or assign spatial coordinates to) the genes in ways that bring similar genes closer together. These clusters are then visualized with VxInsight [1, 20]. Although we typically use genes in our analyses, we often use arrays as well. In fact, the analysis is the same, as all our techniques use either the gene matrix initially described, or the transpose of the gene matrix. Hence, throughout the remainder of this chapter, we will use the terms genes and arrays interchangeably. Next, the clusters are tested with statistical methods to identify genes and groups of genes that are differentially expressed in the identified clusters or genes otherwise identified with respect to experimental questions and hypotheses. The expression values for the identified genes are plotted and tested for stability. Those genes which seem particularly diagnostic or differentiating are studied in detail by reading the available information in online databases and in the original literature. Each of these analysis steps will be presented in an order approximately following the analysis order we use in practice. This type of analysis is quite typical for microarrays and is usually divided into two categories. The method first described for clustering genes is known as cluster analysis, or unsupervised learning. The term unsupervised is used because we are attempting to divine groups within the data set while avoiding the use of a priori knowledge. In contrast, the second method described for the identification of gene lists is usually called supervised learning. The term supervised refers to the fact that we are using prior information in our analysis, in this case attempting to discover which genes are differentially expressed between known groups. An unsupervised method asks, in essence: Are there groups in the data set and, if so, what are they? A supervised method asks: Given groups, can we predict group membership and can we learn what is most important in distinguishing the groups? 4.3. UNSUPERVISED METHODS 4.3.1. Overview of Clustering for Microarray Data Organizing large groups of data into clusters is a standard and powerful practice in exploratory data analysis. It has also become a standard practice in microarray 104 ROBUST METHODS FOR MICROARRAY ANALYSIS analysis. Typically, the first step after the initial data transformations involves the pairwise comparison of the data elements to determine their relative similarity or dissimilarity. A single comparison usually results in a single number. Thus when comparing the expressions of n genes across multiple experimental conditions, one might compute n(n 2 1)/2 correlations using the similarity measure between each possible pair of genes. After the data pairs have been assigned similarities, various grouping algorithms can be used. In the case of microarray analysis, there are a variety of methods that have been applied and/or developed. These include hierarchical clustering [5], the singular value decomposition/principal-component analysis [8], and self-organizing maps [21]. Our method (discussed next) belongs to the class of graph-theoretic clustering methods. Other such methods include those presented in [22 –24]. Finally, a general overview of clustering methods can be found in [25]. In this chapter we focus on a tool known as VxOrd [20], which uses a forcedirected graph layout algorithm. Our method requires, as do most clustering algorithms, a single similarity value for each of the data pairs. In the next section we show one way to compute the similarities and then consider the actual clustering algorithm. 4.3.2. Clustering Using VxOrd 4.3.2.1. Choosing a Similarity Measure One obvious candidate for measuring similarities is the simple correlation coefficient R due to Pearson [26], Pd  )(yi  y ) i¼1 (xi  x ffi Rxy ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P Pd  )2 di¼1 (yi  y )2 i¼1 (xi  x where, of course, 1  Rxy  1. Pearson’s R is just a dot product of two meancentered and normalized d-dimensional vectors. Thus, one can think of R as a measure of the extent to which the two vectors point in the same direction in the d-dimensional space. Of course, the vectors might lie along the same line, but point in opposite directions, which is the meaning of Rxy ¼ 1. If the vectors are completely orthogonal, the correlation will be zero. In fact, Pearson’s correlation is the measure of similarity that we, and the rest of the microarray community, use most often. It is, however, not the only possibility and in fact has some features that do not recommend it under certain situations. For instance, Pearson’s correlation is known to be sensitive to outliers, especially when n is small. Technically, R has a breakdown point of 1/n, meaning that as few as one extreme outlier in the n points can make the statistic completely different from the true measure of correlation for two random but correlated variables [26]. In practice, we have not found this to be a real problem, since we typically use hundreds of arrays. However, early in the development of microarray technology, many data sets were published with order tens of arrays. In these cases, it was deemed valuable to apply more computationally expensive but more robust 4.3. UNSUPERVISED METHODS 105 measures of correlation. Details about robust measures of correlation, including the percentage-bend correlation coefficient, can be found in [27]. We have also found occasion to use very different similarity measures. For example, we clustered genes based on the textual similarity of their annotations in the Online Mendelian Inheritance in Man (OMIM) [28] (http://www.ncbi.nlm. nih.gov/omim). In this case, the text was processed with the RetrievalWare search and retrieval package from Convera. RetrievalWare computes the similarity between two text documents with a proprietary algorithm. For each gene annotation the strongest 19 other annotations were accumulated to create a similarity file and then processed to produce a clustering. The more typical processing for microarrays is discussed in the following section. 4.3.2.2. Postprocessing for Similarity Measures While Pearson’s correlation has a breakdown point of 1/n (a single outlier can distort the statistic from one extreme to the other [26]), it is easy to compute and has been widely accepted in the microarray community. Because Savage-scored expression values are bounded, the influence of outliers is less important. As a result, the correlation coefficient is usually the basis of our similarity computations. When too few arrays are available to have confidence in R, the percentage-bend coefficient [27] is to be used instead. It is common to cluster directly with these coefficients. However, doing so ignores much of the available information because R is such a nonlinear function. For example, there is a slight change in significance when comparing two pairs of genes that have R ¼ 0.5 and R ¼ 0.55, respectively, but the relative significance between R ¼ 0.90 and R ¼ 0.95 can be quite large. Consequently, it is better to transform these correlations with a measure of their relative significance. This transform can be done by converting to the t-statistic for the observed correlation R between the pairs of values [26]: pffiffiffiffiffiffiffiffiffiffiffi R d2 t ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1  R2 Both R and t have computational issues that should be addressed. In particular, R is undefined when the variance of either x or y vanishes, and hence a minimum, acceptable variance must be determined. We typically require that d X i¼1 (xi  x )2 d X i¼1 ( yi  y )2 . 0:0001 Otherwise, no correlation is computed for the pair of expression vectors. A related issue occurs with t when R approaches +1.0. In this case, the t-statistic becomes arbitrarily large. Because clustering will be comparing similarities, the strength of an extreme outlier will distort the clustering. Hence t should be clipped to avoid such extremes (we typically truncate values greater than 300, though even this value may be extreme). 106 ROBUST METHODS FOR MICROARRAY ANALYSIS Missing data also present concerns in computing R. Certainly, if too many values are missing, any computed similarity would be suspect. Recourse to the analyst’s experience and judgment is the best way to choose how many values can be missing before the comparison is not attempted. For large collections of arrays, requiring that at least 70 measurements be simultaneously present for both expression vectors has been acceptable in our experience. Computing all of these correlations produces a huge file of similarity comparisons. For example, the computation for an experiment [29] around Clostridium elegans, which has about 20,000 genes, required the computation of about 2  108 correlations. Using all of the correlations for clustering is neither necessary nor desirable. Most of the correlations will be modest and including them slows the clustering computation and introduces a great deal of resistance to the process that separates the mass of genes into clusters. If a particular gene has strong correlations with a few tens of other genes, they should eventually be clustered together. However, if there are hundreds or thousands of correlations weakly linking that particular gene to others, then the net sum of these weak correlations may overwhelm the few strong correlations. If only a few of the correlations will be used for clustering, some method of choice is required. The analyst can use all correlations above some threshold or just the strongest few correlations for each gene. We have found the latter approach to be sufficient. We have found that using the 20 strongest correlations is often a good starting point. However, even fewer correlations may suffice, especially with the methods discussed next for finding the most central ordination from a distribution of stochastic clustering results. 4.3.2.3. Clustering by Graph Layout Once the similarities have been computed, the VxOrd algorithm is used to cluster the data. VxOrd considers the data set as an abstract graph. In the case of microarrays, we usually think of the genes as nodes in a graph, and the edges as similarities between genes. The relationship between the data elements and their similarity values can be visualized as an abstract, weighted graph G(Vi , Ei, j , Wi, j ) consisting of a set of vertices V (the genes) and a set of edges E with weights W (the similarities between the genes). This graph is only topologically defined, as the vertices have not been assigned spatial locations. Spatial coordinates, called ordinations, are computed from the weighted graph using VxOrd, which places vertices into clusters on a two-dimensional plane, as shown in Figure 4.2. The ordinations are computed such that the sum of two opposing forces is minimized. One of these forces is repulsive and pushes pairs of vertices away from each other as a function of the density of vertices in the local area. The other force pulls pairs of similar vertices together based on their degree of similarity. The clustering algorithm stops when these forces are in equilibrium. Although the algorithm has been discussed in detail in a previous paper [20], we provide here a brief overview for convenience. VxOrd is based on the approach in [30]. Fruchtermann and Rheingold compute a force term for both attraction and repulsion. These terms are then used to generate new positions for the graph vertices. The algorithm combines the attraction and repulsion terms into one potential energy equation, shown below. The first term, in brackets, is due to the attraction 4.3. UNSUPERVISED METHODS 107 FIGURE 4.2. Data elements are nodes and similarities are edge values, which are clustered and assigned (x, y) coordinates by VxOrd. These coordinates are then used to visualize clusters, as shown on the far right. (From [20].) between connected vertices; the second term is a repulsion term. The equation is given by " # ni X (wi, j  l2i, j ) þ Dx, y Ki(x, y) ¼ j¼1 where Ki(x, y) is the energy of a vertex at a specific (x, y) location, ni is the number of edges connected to vertex i, wi, j is the edge weight between vertex i and the vertex connected by edge j, l 2i, j is the squared distance between vertex i and the vertex at the other end of edge j, and Dx,y is a force term proportional to the density of vertices near (x, y). In our ordinations, the energy equation is minimized iteratively in three phases. The first phase reduces the free energy in the system by expanding vertices toward the general area where they will ultimately belong. The next phase is similar to the quenching step that occurs in simulated annealing algorithms [31], where the nodes take smaller and smaller random jumps to minimize their energy equations. The last phase slowly allows detailed local corrections while avoiding any large, global adjustments. VxOrd also includes additional improvements. These improvements include barrier jumping, which keeps the algorithm from getting trapped in local minima; a grid-based method for computing Dx, y , which reduces the computation of the repulsion term from Q(jVj2) to Q(jVj); and edge cutting, which encourages exposure of clusters in the final stages of the optimization. 4.3.2.4. Clustering Parameters the VxOrd algorithm: The analyst has two important controls over 1. The number of similarities used for the clustering 2. The degree of edge cutting permitted, where edge cutting refers to removing key edges in order to expose cliques in the graph The first control concerns how many similarities are passed to the clustering algorithm. Every gene has some correlation with every other gene; however, most of 108 ROBUST METHODS FOR MICROARRAY ANALYSIS these are not strong correlations and may only reflect random fluctuations. By using only the top few genes most similar to a particular gene, we obtain two benefits: The algorithm runs much faster and, as the number of similar genes is reduced, the average influence of the other, mostly uncorrelated genes diminishes. This change allows the formation of clusters even when the signals are quite weak. However, when too few genes are used in the process, the clusters break up into tiny random islands containing only two or three very similar genes, so selecting this parameter is an iterative process. One trades off confidence in the reliability of the cluster against refinement into subclusters that may suggest biologically important hypotheses. These clusters are only interpreted as suggestions and require further laboratory and literature work before we assign them any biological importance. However, without accepting this trade-off, it may be impossible to uncover any suggestive structure in the collected data. The second control tells the algorithm how many edges may be removed so that cliques in the graph, or clusters in the ordination, may be exposed. This parameter must also be balanced for reliability of the clique against actual exposure of the clique. As an example of the impact of these parameters, consider Figure 4.3. Here we are clustering a set of 126 arrays, each with about 12,000 genes. First consider FIGURE 4.3. Effects of using different numbers of similarity links and different parameters for edge cutting. (a) Using too many similarity links, in this case 30, and only a single undifferentiated group is formed. (b) Using 15 similarity links and the data are no longer completely undifferentiated; some stronger similarities are beginning to force the emergence of structure. (c) Using 30 links but with the maximum edge cutting enabled, and clusters are still not apparent. (d ) Data can be separated the with 15 similarity links and aggressive cutting. 4.3. UNSUPERVISED METHODS 109 the effect of using too many similarities. Figure 4.3a shows the result when 30 similarities per array are used. However, when only the top 15 strongest similarities are used, as in Figure 4.3b, three groups begin to be apparent. When a set of elements has a relatively uniform set of similarities, it can be very difficult to separate them in to subclusters. However, there may be subsets of stronger similarities that would divide the data into clusters if allowed to express their influence in the absence of the other, mostly homogeneous similarities. In other words, we can reveal small cliques of vertices by cutting similarity relationships that have been constraining the vertices to remain an undifferentiated agglomeration. Figure 4.3c shows that no cliques are apparent when using 30 similarities per vertex, even with extremely aggressive edge cutting. On the other hand, the suggestive clusters seen in Figure 4.3b readily break into more detailed cliques when only 15 similarities per vertex are used and when aggressive edge cutting is enabled, as shown in Figure 4.3d. 4.3.2.5. Evaluating Utility and Significance of Clustering Clustering algorithms are designed to find clusters. However, one’s initial stance should be that there is no reason to suppose that the clusters found are more than artifacts. We have expended significant effort devising methods for evaluating the clusters produced by VxOrd. These efforts are described in detail in our previous publications, but for completeness we provide a short overview here. The first, most intuitive approach is to check that gene expressions are correlated within clusters and to investigate the biological meaning of the clusters. One of our first efforts in this direction was an analysis of the Spellman yeast data [32]. In this paper we compared the typical expression histories of the genes in each cluster to assure ourselves that genes in the cluster had, generally, uniform expression patterns and that these patterns were different in the various clusters. Figure 4.4 shows Spellman’s yeast cellcycle data clustered with VxOrd, overlaid with expression traces for typical genes in the various clusters. Not only do these traces seem homogeneous within the clusters and different between clusters, but they also have biological significance as the cells move through their replication cycle. Surprisingly, the various states in the cell cycle correspond to a clockwise progression around the roughly circular layout of gene clusters in this figure. This visual inspection was also recast in the same paper in a statistically more rigorous manner. Although we do not provide the details, we concluded that the VxOrd clusters were not artifacts. A statistical test used in [32] showed that a subset of genes associated with cell cycle phase G1 were collocated with p , 0.001 and further that CLB6, RNR1, CLN2, TOS4, and SVS1 collocated with p , 0.0001 for cells exiting from long-term stationary phase. Another test falling into the category of intuitive verification was performed in [33]. This work tested VxOrd clusters of C. elegans genes for clusters enriched in genes known to be involved in various biological processes. Stuart et al. [33] found significant statistical enrichments. These enrichments suggested that other genes in the same clusters could be expected to be involved in the indicated processes. This hypothesis was confirmed with laboratory experiments. 110 ROBUST METHODS FOR MICROARRAY ANALYSIS FIGURE 4.4. Cell cycle data with typical expression traces from each cluster. Interestingly, the clusters lay out in a circle corresponding to the temporal cell cycle phases. Another evaluation method to investigate the clustering algorithm uses exactly the same processing parameters but with randomly permuted versions of the measurements. If the clustering algorithm finds clusters or structures in this randomized data, then the results with the original data should be suspect. The processing methods discussed above have been tested in this way, and randomized data do not exhibit any organized structure; see, for example, the analysis in [29], where the randomized data resulted in a single, symmetric, and otherwise unorganized group of genes, which revealed structure in the data as well as lack of structure in the randomized data. If randomized data show no structure, then the structures in the actual data become more interesting and may possibly be useful. These methods have been useful in showing that the clusterings are not chance occurrences and have led to scientific insights. However, these approaches have not addressed two other important issues related to clustering: (1) how stable these clusters are with respect to variations in the measurements and (2) how stable they are with respect to different random initializations of the VxOrd clustering algorithm, which has an inherently stochastic nature. We investigated these two issues in [20]. To test the stability of the algorithm with respect to random starting points, we ran 100 reordinations of the Spellman cell cycle data [6], which had about 6000 genes. Each reordination was started with a different seed for the random-number generator. We then visually marked the elements of a cluster in one ordination and looked to see if they were still visually clustered together in the other ordinations. The results of this analysis were generally positive and are shown in 4.3. UNSUPERVISED METHODS 111 FIGURE 4.5. (a) Ordinations with different random starting conditions. (b) Effect of increasing edge noise on cluster stability. (From [20].) Figure 4.5a. We also performed a more rigorous neighborhood analysis (see discussion below) with good results. To determine if small changes or noise in the similarities would give small changes in the ordination results, we ran 80 reordinations where we added noise drawn from a Gaussian distribution with mean zero and standard deviations 0.001, 0.010, 0.050, and 0.100 and recomputed the ordinations. These different ordinations were also compared visually and statistically. These results generally showed that our the clusterings held together remarkably well, even when a large fraction of noise was added. The visual results are shown in Figure 4.5b. 4.3.2.6. Finding Most Representative Clustering Each randomly restarted ordination by VxOrd represents a sample from a distribution of possible ordinations arising from a particular similarity file. From this perspective, one might want to identify the best ordination, which is particularly hard because it is an extreme and further because the concept of best cluster or best ordination is not particularly well defined. However, the idea of a most representative ordination or most central ordination (MCO) can be defined with respect to the sample of observed randomly restarted ordinations. In this case, two ordinations are compared by neighborhood analysis to create a single measure of overall similarity between the two ordinations. With this method for comparing two ordinations, one can make all possible comparisons of the available randomly restarted ordinations and then select the ordination that is, on average, most like all the remaining reordinations. This idea of centrality of the distribution of ordinations might be further extended to the second moment to compute some measure of dispersion, which perhaps could eventually be extended to allow some sort of hypothesis testing about these ordinations. However, we have only investigated the centrality issue. As an example, we used massively parallel computers to calculate hundreds or in some cases thousands of reclustered ordinations with different seeds for the random-number generator. We compared pairs of ordinations by counting, for 112 ROBUST METHODS FOR MICROARRAY ANALYSIS every gene, the number of common neighbors found in each ordination. Typically, we looked in a region containing the 20 nearest neighbors around each gene, in which case one could find a minimum of 0 common neighbors in the two ordinations or a maximum of 20 common neighbors. By summing across every one of the genes, an overall comparison of similarity of the two ordinations was computed. We computed all pairwise comparisons between the randomly restarted ordinations and used those comparisons to find the ordination that had the largest count of similar neighbors. Note that this corresponds to finding the ordination whose comparison with all the others has minimal entropy and in a general sense represents the MCO of the entire set. Although not shown here, plots of the entropies, intersection counts, and cross plots of entropy showed that there were central ordinations, even for a data set that we found very difficult to break into stable clusters [34]. It is possible to use these comparison counts (or entropies) as a derived similarity measure in order to compute another round of ordinations. For example, given that 200 random reordinations have been computed, one can compute the total number of times gene gj turns up in the neighborhood of gene gk in the available 200 ordinations. This count, or the average number of times the two genes are near each other, will be high when the two genes are generally collocated (which should be a reflection of similar expression profiles for gj and gk). The clusters from this recursive use of the ordination algorithm are generally smaller, much tighter, and generally more stable with respect to random starting conditions than any single ordination. In fact, we found that most single ordinations were more similar to the MCO when using the derived similarity than when using Pearson’s R. We typically use all of these methods (computing the MCO from among about 100 to 200 random reordinations and computing neighborhood set sizes ranging from 10 to 30 by steps of 5) during exploratory data analysis in order to develop intuition about the data. As an interesting aside, the process of comparing pairs of ordinations results in similarity values between every ordination. These similarities can be used to create clusters of the clusterings! Figure 4.6 shows an example where we found that the random reclusterings seemed to fall into two different attractor basins, which may be interpreted as a sign that there were two different but perhaps equally valuable ways to look at the data and that no single cluster was able to represent both of these views. 4.3.3. Enhancing VxOrd In addition to developing the MCO algorithm, which is layered on top of the VxOrd algorithm, we put some effort into enhancing VxOrd itself. For clarity, we will hereby denote the enhanced version by VxOrd 2.0 and the original verision by VxOrd 1.5. The original motivation for developing VxOrd 2.0 was to cluster larger data sets, although we also found the algorithm to be more stable and more accurate on certain data sets. 4.3. UNSUPERVISED METHODS 113 FIGURE 4.6. Process of comparing individual clusters results in similarity values between each ordination. These similarities can be used to create a cluster of clusters. In this case, there seem to be two attractors, suggesting the data may have two useful visualizations. 4.3.3.1. Adding Graph Coarsening to VxOrd VxOrd 2.0 is based on graph coarsening, which has previously been applied to other graph layout algorithms in order to draw larger graphs more efficiently [35, 36]. Graph coarsening, also known as multilevel mesh refinement, works by replacing the original graph with a smaller but still representative graph. In the context of graph layout, we draw the smaller graph and use this initial layout as a starting point for drawing the original graph. In fact, we use an iterative approach which typically provides a succession of smaller graphs and hence a succession of graph layouts. Suppose the initial graph is denoted by G0. A coarsened version G1 of G0 is obtained by randomly visiting and subsequently merging nodes as follows: 1. Pick a node vi at random. 2. Choose a neighbor vj of vi such that the edge eij has maximal weight. If no such vj exists or vj has been previously visited/merged, go to step 4. 3. Merge vj into vi by adding edge weights of common neighbors or creating new edges if there are no common neighbors. 4. Repeat until all nodes have been visited/merged. 114 ROBUST METHODS FOR MICROARRAY ANALYSIS This algorithm is fast and has been observed [37] to produce much smaller but still representative versions of the original graph. Once the initial graph G0 has been coarsened, we can draw the smaller version G1 using VxOrd 1.5 and then reverse our coarsening to obtain a layout of G0. In fact, we repeatedly apply the coarsening algorithm to get an even smaller graph. The algorithm that we use is as follows: 1. Apply the coarsening algorithm until we obtain a suitably small graph (say 50 nodes) or until the algorithm fails to make any more progress. We obtain a sequence of graphs G0, G1, . . . , Gm from this process. 2. Draw Gm using VxOrd. 3. Refine Gm to obtain Gm21. Place the additional nodes obtained by this refinement in the same positions as their counterparts (merged nodes in Gm) and adjust with VxOrd. 4. Repeat step 3 using Gm22, . . . , G0. This algorithm requires additional adjustments in terms of the grid-based density calculations and the various parameters involved in the simulated annealing. With proper adjustments, however, we obtain better accuracy and stability with this algorithm (VxOrd 2.0) than with the previous version (VxOrd 1.5), as will be demonstrated in the following section. 4.3.3.2. Benchmarking VxOrd 2.0 We first benchmarked VxOrd 2.0 on the so-called swiss roll data set. Although this is not microarray data, it provides a useful example of the essential difference between VxOrd 2.0 and VxOrd 1.5. This data set was used in [38, 39] to test two different nonlinear dimensionality reduction algorithms. The data set is provided as points in three dimensions, which give a spiral embedding of a two-dimensional ribbon (see Fig. 4.7a). To test VxOrd, we considered each point to be a node in an abstract graph, and FIGURE 4.7. The swiss roll is a two-dimensional manifold embedded nonlinearly in three dimensions: (a) actual data set; layouts of associated graph using (b) VxOrd 1.5 and (c) 2.0 VxOrd. 4.3. UNSUPERVISED METHODS 115 we connected each node to its 20 nearest neighbors. In principle, VxOrd should draw the graph as a two-dimensional ribbon. The swiss roll is a useful benchmark because it is very easy to visualize but hard enough so that it will confound (at a minimum) any linear algorithm (such as principal – component analysis). It also confounded the original VxOrd 1.5, as can be seen in Figure 4.7b. Looking closely at Figure 4.7b, we can see that VxOrd 1.5 did well on a local scale (i.e., the colors are together and in the correct order) but failed on a global scale (the ribbon is twisted). Once graph coarsening was added to VxOrd, however, the global structure was also ascertained correctly, as shown in Figure 4.7c. In our subsequent analysis we found a similar story: VxOrd 2.0 does well on large data sets on a global scale but otherwise does not improve VxOrd 1.5. For our next benchmark, we again used the swiss roll data set, an adult leukemia data set provided by the University of New Mexico Cancer Research and Treatment Center, and the yeast microarray data set [6] used previously. In order to test the stability of our modification to VxOrd, we performed multiple runs of both VxOrd 1.5 and VxOrd 2.0 with different random starting conditions. We then compared the ordinations using a similarity metric obtained as a modification of a metric discussed in [40]. Our similarity metric is a function se (U, V ), where U, V are two VxOrd layouts of the same m-node data set x1, . . . , xn. The metric is computed by first constructing neighborhood incidence matrices NU,e and NV,e, where N*,e is an n  n matrix N*,* ¼ (nij), with nij ¼  1 if kxi  xj k , e 0 otherwise Now se (U, V) ¼ NU,e  NV,e kNU,e kkNV,e k where NU,e  NV,e is the dot product of NU,e and NV,e when both matrices are considered to be vectors of length n 2. Finally, we note that in order to make sure we can form reasonable e neighborhoods of a given node xi, we first scale both U and V to lie in the area [21,1]  [21,1]. To see how we can use this metric to assess the stability of VxOrd 2.0, we first revisit the swiss roll data set. In the top row of Figure 4.8, we show an all-versusall similarity matrix for 10 different random runs of both VxOrd 1.5 and VxOrd 2.0. This figure confirms the results from Figure 4.7 and shows that the metric is valid. In particular, we see that VxOrd 2.0 arrives at a consistent solution (indicated by higher similarities) while VxOrd 1.5 is less consistent. Next we computed the same similarity matrices for the adult leukemia data set and for the yeast time-series data, as shown in the second and third rows of 116 ROBUST METHODS FOR MICROARRAY ANALYSIS FIGURE 4.8. Comparison of the similarity matrices for random runs of VxOrd 1.5 and 2.0 using swiss roll, adult leukemia, and yeast data sets. On the left we show the runs produces by VxOrd 1.5, and on the right we see the runs produced by VxOrd 2.0. From top to bottom we have the swiss roll with 10 runs, the adult leukemia data set with 100 runs (10 runs for 10 different edge cutting parameters—more aggressive up and left), and the yeast data set with 10 runs. The color scale is shown on the right and is the same for all images. If we look at this figure as a whole, we see that the right-hand column has more red than the left-hand column and hence that VxOrd 2.0 (right column) is generally more stable than VxOrd 1.5 (left column). 4.4. SUPERVISED METHODS 117 TABLE 4.1 Average Values (Excluding the Diagonal) of the Similarity Matrices Shown in Figure 17 Swiss Roll AML Yeast 0.43 0.87 0.33 0.80 0.62 0.69 VxOrd 1.5 VxOrd 2.0 Figure 4.8. We arrived at similar results in each case. We also computed the average similarity across the matrices (excluding the diagonal) for the different cases, shown in Table 4.1. In the case of the adult leukemia data set, we also experimented with the edge cutting feature of VxOrd. In particular, we computed 10 random runs for each of the 10 most aggressive edge cuts. We found that even though VxOrd 2.0 was more consistent overall, it still was not consistent with the most aggressive cut. The yeast data set was larger (6147 nodes compared to 170 nodes in the adult leukemia data set) and the results of both VxOrd 1.5 and 2.0 were fairly consistent. In particular this suggests VxOrd 1.5 still does well on larger data sets without an inherently gridlike structure (as in the swiss roll). 4.4. SUPERVISED METHODS Clustering a microarray data set is typically only the first step in an analysis. While we have presented tools and techniques to help assure a reasonable and stable ordination, we have not yet discussed the most important part of the analysis: the steps necessary to determine the actual biological meaning of the proposed clusters. While we do not typically address the actual biology in a given analysis, we often provide the appropriate tools for such analysis. These tools must be both informative and accessible to the average biologist. 4.4.1. Using VxInsight to Analyze Microarray Data As stated earlier, most of our work is built upon a database with a visual interface known as VxInsight [1]. VxInsight was originally developed for text mining but has been extended for the purpose of microarray analysis. Once an ordination has been obtained (using the methods described previously), the ordination is imported into VxInsight, along with any annotation or clinical information. VxInsight uses a terrain metaphor for the data, which helps the analyst find and memorize many large-scale features in the data. The user can navigate through the terrain by zooming in and out, labeling peaks, displaying the underlying graph structure, and making queries into the annotation or clinical data. In short, VxInsight provides an intuitive visual interface that allows the user to quickly investigate and propose any number of hypotheses. Details about and applications of VxInsight can be found in [1, 20, 29, 32, 33]. 118 ROBUST METHODS FOR MICROARRAY ANALYSIS 4.4.1.1. Typical Steps in Analysis Using VxInsight VxInsight is very useful for an initial sanity check of a data set. We will typically cluster the arrays to look for mistakes in the scanning or data processing which might have duplicated an array. A duplication will often be apparent in the experiment because the pair of duplicated arrays will cluster directly on top of each other and will typically be far from the other clusters. We have discovered that many data sets cluster more by the day the samples were processed, or even by the technician processing the samples, than because of biologically relevant factors. Further investigation is needed, for example, if almost 100% of a particular processing set clusters by itself. In one case we found a very stable ordination consisting of two groups. After much confusion we discovered that the groups were divided by experimental batch and that one of the groups consisted of patients whose samples contained only dead or dying cells (perhaps due to bad reagents or problems with the freezing process). When the experiments were redone, the original clusters dissolved into more biologically meaningful clusters. One can often see the effect of confounding experimental conditions using this same method. For example, if a set of arrays is processed by the date they were collected and the date corresponds to separate individual studies, then the processing set (date) will be confounded with the study number. Well-designed studies control such confounding by randomizing the processing order or by carefully balancing the processing order. However, it is always wise to use these exploratory analysis methods to ensure that your main effect has not, somehow, been confounded. A more interesting phase of analysis begins after obviously bad data have been culled and the remaining data have been reclustered. The data may be clustered in either of two ways. In one approach, the genes are clustered in an effort to identify possible functions for unstudied genes. See, for example, [29, 32]. In the other approach, which is often seen in clinical studies, we cluster the arrays (the patients) by their overall expression patterns. These clusters will hopefully correspond to some important differentiating characteristic, say, something in the clinical information. As the analysis proceeds, various hypotheses are created and tested. VxInsight has plotting features that are helpful here, including a browser page with various plots as well as links to external, Web-based information. Although useful information can be gleaned by simply labeling different peaks in VxInsight, a more systematic method is even more informative. At the highest level, one may wish to select two clusters of arrays and ask: Which genes have significantly differential expressions between these two clusters? Given any method for identifying such genes, it is useful to display them within the context of the cluster-bygenes map. Sometimes the most strongly differentiating genes for the clusters of arrays may not have been previously studied. In this case, it can be very useful to find known genes that cluster around these unstudied genes using the cluster-bygenes map. This process is illustrated in Figure 4.9, which shows the original table of array data, clustered both by arrays and by genes. The lower map represents the result after clustering by arrays and shows two highlighted clusters (colored white and green, respectively). The genes with strongly differential expressions between the groups 4.4. SUPERVISED METHODS 119 FIGURE 4.9. Array of expression data for large number of experiments shown clustered by genes and by arrays. A list of genes with different expressions between two groups of arrays is shown. This list includes a short annotation and links to more extensive, Web-based annotations. of arrays are shown to the right of this map. Note that the list is sorted by a statistical score and also contains links to the available Web-based annotations. A curved arrow in the figure suggests the path between the gene list and the cluster-bygenes image. That connection is implemented with sockets and forms the basis of a more general analysis tool, which allows an arbitrary gene list to be sent from the analysis of the arrays to the analysis of the genes. 4.4.1.2. Generating Gene Lists There are many methods for generating gene lists or finding genes which are expressed differently in different groups. As stated in the introduction, this process is known as supervised learning, since we are using known groups to learn about (and make predictions about) our data set. Finding gene lists in particular is known as feature or variable selection, where the features in this case are genes. There are a wide variety of methods for feature selection, and we do not provide here an extensive survey of this area. We do, however, mention some of the methods developed specifically to select genes for microarray analysis. The method in [7] was one of the first gene selection methods proposed, the method in [41] applied feature selection for support vector machines to microarray data, and [42] discusses a variety of feature selection methods applied to microarray data. 120 ROBUST METHODS FOR MICROARRAY ANALYSIS For our purposes, we use a simple statistical method for gene selection. A geneby-gene comparison between two groups (1 and 2) can be accomplished with a simple t-test. However, we wanted to eventually support comparisons between more than two groups at a time, so we actually used analysis of variance (ANOVA). This processing results in an F-statistic for each gene. The list of genes is sorted to have decreasing F-scores, and then the top 0.01% of the entire list are reported in a Web page format, with links to the associated OMIM pages. The OMIM pages are then examined manually to hypothesize biological differences between the clusters. 4.4.1.3. Gene List Stability An analysis using the gene list feature of VxInsight typically progresses as follows. First, a question is posed within the VxInsight framework and a statistical contrast is computed for that question. The gene list is initially examined to see if any genes are recognized by their short descriptions, which, if available, are included with the genes. The plots are examined, and the OMIM annotations are read. If the gene appears to be important, the literature links and other relevant National Center for Biotechnology Information (NCBI) resources are studied. This analysis step is very labor and knowledge intensive; it requires the bulk of the time needed to make an analysis. As such, it is very important to not waste time following leads that are only weakly indicated. That is to say, before one invests a great deal of time studying the top genes on a list, it is important to know that those highly ranked genes would likely remain highly ranked if the experiment could be repeated or if slight variations or perturbations of the data had occurred. The critical issue about any ordered list of genes is whether we can have any confidence that this list reflects a nonrandom trend. To be very concrete, suppose that My Favorite Gene (MFG) is at the top of the list in our ANOVA calculations, that is, MFG had the largest observed F-statistic from the ANOVA. What can we conclude about the observed ranking for MFG? Certainly, a naive use of the F-statistic has no support because we tested, say, 10,000 genes and found the very largest statistic from all of those tests. So, an F-value for p ¼ 0.001 would likely be exceeded about 10 times in our process, even if all the numbers were random. Hence, the reported F-statistic should only be considered to be an index for ordering the values. However, if we could repeat the experiment and if MFG was truly important, it should, on average, sort into order somewhere near the top of the gene list. We cannot actually repeat the experiment, but we can treat the values collected for a gene as a representative empirical distribution. If we accept that this distribution is representative, then we can draw a new set of values for each of the two groups by resampling the corresponding empirical distributions repeatedly (with replacement), as shown in Figure 4.10. This process is due to Efron and is known as bootstrapping [43]. Now consider Figure 4.11, where we resample for every gene across all of the arrays in the two groups to create, say, 100 new experiments. These experiments are then processed exactly the same way as the original measurements were processed. We compute ANOVA for each gene and then sort the genes by their F-value. As we construct these bootstrapped experiments, we accumulate the 4.4. SUPERVISED METHODS 121 FIGURE 4.10. A bootstrap method uses the actual measured data as estimates for the underlying distribution from which the data were drawn. One can then sample from that estimated underlying distribution by resampling (with replacement) from the actual measurements. FIGURE 4.11. Actual data are processed to create the gene list, shown at the bottom left. The actual data are then resampled to create several bootstrapped data sets. These data sets are processed exactly the same way as the real data to produce a set of gene lists. The average order and the confidence bands for that order can be estimated from this ensemble of bootstrapped gene lists. 122 ROBUST METHODS FOR MICROARRAY ANALYSIS distribution of the location in the list where each gene is likely to appear. Using these bootstrap results one can determine, for each gene, its average order in the gene lists. Although the distributions for such order statistics are known, they are complex. On the other hand, the bootstrapped distributions are easily accumulated, and they are acceptable for our needs. In addition to the average ranking, we count the 95% confidence bands for each gene’s ranking as estimated by the bootstraps. We report both the upper 95% confidence band and the centered 95% confidence interval for each of the genes. The lower limit of this upper 95% confidence band (LLUCB) is recorded for later use (note that 5% of the time we would observe a ranking below LLUCB by random chance, even when our hypothesis is false, given the two empirical distributions). Next, we investigate the p-values for the observed rankings of these genes under the null hypothesis, H0, that there is no difference in gene expression between the two groups (1 and 2). In this case (when H0 is in fact true), the best empirical distribution would be the unordered combination of all the values without respect to their group labels. To test this hypothesis, we create, for example, 10,000 synthetic distributions by bootstrapping from this combined empirical distribution and process them exactly as we did the original data. We are interested in what fraction of the time we observed a particular gene ranking higher in the bootstrapped results than the appropriate critical value. There are several reasonable choices for this critical value. We could use the actual observed ranking or the average ranking from the bootstraps under the assumption that H0 was false. Instead, we take an even more conservative stance and choose a critical value using a power analysis to control our chance of a type II error. We set b ¼ 0.05, or 5%. If H0 were false (i.e., if the groups do have different means), then the earlier bootstrapping experiments suggest that one might randomly observe a ranking as low as LLUCB about 5% of the time. Hence, we examine the later bootstrap experiments (under H0 assumed true and thus no group differences) and find the fraction of the times that we observe a ranking at or above LLUCB. This value is reported, gene by gene, as the p-value for the actual rankings. In essence, we are saying that if H0 is true, then by random chance we would have seen the gene ranking above LLUCB with probability p. As LLUCB is much lower than the actual ranking, this p-value is very conservative for the actual ranking. To investigate the meaning of the actual F-statistics used to index these gene lists, we computed another bootstrap experiment. We were interested in the effect of scaling the original expression values by their Savage-scored order statistics. As previously discussed, this scoring is felt to be more robust than taking logarithms. However, we were concerned that this might influence our p-values, so we developed a code to estimate the expected F-statistic for the mth ranked gene in a gene list from two groups (1 and 2) respectively having j and k arrays. This code computes a large bootstrap after randomizing the Savage scores within each of the j þ k arrays. The code then computes the ANOVA for each gene and eventually sorts the resulting genes into decreasing order by F-statistics. The final result is a p-value (by bootstrap) for the two groups with the specific number of arrays. This 4.4. SUPERVISED METHODS 123 computation is rather intensive and should either be fully tabulated or run only as needed for genes uncovered by the earlier methods. We have not run extensive simulations of this code against the p-values or the list order distributions, but the limited checks did suggest that genes which always ranked near the top of the differentiating gene lists do have rare F-statistics based on the Savage-scored orders relative to the expected random distributions (data not shown). 4.4.1.4. Comparing Gene Lists As mentioned previously, the ANOVA plus bootstrap approach described above is only one way to find genes which may have important roles with respect to particular biological questions. Our collaborators, for example, have used support vector machine recursive feature elimination (SVM RFE) [41], a Bayesian network approach using a feature selection method known as TNoM [44], and a technique based on fuzzy-set theory as well as more classical techniques, such as discriminant analysis. By using several of these methods, one might hope to find a consensus list of genes. Our experience has shown that this is possible. While the lists from different methods are usually not exactly the same, they often have large intersections. However, the simultaneous comparison of multiple lists has been a difficult problem. We have developed a number of methods which have helped us understand that the lists may be different in the details but still very similar biologically. This makes sense considering that different methods might identify different but closely related elements of regulation or interaction networks. In that case, the methods suggest the importance of the network and the particular region in that network, even though they do not identify exactly the same elements. This relatedness suggests something similar to the kind of “guilty-by-association” method that has been used to impute gene functions for unstudied genes that cluster near genes with known function, as in [29]. Indeed, something similar can be used to evaluate the similarity of multiple gene lists. Figure 4.12a shows a VxInsight screen for clusters of genes. Highlighted across the clusters are genes identified by different methods (shown in different colors). In this particular case, one can see that the various methods do identify genes that are generally collocated, which suggests that gene regulations and interacting networks probably do play a strong role with respect to the question under consideration. Here, for example, the question was, “which genes are differentially expressed in two types of cancers [acute lymphoblastic/myeloid leukemia (ALL/AML)]?”. However, multiple methods do not always produce such strong agreement, as shown in Figure 4.12b. In this case the question was, “which genes are predictive for patients who will ultimately have successful treatment outcomes (remission/ failure)?” Unfortunately, this question had no clear answer. Interestingly, the ANOVA-plus-bootstrap method suggests a very stable set of genes for the first question, while the list for the second question is not stable and has confidence bands spanning hundreds of rank-order positions (data not shown). Finally, we have discovered that when two methods produce similar gene lists, the coherence may be due to the underlying similarity of the methods more than to any true biological significance. We discovered this fact using a visualization of the gene lists using principal-component analysis (PCA), a common technique used for 124 ROBUST METHODS FOR MICROARRAY ANALYSIS FIGURE 4.12. (a) General collocation of genes identified by different algorithms (shown with different colors). This collocations suggests that the different methods are in reasonable agreement. (b) Genes selected by each method are widely separated and show no coherence, suggesting that there is a lack of consensus among the methods. dimensionality reduction that involves projections of the original data onto the principal components. These components are ordered according to the amount of data captured by each component. The first component is the most informative, the second is the next most informative, and so on. Further information on PCA can be found in [45, 46]. An example of PCA applied to microarray data can be found in [8]. In our PCA-based visualization of multiple gene lists, each gene is considered to be a point in patient space, where each dimension corresponds to a different patient. Since, in this case, there were 12,000 genes and 126 patients, the spatial representation had 12,000 points (samples) in a 126-dimensional space. Of the 12,000 genes we only considered about 600 that occurred in the different gene lists, reducing our problem to 600 genes in 126 dimensions. Furthermore, because we were mainly interested in how the genes compared as discriminators, and not how their actual expression levels compared, we projected the genes onto the 126-dimensional unit sphere in patient space, as suggested in Figure 4.13a. Geometrically, this corresponds to comparing the directions of the genes in the various gene lists as opposed to their magnitudes. In order to understand this visualization, is it useful to imagine a sphere with a plane passing through the origin. The sphere corresponds to the unit sphere (the sphere with radius 1 centered at the origin) in the patient space and the plane corresponds to the plane determined by the first two principal components. The first principal component points in the radial direction of the sphere and the second principal component is tangential to the sphere at the sphere’s intersection with the first principal component. The vector representing a particular gene and it will intersect the unit sphere, and it will be near the equator of the sphere (unit circle) if it lies in the plane of the first two principal components. To the extent that the gene lies above or below the plane of the first two principal components, the projection of the intersection back down onto the plane will lie further inside the equator. The distribution of these projections onto the principal-component plane suggests how a given method of gene selection identifies important genes. 4.4. SUPERVISED METHODS 125 FIGURE 4.13. (a) A few genes from three different methods are shown intersecting the unit sphere, along with the projections of those intersections down onto the plane of the first two principal components. Note that genes near that plane will have projections that fall close to the arc of the sphere, while those above or below the plane will have intersections that fall well within the equator of the sphere. (b) The ALL-vs-AML gene list comparison. The gene lists that characterize ALL vs. AML are shown, with a different color for each of the methods used to obtain them. In distinguishing ALL from AML we found that most of the genes in the list were colocalized in our representative visualization. (c) Gene lists that characterize remission vs. failure are shown, with a different color for each of the methods used to obtain them. It can be seen in this figure that distinguishing remission from failure is a difficult task. 126 ROBUST METHODS FOR MICROARRAY ANALYSIS For instance, discriminant analysis and the ANOVA methods are much more similar to each other than to the Bayesian network approach. If we use PCA (see [46] for an introduction to PCA), we see further that many methods will be heavily influenced by differences in the first few principal components of the gene expression data. On the other hand, methods such as SVM RFE [41] are able to examine the simultaneous efficacy of groups of genes, some of which, individually, may not be discriminatory in the first or second principal component. One way to understand these differences is by considering where selected genes project onto the plane of the first two principal components: see Figure 4.13a, which schematically represents a few genes from three methods, identified by different colors. It is evident from Figure 4.13b that the gene lists selected for the ALL/AML problem are related. Unfortunately, it is equally obvious that the gene lists selected for the remission/failure problem are unrelated, as shown using the same analysis in Figure 4.13c. When distinguishing ALL from AML, we found that most of the lists were colocalized in our representative visualization (see Figs. 4.12a and 4.13b). When distinguishing remission from failure, on the other hand, we could not arrive at a satisfactory conclusion (see, Figs. 4.12b and 4.13c), which is also consistent with the results from ANOVA plus bootstrapping (data not shown). 4.4.2. Unifying Gene Lists Although it is useful to compare gene lists, the task of sifting through five or six such lists can be very time consuming for the biologist. For this reason, we also developed a quick-and-easy way to combine multiple lists into a single master list. In order to collate and compare the different gene lists we used a weighted voting scheme. In this scheme, we consider genes to be candidates and gene lists to be votes. In other words, each method suggests, in order of preference, which genes should be elected. Our method for combining the gene lists ranks the candidate genes according to the geometric mean of the voting order in each list, where TABLE 4.2 Rank 1 2 3 4 5 6 7 8 9 10 A Simple Combination of Many Gene Lists Geo Mean SVM Stepwise ROC ANOVA TNoM 1 6.17 6.49 8.57 10.14 10.24 10.9 13.34 13.66 13.91 1 4 3 2 5 6 10 8 11 12 1 2 1 6 2 8 3 5 4 11 15 14 1 6 2 3 10 11 4 5 1 3 4 23 11 25 Note: The overall rank was obtained by using the geometric mean of the ranks provided in columns 3 –7. An empty cell in the array indicates a value of 31. ACKNOWLEDGMENTS 127 each method is allowed only 30 votes (the length of our shortest list) and all other genes are given a vote of 31. An example is shown in Table 4.2. 4.5. CONCLUSION At this point in the analysis it may seem that the biology has dissolved into a sea of numbers and statistical methods. However, these methods are our only guideposts when we begin reading the known information about the indicated genes. Without them we could easily waste very valuable time and people in the study of genes which are only weakly, if at all, related to the central questions of the research. Guided by these methods, we can approach the literature with greater confidence and are much more likely to see the important biology reemerge in the gene annotations and the cited literature. However, even after these statistical filters, this literature is vast and is not organized to make our searching particularly easy. We have come to recognize that this step (where very knowledgeable scientists must read extensively) is the critical, rate-limiting step for our research. As a result, we (and many others) have begun work with the natural language processing (NLP) community to build tools that find, summarize, and reorder important parts of the available online literature to make that reading process simpler and more focused toward our research needs. Although we do not discuss such work here, a demonstration of a preliminary automatic Gene List Exploration Environment (GLEE) can be found at http://aiaia.nmsu.edu/. See also [34] Regardless, gene expression studies are providing new insights into molecular mechanism and hold the promise of deeper biological understanding. However, the speed at which groups of genes generated by microarray analysis can be put together in pathways is one of the limiting steps in the translation of these discoveries to applications. Mistakes and dead ends due to faulty microarray analysis tools are a particularly frustrating way to slow this analysis. The methods presented here are potentially useful in uncovering groups of genes that serve to fingerprint biologically important subtypes; further aiding biological discoveries; and refining diagnosis and improving assessment of prognosis. To provide greater confidence in our tools, we have also benchmarked our methods extensively for reliability. In fact, we believe that both of these factors (usefulness and reliability) are equally important, particularly for the analysis of microarray data. ACKNOWLEDGMENTS The authors of this chapter gratefully acknowledge the immense contributions of our collaborators, without which this work could not have been accomplished. Here we reported on the methods and techniques developed at Sandia National Laboratories. However, much of the work discussed sprang from our collaborators; we thank them for their generous time and patience in teaching us enough biology to be helpful and for including us in their laboratories. While developing these methods we have worked directly with and have had fruitful discussions with dozens of researchers from these laboratories and have benefited 128 ROBUST METHODS FOR MICROARRAY ANALYSIS by many generous introductions to other researchers. We owe each of these people our thanks for their critiques of our work and encouragements to continue. Our collaborations with the University of New Mexico have been very fruitful. The informatics work reported here stems directly from close interactions with Vickie Peck, who opened the world of genomics to us. Certainly, without the continuing encouragements and contributions of Maggie Werner-Washburne we would not have created useful tools or have been able to explain them to life scientists. One of the most important of these was Stuart Kim from Stanford University, who was able to see through the primitive nature of our early efforts and recognize what they could become. Our collaboration with Stuart stretched and improved all of our tools and led to an important, joint publication that continues to have impact. The microarray work with Cheryl Willman’s laboratory at the University of New Mexico Cancer Center drove the development of many of the statistical techniques presented here. Cheryl included us in her weekly laboratory meetings, which must surely have been made more tedious by our presence and the continuing need to teach us the rudiments of leukemia biology. Each of these groups is large and we have learned something from every one of you, for which we say thank you. We would especially like to acknowledge the help and collaborations of Moni Kiraly, Jim Lund, Kyle Duke, Min Jiang, Joshua M. Stuart, and Andreas Eizinger from the Kim Laboratory and, of course, Edwina Fuge, Jose Weber, Juanita Martinez, Anthony Aragon, and Angela Rodriguez from the Werner-Washburne Laboratory. From the Willman Laboratory, we must certainly acknowledge Susan Atlas, Paul Helman, Robert Veroff, Erik Andries, Kerem Ar, Yuexian Xu, Huining Kang, Xuefei Wang, Fred Schultz, Maurice Murphy, and particularly Mónica Mosquera-Caro and Jeffrey Potter, who have been immensely helpful to our research. We would particularly like to thank Jon C. Helton for suggesting the use of Savage scoring as a means to normalize microarray data. If we have unfortunately omitted someone to whom we owe our thanks and gratitude, please accept our apologies and recognize that we do value your help. As always, any misrepresentation or error with respect to our collaborators’ work is purely our own fault. Finally, we would like to thank the W. M. Keck foundation for funding the W. M. Keck Genomics Center at the University of New Mexico, which was particularly important in our work. This work was supported by grants from the National Science Foundation, (NSF) (MCB-0092374) to M.W.W., an NSF Minority Post-doctoral fellowship to M.J.M, and U.S. Department of Agriculture (99-38422-8034) to A. D. A. Development of the algorithms described here was funded in part by the W. M. Keck foundation and a Laboratory Directed Research and Development program from Sandia National Laboratories. In addition, a portion of this work was funded by the U.S. Department of Energy’s Genomics: GTL program (www.doegenomestolife.org) under project “Carbon Sequestration in Synechococcus Sp.: From Molecular Machines to Hierarchical Modeling” (www.genomes-to-life.org). Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the U.S. Department of Energy under Contract DE-AC04-94AL85000. REFERENCES 1. G. S. Davidson et al., “Knowledge mining with VxInsight: Discovery through interaction,” J. Intell. Inform. Sys., 11(3): 259 – 285, 1998. 2. M. Schena et al., “Quantitative monitoring of gene expression patterns with a complementary DNA microarray,” Science, 270(5235): 467 – 470, 1995. 3. J. L. DeRisi, V. R. Iyer, and P. O. Brown, “Exploring the metabolic and genetic conrol of gene expression on a genomic scale,” Science, 278(25): 14863– 14868, 1997. REFERENCES 129 4. P. O. Brown and D. Botstein, “Exploring the new world of the genome with DNA microarrays,” Nature Genet., 21: 33– 37, 1999. 5. M. Eisen et al., “Cluster analysis and display of genome-wide expression patterns,” Proc. Natl. Acad. Sci., 95(25): 14863– 14868, 1998. 6. P. Spellman et al., “Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization,” Mol. Biol. Cell, 9(12): 3273– 3297, 1998. 7. T. Golub et al., “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,” Science, 286(5439): 531 – 537, 1999. 8. O. Alter, P. O. Brown, and D. Botstein, “Singular value decompositon for genome-wide expression data processing and modeling,” Proc. Natl. Acad. Sci., 97(18): 10101– 10106, 2000. 9. H. C. Causton, J. Quackenbush, and A. Brazma, Microarray Gene Expression Data Analysis: A Beginner’s Guide, Blackwell Publishers, Malden, MA, 2003. 10. S. Draghici, Data Analysis Tools for DNA Microarrays, Chapman & Hall, Boca Raton, FL, 2003. 11. M.-L. T. Lee, Analysis of Microarray Gene Expression Data., Kluwer Academic, Boston, MA, 2004. 12. M. Schena, Microarray Analysis, Wiley, New York, 2002. 13. G. E. P. Box, J. S. Hunter, and W. G. Hunter, Statistics for Experimenters, Design, Innovation, and Discovery, Wiley, Hoboken, NJ, 2005. 14. J. F. Zolman, Biostatistics, Oxford University Press, New York, 1993. 15. J. R. Thompson, Simulation: A Modeler’s approach, Wiley, New York, 2000. 16. J. W. Tukey and F. Mosteller, Data analysis and regression, in F. Mosteller (Ed.), Addison-Wesley Series in Behavioral Science: Quantitative Methods, Addison-Wesley, Reading, MA, 1977. 17. I. R. Savage, “Contributions to the theory of rank order statistics—the two-sample case,” Ann. Math. Stat., 27: 590 – 615, 1956. 18. M. Bilban et al., “Normalizing DNA microarray data,” Curr. Issues Mol. Biol., 4(2): 57– 64, 2002. 19. J. Quackenbush, “Microarray data normalization and transformation,” Nature Genet, 32: 496 – 501, 2002. 20. G. Davidson, B. Wylie, and K. Boyack. “Cluster stability and the use of noise in interpretation of clustering”, in 7th IEEE Symposium on Information Visualization (InfoVis 2001), San Diego, CA, 2001. 21. P. Tamayo et al., “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation,” Proc. Natl. Acad. Sci., IEEE Computer Society, Washington, D.C., 96(6): 2907– 2912, 1999. 22. A. J. Enright and C. A. Ouzounis, “BioLayout—an automatic graph layout algorithm for similarity visualization,” Bioinformatics, 17(9): 853 – 854, 2001. 23. R. Shamir and R. Sharan. “CLICK: A clustering algorithm with applications to gene expression analysis,” in Proc. Int. Conf. Intell. Syst. Mol. Biol. (ISMB), 8:307– 316, AAAI Press, Menlo Park, CA, 2000. 24. Y. Xu, V. Olman, and D. Xu, “Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees,” Bioinformatics, 18(2): 536–545, 2002. 130 ROBUST METHODS FOR MICROARRAY ANALYSIS 25. A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Comput. Surv., 31(3): 264 – 323, 1999. 26. R. R. Wilcox, Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy, Springer-Verlag, New York, 2001, p. 110. 27. R. R. Wilcox, “Introduction to robust estimation and hypothesis testing”, in G. J. L. a.I. Olkin (Ed.), Statistical Modeling and Decision Science, Academic, San Diego, CA, 1977, p. 188. 28. Online Mendelian Inheritance in Man (OMIM), McKusick-Nathans Institue for Genetic Medicine, John Hopkins University (Baltimore, MD), National Center for Biotechnology Information, and National Library of Medicine (Bethesda, MD), 2000. 29. S. Kim et al., “A gene expression map for Caenorhabditis elegans,” Science, 293(5537): 2087– 2092, 2001. 30. T. Fruchtermann and E. Rheingold, Graph Drawing by Force-Directed Placement, University of Illinois, Urbana-Champagne, IL, 1990. 31. S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science, 220(4598): 671 – 680, 1983. 32. M. Werner-Washburne et al., “Comparative analysis of multiple genome-scale data sets,” Genome Res., 12(10): 1564– 1573, 2002. 33. J. M. Stuart et al., “A gene co-expression network for global discovery of conserved genetic modules,” Science, 302: 249 – 255, 2003. 34. G. S. Davidson et al., High Throughput Instruments, Methods, and Informatics for Systems Biology, Sandia National Laboratories, Albuquerque, NM, 2003. 35. Y. Koren, L. Carmel, and D. Harel, “Drawing huge graphs by algebraic multigrid optimization,” Multiscale Modeling and Simulation, 1(4): 645 – 673, 2003. 36. C. Walshaw, “A multilevel algorithm for force-directed graph drawing”, in Proceedings of the Eighth International Symposium on Graph Drawing (GD), Springer-Verlag, London, UK, 2000. 37. B. Hendrickson and R. Leland, “A Multi-Level Algorithm for Partitioning graphs”, in Supercomputing, ACM Press, New York, 1995. 38. S. Roweis and L. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, 290(5500): 2323– 2326, 2000. 39. J. B. Tenenbaum, V. D. Silva, and C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, 290(5500): 2319– 2323, 2000. 40. A. Ben-Hur, A. Elisseeff, and I. Guyon. “A stability based method for discovering structure in clustered data,” in R. Attman, A. K. Dunker, L. Hunter, T. E. Klein (Eds.), Pacific Symposium on Biocomputing, World Scientific Hackensack, NJ, 2002. 41. I. Guyon et al., “Gene selection for cancer classification using support vector machines,” Machine Learning, 46: 389 – 422, 2002. 42. J. Jaeger, R. Sengupta, and W. L. Ruzzo. “Improved gene selection for classification of microarrays,” in R. Attman, A. K. Dunker, L. Hunter, T. E. Klein (Eds.), Pacific Symposium on Biocomputing, World Scientific Hackensack, NJ, 2003. 43. B. Efron, “Bootstrap methods: Another look at the jackknife,” Ann. Stat., 7(37): 1–26, 1979. 44. P. Helman, R. Veroff, S. R. Atlas, C. Willman, “A Bayesian network classification methodology for gene expression data,” J. Comput. Biol., 11(4): 581 – 615, 2004. 45. I. T. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, 2002. 46. M. Kirby, Geometric Data Analysis, Wiley, New York, 2001.