Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Normalisation of multicondition cDNA macroarray data

Comparative and functional genomics, 2007
Background. Normalisation is a critical step in obtaining meaningful information from the high-dimensional DNA array data. This is particularly important when complex biological hypotheses/questions, such a functional analysis and regulatory interactions within biological systems, are investigated. A nonparametric, intensity-dependent normalisation method based on global identification of self-consistent set (SCS) of genes is proposed here for such systems. Results. The SCS normalisation is introduced and its behaviour demonstrated for a range of user-defined parameters affecting its performance. It is compared to a standard global normalisation method in terms of noise reduction and signal retention. Conclusions. The SCS normalisation results using 16 macroarray data sets from a Bacillus subtilis experiment confirm that the method is capable of reducing undesirable experimental variation whilst retaining important biological information. The ease and speed of implementation mean th......Read more
Hindawi Publishing Corporation Comparative and Functional Genomics Volume 2007, Article ID 90578, 12 pages doi:10.1155/2007/90578 Research Article Normalisation of Multicondition cDNA Macroarray Data Nicola L. Dawes and Jarka Glassey School of Chemical Engineering and Advanced Materials, Merz Court, University of Newcastle upon Tyne, Newcastle upon Tyne, NE1 7RU, UK Received 17 July 2006; Revised 21 December 2006; Accepted 28 February 2007 Recommended by John Quackenbush Background. Normalisation is a critical step in obtaining meaningful information from the high-dimensional DNA array data. This is particularly important when complex biological hypotheses/questions, such a functional analysis and regulatory interac- tions within biological systems, are investigated. A nonparametric, intensity-dependent normalisation method based on global identification of self-consistent set (SCS) of genes is proposed here for such systems. Results. The SCS normalisation is introduced and its behaviour demonstrated for a range of user-defined parameters aecting its performance. It is compared to a standard global normalisation method in terms of noise reduction and signal retention. Conclusions. The SCS normalisation results using 16 macroarray data sets from a Bacillus subtilis experiment confirm that the method is capable of reducing undesirable experimen- tal variation whilst retaining important biological information. The ease and speed of implementation mean that this method can be easily adapted to other multicondition time/strain series single colour array data. Copyright © 2007 N. L. Dawes and J. Glassey. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. BACKGROUND DNA array-based approaches have been widely applied for gene expression studies in many research areas from func- tional genomics to biomedical applications. As these stud- ies become more rigorous, they require the use of multiple DNA arrays and normalisation is a key issue in the analysis of the resulting data. Indeed, as Edwards [1] claims “Nor- malisation has profound eects on subsequent analysis, irre- spective of the methodology used. Failure to normalize ap- propriately will generally lead to misleading conclusions.” An eective normalisation technique is one that reduces ex- perimental variation or biases (noise) without aecting the measurement of the biological variation (signal). There are a number of well-documented normalisation techniques rang- ing from simple scaling methods to more complex statistical approaches [24]. “Global” scaling methods are suitable for data sets where relatively few genes are expected to change between conditions and global array statistics such as me- dian/mean expression levels can be used. Statistical models require a good level of replication of experiments in order to give acceptable results [5]. For some systems, normalisation is built into the experi- mental design using specific software for the DNA array sys- tem, such as the Aymetrix system. Some DNA array ex- periments, however, produce data that is not so easily nor- malised. This paper focuses on the normalisation of multi- condition time series gene expression data, generated using one-colour membrane macroarrays. For details of the exper- iment, see Materials and Methods. The data set consists of data from 16 hybridisation experiments, using four strains of the organism Bacillus subtilis grown in phosphate-limiting conditions as shown in Figure 1. The four strains are re- ferred to as “wildtype” (B. subtilis 168 WT), “sigB” (sigB-null mutant 168-ML6), “phoR” (phoR-null mutant 168-PR), and “double mutant” (sigB-null, phoR-null mutant 168-ML6PR). Samples were taken from each strain at 4 time points across the transition to phosphate limitation. In the resulting data set, the expression of many genes was aected by the phosphate-limiting conditions and modified in the mutant strains. It is theoretically possible to analyse this data set using some traditional methods of normalisa- tion in order to answer simplified biological hypotheses rep- resented in Figure 1 as Options 1, 2, and 3. However, in order to investigate the mechanisms of interaction between general and specific phosphate-stress response regulons, it is essen- tial to use all the data and normalise it appropriately to avoid loss of important biological variation (signal). Currently, the scientific literature mainly reports on more straightforward investigations of either single time points
2 Comparative and Functional Genomics Genes Option 1: expression over time in a given strain T1 T2 T3 T4 Strain 1 Gene 1 Option 3: expression of a given gene in time in dierent strains T1 T2 T3 T4 S4 S3 S2 S1 Genes Time point 1 S1 S2 S3 S4 Option 2: expression in dierent strains for a given time Time points Genes Postphosphate starvation Prephosphate starvation Strains Figure 1: Three-dimensional structure of the B. subtilis gene expression data set. Graphical representation of the experimental data available for analysis illustrating the complex relationships and the optional simplified biological hypothesis that can be investigated. from a variety of strains/conditions or time profile of gene expression from a single strain using traditional methods of normalisation. However, functional genomics in particular will require an alternative approach to both experimental de- sign and data analysis and hence novel normalisation meth- ods, such as the one proposed here, will become more appro- priate. Depending on the array construction, the application of statistical modelling methods, which rely more heavily on data replication [6], may be limited. Also, most other new normalisation methods have been developed using data from two-colour arrays [7, 8] and in some cases these methods are not suitable for one-colour array data. A number of these normalisation methods are based on the identification of a group of genes deemed to be invariant [9] often between the two-colour channels on a given array [4]. This can only be translated to the analysis of one-colour array data by com- paring all the arrays in the experiment to one array, taken as the baseline array. It is not possible to choose a suitable baseline array in the B. subtilis data set due to biological vari- ability between the strains as well as across the time trajectory within each strain as a result of growth and phosphate starva- tion. Therefore a new normalisation method, which does not require a selection of a baseline array, is proposed to identify a set of invariant genes globally, across all the arrays simulta- neously. This manuscript sets out in detail the proposed nor- malisation method and investigates the sensitivity of the al- gorithm to two critical parameters. Ideally, the benefits of a new normalisation method would be judged by the im- provements in clustering the data compared to nonnor- malised data (or data normalised using alternative normal- isation methods). However, this requires knowledge of ex- pected composition of clusters in order to assess whether correct clusters are identified. In this particular application, gaining such knowledge is indeed the overall aim of the ex- perimental work and hence it is not available at this stage. Thus a comparison in terms of a simple nonparametric scor- ing method [10] (referred to here as the “Park” score) is used to highlight the improvements in identifying dierentially expressed genes (known to be under the control of the pho regulon) from the normalised data. Given the problems in quantifying the benefits of the new normalisation procedure on the biological system studied, it would be desirable to compare this procedure with stan- dard normalisation methods used on data publicly available. However, cDNA array data currently publicly available does not have the same structure (gene expression measured for a
Hindawi Publishing Corporation Comparative and Functional Genomics Volume 2007, Article ID 90578, 12 pages doi:10.1155/2007/90578 Research Article Normalisation of Multicondition cDNA Macroarray Data Nicola L. Dawes and Jarka Glassey School of Chemical Engineering and Advanced Materials, Merz Court, University of Newcastle upon Tyne, Newcastle upon Tyne, NE1 7RU, UK Received 17 July 2006; Revised 21 December 2006; Accepted 28 February 2007 Recommended by John Quackenbush Background. Normalisation is a critical step in obtaining meaningful information from the high-dimensional DNA array data. This is particularly important when complex biological hypotheses/questions, such a functional analysis and regulatory interactions within biological systems, are investigated. A nonparametric, intensity-dependent normalisation method based on global identification of self-consistent set (SCS) of genes is proposed here for such systems. Results. The SCS normalisation is introduced and its behaviour demonstrated for a range of user-defined parameters affecting its performance. It is compared to a standard global normalisation method in terms of noise reduction and signal retention. Conclusions. The SCS normalisation results using 16 macroarray data sets from a Bacillus subtilis experiment confirm that the method is capable of reducing undesirable experimental variation whilst retaining important biological information. The ease and speed of implementation mean that this method can be easily adapted to other multicondition time/strain series single colour array data. Copyright © 2007 N. L. Dawes and J. Glassey. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. BACKGROUND DNA array-based approaches have been widely applied for gene expression studies in many research areas from functional genomics to biomedical applications. As these studies become more rigorous, they require the use of multiple DNA arrays and normalisation is a key issue in the analysis of the resulting data. Indeed, as Edwards [1] claims “Normalisation has profound effects on subsequent analysis, irrespective of the methodology used. Failure to normalize appropriately will generally lead to misleading conclusions.” An effective normalisation technique is one that reduces experimental variation or biases (noise) without affecting the measurement of the biological variation (signal). There are a number of well-documented normalisation techniques ranging from simple scaling methods to more complex statistical approaches [2–4]. “Global” scaling methods are suitable for data sets where relatively few genes are expected to change between conditions and global array statistics such as median/mean expression levels can be used. Statistical models require a good level of replication of experiments in order to give acceptable results [5]. For some systems, normalisation is built into the experimental design using specific software for the DNA array system, such as the Affymetrix system. Some DNA array ex- periments, however, produce data that is not so easily normalised. This paper focuses on the normalisation of multicondition time series gene expression data, generated using one-colour membrane macroarrays. For details of the experiment, see Materials and Methods. The data set consists of data from 16 hybridisation experiments, using four strains of the organism Bacillus subtilis grown in phosphate-limiting conditions as shown in Figure 1. The four strains are referred to as “wildtype” (B. subtilis 168 WT), “sigB” (sigB-null mutant 168-ML6), “phoR” (phoR-null mutant 168-PR), and “double mutant” (sigB-null, phoR-null mutant 168-ML6PR). Samples were taken from each strain at 4 time points across the transition to phosphate limitation. In the resulting data set, the expression of many genes was affected by the phosphate-limiting conditions and modified in the mutant strains. It is theoretically possible to analyse this data set using some traditional methods of normalisation in order to answer simplified biological hypotheses represented in Figure 1 as Options 1, 2, and 3. However, in order to investigate the mechanisms of interaction between general and specific phosphate-stress response regulons, it is essential to use all the data and normalise it appropriately to avoid loss of important biological variation (signal). Currently, the scientific literature mainly reports on more straightforward investigations of either single time points 2 Comparative and Functional Genomics Prephosphate starvation Postphosphate starvation s ain Genes St r Time points Genes Genes Time point 1 S1 S2 S3 S4 Strain 1 T1T2 T3 T4 Gene 1 T1 T2 T3 T4 Option 2: expression in different strains for a given time S1 S2 S3 S4 Option 1: expression over time in a given strain Option 3: expression of a given gene in time in different strains Figure 1: Three-dimensional structure of the B. subtilis gene expression data set. Graphical representation of the experimental data available for analysis illustrating the complex relationships and the optional simplified biological hypothesis that can be investigated. from a variety of strains/conditions or time profile of gene expression from a single strain using traditional methods of normalisation. However, functional genomics in particular will require an alternative approach to both experimental design and data analysis and hence novel normalisation methods, such as the one proposed here, will become more appropriate. Depending on the array construction, the application of statistical modelling methods, which rely more heavily on data replication [6], may be limited. Also, most other new normalisation methods have been developed using data from two-colour arrays [7, 8] and in some cases these methods are not suitable for one-colour array data. A number of these normalisation methods are based on the identification of a group of genes deemed to be invariant [9] often between the two-colour channels on a given array [4]. This can only be translated to the analysis of one-colour array data by comparing all the arrays in the experiment to one array, taken as the baseline array. It is not possible to choose a suitable baseline array in the B. subtilis data set due to biological variability between the strains as well as across the time trajectory within each strain as a result of growth and phosphate starvation. Therefore a new normalisation method, which does not require a selection of a baseline array, is proposed to identify a set of invariant genes globally, across all the arrays simultaneously. This manuscript sets out in detail the proposed normalisation method and investigates the sensitivity of the algorithm to two critical parameters. Ideally, the benefits of a new normalisation method would be judged by the improvements in clustering the data compared to nonnormalised data (or data normalised using alternative normalisation methods). However, this requires knowledge of expected composition of clusters in order to assess whether correct clusters are identified. In this particular application, gaining such knowledge is indeed the overall aim of the experimental work and hence it is not available at this stage. Thus a comparison in terms of a simple nonparametric scoring method [10] (referred to here as the “Park” score) is used to highlight the improvements in identifying differentially expressed genes (known to be under the control of the pho regulon) from the normalised data. Given the problems in quantifying the benefits of the new normalisation procedure on the biological system studied, it would be desirable to compare this procedure with standard normalisation methods used on data publicly available. However, cDNA array data currently publicly available does not have the same structure (gene expression measured for a 3500 3000 2500 2000 1500 1000 500 0 Number of SCS genes as parameter a is increased 0 400 Figure 2: Number of genes identified as SCS with increasing value of a. The maximum number of possible SCS genes is 2958 as genes that are unreliable or have zero expression are excluded from the SCS during the implementation of the algorithm. Number of SCS genes as parameter x is varied 500 400 300 200 100 0 10 2. 800 1200 1600 2000 2400 2800 3200 3600 4000 Parameter a Number of SCS genes number of conditions over a number of time points following the same experimental protocol) and hence this normalisation procedure could not be applied to such data. The main problem with identifying the best (or most appropriate) normalisation method is the lack of a “gold standard”—a validation set that can reflect the complexity of real data [4]. Thus one of two approaches usually employed to compare different normalisation methods or to introduce a new one is the ability to reduce noise in the data and the ability to retain biological signal. A number of approaches were used to asses the noise reduction, such as MA diagnostic plots used for two-colour cDNA microarrays [11] or coefficient of variation [4]. Unfortunately, the data investigated in this work does not contain technical replicates other than the duplicate spots on the same array. These cannot be used for the purposes of noise reduction assessment although they were used in MA plots to establish the reliability of each of the gene expression values as well as to check whether the proposed normalisation introduces any undesirable artefacts into the data. Thus alternative means of demonstrating noise reduction and a more plausible biological explanation of the expression data is provided in this manuscript. The ability to retain biological signal is demonstrated through the changes in the detection of differentially expressed genes. 3 Number of SCS genes N. L. Dawes and J. Glassey 20 30 RESULTS 2.1. Self-consistent set normalisation The SCS normalisation process can be seen as a series of filters implemented in the SCS algorithm described in the Materials and Methods section. A gene must pass all the filters to be classed as “self-consistent.” The filters are based on the absolute difference in rank position of a gene’s contribution to the total expression on a single array between the arrays representing corresponding time points for each strain. For each time point, comparisons were made between wildtype and sigB; wildtype and phoR, wildtype and double mutant, sigB and phoR, sigB and double mutant, and phoR and double mutant. Any gene with an absolute rank difference below the threshold a for all six comparisons across all four time points is included in the self-consistent set. These stringent criteria allow only a small, conservative set of genes to be deemed self-consistent. The size of this set is largely determined by the parameter a, the maximum absolute rank difference allowed. Once the SCS set is identified, each gene on a given array is divided by the total expression of the SCS set on that array to normalise the data. Genes that show a consistently high or low expression may skew the SCS and introduce bias, therefore a number of genes were excluded before the filter stage by finding the average of the contributions for each gene on all arrays and excluding the top and bottom x%. 2.2. Sensitivity of SCS normalisation to parameter settings The number of SCS genes identified by the algorithm is largely dependant on the two user-defined parameters, a, the 40 50 60 70 Value of parameter x 80 90 100 a = 400 a = 800 a = 1000 Figure 3: Effect of parameter x on SCS size. Influence of parameter x (the percentage of genes excluded prior to identifying SCS genes) over the number of SCS genes identified for different values of parameter a. absolute rank difference limit, and x, the proportion of genes excluded. Figures 2 and 3 show how the number of genes identified as SCS changes depending on the values of a and x. It is clear that parameter a has a greater influence over the size of the SCS than x. As a increases so does the number of genes which pass the filters and end up in the final SCS whereas when x is increased, the stringency of the algorithm is increased as there are less potential SCS genes to start with. A wide range of a and x was investigated in this case to establish the sensitivity of the algorithm to these values, although it is clear that excessively large values of a result in an unrealistically large SCS gene sets. Several values of a, set to low proportion of the total number of genes spotted on the array, can be tested rapidly and the appropriate threshold selected (as described in this manuscript). Alternatively, a minimum ranking difference (MRD = min{rank(i) − rank( j)}) can be calculated for each gene and each condition. Figure 4 shows a histogram of MRD over all six comparisons performed in this study. The MRD distribution can then be used to estimate the threshold value of a. In this case, a comparison between sets obtained when parameter a values were set to 200, 400, and 600 (representing approximately 5%, 10%, and 4 Comparative and Functional Genomics 100 7 6 4 60 3 40 2 20 1 0 Cumulative (%) 80 5 40 00 M or e 00 00 00 30 20 16 0 0 0 00 00 12 10 80 60 20 0 0 40 Ave. frequency for 6 strain-wise comparisons ×103 MRD Figure 4: Minimum rank difference (MRD) histogram. MRD is calculated for each gene and each of the six strain-wise comparisons. A standard global scaling normalisation, such as normalised median-based trimmed mean (nMTM) density—an output of the array image analysis software ArrayVision, would simply result in a straight line of totals for all the arrays since it assumes linear relationship between the intensities on different arrays and forces the total intensity on each array to be equal. This is an assumption which only holds true if a small number of genes are expected to change between conditions. Biologically, we are expecting to see a reduction in overall gene expression over the time course in each strain as phosphate starvation is encountered. Clearly the variations between array totals are significantly reduced in the case of SCS normalised data as shown in Figure 8. In addition, between-array MA plots of normalised data (both time-point and strain-wise comparisons, not shown here) revealed no block effects or introduction of artefacts by the normalisation technique. 2.4. 15% of the total number of genes, resp.) was performed using the improvements in the number of differentially expressed genes (only data for genes scoring an extreme Park score of either 0 or 16 is shown here). The bar chart in Figure 5 shows that there is no detectable improvement in using a = 600 (resulting in 132 SCS genes). There is a more notable difference in the number of genes with an extreme Park scores when a = 200, however this only results in 11 SCS genes, which is a rather low proportion of the total number of genes spotted on the array. Thus the value of a = 400 (63 SCS genes) was chosen for the future analysis of this data set. Parameter x was varied from 0.5%–5% of the total number of genes in the data set, for three different values of a (Figure 3). The number of SCS genes decreases linearly as x increases. Parameter x was selected to exclude 40 genes (∼1% of the total number of genes spotted on the array) as shown in Figures 6 and 7. This results in a reduction in standard deviation of 62% on average across all the arrays. The application of the SCS algorithm, using a = 400 and x = 1%, to the full B. subtilis data set resulted in 63 SCS genes (2% of the total number of genes spotted on the array). These genes are listed in Table 1, grouped into functional categories as defined in SubtiList Web Server [12]. The raw expression values for these genes vary over the time course of the experiments, however the rank positions of the contributions of these genes are similar for any given time point in each experiment. The SCS set represents genes whose expression remains relatively unchanged under the experimental conditions investigated here. 2.3. Noise reduction Although the lack of technical replicates precludes the use of coefficient of variation as a measure of noise reduction, the analysis of totals of gene expression for each array can serve as a useful tool to assess the impact of the normalisation procedure. Figure 8 shows a comparison of the totals of the raw gene expression for each array (black diamonds) as well as the SCS normalised totals (red squares). Differential gene expression In the absence of technical replicates in the data set studied, it has proved difficult to apply the common methodologies employed to identify differentially expressed genes such as the t-statistic or Wilcoxon test. Instead, each gene’s Park score has been calculated for every strain-wise comparison for both the nMTM- and SCS-normalised data sets. There is variation in the Park scores between the two normalisation methods and the results indicate differentially expressed genes are more likely to be correctly identified when the data is normalised with the SCS method rather than the nMTM. To illustrate this, a subset of 33 genes known to be under the control of phoR is focussed on since the expression of the pho-regulated genes is expected to be notably lower in the phoR-null mutant compared to the wildtype or the sigB-null mutant. Therefore the Park scores for the wildtype/phoR and the sigB/phoR comparisons are shown (Figures 9 and 10). These 33 genes (with the exception a small number of genes in the pho regulon that are repressed by phoR) are expected to have a high Park score in the two comparisons, indicating that they are expressed to a greater degree in the wildtype or sigB strain compared to the phoR strain. The parity plots (Figures 9 and 10) show the Park score for the selected genes using the two normalisation techniques. With the SCS normalisation, the Park scores for different values of parameter a (200, 400, and 600) are also shown. If the two techniques were equal, all the symbols are expected to lie on the parity line. Of the 33 pho-regulated genes shown in Figure 9 (the sigB/phoR comparison) 58% have a higher Park score with the SCS normalisation (a = 400), so that the points lie below the parity line. A further 36% have equal Park scores when under both normalisation techniques and only 3 genes have a higher Park score when normalised using the global scaling method. No major improvement is seen when parameter a is set to 200 or 600. A similar result is seen in Figure 10 (the wildtype/phoR comparison). In this case 42% of the pho-regulated genes have equal Park scores in both normalisation techniques and the remaining 58% have a higher Park score with N. L. Dawes and J. Glassey 5 1600 Number of genes 1400 1200 1000 800 600 400 200 0 0 16 0 16 0 16 0 16 0 16 0 16 Score Score Score Score Score Score Score Score Score Score Score Score wt/sigB wt/phoR wt/dm sigB/phoR sigB/dm phoR/dm a = 200 a = 400 a = 600 Bottom 1% genes 2E − 05 1.5E − 05 1E − 05 5E − 06 0E + 00 Bottom 100 average gene contribution values stored in ascending order Gene expression averaged over all arrays Figure 6: Lower end of gene contribution values averaged over all arrays. The average contribution values of the bottom 100 genes when the average contributions data is sorted in ascending order. When parameter x is set to 1%, the 40 genes with the lowest average contributions are excluded from the SCS. 3.5E − 02 Top 1% genes 1.6E + 09 Comparison of the sum of the raw data and normalised data on each array 25 1.4E + 09 20 1.2E + 09 1E + 09 15 8E + 08 10 6E + 08 4E + 08 5 2E + 08 0E + 00 Wildtype sigB mutant phoR mutant 0 Double mutant Normalised data (whole array totals) 2.5E − 05 Background corrected MTM density (whole array totals) Gene expression averaged over all arrays Figure 5: Extreme Park scores for each strain-wise comparison. The number of genes scoring either 0 or 16 when the Park score is calculated for each strain-wise comparison for different values of parameter a. Array totals SCS normalised totals Figure 8: Raw gene expression and normalised data totals. Comparison of the raw (nonnormalised) background gene expression array total (black diamonds) and SCS normalised data (red squares) using values of a = 400, x = 1%. 3E − 02 2.5E − 02 2E − 02 the SCS normalisation (a = 400). This indicates that the SCS normalisation allows a clearer discrimination of the genes which are known to be differentially expressed in this experimental system. 1.5E − 02 1E − 02 5E − 03 0E + 00 Top 100 average gene contribution values stored in ascending order Figure 7: Upper end of gene contribution values averaged over all arrays. The average contribution values of the top 100 genes when the average contributions data is sorted in ascending order. When parameter x is set to 1%, the 40 genes with the highest average contributions are excluded from the SCS. 3. DISCUSSION The proposed normalisation technique results in a set of genes deemed to be self-consistent. These genes show a level of consistency throughout the whole data set based on ranked positions on each array. Using rank positions rather than actual expression or contribution data means 6 Comparative and Functional Genomics Table 1: Functions of the 63 SCS genes identified in the B. subtilis data. The functional descriptions and functional categories were obtained from the SubtiList Web Server [12]. The 63 SCS genes were identified with parameter a = 400. Gene name Function∗ 1. Cell envelope and cellular processes med msmX nark rocC pstS yfmO ytlD ytrB yurO yvfR yvfS ywoE yfiJ tatCY ftsZ phrA ykuD comFB 2. Intermediary Metabolism Positive regulator of comK Multiple sugar-binding transport ATP-binding protein Nitrite extrusion protein Amino acid permease Phosphate ABC transporter (binding protein) Similar to multidrug-efflux transporter Similar to ABC transporter (permease) Similar to ABC transporter (ATP-binding protein) Similar to multiple sugar-binding protein Similar to ABC transporter transmembrane subunit Similar to ABC transporter transmembrane subunit Similar to permease Similar to two-component sensor histidine kinase [YfiK] Component of the twin-arginine translocation pathway Cell-division initiation protein Inhibitor of the activity of phosphatase RapA Similar to hypothetical proteins Late competence gene ptsH pgk sdhA gcvPB argH trpD pucA purC yabR lipA moaE 3. Information Pathways Histidine-containing phosphocarrier protein of the phosphotransferase system (PTS) (HPr protein) Phosphoglycerate kinase Succinate dehydrogenase (flavoprotein subunit) Probable glycine decarboxylase (subunit 2) Argininosuccinate lyase Anthranilate phosphoribosyltransferase Xanthine dehydrogenase Phosphoribosylaminoimidazole succinocarboxamide synthetase Similar to polyribonucleotide nucleotidyltransferase Probable lipoic acid synthetase Molybdopterin converting factor (subunit 2) dnaE uvrA sigL glcR hpr lrpC spoVT yetL yisV rplD rplJ ykkC 4. Other Functions DNA polymerase III (alpha subunit) Excinuclease ABC (subunit A) RNA polymerase sigma factor Transcriptional repressor involved in the expression of the phosphotransferase system Transcriptional repressor of sporulation and extracellular proteases genes Transcriptional regulator (Lrp/AsnC family) Transcriptional regulator Similar to transcriptional regulator (MarR family) Similar to transcriptional regulator (GntR family)/aminotransferase (mocR-like) Ribosomal protein L4 Ribosomal protein L10 (BL5) Similar to chaperonin yvtA albC ppsB xtmA pcrB yurV Similar to htrA-like serine protease Antilisterial bacteriocin (subtilosin) production Peptide synthetase PBSX defective prophage terminase (small subunit) pcrB homolog Similar to NifU protein homolog N. L. Dawes and J. Glassey 7 Table 1: Continued. Function∗ ycgL ydiI yisX ykkA yrbG yshB yutH ywnB yazA ybbP yfkC yloN ytqA yveS 6. No Similarity Similar to unknown proteins Similar to unknown proteins Similar to unknown proteins Similar to unknown proteins Similar to hypothetical proteins from B. subtilis Similar to unknown proteins Similar to unknown proteins Similar to unknown proteins Similar to unknown proteins Similar to unknown proteins Similar to unknown proteins Similar to unknown proteins Similar to unknown proteins Similar to unknown proteins ybdL ydaS Unknown Unknown sigB/phoR park scores for phoR genes 16 Wild type/phoR park scores for phoR genes 16 14 nMTM density park scores nMTM density park scores Gene name 5. Similar to Unknown Proteins 12 10 8 6 4 2 14 12 10 8 6 4 2 0 0 0 2 4 6 8 10 12 14 16 SCS park scores Parity line SCS (a = 200) vs. nMTM SCS (a = 400) vs. nMTM SCS (a = 600) vs. nMTM 0 2 4 6 8 10 SCS park scores Parity line SCS (a = 200) vs. nMTM 12 14 16 SCS (a = 400) vs. nMTM SCS (a = 600) vs. nMTM Figure 9: Park scores for sigB/phoR comparison. Comparison of the Park scores for the 33 known genes under the pho regulon control in the sigB-null mutant versus phoR-null mutant using globally scaled normalisation (nMTM) and SCS-normalised data. Park scores are shown for the SCS normalisation when parameter a is set to 200, 400, and 600. Note that some of the symbols are placed in identical positions when the Park scores are identical for both data sets. Figure 10: Park scores for wild type/phoR comparison. Comparison of the Park scores for the 33 known genes under the pho regulon control in the wildtype strain versus phoR-null mutant using globally scaled normalisation (nMTM) and SCS-normalised data. Park scores are shown for the SCS normalisation when parameter a is set to 200, 400, and 600. Note that some of the symbols are placed in identical positions when the Park scores are identical for both data sets. that whole array effects, such as exposure length, may be disregarded as we can assume that they will not affect the rank positions of the gene expression contributions. The resulting SCS is a small conservative set of genes. Some of these genes will have a stable level of expression in all experimental conditions and may be classed as perhaps core genes or essential to the cell’s basic functions. When this SCS list is compared to the list of essential B. subtilis genes published by Kobayashi et al. [13], we find that a small number of the SCS genes are indeed essential for the organism’s survival. In particular dnaE, ftsZ, pgk, rplD, rplJ, and yurV appear on both lists. There are also a number of other genes on the SCS list 8 that may belong to the same operons or pathways as some of the essential B. subtilis genes. In total, about a third of our SCS genes appear to be essential or linked to essential genes, however the list of essential genes was generated by growing B. subtilis in optimum conditions. Therefore these genes may behave differently in the phosphate-limited conditions or in the mutant strains used in this experiment. For instance, tagA, B, D, and F are listed as essential genes, but are also under the control of phoR which is knocked out in the phoR-null mutant [14], likewise nadE and spoVC are linked with the sigB gene [15, 16], the genes knocked out in the sigBnull mutant. Results from synthetic data set analyses [17] confirm that the algorithm indeed identifies a suitable set of SCS genes on the basis of underlying biological signal rather than chance correlation in the expression data. 3.1. Parameter setting Whilst the authors believe that the SCS normalisation algorithm is generally applicable to any three-dimensional experimental data as shown in Figure 1, it is essential that the values of parameters a and x are selected appropriately for each data set. This can be performed in a straightforward manner by repeating the SCS normalisation procedure with a number of combinations of these parameters and the resulting SCS gene sets evaluated as shown here, with limited computational effort. Alternatively, statistical methods can be applied to investigate the probability behaviour of minimum rank difference (MRD) distribution and to estimate the threshold value of a. 3.2. Noise reduction: totals Figure 8 shows that the variability of the normalised data is much lower than that of the raw MTM density data. The change in total expression over time is biologically more plausible when the data is normalised using this algorithm. The three mutants initially have a lower total gene expression than the wildtype strain. The phoR strain and the double mutant show a similar pattern of total expression over time compared to the wildtype and the sigB strain. This can be expected as the organism does not have the mechanism to specifically cope with phosphate stress in the phoR-null mutant or the double mutant. The wildtype strain shows decrease in total gene expression over time when the data is normalised. It is expected that under phosphate-limited conditions the organism will eventually sporulate and so will downregulate a number of metabolic pathways, hence reducing the overall amount of mRNA in the cells. The sigB strain also shows a decrease in total gene expression, when the data is normalised, up to the last time point, where the total gene expression increases. A biological explanation for this could be related to the hyperinduction of the phoR operon or the onset of sporulation. Upon inspection it transpires that 124 genes are upregulated by at least 3 fold between the last two time points of the SCS-normalised data in the sigB-null mutant. Of these genes 32% are either related to sporulation Comparative and Functional Genomics or involved in reaction pathways that result in the release of phosphate. A further 37% of these up-regulated genes currently have an unknown function. The remaining 31% have varying functions but mainly belonging to functional categories 1 and 2 (see SubtiList [12] for functional category classification). In the three mutant strains, overall gene expression was lower at the outset (as indicated in Figure 8) and confirmed by independent assays [18]. This is clearly not the case with the nonnormalised data, where the total gene expression in the phoR strain and the double mutant is relatively high at the outset (shown by the black markers in Figure 8). 3.3. Differentially expressed genes Although the lack of technical replicates precluded the application of a number of standard statistical tests as reported in the literature (e.g., t-tests and Wilcoxon test), The Park score analysis of these genes clearly shows that SCS normalisation enables better discrimination of the differentially expressed genes in different mutant strains (Figures 9 and 10). 4. CONCLUSIONS A nonparametric normalisation method is proposed for multicondition time series gene expression data. This method is based on a series of comparisons of ranked gene expression contributions on the individual arrays. If the rank position of a gene contribution, to the array total, does not change within specified limits across all the arrays then that gene is included in the self-consistent set (SCS) of genes. The total expression of these genes on each of the arrays is then used to normalise the expression data of the rest of the genes. The algorithm depends upon two user-defined parameters, a, the absolute rank difference limit and, to a lesser degree, x, the proportion of genes excluded. The results of simulated studies using randomly generated synthetic data sets [17] confirmed that the SCS normalisation performs as expected. Current work concentrates on robustness studies of the SCS normalisation in order to assess the sensitivity of the algorithm to experimental data corrupted by known random and systematic noise. Also the application of this method to other gene expression data containing a number of technical replicates, which exhibits the same structure shown in this manuscript, is being investigated. We believe that the proposed normalisation method may be useful in other cases of single colour DNA array analysis with a combination of multiple strains, conditions, and/or time points. The method provides a way of normalising using all the data simultaneously without having to assign a baseline array or using complex statistics that require replicate data. Using this approach will allow us to apply further data analysis techniques with more confidence in the biological plausibility of the results. Therefore the time, money and effort that has been put into producing this data set in the first place will not be entirely lost due to oversight of the importance of technical and biological replication and therefore some useful knowledge may still be gained from the data. N. L. Dawes and J. Glassey 9 Part A-exclude genes with extreme expression Find each gene’s contribution to its array by dividing its expression value by the array total, for each array Calculate the average contribution for each gene based on all strains and time points Sort the genes from lowest to highest average and exclude the top and bottom “X ′′ percent Part B-identify self-consistent set from remaining genes Rank gene contribution values in ascending order on each array For the first time point compare the differences in ranks between all the strains. If the rank difference for any given gene is below parameter “a′′ in all comparisons, then the gene passes the first filter Repeat for all other time points Part C-iterate and normalise data Repeat from part A using data generated in part B until the self-consistent set does not change between two successive iterations Any gene that passes all the filters is placed in the self-consistent set Correct for genes with large contributions to the array by dividing the original data on any given array by the total of the current self-consistent set on that array Normalise data by dividing the original expression values by the total of this self-consistent set on each array Figure 11: Flow Diagram of SCS algorithm. SCS algorithm depicted as a flow diagram showing the three stages of the algorithm. 5. METHODS 5.1. Array dataset Data has been obtained from experiments where both the specific and nonspecific response to phosphate stress has been investigated in a set of isogenic Bacillus subtilis mu- tants over time [18]. The overall aim was to identify regulatory interactions between the σ B -dependent general stress and pho regulons in B. subtilis. Strains with null mutations in the key regulatory genes sigB and phoR were used to investigate the level of interaction between these two regulons. In total, four strains were used: a wildtype strain (strain 168), sigB-null mutant, phoR-null mutant, and a 10 Comparative and Functional Genomics Example of nonparametric scoring algorithm (Park et al. [10]): 2 strain, 3 genes and 4 time points (1) Tabulate data as in the table below, including an extra row for each gene to assign zeros to the expression values of one strain and ones the expression values of the other strain. Gene A Gene B Gene C 10.717 0 0.177 0 1.786 0 Strain 1 sigB-null mutant 10.94 10.593 0 0 0.251 0.246 0 0 0.768 1.057 0 0 6.701 0 0.536 0 0.606 0 3.29 1 17.47 1 1.44 1 Strain 2 phoR-null mutant 3.43 3.48 1 1 34.87 40.47 1 1 0.78 1.13 1 1 2.79 1 30.41 1 0.81 1 (2) Rank the data in ascending order for each gene moving the zeros and ones accordingly Gene A Gene B Gene C 2.79 1 0.177 0 0.606 0 Strain 1 sigB-null mutant 3.29 3.43 1 1 0.246 0.251 0 0 0.768 0.78 0 1 3.48 1 0.536 0 0.81 1 6.701 0 17.47 1 1.057 0 Strain 2 phoR-null mutant 10.593 10.717 0 0 30.41 34.87 1 1 1.13 1.44 1 1 10.94 0 40.47 1 1.786 0 (3) The Park scores are found by determining how many moves it takes to get all the zeros back on he right-hand side. This can be computed using the following equation: Score =   i∈N2 j ∈N1 ⎧ ⎪ ⎨0, Park scores are: Gene A = 16 Gene B = 0 Gene C = 6 h(x) = ⎪ ⎩1,  h x j − xi  if x ≤ 0 if x > 1 Figure 12: Example of Park score calculations. Park score calculations shown for the expression data of three genes in two different strains over four time points. Genes A and B score an extreme Park score as they are expressed more highly at every time point in one of the strains compared to the other strain, whereas Gene C is not and scores a midrange Park score. sigB-null, phoR-null (double) mutant. For a detailed description of the bacterial strains, plasmids, primers, and medium used, see Allenby et al. [19]. Each strain was cultured in phosphate-limiting conditions with typically four samples taken at specified times. These samples were processed and used in transcriptome analysis by hybridising to B. subtilis Panorama gene arrays (Sigma Genosys Biotechnologies Inc., The Woodlands, USA). The procedures of cell harvesting, RNA preparation, synthesis of radioactively labelled cDNA, and hybridisation to the arrays as described by [20] were followed. Arrays were exposed on a Fuji cassette for a predetermined time. After exposure, the cassette was scanned using a Storm phosphorimager to generate both .gel and .tiff image files. These digital images were imported into the software package ArrayVision to generate the data set. 5.2. Use of technical replicates The replicated spots on the arrays were used to identify any poor quality spots. The log2 transform of the data was used and for each array the difference of the two spots was plotted against the average of the two spots. The variability of the differences as a function of average intensity can be modelled using a locally smoothed estimate of the interquartile range. It has been shown that unreliable replicates can be identified by plotting lines representing ±3 × IQR on the MA plot [21]. Any replicate pair falling outside these lines can be deemed to have a replicate difference greater then expected based on their average value. Any gene identified as having poor replicates on any array was flagged and kept out of the self-consistent set during the normalisation algorithm. N. L. Dawes and J. Glassey 11 5.3. SCS algorithm 5.5. Below is a mathematical description of the SCS algorithm, Figure 11 describes the process as a flow diagram. For a (m × n) data set where m is the number of genes (rows) and n is the number of arrays (columns), each element of the data set is gi j , where i = 1 to m and j = 1 to n. For multistrain time series data n = s × t, where s = the number of strains from 1 to S and t = the number of time points from 1 to T. First, the contributions matrix C is generated by dividing each gene expression value by the column total gi ci = m . (1) i=1 gi j Before the SCS normalisation (see detailed description of the algorithm in Section 5.3) was applied to the B. subtilis data set, unreliable replicates were flagged and any genes with an expression value below that of the array background value were also flagged. The genes falling below the background values were taken as having an expression too low to accurately detect and were forced to have an expression value of zero. These flagged genes were prevented from being part of the SCS but were not excluded from the data set at this point. Different values of the absolute rank difference threshold, a, and the exclusion limit, x, were tested and the resulting SCS gene sets recorded and analysed. The average of each row of contributions is calculated and the top and bottom x% is disregarded ACKNOWLEDGMENTS R = mx < rank n j =1 c j n   < (m − mx) , i (2) R is a vector of row numbers left once the top and bottom x% have been excluded. These row numbers are used to generate a new contributions matrix C2 which is a subset of the matrix C. It is from this new (m − 2mx) × n matrix that the initial SCS genes will be identified C2 = (∀R)C2 ⊆ C. (3) For time point t, SCSt =       rank C2(s1) − rank C2(s2) ) < a& t       rank C2(s1) − rank C2(s3)  < a& t        rank C2(s2) − rank C2(s3)  < a . (4) t In (4), shown here there are three strains to consider, as the rank differences are calculated for each possible pairing of strains, the more strains there are, the more terms are needed in the equation. This is carried out for each time point to give SCS1 , SCS2 , SCS3 , . . ., SCST . Then any gene that appears in all the SCSt lists is deemed to be self-consistent across all strains and time points. These genes are then used to normalise the data by dividing each column of data by the sum of the SCS genes in that column gi Ni = (5) gSCS j . The process is then iterated k times by repeating each step from the calculation of the contributions until no change is seen between SCSk and SCSk−1 . 5.4. Differential expression of genes A nonparametric Park score test [10] was also used to assess the differential expression of the same genes in the same two strains using the globally scaled nMTM density data from ArrayVision and SCS-normalised data. Figure 12 shows the basic features of the method. For the Park score analysis, data from the sigB-null mutant was entered as “strain1” and data from phoR-null mutant as “strain2” data. Application to data sets This work is supported by the BBSRC (Studentship no. 01/B1/E/01350). We would also like to thank Dr. Nicholas Allenby, Professor Colin Harwood, and Dr. Zoltan Pragai for the B. subtilis macroarray data and their input to the project and also Professor Alan Ward and Dr. Anil Wipat for their help, advice, and continued support. REFERENCES [1] D. Edwards, “Non-linear normalization and background correction in one-channel cDNA microarray studies,” Bioinformatics, vol. 19, no. 7, pp. 825–833, 2003. [2] K. V. Ballman, D. E. Grill, A. L. Oberg, and T. M. Therneau, “Faster cyclic loess: normalizing RNA arrays via linear models,” Bioinformatics, vol. 20, no. 16, pp. 2778–2786, 2004. [3] J. Quackenbush, “Microarray data normalization and transformation,” Nature Genetics, vol. 32, no. 5, supplement, pp. 496–501, 2002. [4] W. Wu, N. Dave, G. C. Tseng, T. Richards, E. P. Xing, and N. Kaminski, “Camparison of normalization methods for Codelink Bioarray data,” BMC Bioinformatics, vol. 6, p. 309, 2005. [5] M. K. Kerr, M. Martin, and G. A. Churchill, “Analysis of variance for gene expression microarray data,” Journal of Computational Biology, vol. 7, no. 6, pp. 819–837, 2000. [6] Y. Barash, E. Dehan, M. Krupsky, et al., “Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays,” Bioinformatics, vol. 20, no. 6, pp. 839–846, 2004. [7] B. M. Bolstad, R. A. Irizarry, M. Åstrand, and T. P. Speed, “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias,” Bioinformatics, vol. 19, no. 2, pp. 185–193, 2003. [8] T. Park, S.-G. Yi, S.-H. Kang, S. Lee, Y.-S. Lee, and R. Simon, “Evaluation of normalization methods for microarray data,” BMC Bioinformatics, vol. 4, no. 33, 2003. [9] T. B. Kepler, L. Crosby, and K. T. Morgan, “Normalization and analysis of DNA microarray data by self-consistency and local regression,” Genome Biology, vol. 3, no. 7, 2002, RESEARCH0037. [10] P. J. Park, M. Pagano, and M. Bonetti, “A nonparametric scoring algorithm for identifying informative genes from microarray data,” in Proceedings of the 6th Pacific Symposium on Biocomputing, pp. 52–63, Lihue, Hawaii, USA, January 2001. 12 [11] Y. H. Yang, S. Dudoit, P. Luu, and T. P. Speed, “Normalization for cDNA microarray data,” in Microarrays: Optical Technologies and Informatics, vol. 4266 of Proceedings of SPIE, pp. 141–152, San Jose, Calif, USA, January 2001. [12] SubtiList Web Server, http://genolist.pasteur.fr/SubtiList/. [13] K. Kobayashi, S. D. Ehrlich, A. Albertini, et al., “Essential Bacillus subtilis genes,” Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 8, pp. 4678–4683, 2003. [14] Z. Prágai, N. E. E. Allenby, N. O’Connor, et al., “Transcriptional regulation of the phoPR operon in Bacillus subtilis,” Journal of Bacteriology, vol. 186, no. 4, pp. 1182–1190, 2004. [15] M. B. Méndez, L. M. Orsaria, V. Philippe, M. E. Pedrido, and R. R. Grau, “Novel roles of the master transcription factors Spo0A and σ B for survival and sporulation of Bacillus subtilis at low growth temperature,” Journal of Bacteriology, vol. 186, no. 4, pp. 989–1000, 2004. [16] H. Antelmann, R. Schmid, and M. Hecker, “The NAD synthetase NadE (OutB) of Bacillus subtilis is a σ B -dependent general stress protein,” FEMS Microbiology Letters, vol. 153, no. 2, pp. 405–409, 1997. [17] N. L. Dawes, Normalisation and analysis of genome expression data, Ph.D. thesis, CEAM, Newcastle University, Newcastle, UK, 2006. [18] Z. Prágai and C. R. Harwood, “Regulatory interactions between the Pho and σ B -dependent general stress regulons of Bacillus subtilis,” Microbiology, vol. 148, no. 5, pp. 1593–1602, 2002. [19] N. E. E. Allenby, N. O’Connor, Z. Prágai, A. C. Ward, A. Wipat, and C. R. Harwood, “Genome-wide transcriptional analysis of the phosphate starvation stimulon of Bacillus subtilis,” Journal of Bacteriology, vol. 187, no. 23, pp. 8063–8080, 2005. [20] C. Eymann, G. Homuth, C. Scharf, and M. Hecker, “Bacillus subtilis functional genomics: global characterization of the stringent response by proteome and transcriptome analysis,” Journal of Bacteriology, vol. 184, no. 9, pp. 2500–2520, 2002. [21] K. R. Hess, W. Zhang, K. A. Baggerly, D. N. Stivers, and K. R. Coombes, “Microarrays: handling the deluge of data and extracting reliable information,” Trends in Biotechnology, vol. 19, no. 11, pp. 463–468, 2001. Comparative and Functional Genomics International Journal of Peptides BioMed Research International Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Advances in Stem Cells International Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Virolog y Hindawi Publishing Corporation http://www.hindawi.com International Journal of Genomics Volume 2014 Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Journal of Nucleic Acids Zoology International Journal of Hindawi Publishing Corporation http://www.hindawi.com Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Volume 2014 Submit your manuscripts at http://www.hindawi.com The Scientific World Journal Journal of Signal Transduction Hindawi Publishing Corporation http://www.hindawi.com Genetics Research International Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Anatomy Research International Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Enzyme Research Archaea Hindawi Publishing Corporation http://www.hindawi.com Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Volume 2014 Hindawi Publishing Corporation http://www.hindawi.com Biochemistry Research International International Journal of Microbiology Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 International Journal of Evolutionary Biology Volume 2014 Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Molecular Biology International Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Advances in Bioinformatics Hindawi Publishing Corporation http://www.hindawi.com Volume 2014 Journal of Marine Biology Volume 2014 Hindawi Publishing Corporation http://www.hindawi.com Volume 2014
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Fezal Ozdemir
Ege University
Monica Ballarino
Università degli Studi "La Sapienza" di Roma
Hikmet Budak
University of Nebraska Lincoln
Grum Gebreyesus
Aarhus University