1471-2105-7-270.pdf

Yvonne Li

BMC Bioinformatics BioMed Central Open Access Methodology article Improving the specificity of high-throughput ortholog prediction Debra L Fulton†1,2, Yvonne Y Li†1,3, Matthew R Laird1, Benjamin GS Horsman1, Fiona M Roche1 and Fiona SL Brinkman*1 Address: 1Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, BC, Canada, 2Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada and 3Canada's Michael Smith Genome Sciences Centre, 570 W. 7th Avenue, Vancouver, BC, Canada Email: Debra L Fulton - debra@cmmt.ubc.ca; Yvonne Y Li - yli@bcgsc.ca; Matthew R Laird - lairdm@sfu.ca; Benjamin GS Horsman - bhorsman@sfu.ca; Fiona M Roche - fiona_roche@sfu.ca; Fiona SL Brinkman* - brinkman@sfu.ca * Corresponding author †Equal contributors Published: 28 May 2006 BMC Bioinformatics 2006, 7:270 doi:10.1186/1471-2105-7-270 Received: 03 October 2005 Accepted: 28 May 2006 This article is available from: http://www.biomedcentral.com/1471-2105/7/270 © 2006 Fulton et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Orthologs (genes that have diverged after a speciation event) tend to have similar function, and so their prediction has become an important component of comparative genomics and genome annotation. The gold standard phylogenetic analysis approach of comparing available organismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis; therefore, ortholog prediction for large genome-scale datasets is typically performed using a reciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectly predict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. In addition, there is an increasing interest in identifying orthologs most likely to have retained similar function. Results: To address these issues, we present here a high-throughput computational method named Ortholuge that further evaluates previously predicted orthologs (including those predicted using an RBH-based approach) – identifying which orthologs most closely reflect species divergence and may more likely have similar function. Ortholuge analyzes phylogenetic distance ratios involving two comparison species and an outgroup species, noting cases where relative gene divergence is atypical. It also identifies some cases of gene duplication after species divergence. Through simulations of incomplete genome data/gene loss, we show that the vast majority of genes falsely predicted as orthologs by an RBH-based method can be identified. Ortholuge was then used to estimate the number of false-positives (predominantly paralogs) in selected RBH-predicted ortholog datasets, identifying approximately 10% paralogs in a eukaryotic data set (mouse-rat comparison) and 5% in a bacterial data set (Pseudomonas putida – Pseudomonas syringae species comparison). Higher quality (more precise) datasets of orthologs, which we term "ssd-orthologs" (supporting-species-divergence-orthologs), were also constructed. These datasets, as well as Ortholuge software that may be used to characterize other species' datasets, are available at http:/ /www.pathogenomics.ca/ortholuge/ (software under GNU General Public License). Conclusion: The Ortholuge method reported here appears to significantly improve the specificity (precision) of high-throughput ortholog prediction for both bacterial and eukaryotic species. This method, and its associated software, will aid those performing various comparative genomics-based analyses, such as the prediction of conserved regulatory elements upstream of orthologous genes. Page 1 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 Figure An log example as an1ortholog of how RBH analysis may falsely identify a paraAn example of how RBH analysis may falsely identify a paralog as an ortholog. Illustrated is a hypothetical species tree and gene tree for the human, cattle, and mouse species, where human and cattle orthologs (unshaded genes) are being identified. If the true cattle ortholog has not yet been sequenced because of an incomplete bovine genome project, it will not be present in the gene dataset used for analysis (cattle gene crossed out with an X), and the best reciprocal BLAST hit for the human gene will be a cattle paralog (shaded gene). However, Ortholuge will detect this case as a potential paralog, because it examines the relative phylogenetic distance between genes and identifies how well their relative distances match expected species divergence. Background Ortholog prediction is an important facet of comparative genomics and is frequently used in genome annotation, gene function characterization, evolutionary genomics, and in the identification of conserved regulatory elements. As the number of genome sequences grow, comparative genomics has become increasingly relevant. Errors in ortholog prediction can greatly affect such studies and associated downstream analyses (including functional genomics and proteomics analyses), so there has been increasing interest in high quality ortholog prediction. Orthologs are commonly defined as genes that have diverged after a speciation event [1], whereas genes that have diverged after a gene duplication event, either before a speciation event (out-paralogs) or after a speciation event (in-paralogs), are collectively known as paralogs. It has been found that orthologs tend to have similar function and so their utility in comparative analyses is paramount. Classically, orthologous genes are identified by phylogenetic analysis. A phylogenetic tree for the genes is compared against a reference species tree, with the notion that the gene tree of orthologs should be similar to the species tree. However, sophisticated phylogenetic analysis is not easily automated, due in part to the complexity of both manual sequence alignment editing and choice of appropriate genes and species to be included in an analysis. http://www.biomedcentral.com/1471-2105/7/270 Whole-genome analyses indicate that many gene families (essentially paralogs) were formed before the divergence of most species commonly being compared in a comparative genomics analysis (out-paralogs). Therefore, orthologs – which diverged due to speciation – are typically more similar to each other than to other genes in the genome. This is why sequence similarity is often used to infer gene orthology between two or more species, and is also the premise behind the most common high-throughput ortholog prediction method used today: the reciprocal-best-BLAST-hits (RBH) analysis [2]. With the RBH method, genes from species A and species B are predicted to be orthologs if they are both the "best BLAST hit" of the other, when all genes from species A are compared to all genes from species B by BLAST analysis. There are numerous resources and methods that use a version of RBH as part of their ortholog prediction process, including the Clusters of Orthologous Groups (COG) database [3,4], The Institute for Genomic Research (TIGR)'s EGO database [5], and INPARANOID [6,7]. However, if a gene is not present in one organism's gene dataset, perhaps due to incomplete genome sequence data or gene loss in the organism, the RBH method will incorrectly predict a paralog as an ortholog (Fig. 1). Today, comparative genomics is often being performed using incomplete genomes, especially for large eukaryotic genome sequencing projects. Also, gene loss is a major driving force behind bacterial evolution [8]. It is therefore important to recognize that many of the current ortholog databases will likely contain false-positives due to the limitations of the RBH approach. For comparative analyses, it is also frequently desirable to identify orthologs that most likely have similar function. In some cases, an ortholog may diverge more rapidly in sequence (and function) in one organism/species versus another related organism/species. In addition, a gene duplication may occur in one species, but not a second species, after species divergence. In this case either one – or both – of the duplicated genes (in-paralogs) are more likely to have diverged in function [9]. We therefore propose to differentiate such orthologs (reflecting what has sometimes been referred to as "many-to-many" ortholog relationships) from those that appear to have diverged only due to a speciation event. We also wish to identify those orthologs that have diverged to a degree that is similar to that expected for its species, since those orthologs that have undergone unusually rapid divergence in one species, relative to another, may have also diverged more in function. We therefore propose the term ssd-orthologs (for "supporting-species-divergence" orthologs) to define orthologs that appear to have diverged only due to speciation – and have diverged to the same relative degree as their species. These ssd-orthologs are more likely to have retained similar function, and would better suit the pur- Page 2 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270 are likely ssd-orthologs and which are likely paralogs or other non-ssd-orthologs. The pipeline requires tentative ortholog predictions (and the associated gene/protein sequences) for large gene datasets from three species, two of which are the species to be compared, and one of which is an outgroup species. All phylogenetic distances between the genes/proteins in an ortholog group are computed for each group in the input list. Ratios of these distances are used to evaluate ortholog quality. We find that these ratios show certain consistencies over several sets of eukaryotic and bacterial orthologs, along with data sets introduced with true-negatives for comparison. This permitted the formulation of ratio cut-offs for retaining ssd-orthologs and removing probable paralogs, which resulted in a higher quality data set of orthologs. Overall, we demonstrate that the relative evolutionary relationships may be used to support the prediction of orthologs. In addition, noting those orthologs with prominent differences (such as recent gene duplications after species divergence) may help refine analyses to permit the identification of those orthologs that most likely retain the same function. Results An overview of the Ortholuge approach for increasing the specificity of ortholog predictions is outlined in Figure 2. Based on the analyses described below, the details of this approach were formulated and the approach validated using both prokaryotic and eukaryotic data sets. Ortholuge software is available [28] to assist with the analysis of data sets other than those reported here. Figure An overview 2 of the Ortholuge method An overview of the Ortholuge method. (A) Flow-chart outlining the main steps of the method. (B) The three ratios computed by Ortholuge. The phylogenetic distances in the numerator (dark line) and denominator (dashed line) for each ratio is shown, overlaid on the phylogenetic tree (gray line) that relates the ingroups and outgroup. Note that the three ratios are related such that Ratio2 = Ratio1 × Ratio3. Therefore, ratio data is presented both in terms of frequency histograms for all three ratios (see Fig. 4) and also as Ratio1 × Ratio2 plots (see Fig. 5) for just two of the three ratios – the latter is simply another way to conveniently visualize the data. poses of many comparative analyses. To avoid the confusion that may stem from the association of the term "many-to-many orthologs" with in-paralogs, we will use the term paralogs in this text to refer to out-paralogs and specify in-paralogs, when applicable. To address these issues, we have developed a method we call Ortholuge. Ortholuge is a high-throughput analysis pipeline that evaluates previously predicted orthologs (such as RBH-predicted orthologs on a genome-wide scale) and generates predictions regarding which of these Data sets exhibited little bias due the automated sequence alignment trimming approach We investigated the behaviour and utility of Ortholuge through analysis of diverse eukaryotic and bacterial RBHderived datasets. For the initial test eukaryotic data set, we chose predicted mouse-rat-human orthologs from the expressed sequence tag (EST) data in TIGR's Eukaryotic Gene Ortholog (EGO) database [5] (for a mouse-rat comparison, with human as the outgroup). The majority of our subsequent analyses utilized the higher quality MGDbased dataset (see Methods describing datasets) and the RefSeq-based RBH dataset composed of these same species, as indicated. For the bacterial data set, we chose three gamma-proteobacteria: Escherichia coli, Pseudomonas putida, and Pseudomonas syringae (a Pseudomonas species comparison, with E. coli as the outgroup). Orthologs between these three species (and other sets of species subsequently examined) were predicted using a transitive RBH approach, applied to the deduced proteins from complete genome sequences [10-12]. Accurate sequence alignment is critical for phylogenetic analysis; thus, we wished to improve the automated alignment and trimming components of the Ortholuge Page 3 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 characteristics Ratio Figure 1 (R1) 3 ratio distribution curves for selected alignment Ratio 1 (R1) ratio distribution curves for selected alignment characteristics. Higher quality mouse-rat-human ortholog sequence sets were analyzed to devise the gap-masking and sequence trimming approaches. These methods were evaluated for the introduction of ratio distribution biases for selected alignment characteristics such as identity and gap length. Ratio distribution curves were plotted for several characteristics. No obvious bias was observed through the introduction of our gap masking approach or alignment trimming. method. We therefore performed a comprehensive examination of biases in our automated alignment editing process (see Methods). A sample of RBH-predicted ortholog sequence sets was analyzed to devise the gapmasking and sequence trimming approaches. The sequence sets were examined to identify both gaps introduced by misalignments and gaps introduced through sequence insertions and deletions. Our observations suggested that some of the noise introduced through the misalignment may be alleviated through the removal of the gapped-segment flanking portions. We also noted that there was no appreciable effect on the sequence distances when the flanking sequences around the sequence-variation gapped regions were removed. We manually introduced gap-masking simulations over the sequences using various window length criteria to establish a gap-masking approach with a relatively conservative worst-case scenario. Both the trimming and gap-masking methods were evaluated for the introduction of ratio distribution biases by selected alignment characteristics. No obvious bias was observed through the introduction of our gap masking approach or alignment trimming (Fig. 3). Ortholuge produces ratios which form distributions Ortholuge was designed with the purpose of overcoming certain limitations of the RBH method, such as the problem illustrated in Figure 1. Ortholuge overcomes this problem by using ratios of phylogenetic distances http://www.biomedcentral.com/1471-2105/7/270 between genes to evaluate orthology, and using an outgroup species as a reference for two ingroup species being compared (Fig. 2). For these three species, the distances for the "ortholog triple" are calculated and the three possible ratios that can be generated are calculated (Fig. 2). With this approach, the problem illustrated in Figure 1 would be detected because the human-cattle distance is unexpectedly larger than the human-mouse distance – impacting on ratio values. We ran Ortholuge on three mouse-rat-human datasets: two sets of RBH-predicted orthologs – one based on EGO data and the other based on RefSeq data – and a third high-quality curated set. For all datasets, human was the outgroup used to help predict more precise orthologs between mouse (ingroup1) and rat (ingroup2). The resulting Ortholuge phylogenetic distance ratios are shown in Figure 4 and Supplemental Figure 3 as histograms. For each of the three ratios, we tabulated the frequency of putative orthologous groups within certain ratio value ranges. Ratio1, Ratio2, and Ratio3 each form clear distributions. Ratio3 is generally located around a ratio value of 1, which is expected if the chosen outgroup is more distant relative to the ingroups. It is centered to the left or right of 1 depending on which of the two ingroups is closer to the outgroup. The Ratio1 and Ratio2 distributions are generally located at a ratio much lower than 1, reflecting the closer relationship between the ingroup species versus any ingroup to the outgroup. We ran our analyses on both protein and nucleotide sequences and found that for closely related species Figure 4ratios Histogram putative) distance orthologous illustrating the groups distribution across the of three RBH-predicted Ortholuge(i.e. Histogram illustrating the distribution of RBH-predicted (i.e. putative) orthologous groups across the three Ortholuge distance ratios. The results for predicted mouse-rat-human RBH ortholog sets (EGO RBH data set; 19,200 ortholog groups) are shown. Each of the three ratios forms their own distribution: Ratio1 and Ratio2 are generally located at ratio values lower than 1 and Ratio3 is generally located about a ratio value of 1, reflecting the relative distances between ingroups and between each ingroup and the outgroup. A similar ratio analysis was performed on a RefSeq RBH dataset (see Figure 3 of [Additional file 1]). Page 4 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270 the divergence observed for most genes (i.e. the highest frequency ranges). Figure 5group Ortholuge eukaryotic ortholog R1 data, × R2 where plots each (Ratio1 pointversus represents Ratio2) one forputative selected Ortholuge R1 × R2 plots (Ratio1 versus Ratio2) for selected eukaryotic data, where each point represents one putative ortholog group. (A) Putative orthologous groups identified using RBH for mouse-rat-human (Figure 4 shows the corresponding histogram). (B) Putative orthologs groups for mouse-rat-human from a higher quality (more precise) dataset (see Methods). It is expected that this more precise data set comprises primarily true orthologs. (C) A lower quality data set of RBH-predicted orthologous groups for cattlehuman-mouse, where cattle genes have been identified from an incomplete genome sequence. (D), (E), (F) are zoomed-in versions of (A), (B), (C), respectively, with axes shown from 0 to 2 instead of 0 to 30. Note that most orthologous groups exhibit low Ratio1 and Ratio2 values, in all three data sets. For example, in panels A and D, about 86% of orthologs have Ratio1 and Ratio2 values less than 1. However, the higher quality data set (panels B and E) contains fewer points at higher Ratio values versus the RBH-predicted data set. The lower quality data set contains more points with very high Ratio2 values (i.e. only 73% of points have Ratio1 and Ratio2 values less than 1), potentially reflecting the increased occurrence of probable cattle paralogs (i.e. paralogs being misidentified as orthologs by an RBH-analysis with an incomplete cattle genome). such as these, nucleotide sequences provide a better ratio distribution resolution. However, the overall ratio distributions are similar, even when using different methods of initial ortholog detection (see Figure 4 of [Additional file 1]). We also performed this analysis with our bacterial P. putida-P. syringae-E. coli orthologs, comparing P. putida (ingroup1) and P. syringae (ingroup2) using E. coli as the outgroup. We observed very similar results: Both the eukaryotic and prokaryotic data sets are consistent in the distributions formed, and in the approximate position of the distributions. Since we expected most ssd-orthologs (see Introduction for definition) to evolve in a similar manner, we hypothesized that orthologs falling within the higher frequency ranges of the distributions are more likely to be ssd-orthologs compared to those that are outliers. In essence, what is defining the species divergence is Ortholuge ratios can also be conveniently visualized in an R1 × R2 plot Instead of histograms (Fig. 4), an alternative way to represent Ortholuge ratios is to use a 2-dimensional plot of two Ortholuge ratios, where each putative ortholog group is represented by one point in the graph. In principal, any two of the three ratios can be used for the plot, since the three ratios are related. That is, Ratio3 equals Ratio2 divided by Ratio1. Through subsequent analyses, we found that the Ratio1 and Ratio2 combination (i.e. an R1 × R2 plot) was the simplest to visualize and to work with. For the R1 × R2 plots, the eukaryotic mouse-rat-human RBH-predicted putative orthologous groups appear to occupy three types of positions (Fig. 5A and 5D). (1) The majority of points form a cluster (highest frequency range) at low Ratio1 and Ratio2 values. In fact, about 85% of orthologs have Ratio1 and Ratio2 values less than 1. (2) Some points with higher Ratio1 values are located along a curve that approaches, and then falls along, the line equation Ratio2 = 1. This is consistent with an unusually high divergence of a gene from ingroup 2. (3) Conversely, some points with higher Ratio2 values are located along a line that is roughly around line equation Ratio1 = 1. This is consistent with an unusually high divergence for a gene from ingroup 1. The RBH-predicted orthologous groups for P. putida-P. syringae-E. coli species show a similar R1 × R2 plot (Fig. 6A and 6D). Consistent with the eukaryotic results, the vast majority of orthologous groups for this prokaryotic analysis also exhibit Ratio1 and Ratio2 values less than 1. We expected most ssd-orthologs to evolve in a similar manner, and found that most orthologous groups form a cluster (high frequency range) in an R1 × R2 plot. Therefore, we hypothesized that orthologous groups falling within the high frequency range are more likely to contain ssd-orthologs. Conversely, those outside of this range (i.e. high Ratio1 or Ratio2 values) are more likely to contain, in an ingroup, either an ortholog that has undergone unusual divergence, or a paralog. "Higher quality" orthologous groups are found primarily in "low" Ortholuge ratio ranges, in R1 × R2 plots The data sets of tentative orthologs predicted above by an RBH approach will certainly contain genes that are being falsely identified as orthologs. It is difficult, if not impossible, to obtain a dataset of this size that contains only true orthologs, due to the inherent nature of inference associated with evolutionary study. However, data sets of "higher" and "lower" quality can be constructed and examined (see Methods), to observe how their Ortholuge Page 5 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270 EGO database (with mouse as the outgroup). The incomplete state of the bovine genome data at the time of this analysis should lead to more falsely predicted orthologs, since some true orthologs will be missing from the bovine dataset (see Fig. 1 for a scenario). These results are shown in Figure 5C and 5F. Note the higher number of points with a high Ratio2 value, falling along the line equation Ratio1 = 1; these points are consistent with how the ratio would behave if the bovine data contained paralogs that were notably more divergent than expected for most orthologs. Figure Ortholuge two ortholog 6 R1 data × R2sets plotsand forathe true-negative prokaryoticdata data, setillustrating Ortholuge R1 × R2 plots for the prokaryotic data, illustrating two ortholog data sets and a true-negative data set. (A) Putative orthologous groups from an RBH-predicted data set. (B) Probable true orthologs from a higher quality (more precise) data set. (C) True-negative orthologs (i.e. true paralogs) from the "gene-loss simulation" data set. Darker dots represent putative orthologous groups which have had an ingroup1 true-negative (paralog) introduced into the group. Lighter dots represent putative orthologous groups which have had an ingroup2 true-negative (paralog) introduced into the group. (D), (E), (F) are zoomed-in versions of (A), (B), (C), respectively, with axes shown from 0 to 2 instead of 0 to 10. Most putative ortholog groups (particularly for the high quality data set) exhibit low Ratio1 and Ratio2 values (for example, all values are less than 1 for the points in the high quality data set plot), whereas most true-negative groups exhibit higher Ratio1 and Ratio2 values (i.e. only 9% of ingroup1 true negative introductions, and 6% of ingroup2 true negative introductions, have points with Ratio1 and Ratio2 values less than 1). ratios change in comparison to each other. These data sets should contain a notably greater or smaller proportion of true orthologs, respectively. We therefore examined the behaviour of Ortholuge ratios for a higher quality data set of probable orthologs. Curated orthologs between human, mouse, and rat genomes were acquired from the Mouse Genome Database (MGD). Figure 5B and 5E illustrate that this higher quality data set occupies a smaller area of the R1 × R2 plot. This smaller area is observed, even when the number of points is normalized with the number plotted for the RBH-based data (data not shown). For this higher quality (more precise) data set there are notably fewer points along the Ratio1 = 1 line equation and the Ratio2 = 1 line in the plot, compared to the RBH-based data plot in Figure 5A and 5D. Conversely, we examined the ratios associated with a "lower quality" data set, involving RBH-predicted orthologs for bovine, human, and mouse, from TIGR's To gain a sense of the differences in plots of different quality datasets, note that below Ratio1 and Ratio2 values of 1, there lies 97% of high quality dataset points (Fig. 5B), 86% of RBH-predicted ortholog group points (Fig. 5A), and only 73% of the low quality data set points (Fig. 5C). These results suggest that true orthologs (or at least more precise ortholog data sets) tend to fall within the bulk of the highest frequency range (i.e. relatively "low" Ratio values in an R1 × R2 plot), while orthologs with unusual divergence patterns (non-ssd-orthologs) and paralogs have either high Ratio1 or high Ratio2 values. For the prokaryotic analysis, a higher quality data set was compared to the RBH-based data set as well. Figure 6A and 6B illustrate the same trend as the eukaryotic data, with respect to how the R1 × R2 plots look for more precise and less precise ortholog data sets. Known paralogs (true-negatives) introduced into orthologous groups generate either high Ratio1 or high Ratio2 values, as shown in a gene loss/incomplete genome simulation The above comparisons of higher quality (more precise) and lower quality (less precise) ortholog data sets support our hypothesis that orthologs and paralogs fall within different regions of the R1 × R2 plot. However, a stronger argument can be made by examining specifically where falsely predicted orthologs (true paralogs) occur in such distributions. A true-negative data set was therefore constructed by removing genes from one of the ingroup gene data sets and then identifying the next best reciprocal BLAST hit with the other ingroup (ensuring transitivity of this introduction with the other ingroup and outgroup). Therefore a true negative is essentially an ortholog triple which has been transformed into a false positive by introducing a less similar sequence for one of the species sequences. These true negatives represent the types of ortholog predictions that would result from an RBHmethod in scenarios such as Figure 1. Since we know that RBH can make incorrect predictions when a genome is incomplete or when gene loss has occurred, this analysis simulates what would occur with the RBH method in such cases. The benefit of this analysis is that we specifically Page 6 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270 orthologs and should still be falling within the main cluster of true-orthologs, as we observe. In other words, since the goal of Ortholuge is to improve ortholog identification between the two ingroups, it is beneficial that an outgroup paralog does not generally interfere with/affect the analysis. Figure R1 of negatives) introducing × R27plots, in the outgroup foranalysis the prokaryotic paralogs (outgroup data, illustrating ortholog thetrueeffect R1 × R2 plots, for the prokaryotic data, illustrating the effect of introducing outgroup paralogs (outgroup ortholog truenegatives) in the analysis. Unlike for other figures of R1 × R2 plots in the paper, only ratio ranges from 0 to 2 are shown for each axis. (A) RBH-predicted orthologous groups. (B) Outgroup paralogs from a true-negative data set where all possible outgroups were replaced with next best RBH paralogs. They cannot be well distinguished from other orthologs, however, this is actually promising, since Ortholuge is in essence identifying orthologs between the ingroups only. This analysis shows that an outgroup paralog does not interfere greatly with the identification of true orthologs shared between the ingroups. know the true-negatives introduced, allowing us to examine how the Ortholuge ratios for these true-negatives (paralogs) behave. For the E. coli-P. putida-P. syringae input ortholog groups, we constructed two true-negative data sets. In the first, we replaced P. putida genes with their next best RBH hit to P. syringae, resulting in ingroup1 paralogs. In the second, we replaced P. syringae genes with their next best RBH hit to P. putida, resulting in ingroup2 paralogs. For both, we conservatively introduced all possible paralogs into the analysis, resulting in roughly 50% of the genes converted to true-negatives (i.e. conservative, because most data sets would never contain this many true-negatives). The results from these two data sets (Fig. 6C and 6F), show that these true-negatives overlap very little with the RBHpredicted orthologs (Fig. 6A) or with the high quality (more precise) orthologs (Fig. 6B). This demonstrates that even with all possible true paralogs simulated, very few of them are falling within the higher frequency ranges of the RBH distributions. We also constructed a third true-negative data set with all outgroup genes (E. coli) replaced by their next best RBH hit to both P. syringae and P. putida. The R1 × R2 plot (Figure 7) shows that these true-negative cases plot at lower Ratio1 and Ratio2 values and do not separate well from what would be expected for true-orthologs. This is actually promising, since in the case of a paralog in an outgroup, the two ingroups should still be regarded as probable true Ortholuge ratio cut-offs, to separate orthologs from paralogs, can be determined based on an iterative-truenegative analysis After determining that the introduced true-negatives almost never fall within certain ratio ranges, it became clear that ratio cut-offs could be derived to exclude most true-negatives, and thus improve the specificity (precision) of ortholog prediction. To do this, another strategy was employed to simulate the introduction of paralogs (true-negative ortholog predictions) and then formulate ortholog identification cut-offs. This second strategy, involving an iterative-true-negative analysis, allows one to view the variance in proportion of true-negatives in a particular ratio range, and is also amenable to high throughput use for the formulation of cut-offs. For both the eukaryotic (human-mouse-rat) RBH-predicted data set (RefSeq-based), and the prokaryotic RBH-predicted data set, we conservatively modeled an incomplete genome (or gene loss) scenario by randomly replacing 25% of the genes in the RBH-predicted data set with the "next best RBH" hit (i.e. a true-negative). This randomized introduction of true-negatives was iterated at least 50 times, and each iteration was evaluated by Ortholuge. The proportion of true-negative orthologs was averaged over all iterations and the standard deviation determined. We found that that once again, the ratio values of true-negative orthologs do not overlap well with those of the bulk of RBH-predicted orthologous groups (Figure 8 and Supplemental Figures 1 and 2). For both the prokaryotic and eukaryotic RBH-based data sets, this iterative true-negative analysis was used to determine ratio ranges where true paralogs were very unlikely to land and ranges where they were very likely to land. The borders of these ranges (described in Figure 8 and Supplemental Figures 1 and 2) became the ratio cut-off values. This permitted classification of the RBH-predicted tentative orthologous groups into probable ssd-orthologs, probable paralogs, or "uncertain" categories. It should be noted that a more accurate name for the 'probable paralog' category might be 'probable non-ssd-ortholog,' because there may be true orthologs that have undergone unusual divergence in one ingroup species within this category. However, in such cases the non-ssd-orthologs may have functionally diverged, and therefore are cases that we would want to differentiate from our ssd-ortholog set. Regardless, for ease of comprehension, we propose to call those cases with very atypical ratios (in the range of what Page 7 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270 is observed for paralogs) "probable paralogs", since paralogs likely predominate in this region. Figure Example orthologs negative sets of true-negatives) 8analysis ofand theprobable generation (i.e. based paralogs, of oncut-offs an based introduction foronclassification an iterative-trueof random of ssdExample of the generation of cut-offs for classification of ssdorthologs and probable paralogs, based on an iterative-truenegative analysis (i.e. based on an introduction of random sets of true-negatives). The particular analysis illustrated here is a Ratio1 analysis for the mouse, rat, human RefSeq RBH dataset, with true-negatives introduced into the mouse (ingroup1) set. In panel A, the number of putative orthologous groups in each ratio range for the true-negative-transformed data set is shown for the whole data set (light shaded bars) and for just the introduced true-negatives only (dark shaded bars). Note how the distribution of the data set differs from that of the true negatives (i.e. introduced paralogs). In panel B, the proportion of randomly introduced true-negatives at 0.5 ratio range intervals is used to formulate cut-offs (denoted by dashed lines) for classifying ssd-orthologs and probable paralogs for the analysis. For the ssd-orthologs cutoff (left-most dashed line), no more than 10% true negatives in a given ratio range are permitted for the ssd-orthologs range. For the probable paralogs cut-off (right-most dashed line) the proportion of true negatives is at or above 50 percent. The resulting middle region bounded by these two cutoff points establishes the "uncertain" orthology class ratio range. Dashed-lines denoting these particular cut-offs are also illustrated on the figure in Panel A for reference. This approach for a true-negative analysis and cut-off generation is also performed for Ratio2 [Additional file 1] and the combination of cut-offs for Ratio1 and Ratio2 are used to classify putative orthologous groups from another data set (such as an RBH-predicted data set) into the three classification levels of "probable ssd-ortholog", "uncertain" and "probable paralogs". Panel C schematically shows the areas of an R1 × R2 that would be classified in this way, with the cut-off numbers in this particular example matching the RefSeq RBH-based mouse-rat-human analysis (see Table 2 for how these ranges are numerically determined). We chose a 25% true-negative introduction, since this is likely above a worst-case scenario in terms of the number of genes that may be missing in an incomplete genome, or most cases of naturally occurring gene loss. We felt it was important to "saturate" the data set with true-negatives, because any given RBH-based dataset will likely contain some proportion of false-positives in the putative orthologous groups (i.e. it is difficult to ensure one has a completely true-positive set of orthologs). Therefore, to effectively identify the ranges where true-negatives were becoming increasingly more common we needed to observe a large proportion of true-negatives. However, we did not want to transform a data set with all possible truenegatives, as this would not provide a sense of the variation in proportion of true-negatives within a given ratio range. Note that we also chose to report the results here for a transformation of an RBH-predicted data set with the true-negatives (i.e. a RefSeq-based RBH analysis), rather than a transforming a high quality dataset, since the RefSeq based analysis could be more easily fully automated (i.e. it did not require developing a curated set of high quality orthologs). However, transformation of a eukaryotic high quality dataset with true-negatives generated similar cut-off values (data not shown). Through an iterative sampling approach we were able to generate standard deviations of the proportion of true-negatives in a given ratio range (Figure 8B), providing a clearer picture of the likelihood of a true-negative occurring in that range. Ortholuge ratios in combination can help predict which gene in a given putative orthologous group is likely a paralog A closer inspection of the Ortholuge ratios shows that they behave in a predictable fashion when the ortholog group contains one or more false-positives (Table 1). For example, if ingroup1 is actually a paralog, then the distance between ingroup1-outgroup and the distance between ingroup1-ingroup2 would be larger than the norm for an ssd-ortholog. This would cause Ratio2 to increase (the degree of increase would depend on how diverged the paralog is from the missing 'true' ortholog), and Ratio1 to increase a slighter amount (depending on how distant the outgroup is). Conversely, if ingroup2 is actually the paralog, then Ratio1 would be expected to increase and Ratio2 to increase slightly. These predictable changes do indeed occur, as illustrated by an analysis of true-negatives (Figure 6C and 6F), an analysis of a dataset of tentative orthologs identified by RBH using an incomplete genome (Figure 5C and 5F), and an additional manual review of selected cases (data not shown). We propose that when unusual ratio ranges are identified for a given orthologous group, the relative changes can facilitate pre- Page 8 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270 Table 1: Ortholuge-ratios can help predict which gene in a given putative orthologous group is likely a paralogaa. Ratio1 Ka K ↓ K or Kc ↓ variable d Ratio2 K Ka Ratio3 K ↓ K or Kc variable d ↓ K Probable Paralog Ingroup1 paralog ↓ Ingroup2 paralog variable d Outgroup paralogb Ingroup1 & Ingroup2 paralogs Ingroup1 & Outgroup paralogs ↓ Ingroup2 & Outgroup paralogs a Only selected scenarios are listed. Arrows indicate relative increases or decreases in a ratio value, when compared to the highest frequency values in a histogram plot (i.e. "expected" ratio value). Smaller arrows indicate that the increase is less. In the case of the ingroup1 or ingroup2 paralog scenarios, it will depend on how divergent the paralog is and how distant the outgroup is. b Note that an outgroup paralog cannot be discriminated from cases of orthologs, nor does this analysis need to discriminate such cases (see text). However, this has been included in the table solely to illustrate how ortholog paralog cases can be discriminated (using Ratio 3) from cases where there is a combination of an ingroup1 (or ingroup2) paralog and an outgroup paralog. c This scenario will resemble an ingroup1 paralog scenario or ingroup2 paralog scenario, if one of the two ingroup paralogs diverged much more than the other. d The variation may be an increase or decrease, depending on which of the two paralogs is more diverged. Ratio 3 can help resolve such cases. dictions regarding which of the two ingroups may contain a paralog (or non-ssd-ortholog). ally similar to each other versus those that have diverged at different evolutionary rates in each species. Note that an outgroup paralog cannot be well predicted, however this does not affect the utility of Ortholuge, since the method is focused on characterizing the orthology of the two ingroups. It should also be noted that multipleparalog scenarios (last three rows in Table 1), are more complex. Though relatively easy to predict on paper, they are more difficult to distinguish in reality, because the amount of divergence for the two paralogs may vary greatly. In most cases they would resemble one of the first three scenarios, depending on which of the two paralogs was more diverged. Nevertheless, in the end, these rare cases (two paralogs in a group of three) will still most frequently display atypical ratios, and will not fall within probable ortholog cut-offs. Using the derived ratio cut-offs, we have constructed several data sets of probable ssd-orthologs consisting of: mouse-rat comparisons (with human as the outgroup), and one for a P. putida-P. syringae comparison (with E. coli as the outgroup). These ssd-orthologs are particularly suited for comparative genomics analyses. In addition, notations are added to all the data analysed, indicating cases of probable gene duplication after species divergence ("possible in-paralog") – a scenario that can increase the likelihood of functional divergence of the genes. These higher quality sets of orthologs can be found via the Ortholuge website [28]. The proportion of ssdorthologs in the RBH-predicted data sets is summarized in Table 2. Note that cases of in-paralogs are not counted within the counts of ssd-orthologs in Table 2. Such cases, due to their uncertain potential to have diverged in function because of a gene duplication, are counted within the "uncertain" category. Ortholuge in action: an estimation of probable ssdorthologs and probable paralogs in RBH-based data sets An example of ratio cut-offs generated based on our truenegative analysis is listed in Table 2 (see also Figure 8 and Supplemental Figures 1 and 2). Researchers are of course encouraged to choose their own cut-off to suit their needs (i.e. more sensitivity or specificity). However, based on our simulations, these cut-offs should effectively differentiate probable orthologs and paralogs for these data sets. We also propose that these cut-offs can identify those orthologs most closely following species divergence (i.e. ssd-orthologs) – orthologs which may be more function- Using the cut-offs, we were also able to estimate the proportion of RBH-predicted orthologs that are likely paralogs for these eukaryotic and prokaryotic data sets (Table 2; see also data available on the Ortholuge website [28], which includes a classification of the EGO dataset using the RefSeq analysis cut-offs). For the prokaryotic data about 5% of RBH-based predictions are probable paralogs. For the eukaryotic data, about 10% of the RBH-pre- Page 9 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 dictions are probable paralogs. These are significant numbers that validate the need for a method like Ortholuge, particularly if one is trying to use RBH-predicted orthologs for downstream analyses that require stringent ortholog prediction (for example, for regulatory element detection). Application of these cut-offs to classify the curated eukaryote and prokaryote datasets suggest that the false negative rate in is in the range of 0.7% for prokaryote data and 3% for the eukaryote data. To facilitate the analysis of other datasets, we have developed Ortholuge software that can be used to characterize any existing dataset of orthologs. If no pre-existing ortholog dataset is available, Ortholuge can also construct such a dataset using an RBH-based approach applied to whole genome datasets (or other adequate datasets of genes from three organisms that a user supplies). Ortholuge was developed using Perl under Linux (SuSE 9.0 and RH 9.0) and operates in any UNIX environment, provided all the needed tools (see Methods) are available for the user's operating system. This freely available, open source, software is available on the Ortholuge website [28]. Discussion For cross-genome comparison purposes, researchers often wish to compare orthologs – in particular orthologs that have not undergone unusual divergence rates relative to one another, and have more likely retained similar function. We propose that Ortholuge is an approach, suitable for high-throughput genome-scale analysis, which aids identification of such orthologs. The Ortholuge method significantly improves the specificity/precision of highthroughput RBH-based ortholog analysis. For example, our results indicate that roughly 1 in 10 RBH-predicted rat-mouse orthologs are very likely paralogs, and about 1 in 20 RBH-predicted orthologs for two Pseudomonas species are similarly likely incorrect. Note that our RBH analysis requires transitivity between three species, rendering it more stringent than the typical RBH analysis between two species. This suggests that the typical RBH analysis may have an even greater number of false predictions. The resulting more specific identification of orthologs by Ortholuge is an important requirement for many downstream analyses, such as identifying gene regulatory regions, or characterizing differences in microarray-measured gene expression responses across species. An automated method such as Ortholuge is of course no substitute for a more manual, comprehensive phylogenetic analysis and has some limitations as mentioned below. However, its simplicity and utility for highthroughput analyses suggest that it is a useful complement to RBH-based identification of putative orthologs using http://www.biomedcentral.com/1471-2105/7/270 whole-genome gene datasets. In addition, Ortholuge's higher specificity approach can complement other methods that may provide a higher sensitivity/recall approach for ortholog identification [13]. Ortholuge evaluates orthologs through phylogenetic distance comparisons. To perform such comparisons, an outgroup is required to assist the prediction of orthologs between the two ingroups – this has simultaneous advantages and disadvantages. The added sequence provides extra resolution and extra specificity; however, a distant outgroup may lessen the sensitivity of the approach. Presumably, though, as more genomes are sequenced, the number of possible outgroups available to choose from will increase and very distant outgroups will become less of a problem. The Ortholuge pipeline generates predictions by evaluating the entire genome at once (or at least adequate gene representation for the species). The more data points that are representative of the genome, the more confident the ratio cut-offs will be. It assumes that the majority of incoming predictions are true orthologs, will exhibit expected ratios, and will thus form the high frequency ranges of the distributions. Our analysis does suggest this assumption to be reasonable and, notably, both eukaryotic and bacterial orthologs display similar ratio distributions, despite marked differences regarding how such organisms evolve. Once the genome-wide predictions are made for a certain species combination, Ortholuge can be used to estimate how likely it is that a specific putative orthologous group contains a true-negative within its ingroups. In such cases, we can match these ratios with a category (i.e. classification shown in Table 2), to suggest which gene in the ortholog group is likely to be the paralog. However, it should be emphasized that at this time we have not exhaustively examined all possible scenarios, and so such analysis should be taken as a guide requiring further investigation. Interestingly, this method also appears to be useful to examine, in a genome-wide scale, the relationships between species. By examining the ratio values at the highest frequency ranges in the histograms, one can easily determine which two of any three organisms are more similar to each other, on average, and on a genomewide scale (for example, that cattle genes are more similar to human genes, than mouse genes are to human genes, on average). The simplicity of Ortholuge allows for many benefits. For example, it can easily be re-run when genome annotations undergo significant changes. In addition, it can easily be customized with any method of sequence alignment or phylogenetic distance calculation, depending upon the Page 10 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270 Table 2: Proportion of RBH-predicteda orthologs that are likely ssd-orthologsb and likely paralogs, according to Ortholuge analysis. Data setc Probable paralog Proportio n of introduce d truenegatives in a truenegative analysisd Proportio n of RBHpredicted orthologse Ratio Rangec Proportio n of introduce d truenegatives in a truenegative analysisd Proportio n of RBHpredicted orthologse Ratio Rangec Proportio n of introduce d truenegatives in a truenegative analysisd Proportio n of RBHpredicted orthologse R1 ≤ 0.60 and R2 ≤ 0.55 0.8% 76% See footnotef 16% 14% R1 > 0.80 or R2 > 0.80 77%d 10% R1 ≤ 0.55 and R2 ≤ 0.70 1.3% 91% See footnotef 24% 4% R1 > 0.75 and R2 > 0.85 87% 5% Ratio Rangec rat-mouse comparison (human outgroup) P. putida-P. syringae comparison (E. coli outgroup) Orthology uncertainf Probable ssd-ortholog a RBH-predicted = Predicted to be orthologous using a Reciprocal-best BLAST hit approach. orthologs" = orthologs that appear to have diverged only due to speciation and have diverged at an expected relative rate for the species. Such orthologs are likely to have more similar function. See text for details. c Ratio Range for both Ratio1 (R1) and Ratio2 (R2). See Figure 8C for a schematic illustration of the cut-off ranges on a R1 × R2 plot. d Proportion of introduced true-negatives for the 25% true-negative analysis is shown here, however the actual number of true-negatives will be higher due to false-positives likely occurring in the original ortholog dataset. This analysis was used to estimate % false predictions in range (see text and Figure 8. e RBH-predicted data sets were examined using the cut-offs generated by the true-negative analysis, to identify what proportion of all RBHpredicted orthologs fell within each range. For the rat-mouse comparison 6294 RefSeq-based groups were classified into "probable ssd-ortholog", "uncertain", and "probable paralog" classes. For the Pseudomonas comparison, a total of 1456 groups were classified. Note that for an analysis of the EGO-based rat-mouse data set of 19,200 groups with the same cut-offs, 76% ssd-orthologs and 16% probable paralogs were predicted (when inparalogs were not counted, because of the lack of differentiation of gene isoforms in the EGO data set). f This "uncertain" category falls between the other two ranges and is graphically illustrated, for ease of understanding, in Figure 8C. This category follows the formula (R1 > a and R1 < b and R2 < d) or (R2 > c and R2 < d and R1 < a), where a and b are the lower and upper cut-off values, respectively, for Ratio1 (i.e. lower = cut-off for ssd-orthologs and higher = cut-off for probable paralogs), and c and d are the lower and upper cutoff values, respectively, for Ratio2. Note this "uncertain" category also contains counts of in-paralogs detected (7% of eukaryotic data, and negligible for prokaryotic data) – see text for details. b "Supporting-species-divergence researcher's preference. It is expected that further analysis will reveal relationships between true-negative analyses and ratio cut-off generation, negating the need to perform a full iterative-introduced-true-negative analysis for each species comparison. Of course, users can choose their own Ortholuge ratio cut-offs, either using a true-negative analysis, or another approach of their choice, for identification of orthologs at their preferred level of specificity. Accepting only orthologs in a certain ratio range and discarding the rest will certainly eliminate a small fraction of true orthologs from the input set. For example, if the probable paralog cut-offs are applied to the "high quality" curated prokaryotic and eukaryotic data sets, we eliminate 0.7% and 3% of the prokaryotic and eukaryotic predictions, respectively. However, if the more stringent ssdortholog cut-offs are applied, we eliminate 1.4% and 8% of the predictions, respectively. While these outliers may be false-positives in the curated data, they may also be true orthologs that have undergone unusual divergence in one ingroup species. For example, if a gene duplication occurred in one ingroup species after the speciation diver- gence, the resulting duplicated gene may undergo accelerated evolution [14]. Such scenarios would result in skewed ratios for true orthologs. However, we propose that such orthologs with unusual (relative) divergence may more likely have differing function at some level. In many genome-wide studies involving comparisons between species, researchers wish to identify those genes that are more likely to be functionally equivalent – i.e. orthologs that did not experience unusual rates of evolution or gene transfer. Ortholuge improves the identification of such "supporting species divergence" ortholog pairs (i.e. ssd-orthologs). This is apparently an important issue, as illustrated by some confusion occurring in the literature regarding the definition of orthologs. The definition that we, and many evolutionary biologists use, is the one initially proposed [1] that describes orthologs as genes that have diverged due to speciation (rather than due to gene duplication, which describes paralogs). However, the term ortholog is increasingly being inferred to mean 'functionally equivalent genes in different species' – a common misconcep- Page 11 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 tion [15]. While we and others agree that orthologs tend to have similar function, this is not a requirement for orthology [16]. So, it appears that while many researchers are identifying orthologs in a genetic or genomic study, what they really wish to identify is the subset of orthologs that are specifically functionally equivalent. Some methods, such as the widely used INPARANOID, refer to all in-paralogs (i.e. genes created by gene duplication after the species divergence) in the one species as orthologous to the related gene in the other species. They do not clearly distinguish between such cases of in-parology and more simple one-to-one orthologous relationships. We believe that such cases should be differentiated because a duplication event after species divergence may have led to significant functional divergence of one or both of the duplicated genes in the one species. In Ortholuge, cases involving possible in-paralogs are flagged using a simple analysis that focuses on detecting the most clear-cut in-paralog cases. For our analysis, we did apply ratio cutoffs derived using one-to-one ortholog RBH-based (RefSeq) data to classify a same species (EGO) data set that includes both one-to-one orthologs and many-to-many orthologs. However, we recognize a need to implement more robust procedures that would consider all cases of suspected recent gene duplications in the analysis (the current method is subject to the limitations of the initial RBH-based ortholog identification). It would also be desirable to complement this analysis further by noting cases of relative gene rearrangement in the input set of orthologs. Ortholuge in its current form cannot detect gene rearrangements, however it could potentially complement other bioinformatics approaches that detect such rearrangements [17]. Ortholuge could also be adapted to contain a gene rearrangement analysis that is customized to its methodology. These additional ortholog evolutionary scenarios, involving possible inparalogy or gene rearrangements, should be specifically noted because they cannot be distinguished by examining Ortholuge distance ratios alone. They require further study in any comparative analysis, since functional equivalence between the orthologs is less likely. Regarding the limitations of this method, it should be emphasized that Ortholuge is limited by the quality of the initial ortholog-analysis (i.e. RBH can miss cases of trueorthology, and some data sets such as those from EGO are incomplete and don't clarify which genes are isoforms, which complicates in-paralog analysis). Ortholuge is also only as good as the quality of the sequence data being analyzed. We have tested our alignment trimming and masking of regions of lower alignment quality extensively to improve the critical sequence alignment component of our method; however, certainly this method will fail if low quality sequences, with many errors, are used in the http://www.biomedcentral.com/1471-2105/7/270 analysis. In addition, the top BLAST hit is not necessarily the nearest neighbour [18] and so true orthologs may be missed when using Ortholuge after initially identifying orthologs with an RBH-based approach. Ortholuge could therefore improve if the initial ortholog prediction method is improved (it should be emphasized that Ortholuge can be used with any input dataset of proposed orthologs deduced by any current or future ortholog prediction methodology – not just the ones presented). Regardless of any limitations, Ortholuge appears to effectively improve the specificity of ortholog identification and is suitable for high-throughput, genome-wide use. Given the amount of genomics data being obtained at this time, such specific, high-throughput approaches will become increasingly necessary, as genomics research moves further toward more multi-genome comparative analyses. Conclusion Ortholuge improves the specificity of ortholog identification and is suitable for high-throughput use. This precise ortholog prediction method complements other ortholog prediction methods that are not focused on precision and it potentially identifies those orthologs most likely to be functionally similar. The Ortholuge method provides important data set evaluation for a variety of analyses based on comparative approaches, including gene function prediction, prediction of conserved regulatory elements, and comparative analysis of gene order or gene/ protein expression data. Methods Data sets 1. Eukaryotic Gene Orthologs (EGO) RBH data set EGO release 8 database was obtained from The Institute of Genomic Research (TIGR) [5]. This database is composed of two files: 1) one file housing ortholog identifiers and tentative consensus sequence (TC) identifiers and 2) a second file TC sequences in FASTA format. Both files were used to extract and create 19,200 unique mouse, rat, human tentative ortholog gene sets files (TOGs) for Ortholuge analysis. 2. Eukaryotic curated orthologs ("high quality" MGD dataset) The Mouse Genome Database (MGD) is a comprehensive, high-quality database which currently includes orthology information for mouse, human, rat, and 14 other mammals [19]. Orthology annotations are manually curated from scientific literature and each orthology assertion is based on criteria recommended by the Human Genome Organisation (HUGO). A program was developed to extract the orthologous gene pairs from the MGD Sybase database for two species tri- Page 12 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 ples: 1) mouse, rat, human 2) cattle, human, mouse. All relevant human, mouse, and rat RefSeq [20] mRNA sequences and protein sequences were obtained from the National Center for Biotechnology Information (NCBI) FTP site along with the Locus Link RefSeq mappings file. FASTA-formatted ortholog sets for those ortholog pairs were created that satisfied a transitive, triple ortholog relationship and had corresponding RefSeq sequences annotated with a reviewed or validated status. 2642 mouse, rat, human mRNA, 2499 mouse, rat, human protein, and 427 cattle, human, mouse mRNA ortholog sets were created. 3. Eukaryotic Gene Orthologs (EGO) "lower quality" RBH set Cattle, human, and mouse ortholog groups, totaling 16,134 in number, were extracted from the EGO release 8 database. The cattle genome was incomplete at this time and thus we expected more incorrectly predicted orthologs by the RBH method (see Fig. 1 for the scenario). 4. Eukaryotic RefSeq-based RBH ortholog set The species-specific mouse, rat, human RefSeq files were obtained from the NCBI FTP site and BLAST databases [2] were constructed for each file. A pairwise blastall analysis was performed between each species enforcing a 10e-04 E value cut-off. 6294 ortholog FASTA-formatted sets were created from transitive, best-hit mRNA RefSeqs. We allowed one unique best-hit isoform per Locuslink ID in the RBH dataset. 5. Eukaryotic RBH Tentative Consensus (TC) ortholog set involving cattle A higher-quality, non-redundant RBH TC dataset was established using the cattle, human, and mouse tentative sequences found in the EGO release 8 database. The transitive, triple reciprocal top best BLAST hit for each unique cattle TC was used to form 15,660 ortholog groups. This approach served to reduce the over-representation of TC's found in the currently established set of EGO tentative ortholog groups (TOGs) due to the allowance of multiple RBH relationships within a specified cut-off. 6. Bacterial RBH-predicted data sets Protein sequences of Escherichia coli K12 [10], Pseudomonas syringae pv. tomato str. DC3000 [11], and Pseudomonas putida KT2440 [12] were obtained from NCBI. For the RBH analysis, first a BLASTp was performed between all pair-wise combinations, with an E-value cutoff of 10e-04. Genes that retained a transitive reciprocal best hit property and passed the BLAST cut-off were retained. There were 1456 ortholog groups constructed. 7. Bacterial higher quality orthologs A set of higher quality orthologs was constructed from a set provided by Lerat et al [21], who found all the gene families in 13 gamma-proteobacteria genomes that had http://www.biomedcentral.com/1471-2105/7/270 exactly one gene per species. For simplicity, we chose those that had annotated gene names in each of our three chosen bacterial species. Initially, there were 156 ortholog groups, and of these 143 ortholog groups passed our automated alignment editing stage. 8. OrthoMCL eukaryote ortholog dataset The OrthoMCL database files were downloaded [22] and a set of mouse-rat-human ortholog triples were extracted from the OrthoMCL clusters to construct ortholog triples. These predicted ortholog groups were analyzed using the Ortholuge analysis software. Through our analyses, we observed that the use of nucleotide sequences provided better resolution for these particular sets of eukaryotic data, at the level of divergence being examined using Ortholuge (see Figure 4 of [Additional file 1]), whereas protein sequences provided better resolution for the particular bacterial data we were analyzing (data not shown). Consequently, all analyses below were performed using nucleotide sequences for eukaryote analysis and protein sequences for the given prokaryote analysis. Ortholuge analysis pipeline The input parameters for Ortholuge include a list of tentative ortholog species groups with sets of FASTA-formatted sequences for each respective gene/protein in the tentative orthologs set. A flowchart overview of this pipeline can be seen in Figure 2. If ortholog groups have not been predetermined, the Ortholuge software we developed is capable of calculating an initial list of tentative orthologous groups, using the RBH approach. In this latter case, the input required is a FASTA-formatted list of sequences from genes predicted in three genomes to be examined (two sequences to be compared, one reference sequence as an outgroup). Note that whole-genome data does not necessarily need to be used, however the dataset should be large enough to ensure that the distribution of relative evolutionary distances will centre around what is likely the true median for the relative evolutionary distance for the given organisms being examined. 1. Sequence alignments Initial alignments of the genes/proteins for each tentative ortholog group are generated using CLUSTALW [23] with either DNA or PROTEIN alignment options. All other parameters are default. 2. Automated alignment editing All alignment overhangs and poly-A tails are removed in each aligned set of sequences. An alignment must be aligned over 300 base pairs (bp) or 100 amino acids (aa) or it is discarded from the analysis. This choice of threshold was based on previous studies that have suggested that Page 13 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 http://www.biomedcentral.com/1471-2105/7/270 it is more likely that a sequence codes for a protein if its length is over 100 amino acids [24]. vided to facilitate viewing data in low ranges in this format). Gap masking is performed to remove ambiguously aligned gap-flanking regions. A sample of RBH-predicted ortholog sequence sets were examined to identify both gaps introduced by misalignments and gaps introduced through sequence insertions and deletions. Gap-masking simulations using various window length intervals were applied to the aligned sequences to establish a gap-masking approach. Our approach entails running a sliding 25base pair window over the aligned sequences in both directions to assess gap percentages exceeding a 40% gap threshold. The window size and gap threshold were chosen such that overlapping windows exceeding the gap threshold would produce a worst-case gap masked region of 49 base pairs. True-negative introduction analyses Mean/iterative true-negative analysis For the eukaryotic RefSeq-based RBH ortholog dataset, a selected proportion of the ortholog sets were randomly transformed to true-negative ortholog sets and then run through the Ortholuge analysis. We report here the results for a 25% transformation of the data, though other percentages were examined (data not shown). To do this transformation (introduction of true-negatives), the full set of mouse, rat, and human RefSeq sequence files were obtained from NCBI and a pairwise best-hits list was created using a pairwise blastall analysis with a 10e-4 E-value cut-off. An orthologous set was transformed to a true-negative by replacing one of the species sequences with another sequence that had a greater (next highest) BLAST expect value and which still satisfied a reciprocal and transitive best BLAST hit with the two other sequences in the orthologous set. In essence, we were removing an RBHpredicted ortholog and identifying another gene that could satisfy an RBH relationship. This essentially simulated what could happen if a gene was lost in one genome, or a genome sequence was incomplete, by removing a gene from a proposed ortholog "triple" and determining what the next RBH relationship would be for the remaining genes in the triple. Care was taken to ensure that the set of original sequences in the higher quality set being transformed initially satisfy an RBH relationship. Furthermore, the algorithm mandated that the non-orthologous replacement is not an isoform of the replaced sequence. Each transformed dataset was then run through the Ortholuge analysis. Such true-negative transformations were iteratively performed 50 to 100 times for each truenegative percentage proportion. A mean true-negative value and standard deviation for each ratio value in the distribution could then be calculated. Note that this same approach was also used to perform an iterative true-negative analysis for other eukaryotic data sets, and the prokaryotic data. Both the trimming and gap-masking methods were evaluated for the introduction of ratio distribution biases by selected alignment characteristics. Selected characteristics of both trimmed and gap masked alignments were recorded and analyzed to determine whether the automated alignment editing process had created a ratio distribution bias for certain alignment characteristics. These characteristics included: number of aligned base pairs, identity over aligned length, identity over left and right ends, proportion of gaps over full length, proportion of gaps over left and right ends. Here we defined end length as MIN(.25 * alignment length, 150 bp/50aa). See also Figure 3. 3. Sequence distances and calculation of Ortholuge ratios The EDNADIST or EPROTDIST programs of EMBOSS [25] and PHYLIP 3.6 [26] software, respectively, were used to compute the nucleotide or protein distances. We opted to analyze our data using the Kimura distance formula due to its simplicity and computational efficiency. We used a conservative transition/transversion rate of 2 as an approximation, although studies do suggest that transition/transversion rates are context dependent [27]. All other parameters were defaults. The phylogenetic distances were used to compute the three ratios, Ratio1, Ratio2, and Ratio3, as described in Figure 2. Ratios are then displayed manually in two forms: Histograms and as R1 × R2 plots. The ratio frequencies are enumerated for a given interval and histograms are constructed for all three ratios, visually displaying the ratio frequencies of tentative orthologous groups within a ratio of 2.5. The R1 × R2 plots are comprised of an x-y plot of Ratio1 versus Ratio2 which facilitates visualization of the full ratio distribution range (though zoomed in versions of these plots up to ratio values of 2.0 are also pro- The true-negative mean and standard deviations were analyzed to establish conservative ratio cut-offs and estimate false-positive proportions. A three-level classification system for true mean false-positive values over defined ratio intervals was derived from this analysis (see "Establishing Cut-offs" Method's section, below). The number of ratios in a given set falling into each level (i.e. "probable ssdortholog", "uncertain", and "probable paralog" classes) was counted. Page 14 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 True-negative introduction in the bacterial set – an introduction of all possible true-negatives For the bacterial RBH-predicted data set, P. syringae genes (ingroup2) were replaced with their next best reciprocal BLAST hit to P. putida (within a 10e-4 E value cut-off), wherever possible. 668 out of 1456 ortholog triples were transformed into true-negative triples for this dataset. There are no iterations necessary here, so the transformed data set was then run through Ortholuge once. Establishing cut-offs for Ortholuge-predicted "probable paralogs", uncertain, and probable ssd-orthologs Researchers are of course encouraged to use the above true-negative analysis to formulate their own cut-offs, since cut-offs of differing levels of sensitivity and specificity are possible. In our example analysis, we examined the iterative/mean true-negative analysis for a eukaryotic and prokaryotic dataset using a histogram and examined the data in terms of the proportion of introduced truenegatives identified in each ratio range. These percentages are used to aid in identifying cut-offs for more specific (precise) identification of probable orthologs (or ssdorthologs) and probable paralogs. We examined the trend manually, and opted to identify "probable ssd-orthologs" as those occurring in ratio ranges where there were, on average, only between 0–10% introduced true-negatives (out of the total number of tentative orthologous groups in the range; see Results, Figure 8). Tentative orthologous groups falling in ratio ranges that contained between 10 to 50% introduced true-negatives (on average) were classified as orthology "uncertain". Finally, groups falling in ratio ranges that contained greater than 50% introduced true-negatives were classified as "Probable paralog". We chose this cut-off because it was at this point that the transition from few introduced true-negatives in a range, to mostly introduced true-negatives in a range, increased significantly. Note that at this point there will also likely be some true-negatives occurring in the analyzed dataset (as illustrated also by our "higher quality" data set analyses), and so the actual proportion of true-negatives at this probable paralog cut-off point will likely be much higher. As mentioned in the results, we opted to perform this analysis on completely automated RBH data (RefSeq-based), rather than high quality data, since we appear to be able to obtain meaningful results, while being able to take advantage of the automated nature of RBH data set generation. However, we did also perform this analysis on the high quality data set, and on any EGO data set, generating comparable results. Identification of in-paralogs For those tentative orthologs predicted by Ortholuge to be ssd-orthologs (and also for other classes as well, in case researchers wish to use other cut-offs), we performed an additional analysis to identify cases of in-paralogy that http://www.biomedcentral.com/1471-2105/7/270 may affect the possible functional equivalence of the ssdorthologs (see the Introduction for a discussion of this issue). To do this, we combined both ingroup species' sequences into one database and performed a BLAST analysis using all the individual sequences from each of the ingroup species as a query. We then identified individual sequence cases in which top hits (other than a query sequence self-hit) were to another sequence in its own species. If the bit score for this same-species hit was greater than the bit score for the other species hit, then the case was flagged as an in-paralog candidate (ie. a gene duplication may have occurred after the speciation, potentially affecting the function of the ssd-ortholog). Any such inparalog cases were classified under the "uncertain" category, unless they had been classified, according to a Ratio1 and Ratio2 analysis, as belonging to the "probable paralog" category (in the latter case they would remain in the probable paralogs category). Note that this analysis only identifies a proportion of all cases – in particular very clear cut cases. It does not identify all possible in-paralogs and researchers are encouraged to investigate any such cases more thoroughly. Authors' contributions BH and FSLB developed the initial framework for this computational method and FSLB led formation of the final draft of the manuscript. DLF and YYL developed the final methodology, performed the analyses of the selected data sets, and drafted the initial versions of the manuscript, with each focusing their research and analyses on eukaryotic and prokaryotic data, respectively, during their research rotations. FMR participated in the design of this study and provided critical input that improved this work. MRL used initial scripts developed by DLF and YYL to develop a software package that will perform the main component of the Ortholuge analysis. All authors read and approved the final manuscript. Additional material Additional file 1 Supplementary Figures. Supplementary Figure 1. Ratio1, Ratio2 and Ratio3 histograms of the P. putida – P. syringae – E. coli putative orthologous sets summarizing results of a true negative introduction analysis. Supplementary Figure 2. Ratio2 and Ratio3 histograms of the mouse-rathuman putative orthologous sets indicating the average proportion of true negatives observed in our simulation of an incomplete genome through the iterative introduction of a mouse (ingroup1) paralog in randomly selected ortholog sets. Supplementary Figure 3. Histograms of Ortholuge Ratios 1, 2, and 3 for the mouse-rat-human RBH RefSeq nucleotide dataset. Supplementary Figure 4. Histograms of Ortholuge Ratios 1, 2, and 3 for the mouse-rat-human OrthoMCL protein dataset. Click here for file [http://www.biomedcentral.com/content/supplementary/14712105-7-270-S1.pdf] Page 15 of 16 (page number not for citation purposes) BMC Bioinformatics 2006, 7:270 Acknowledgements The authors wish to thank members of the Brinkman Laboratory for helpful discussions and technical assistance. FSLB is a Canadian Institutes of Health Research New Investigator (CIHR) and Michael Smith Foundation for Health Research (MSFHR) Scholar. DLF and YYL are CIHR/MSFHR Bioinformatics Training Program for Health Research award recipients. All other authors of this work, as well as computer hardware resources utilized for this project, were supported by the Functional Pathogenomics of Mucosal Immunity Project and Pathogenomics of Innate Immunity Project (funded by Genome Canada/Genome Prairie/Genome BC and Inimex Pharmaceuticals) and by IBM and Sun Microsystems. http://www.biomedcentral.com/1471-2105/7/270 14. 15. 16. 17. 18. 19. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool 1970, 19:99-113. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215:403-410. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: An updated version includes eukaryotes. BMC Bioinformatics 2003, 4:41. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: New developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29:22-28. Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, White J, Holt I, Liang F, Quackenbush J: Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res 2002, 12:493-502. Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 2001, 314:1041-1052. O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 2005, 33:D476-480. Kunin V, Ouzounis CA: The balance of driving forces during genome evolution in prokaryotes. Genome Res 2003, 13:1589-1594. Zhang P, Gu Z, Li WH: Different evolutionary patterns between young duplicate genes in the human genome. Genome Biol 2003, 4:R56. Blattner FR, Plunkett G 3rd, Bloch CA, Perna NT, Burland V, Riley M, Collado-Vides J, Glasner JD, Rode CK, Mayhew GF, Gregor J, Davis NW, Kirkpatrick HA, Goeden MA, Rose DJ, Mau B, Shao Y: The complete genome sequence of escherichia coli K-12. Science 1997, 277:1453-1474. Buell CR, Joardar V, Lindeberg M, Selengut J, Paulsen IT, Gwinn ML, Dodson RJ, Deboy RT, Durkin AS, Kolonay JF, Madupu R, Daugherty S, Brinkac L, Beanan MJ, Haft DH, Nelson WC, Davidsen T, Zafar N, Zhou L, Liu J, Yuan Q, Khouri H, Fedorova N, Tran B, Russell D, Berry K, Utterback T, Van Aken SE, Feldblyum TV, D'Ascenzo M, Deng WL, Ramos AR, Alfano JR, Cartinhour S, Chatterjee AK, Delaney TP, Lazarowitz SG, Martin GB, Schneider DJ, Tang X, Bender CL, White O, Fraser CM, Collmer A: The complete genome sequence of the arabidopsis and tomato pathogen pseudomonas syringae pv. tomato DC3000. Proc Natl Acad Sci U S A 2003, 100:10181-10186. Nelson KE, Weinel C, Paulsen IT, Dodson RJ, Hilbert H, Martins dos Santos VA, Fouts DE, Gill SR, Pop M, Holmes M, Brinkac L, Beanan M, DeBoy RT, Daugherty S, Kolonay J, Madupu R, Nelson W, White O, Peterson J, Khouri H, Hance I, Chris Lee P, Holtzapple E, Scanlan D, Tran K, Moazzez A, Utterback T, Rizzo M, Lee K, Kosack D, Moestl D, Wedler H, Lauber J, Stjepandic D, Hoheisel J, Straetz M, Heim S, Kiewitz C, Eisen JA, Timmis KN, Dusterhoft A, Tummler B, Fraser CM: Complete genome sequence and comparative analysis of the metabolically versatile pseudomonas putida KT2440. Environ Microbiol 2002, 4:799-808. Zheng XH, Lu F, Wang ZY, Zhong F, Hoover J, Mural R: Using shared genomic synteny and shared protein functions to enhance the identification of orthologous gene pairs. Bioinformatics 2005, 21:703-710. 20. 21. 22. 23. 24. 25. 26. 27. 28. Castillo-Davis CI, Hartl DL, Achaz G: Cis-regulatory and protein evolution in orthologous and duplicate genes. Genome Res 2004, 14:1530-1536. Jensen RA: Orthologs and paralogs – we need to get it right. Genome Biol 2001, 2:. INTERACTIONS1002 Fitch WM: Homology a personal view on some of the problems. Trends Genet 2000, 16:227-231. Brudno M, Malde S, Poliakov A, Do CB, Couronne O, Dubchak I, Batzoglou S: Glocal alignment: Finding rearrangements during alignment. Bioinformatics 2003, 19(Suppl 1):i54-62. Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol 2001, 52:540-542. Eppig JT, Bult CJ, Kadin JA, Richardson JE, Blake JA, Anagnostopoulos A, Baldarelli RM, Baya M, Beal JS, Bello SM, Boddy WJ, Bradt DW, Burkart DL, Butler NE, Campbell J, Cassell MA, Corbani LE, Cousins SL, Dahmen DJ, Dene H, Diehl AD, Drabkin HJ, Frazer KS, Frost P, Glass LH, Goldsmith CW, Grant PL, Lennon-Pierce M, Lewis J, Lu I, Maltais LJ, McAndrews-Hill M, McClellan L, Miers DB, Miller LA, Ni L, Ormsby JE, Qi D, Reddy TB, Reed DJ, Richards-Smith B, Shaw DR, Sinclair R, Smith CL, Szauter P, Walker MB, Walton DO, Washburn LL, Witham IT, Zhu Y, Mouse Genome Database Group: The Mouse Genome Database (MGD): from genes to mice – a community resource for mouse biology. Nucleic Acids Res 2005, 33:D471-475. Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequence (RefSeq): A curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2005, 33:D501-504. Lerat E, Daubin V, Moran NA: From gene trees to organismal phylogeny in prokaryotes: The case of the gamma-proteobacteria. PLoS Biol 2003, 1:E19. Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS: OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res 2006, 34:D363-368. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the clustal series of programs. Nucleic Acids Res 2003, 31:3497-3500. Brinkman FS, Blanchard JL, Cherkasov A, Av-Gay Y, Brunham RC, Fernandez RC, Finlay BB, Otto SP, Ouellette BF, Keeling PJ, Rose AM, Hancock RE, Jones SJ, Greberg H: Evidence that plant-like genes in chlamydia species reflect an ancestral relationship between chlamydiaceae, cyanobacteria, and the chloroplast. Genome Res 2002, 12:1159-1167. Rice P, Longden I, Bleasby A: EMBOSS: The european molecular biology open software suite. Trends Genet 2000, 16:276-277. Felsenstein J: PHYLIP-phylogeny inference package. Cladistics 1989, 5:164-166. Hwang DG, Green P: Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc Natl Acad Sci U S A 2004, 101:13994-14001. Ortholuge [http://www.pathogenomics.ca/ortholuge/] Publish with Bio Med Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical researc h in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 16 of 16 (page number not for citation purposes)

Log In

1471-2105-7-270.pdf

1471-2105-7-270.pdf

Related Papers

RELATED PAPERS