Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Decoding the rice genome

BioEssays, 2006
...Read more
Decoding the rice genome Shubha Vij, Vikrant Gupta, Dibyendu Kumar, Ravi Vydianathan, Saurabh Raghuvanshi, Paramjit Khurana, Jitendra P. Khurana, and Akhilesh K. Tyagi* Summary Rice cultivation is one of the most important agricultural activities on earth, with nearly 90% of it being produced in Asia. It belongs to the family of crops that includes wheat, maize and barley, and it supplies more than 50% of calories consumed by the world population. Its immense economic value and a relatively small genome size makes it a focal point for scientific investigations, so much so that four whole genome sequence drafts with varying qualities have been generated by both public and pri- vately funded ventures. The availability of a complete and high-quality map-based sequence has provided the opportunity to study genome organization and evolution. Most importantly, the order and identity of 37,544 genes of rice have been unraveled. The sequence provides the required ingredients for functional genomics and mole- cular breeding programs aimed at unraveling intricate cellular processes and improving rice productivity. BioEssays 28:421–432, 2006. ß 2006 Wiley Periodicals, Inc. Introduction Rice is one of the most important food crops of the world. More than half of the world population depends on rice as the major source of calories and proteins. About 840 million people in the world are undernourished, which includes almost 200 million children from developing countries (http://www.fao.org/). Rice production will have to be increased substantially to meet the demand of the growing world population, especially in the Asian subcontinent. The rice production has, however, declined in the last 4 years (http://www.irri.org). (1) This is due to increasing urbanization leading to shortage of cultivable land and deteriorating environmental conditions. To meet the growing demand, a combination of breeding strategies and molecular biology tools has to be used in synchrony to obtain varieties that are high yielding and also more resistant to various abiotic and biotic stresses. (2) Sequencing of the rice genome was initiated with the aim of using the sequence information to understand the function of its gene repertoire. The pioneering work, which laid the foundation for rice genome sequencing, was initiated in early 1990s. (3) The work centered on constructing a linkage map (http://rgp.dna.affrc.go.jp/pub- licdata/geneticmap2000.index.html), (4) YAC (yeast artificial chromosome) based physical map, (5,6) a transcript map (7,8) and sequence-ready BAC/PAC (bacterial artificial chromo- some/P1-derived artificial chromosome) physical map. (2,9) Rice is also amenable to genetic transformation, thereby providing an ideal crop system for functional genomics. (10) Moreover, the rice genome shares a syntenic relationship with other cereal crops like sorghum and maize. (11,12) Amongst the different cereal crops, rice was chosen as the best representa- tive genome due to a relatively small estimated genome size of 430 Mb. (13) This review aims to trace the path of rice genome sequencing from its initiation to the current status and seeks to interpret the information obtained from the genome of the first food crop to be sequenced. Strategies to sequence whole genomes In order to sequence a large DNA molecule, it is first broken into small fragments, which are cloned and sequenced. The overlapping sequence reads are assembled using computer software programs into contigs. The quality of the sequence submitted in the database is variously classified as phase 0, I, II and III. The initial raw sequence generated is referred to as phase 0 and the assembled sequence represents phase I. When the contigs are ordered and oriented, the sequence is Interdisciplinary Centre for Plant Genomics and Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110 021, India. Funding agency: The research work of our group is funded by the Department of Biotechnology, Government of India, New Delhi. SV, VG, DK and VR were supported by research fellowships from CSIR/ UGC, Government of India, New Delhi. *Correspondence to: Akhilesh K. Tyagi, Interdisciplinary Centre for Plant Genomics and Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110 021, India. E-mail: akhilesh@genomeindia.org DOI 10.1002/bies.20399 Published online in Wiley InterScience (www.interscience.wiley.com). BioEssays 28:421–432, ß 2006 Wiley Periodicals, Inc. BioEssays 28.4 421 Abbreviations: BAC, Bacterial artificial chromosome; bp, Base pairs; EST, Expressed sequence tag; IRGSP, International Rice Genome Sequencing Project; JAK, Janus kinase; Mb, Million base pairs; MDRs, Mathematically determined reads; MOsDB, MIPS Oryza sativa DataBase; MTP, Minimum tiling path; ORF, Open reading frame; PAC, P1-derived artificial chromosome; QTL, Quantitative trait loci; RePS, Repeat masked phrap with scaffolding; RiceGAAS, Rice Genome Automated Annotation System; SNP, Single nucleotide polymorphism; SSR, Simple sequence repeat; STAT, Signal Transdu- cers and Activators of Transcription; STCs, Sequence tag connectors; TIR-NB-LRR, Toll-Interleukin-Region-Nucleotide-Binding site-Leucine- Rich Repeat; WGS, Whole Genome Shotgun; YAC, Yeast artificial chromosome. Genes and genomes
designated as phase II. All stages before the final stage generate draft sequences of variable quality, which refers to the fact that the sequence is incomplete. The final step is to convert the draft into the finished sequence, also referred to as phase III (http://www.ncbi.nlm.nih.gov/HTGS/). For small genomes, like that of microbes, the finished sequence refers to a complete sequence, without gaps. However, in the case of eukaryotes, it is virtually impossible to get the complete genome information in a single piece because they contain a large amount of repetitive sequences, which are especially concentrated in the region spanning centromeres and telomeres. (14) The two main strategies for whole genome sequencing are discussed below. In the Clone-by-Clone Shotgun approach, the genome is fragmented and cloned in BAC/PAC vectors. Inserts of genomic DNA fragments in the BAC/PAC vectors are anchored physically to the genome, with the help of DNA markers, to develop a minimum tiling path (MTP). The MTP is generated using a combination of techniques including finger- print patterns, sequence tag connectors (STCs) and marker information. Each BAC/PAC (with average insert size of 100– 150 kb) present in the MTP is again broken into small-sized fragments, cloned and sequenced. The sequence of the genome is then obtained by merging the individual BAC/PAC sequences. (14,15) Although this approach is time consuming, it offers the advantage that each clone is anchored to a speci- fic chromosome, thus making the task of finishing much easier. (16) In addition, since the finished genome of a model organism will leverage other genomes, this technique is eventually cost effective. A high-quality sequence has been generated for human genome as well as Arabidopsis and rice genomes adopting this approach. (17–19) In the Whole Genome Shotgun (WGS) approach, the genomic DNA as such is broken into small-sized fragments, cloned and directly used for sequencing. The sequences are then assembled to reconstruct the whole genome. (15) This approach avoids the initial task of making BAC/PAC libraries, constructing a MTP and individual library construction for each BAC/PAC clone in the MTP. This strategy has been used extensively for bacterial genomes. The WGS approach has also been used for sequencing the human genome as well as indica and japonica rice genomes. (20–22) The potential problem in use of WGS for eukaryotic genomes is misassem- bly due to a high percentage of repetitive elements. (14,16) This is because each contig has to be individually anchored to the chromosome, which makes the task of finishing more laborious and cumbersome. 2002: The year of rice genome sequence drafts Overview Rice was the ideal candidate for genome sequencing after Arabidopsis since Arabidopsis and rice are widely accepted as model dicot and monocot plants, respectively. (23) Rice was the first organism whose sequencing was pursued by four groups independently, which itself speaks for the importance of its genome information. (24,25) Although the task of sequencing the rice genome was initiated by the publicly funded Interna- tional Rice Genome Sequencing Project (IRGSP, see next section), (26) it was a private company, Monsanto (St Louis, MO, USA), that released the first draft of the rice genome in April, 2000, based on the data generated at the University of Washington. Monsanto sequenced a total of 3,391 BACs using a clone-by-clone approach, to the level of 5X coverage, to produce a draft sequence of 399 Mb. This sequence was assembled in 52,202 contigs, representing 259 Mb non- overlapping data, which was expected to cover almost 60% of the rice genome. (27) Meanwhile, two other groups, Syngenta (Torrey Mesa Research Institute, San Diego, USA) and Beijing Genomics Institute (BGI), China, also launched their indepen- dent sequencing programs. IRGSP, Monsanto and Syngenta chose the japonica cultivar ‘Nipponbare’ while BGI used the indica cultivar ‘93-11’ for sequencing. The aim of both the private ventures (Monsanto and Syngenta) and BGI for rice genome sequencing was primarily gene discovery and identification of molecular markers for breeding. Hence, these groups aimed at obtaining a draft sequence to get a broad overview of the rice genome. (14) The Monsanto and Syngenta data were not released to the public database but could be accessed for academic purposes on entering a database registration agreement through their site (http://www.rice- research.org, http://www.tmri.org). (21,27) Both Monsanto and Syngenta also allowed its sequences to be incorporated into IRGSP sequence as long as IRGSP used the information to improve the sequence from draft to finished level. The BGI data, unlike the Monsanto and Syngenta data, were made available freely (http://btn.genomics.org.cn/rice/). However, the aim of IRGSP was to obtain a highly accurate finished sequence of the rice genome. (26) As a first step, the IRGSP announced the release of a high-quality map-based draft sequence in the public domain in December 2002. (2) As a result of these private and public ventures, the year 2002 saw the release of three draft sequences of the rice genome. The details of the participating groups and their efforts to produce the draft sequence are given below. Detailed history The decision to sequence the rice genome was taken at the International Plant Molecular Biology Conference held in 1997 in Singapore. Countries sharing a common interest in sequencing the rice genome joined hands to achieve this task (28) and launched the International Rice Genome Sequen- cing Project (IRGSP). (26) It was the third largest public genome project undertaken after the human and mouse genome projects. (28) The consortium included laboratories from Japan, USA, China, Taiwan, France, India, Korea, Brazil, Thailand Genes and genomes 422 BioEssays 28.4
Genes and genomes Decoding the rice genome Shubha Vij, Vikrant Gupta, Dibyendu Kumar, Ravi Vydianathan, Saurabh Raghuvanshi, Paramjit Khurana, Jitendra P. Khurana, and Akhilesh K. Tyagi* Summary Rice cultivation is one of the most important agricultural activities on earth, with nearly 90% of it being produced in Asia. It belongs to the family of crops that includes wheat, maize and barley, and it supplies more than 50% of calories consumed by the world population. Its immense economic value and a relatively small genome size makes it a focal point for scientific investigations, so much so that four whole genome sequence drafts with varying qualities have been generated by both public and privately funded ventures. The availability of a complete and high-quality map-based sequence has provided the opportunity to study genome organization and evolution. Most importantly, the order and identity of 37,544 genes of rice have been unraveled. The sequence provides the required ingredients for functional genomics and molecular breeding programs aimed at unraveling intricate cellular processes and improving rice productivity. BioEssays 28:421–432, 2006. ß 2006 Wiley Periodicals, Inc. Interdisciplinary Centre for Plant Genomics and Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110 021, India. Funding agency: The research work of our group is funded by the Department of Biotechnology, Government of India, New Delhi. SV, VG, DK and VR were supported by research fellowships from CSIR/ UGC, Government of India, New Delhi. *Correspondence to: Akhilesh K. Tyagi, Interdisciplinary Centre for Plant Genomics and Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110 021, India. E-mail: akhilesh@genomeindia.org DOI 10.1002/bies.20399 Published online in Wiley InterScience (www.interscience.wiley.com). Abbreviations: BAC, Bacterial artificial chromosome; bp, Base pairs; EST, Expressed sequence tag; IRGSP, International Rice Genome Sequencing Project; JAK, Janus kinase; Mb, Million base pairs; MDRs, Mathematically determined reads; MOsDB, MIPS Oryza sativa DataBase; MTP, Minimum tiling path; ORF, Open reading frame; PAC, P1-derived artificial chromosome; QTL, Quantitative trait loci; RePS, Repeat masked phrap with scaffolding; RiceGAAS, Rice Genome Automated Annotation System; SNP, Single nucleotide polymorphism; SSR, Simple sequence repeat; STAT, Signal Transducers and Activators of Transcription; STCs, Sequence tag connectors; TIR-NB-LRR, Toll-Interleukin-Region-Nucleotide-Binding site-LeucineRich Repeat; WGS, Whole Genome Shotgun; YAC, Yeast artificial chromosome. BioEssays 28:421–432, ß 2006 Wiley Periodicals, Inc. Introduction Rice is one of the most important food crops of the world. More than half of the world population depends on rice as the major source of calories and proteins. About 840 million people in the world are undernourished, which includes almost 200 million children from developing countries (http://www.fao.org/). Rice production will have to be increased substantially to meet the demand of the growing world population, especially in the Asian subcontinent. The rice production has, however, declined in the last 4 years (http://www.irri.org).(1) This is due to increasing urbanization leading to shortage of cultivable land and deteriorating environmental conditions. To meet the growing demand, a combination of breeding strategies and molecular biology tools has to be used in synchrony to obtain varieties that are high yielding and also more resistant to various abiotic and biotic stresses.(2) Sequencing of the rice genome was initiated with the aim of using the sequence information to understand the function of its gene repertoire. The pioneering work, which laid the foundation for rice genome sequencing, was initiated in early 1990s.(3) The work centered on constructing a linkage map (http://rgp.dna.affrc.go.jp/publicdata/geneticmap2000.index.html),(4) YAC (yeast artificial chromosome) based physical map,(5,6) a transcript map(7,8) and sequence-ready BAC/PAC (bacterial artificial chromosome/P1-derived artificial chromosome) physical map.(2,9) Rice is also amenable to genetic transformation, thereby providing an ideal crop system for functional genomics.(10) Moreover, the rice genome shares a syntenic relationship with other cereal crops like sorghum and maize.(11,12) Amongst the different cereal crops, rice was chosen as the best representative genome due to a relatively small estimated genome size of 430 Mb.(13) This review aims to trace the path of rice genome sequencing from its initiation to the current status and seeks to interpret the information obtained from the genome of the first food crop to be sequenced. Strategies to sequence whole genomes In order to sequence a large DNA molecule, it is first broken into small fragments, which are cloned and sequenced. The overlapping sequence reads are assembled using computer software programs into contigs. The quality of the sequence submitted in the database is variously classified as phase 0, I, II and III. The initial raw sequence generated is referred to as phase 0 and the assembled sequence represents phase I. When the contigs are ordered and oriented, the sequence is BioEssays 28.4 421 Genes and genomes designated as phase II. All stages before the final stage generate draft sequences of variable quality, which refers to the fact that the sequence is incomplete. The final step is to convert the draft into the finished sequence, also referred to as phase III (http://www.ncbi.nlm.nih.gov/HTGS/). For small genomes, like that of microbes, the finished sequence refers to a complete sequence, without gaps. However, in the case of eukaryotes, it is virtually impossible to get the complete genome information in a single piece because they contain a large amount of repetitive sequences, which are especially concentrated in the region spanning centromeres and telomeres.(14) The two main strategies for whole genome sequencing are discussed below. In the Clone-by-Clone Shotgun approach, the genome is fragmented and cloned in BAC/PAC vectors. Inserts of genomic DNA fragments in the BAC/PAC vectors are anchored physically to the genome, with the help of DNA markers, to develop a minimum tiling path (MTP). The MTP is generated using a combination of techniques including fingerprint patterns, sequence tag connectors (STCs) and marker information. Each BAC/PAC (with average insert size of 100– 150 kb) present in the MTP is again broken into small-sized fragments, cloned and sequenced. The sequence of the genome is then obtained by merging the individual BAC/PAC sequences.(14,15) Although this approach is time consuming, it offers the advantage that each clone is anchored to a specific chromosome, thus making the task of finishing much easier.(16) In addition, since the finished genome of a model organism will leverage other genomes, this technique is eventually cost effective. A high-quality sequence has been generated for human genome as well as Arabidopsis and rice genomes adopting this approach.(17–19) In the Whole Genome Shotgun (WGS) approach, the genomic DNA as such is broken into small-sized fragments, cloned and directly used for sequencing. The sequences are then assembled to reconstruct the whole genome.(15) This approach avoids the initial task of making BAC/PAC libraries, constructing a MTP and individual library construction for each BAC/PAC clone in the MTP. This strategy has been used extensively for bacterial genomes. The WGS approach has also been used for sequencing the human genome as well as indica and japonica rice genomes.(20–22) The potential problem in use of WGS for eukaryotic genomes is misassembly due to a high percentage of repetitive elements.(14,16) This is because each contig has to be individually anchored to the chromosome, which makes the task of finishing more laborious and cumbersome. 2002: The year of rice genome sequence drafts Overview Rice was the ideal candidate for genome sequencing after Arabidopsis since Arabidopsis and rice are widely accepted as 422 BioEssays 28.4 model dicot and monocot plants, respectively.(23) Rice was the first organism whose sequencing was pursued by four groups independently, which itself speaks for the importance of its genome information.(24,25) Although the task of sequencing the rice genome was initiated by the publicly funded International Rice Genome Sequencing Project (IRGSP, see next section),(26) it was a private company, Monsanto (St Louis, MO, USA), that released the first draft of the rice genome in April, 2000, based on the data generated at the University of Washington. Monsanto sequenced a total of 3,391 BACs using a clone-by-clone approach, to the level of 5X coverage, to produce a draft sequence of 399 Mb. This sequence was assembled in 52,202 contigs, representing 259 Mb nonoverlapping data, which was expected to cover almost 60% of the rice genome.(27) Meanwhile, two other groups, Syngenta (Torrey Mesa Research Institute, San Diego, USA) and Beijing Genomics Institute (BGI), China, also launched their independent sequencing programs. IRGSP, Monsanto and Syngenta chose the japonica cultivar ‘Nipponbare’ while BGI used the indica cultivar ‘93-11’ for sequencing. The aim of both the private ventures (Monsanto and Syngenta) and BGI for rice genome sequencing was primarily gene discovery and identification of molecular markers for breeding. Hence, these groups aimed at obtaining a draft sequence to get a broad overview of the rice genome.(14) The Monsanto and Syngenta data were not released to the public database but could be accessed for academic purposes on entering a database registration agreement through their site (http://www.riceresearch.org, http://www.tmri.org).(21,27) Both Monsanto and Syngenta also allowed its sequences to be incorporated into IRGSP sequence as long as IRGSP used the information to improve the sequence from draft to finished level. The BGI data, unlike the Monsanto and Syngenta data, were made available freely (http://btn.genomics.org.cn/rice/). However, the aim of IRGSP was to obtain a highly accurate finished sequence of the rice genome.(26) As a first step, the IRGSP announced the release of a high-quality map-based draft sequence in the public domain in December 2002.(2) As a result of these private and public ventures, the year 2002 saw the release of three draft sequences of the rice genome. The details of the participating groups and their efforts to produce the draft sequence are given below. Detailed history The decision to sequence the rice genome was taken at the International Plant Molecular Biology Conference held in 1997 in Singapore. Countries sharing a common interest in sequencing the rice genome joined hands to achieve this task(28) and launched the International Rice Genome Sequencing Project (IRGSP).(26) It was the third largest public genome project undertaken after the human and mouse genome projects.(28) The consortium included laboratories from Japan, USA, China, Taiwan, France, India, Korea, Brazil, Thailand Genes and genomes and UK.(29) The participants from the member countries are Rice Genome Research Program (RGP) Japan (http:// rgp.dna.affrc.go.jp), The Institute for Genomic Research (TIGR) USA (http://www.tigr.org/tdb/e2k1/osa1), National Center for Gene Research (NCGR) China (http://www. ncgr.ac.cn/), Genoscope France (http://www.genoscope. cns.fr/), Arizona Genomics Institute (AGI) USA (http:// www.genome.arizona.edu), Cold Spring Harbor Laboratory (CSHL) USA (http://nucleus.cshl.org/riceweb), Academia Sinica Plant Genome Center (ASPGC) Taiwan (http://genome. sinica.edu.tw), Indian Initiative for Rice Genome Sequencing (IIRGS) India (http://www.genomeindia.org/), Plant Genome Initiative at Rutgers (PGIR) USA (http://pgir.rutgers.edu), Korea Rice Genome Research Program (KRGRP) Korea (http://biogen.niast.go.kr), National Center for Genetic Engineering and Biotechnology (BIOTEC) Thailand (http:// www.cs.ait.ac.th/nstda/biotec/biotec.html), Brazilian Rice Genome Initiative (BRIGI) Brazil (http://www.ufpel.tche.br/faem/ fitotecnia/fitomelhoramento), John Innes Centre United Kingdom (http://www.jic.bbsrc.ac.uk), Washington University School of Medicine Genome Sequencing Center (GSC) USA (http://genome.wustl.edu/) and Wisconsin Rice Genome Project (GCOW) USA (http://www.gcow.wisc.edu). The IRGSP effort evolved around a few basic points: the sequencing strategy, the rice cultivar to be sequenced, the accuracy of sequence and the sequence release policy. It was decided to use the japonica cultivar ‘Nipponbare’ since it had already been used by the Rice Genome Research Program (RGP), Japan, as a source of EST sequencing and construction of a dense linkage and YAC physical map.(26) The guidelines for the method of sequencing, sequence quality and release policy were developed largely on the same lines as the Human Genome Project (http://www.gene.ucl.ac.uk/hugo/ bermuda.htm). The backbone of the IRGSP sequence-ready physical map for sequencing was derived from a PAC library comprising of 71,040 clones(30) and a BAC library consisting of 48,960 clones.(31) The other equally important sources for large insert clones for sequencing were a BAC library (90,000 clones) made at Clemson University Genomics Institute (CUGI)(9) and BAC libraries made by Monsanto.(27) The MTP for the 12 chromosomes sequenced by IRGSP were largely constructed using these large insert clones. The clones were chosen to form the MTP using the fingerprint patterns, BAC/PAC end sequences and information available from markers of each clone.(2) The general strategy employed by IRGSP for sequencing each large insert BAC/PAC clone was to shear the DNA and make two libraries for each clone having an insert size of 2 and 5 kb, respectively. On an average, 2000 clones from the two libraries were randomly sequenced from both ends to get 10X coverage and assembled to the phase II level, also referred to as the draft sequence.(29) The sequences were then assembled using a combination of the base caller, PHRED,(32,33) the assembler, PHRAP (http://boze- man.mbt.washington.edu) and sequence viewer and editor, CONSED(34) software. The IRGSP had set the target to finish the rice genome sequence by 2008.(26) This goal changed when Monsanto released the draft sequence of ‘japonica’ in 2000.(27) Two other groups, Syngenta and BGI published drafts of ‘japonica’ and ‘indica’ simultaneously in 2002.(21,22) Due to these developments, IRGSP decided to release the draft (phase II data) before releasing the finished sequence.(14) Consequently, the draft sequence was released by the consortium at a meeting held in Japan in December 2002 (http://rgp.dna.affrc.go.jp/rgp/Dec18_NEWS.html). This task was speeded by Monsanto’s decision to provide its BAC libraries sequenced up to 5X coverage to IRGSP.(27) The IRGSP draft sequence consisted of 3,380 BAC/PAC clones representing 366 Mb of the rice genome. This sequence covered 92% of the rice genome at >10X level. A total of 62,435 genes were predicted from the non-overlapping draft sequence (http://rgp.dna.affrc.go.jp/rgp/Dec18_NEWS.html). Syngenta (Torrey Mesa Research Institute, CA, USA) collaborated with Myriad Genetics (Salt Lake City, Utah) to sequence the ‘japonica’ variety of rice.(21) The draft was completed in just 14 months after inception of the program.(35) The genome was sequenced using a whole genome sequencing strategy. The repeat sequences were removed from the data and the remaining sequence represented 390 Mb of the estimated 420 Mb genome with coverage of 6X.(21) The number of genes was estimated to be 32,000 to 50,000 using a combination of different prediction programs [FGENESH (monocot), GeneMark.HMM (Arabidopsis and rice) and GENSCAN (Arabidopsis and maize)]. The Beijing Genomics Institute (BGI), China, announced its decision to sequence the ‘indica’ rice genome in May 2000. BGI, like Syngenta, took the whole genome sequencing route to sequence the rice genome and also released the draft sequence in 2002.(22) The sequence was made public by releasing the data on their website; http://www.btn. genomics.org.cn/rice/. The repeat sequences were identified mathematically and all 20-mer sequences whose frequency was above a particular threshold were categorized as mathematically determined reads (MDRs). On the basis of this, almost 78 Mb of sequence was identified as repeat sequence. These data were masked using RePS (Repeat masked phrap with scaffolding)(36) and the remaining sequence represented 361 Mb of the estimated 466 Mb genome with a coverage of 4X. Among the different prediction programs used for gene identification, FGENESH was found to be the most useful. The program predicted 46,022 to 55,615 genes in the BGI draft sequence.(22) After the release of the indica draft sequence,(22) BGI did additional sequencing to get a 6.28X coverage of the genome,(37) which was almost identical to the coverage obtained in the Syngenta draft,(21) although for a different cultivar. For the purpose of analysis, repeats were masked in BioEssays 28.4 423 Genes and genomes both the draft sequences and reassembled.(36,38) These independently assembled scaffolds from BGI and Syngenta draft data were combined to get super scaffolds in such a way as to get the order and orientation information but preserve the SNP differences between the two subspecies.(37) The total number of genes predicted in BGI, Syngenta and IRGSP data using FGENESH(39) were 49,088, 45,824 and 43,635, respectively.(37) This reduction in gene number compared to the previous estimates (http://rgp.dna.affrc.go.jp/rgp/Dec18_ NEWS.html)(21,22) can be attributed to an improved identification and elimination of TE-related genes.(37) Further, the objective of obtaining almostallthe rice genesina single piecewasfulfilled by checking the BGI and Syngenta assembled sequence with a collection of 19,079 full-length cDNA clones available in the KOME database. Almost 98% of the genes could be aligned in a single piece to either of the two genomes.(37) The salient features of the updated BGI and Syngenta draft sequences(37) are compared with the IRGSP finished sequence(18) in Table 1. The draft sequences were not expected to match the quality of finished sequence, yet proved to be quite useful to the rice research community in general.(14) They were used extensively for identifying genes in rice and for making comparisons with other plant species. The drafts also accelerated the pace for functional genomics, since the work on microarrays, proteomics and several other genome-wide studies could move much faster due to the ready availability of the sequence information.(29) Other areas of research such as breeding for introgression ofbetter traits and evolutionary studies also gained from the availability of the draft sequence. It was, however, necessary to have the complete sequence information not only for accurate interpretation of the rice genome in its own context, but also to serve as a standard and resource for other cereal genomes. The standard itself should be as reliable as possible to help extrapolation of information in the true sense to other economically valuable cereals.(40) Table 1. Comparison of BGI and Syngenta draft with IRGSP finished rice genome sequence Sequencing Group Genome size Coverage Assembled contigs Sub species/ cultivar Sequencing strategy Predicted genes Syngenta* BGI* IRGSP** 433 Mb >6X 46,246 466 Mb 6.28X 64,052 389 Mb >10X 57 japonica/ indica/93-11 japonica/ Nipponbare Nipponbare WGS WGS Clone-by-clone shotgun 45,824 49,088 37,544 Based on reference 38* and 18**. The contigs in Syngenta and BGI draft sequences were linked together to create much larger scaffolds and super scaffolds. 424 BioEssays 28.4 2004: The international year of rice—IRGSP releases the map-based finished rice genome sequence The year 2004 was declared as the International Year of Rice by the UN General Assembly. The theme of the program was ‘Rice is life’, reflecting the importance of rice as a food crop. The declaration was in recognition of the importance of rice, which provides food to more than half of the world population and is a source of income for millions of rice producers (http://www.fao.org/rice2004). The year also marked the declaration of the complete rice genome sequence by the IRGSP (http://rgp.dna.affrc.go.jp/IRGSP/celebrates/ celebrates.html). To commemorate the International Year of Rice, IRGSP received the Research Accomplishment Award at the world rice research conference for its role in decoding the rice genome sequence (http://rgp.dna.affrc.go.jp/IRGSP/ WRRC2004-Award/WRRC2004-Award.html). Before the declaration of the completion of the rice genome, the finished sequence of three chromosomes (1, 4 and 10) had already been published(41–43) To obtain the finished sequence, more than 4,000 BAC/PAC clones were sequenced, of which 3,401 clones (with at least 10X coverage and 99.9% accuracy) were used to obtain 95% coverage of the 389 Mb rice genome. The size of the genome (389 Mb) was estimated by adding the sum of non-overlapping sequence along with the estimated size of gaps. The finished sequence includes three completely sequenced centromeres (chromosome 4, 5 and 8). To reach phase III level (finished sequence), the sequence of each clone was checked for problem regions. The aim was to obtain an error rate of less than one per 10 kb with the least possible gaps (http://demeter.bio.bnl.gov/Guidelines.html). The main problem regions were gaps (physical/sequencing), low-quality regions and misassembled regions. Generally, these problems were solved by any one or a combination of the following approaches. For low-quality regions, resequencing was done using universal or custom primers. Sequencing gaps were closed by sequencing of bridge clones, PCR fragments or direct sequencing of BAC/PAC clones. Physical gaps were filled using PCR fragments or 40 kb fosmid clones. Sequencing using alternate chemistry was done when the normally used chemistry did not yield results. For regions that were not solved by these conventional methods, small insert libraries of the region were made or transposons were used to disrupt the difficult region. Each finished clone was finally confirmed by comparing its in silico restriction pattern with the actual restriction pattern.(18,44) The 370,733,456 bp long finished sequence was used to construct 12 chromosome-specific pseudomolecules in 57 contigs with an average continuous sequence length of 6.9 Mb (Fig. 1). A total of 62 physical gaps still remain in the finished sequence including 9 centromeres and 17 telomeres constituting 18.1 Mb of rice genome. The total number of genes predicted for the finished sequence is 37,544. EST Genes and genomes China Japan Korea Japan UK Chr1 Chr2 Chr3 Chr4 Chr5 45.05 43.26 36.78 35.95 37.37 36.18 36.15 35.48 30.00 29.73 USA Japan Japan Chr6 Chr7 Chr8 Chr9 31.60 30.73 30.28 29.64 28.57 28.43 30.53 22.69 Taiwan Japan France India USA France Chr10 Chr11 Chr12 23.96 22.68 30.76 28.35 27.77 27.56 USA Brazil Japan Korea Thailand Figure 1. Pseudomolecules of the 12 rice chromosomes. The participating nations responsible for sequencing each chromosome are given on the top. The arrowheads indicate the location of centromeres and green colour represents the position of physical gaps. The gap on short arm of chromosome 9 represents the nucleolar organizer consisting of 17S–5.8S–25S rDNA coding units. Values given at the bottom represent estimated (red) and sequenced (green) bases for each chromosome in Mb (modified from reference 18). markers were used to measure the genome coverage in the finished sequence. Almost 99.4% of the available ESTs were represented in the pseudomolecules.(18) The strength and validity of the gene prediction programs was checked by comparing the predicted genes to full-length cDNAs(45) and ESTs (http://www.ncbi.nlm.nih.gov/dbEST/) available in the database. A total of 61% predicted genes showed a match with either a cDNA or an EST.(18) What does the rice genome sequence reveal? General features The map-based sequence of the rice genome is estimated to cover 95% of the 389 Mb rice genome. A total of 37,544 genes have been predicted for the complete sequence with an average gene density of one gene per 9.9 kb and average gene length of 2,699 bp. Chromosomes 1 and 3 have the highest gene density with a gene density of one gene per 8.9 kb and one gene per 8.7 kb, respectively. Chromosomes 11 and 12 had the lowest gene density of one gene per 10.7 kb and one gene per 11.6 kb, respectively, compared to the rest of the rice chromosomes.(18) The rice genome was estimated to comprise 10–25% repeat elements before the availability of genome se- quence.(46,47) In the finished sequence, repeats constitute at least 35% of the rice genome. The number of transposable elements was maximum for chromosome 8 (38%) and 12 (38.3%) and least for chromosome 1 (31%), 2 (29.8%) and 3 (29%). The number of class II repeat elements like hAT, CACTA, IS630/Tc1/mariner, IS256/Mutator and IS5/Tourist is more than two-fold greater than class I elements like LINEs, SINEs, Ty1/copia and Ty3/gypsy. However, the class I elements contribute more to the genome (19.4%) compared to class II elements (12.9%). Thus, the presence of class II elements such as IS256/Mutator, IS5/Tourist and IS630/Tc1/ mariner in the rice genome correlated with gene density and they were most frequently present on the first three chromosomes.(18) Detailed analysis of the rice genome has led to the identification of three main classes of duplications. The first class of duplication is segmental, involving duplication of a large number of genes along the length of the chromosome.(18,37,48,49) The second class is tandem duplications involving individual genes and the third is background duplications accounting for all other duplications that could not be classified into either of the first two categories.(18,37) When only those rice genes showing homology to non-redundant KOME cDNAs were considered for duplication analysis, a total of 18 pairs of duplicated segments BioEssays 28.4 425 Genes and genomes covering more than 65% of the length of the mapped superscaffolds were identified in the indica rice genome sequence.(37) Analysis of the japonica rice genome sequence has shown that almost 60% of the rice genome is duplicated.(18) All the chromosomes have duplicated segments; however, the biggest duplicated block is shared between chromosome 11 and 12. From analysis of the duplicated segments, it seems that the whole genome duplication occurred about 55 to 70 million years ago before the divergence of the major cereals from their common ancestor. Most of the observed duplications can be attributed to this event. However, the chromosome 11–12 duplication is probably recent in origin and represents a segmental duplication, which was earlier predicted to have occurred about 20 million years ago.(21,37,49–51) A recent analysis on the basis of the finished sequence, however, estimates this to have happened as recently as 7.7 million years ago.(52) It may be mentioned that, for assessing the age of segmental duplications, the quality of sequence and annotation is very important. Thus, analysis of the duplication events in rice provides evidence for whole genome duplication, a recent segmental duplication and several individual duplication events. Analysis of the finished sequence for organellar insertions showed that there were at least 421 chloroplast and 909 mitochondrial DNA insertions contributing to 0.2% each of the nuclear genome. The pattern of chloroplast and mitochondrial insertions in the rice genome suggests that their transfer processes were independent of each other.(18) In another analysis, the nuclear localized plastid DNA was similarly found to be 0.2% of the total rice nuclear genome and was predominantly present near the pericentromeric regions. On the one hand, the number of such insertions was highest in chromosome 1 and lowest for chromosomes 9, 10 and 11. On the other hand, amount of insertions (in kb) was greatest in chromosome 10 and least in 11. Age distribution analysis revealed that the phenomenon of chloroplast–nuclear DNA flux involved a constant process of integration, shuffling and elimination with 80% of them being eliminated from the nuclear genome in the span of a million years.(53) The GC content varies widely amongst different organisms ranging from 26 to 65%.(54) Study of GC content in plant species showed that Gramineae genomes were richer in GC content compared to dicot genomes.(55) The overall GC content of the Arabidopsis genome is 34.7% with the exons having 44.1% and introns having 32.7% GC content.(17) The rice genome has an average GC content of 43.6% with 54.2% GC content of exons and 38.3% GC content of introns.(18) This GC content is much higher compared to Arabidopsis, especially in the coding regions. Another difference observed between the GC content of the two plants was that a distinct gradient in GC content existed within the rice genes, with the 50 end having on an average 25% more GC content compared to the 30 end. Such a gradient in GC content was not seen in Arabidopsis genes.(22) In another study, the GC content of two 426 BioEssays 28.4 Gramineae data sets (rice and maize) was compared with two dicots (Arabidopsis and tobacco) and a similar difference in the gradient of GC content in the direction of transcription was observed.(54) The centromere is the physical entity on the chromosome that binds microtubules and other centromeric-associated proteins so they serve as points of chromatid segregation during cell division.(56) With the exception of yeast centromere, which is made of 125 bp unique sequence, eukaryotic centromeres are known to contain long stretches of repetitive DNA sequences. Due to this, centromeres have long remained recalcitrant to cloning, sequencing and subsequent assembly.(56) Although most rice centromeres, like other eukaryotes, are large in size (>1 Mb) and thus difficult to sequence, some rice centromeres (chromosomes 4, 5 and 8) were smaller and hence could be sequenced fully.(18,57,58) Rice centromeres typically comprise 165 bp CentO satellite repeat sequences and retrotransposon elements.(59) The rice chromosome 4 centromere is 124 kb long with 18 tracts of 379 CentO repeats (59 kb) and 19 centromeric retroelements forming the core centromere. There were four different types of retroelements but the LTR retrotranspsosons like Ty3/gypsy-like retrotransposons constituted the largest retrotransposon family.(58) The chromosome 8 centromere was made of three clusters of CentO repeats spanning 68.5 kb while for chromosome 5 the size was 50.3 kb. The CentO repeats were tandemly arrayed and interrupted by 220 TE-related sequences, mostly Ty3/ gypsy-like retrotransposons. Chromosomes 4 and 8 had similar amounts of CentO repeats but had different numbers of retroelements. Surprisingly, 201 ORFs were predicted in the 1.97 Mb region around the chromosome 8 centromere. The majority of these predicted genes were found to code for hypothetical proteins but at least 20% showed similarity to known proteins or rice full-length cDNAs.(57) Out of these genes, 14 were present in the centromeric region and 12 of these were experimentally confirmed to be functional.(60) The presence of functional genes inside rice centromeres was an interesting finding since centromeres were previously considered to be transcriptionally silent heterochromatic zones.(61,62) This finding is similar to the presence of genes in the human neocentromeres.(63) Possibly, human neocentromeres represent an earlier stage and the rice centromeres (chromosomes 4 and 8) represent an intermediate stage in centromere evolution. Thus, the rice centromeres are probably not fully developed and in the due course of time the centromeric region would adapt to its role in cell division and accumulate repetitive sequences, and the genes will lose their expression and become transcriptionally silent.(60) The complete sequencing of the rice centromeres has revealed the basic structure of eukaryotic centromeres, has helped identify the minimum sequence required for centromere function and would prove useful for understanding their evolution. Genes and genomes Gene predictions Gene annotation consists of two basic steps. In the first step, different computer prediction programs are used for gene prediction and, in the second step, the predicted genes are validated using information on gene function available in the database.(64) Computational gene prediction in rice is facilitated by several publicly available databases. The different gene prediction programs used for annotating rice by different groups include Genscan (http://genes. mit.edu/ GENSCAN.html), FGENESH (http://www.softberry. com/berry.phtml), GeneMark.hmm (http://opal.biology.gatech. edu/GeneMark/eukhmm.cgi), GlimmerR (http://www.tigr.org/ software/glimmer/), RiceHMM (http://rgp.dna.affrc.go.jp/ RiceHMM/), tRNAscan-SE (http://www.genetics.wustl.edu/eddy/ tRNAscan-SE/), SplicePredictor (http://bioinformatics.iastate. edu/cgi-bin/sp.cgi), GeneSplicer (http://www.tigr.org/tdb/ GeneSplicer/), GeneFinder (http://rulai.cshl.org/tools/genefinder/) and NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/).(65) Out of these, FGENESH has been found to be the most-useful prediction tool available for rice.(22) Several websites provide detailed information about rice gene annotation. These include RiceGAAS, Rice Genome Automated Annotation System (http://RiceGAAS.dna.affrc.go.jp), TIGR, The Institute for Genomic Research (http://www.tigr.org/tdb/e2k1/osa1), Gramene (http://www.gramene.org/) and MOsDB, MIPS Oryza sativa DataBase (http://mips.gsf.de/proj/plant/jsf/rice/index.jsp).(66–69) The availability of 400,000 ESTs and at least 32,000 full-length cDNA clones has helped to a large extent in validation of computational gene prediction in rice (http://www.ncbi.nlm.nih. gov/UniGene, cdna01.dna.affrc.go.jp/cDNA).(70) The rice genome annotation at TIGR named Osa1 (Oryza sativa 1) is the most-widely used database for certain major projects of gene array, transcriptomics and annotation as it provides details of annotation and sequence assembly of the rice genome.(71) The annotation details of each gene model are linked to its functional information like expression data, gene ontologies and tagged lines (http://www.tigr.org/tdb/e2k1/osa1). Recently, another database called Rice Annotation Project Database (RAP-DB) has been made public. It utilizes IRGSP assembly and can be accessed through http://rapdb.lab.nig.ac.jp/.(72) The number of genes predicted in the finished rice genome sequence is 37,544.(18) In addition, a 7 Mb region on chromosome 9 and 0.25 Mb region on chromosome 11 code for ribosomal RNA. 763 tRNA genes were also predicted. This is smaller in comparison to the number predicted in draft sequences. In the IRGSP data, the vast difference in the number of genes predicted in the previously finished chromosomes 1, 4 and 10 as well as draft sequence in comparison to the finished genome sequence is explained by improvement in their annotation process. This excludes transposon-related genes numbering 17,752, since, FGENESH predicted a total of 55,296 genes for the finished sequence, which was comparable to the genes estimated from the previously finished chromosomes or draft sequence. The view that the rice genes without Arabidopsis homologues could include wrongly predicted genes is also supported by the finding that this subset of genes is largely different in its features from the rest of the rice genes. These differences include smaller size, more introns and unusual 30 GC richness. Another striking feature of these genes is that only a very small percentage is supported by EST data.(73) To further prove this point, these rice genes were annotated using the maize transcriptome data (representing more than 80% of maize genes). Only 15% of the rice genes lacking Arabidopsis homologues were supported by maize ESTs. Further, manual annotation of these genes showed that at least 30% of these genes were transposable elements.(73) This study supports the number of genes predicted by IRGSP for the finished sequence.(18) It is possible that the predicted rice genes that are not supported by ESTs could be supported by expression evidence from other functional genomics approaches like tiling microarrays(74–76) or MPSS (http:// mpss.udel.edu/rice) and only detailed analysis will give a true picture about the nature of these genes. The 37,544 genes predicted from the IRGSP finished sequence could be classified into 3,328 different types of domains. Out of the most abundant domains predicted, five were protein kinases. More than half of the predicted proteins could be associated with a biological process.(18) A total of 71% predicted rice gene products had homologues in Arabidopsis, while the percentage of rice gene products with homologues in humans, Drosophila, C. elegans, yeast, Synechocystis and E. coli were 40.8%, 38%, 36.5%, 30.2%, 17.6% and 10.2%, respectively (Fig. 2). Comparison of the two sequenced plant genomes Rice and Arabidopsis are distantly related species that diverged about 200 million years ago.(77) The rice genome is about three times larger and has almost 50% more genes than the Arabidopsis genome.(18,65) The largest syntenic region observed to date between the two organisms was identified in an analysis of chromosome 4 finished sequence with the Arabidopsis genome. The syntenic region covered 119 Arabidopsis proteins showing an identity of at least 70% over a minimum stretch of 30 amino acids. This analysis revealed that there was collinearity between the two genomes but was preserved only to a small extent.(21) Analysis of the Arabidopsis genome revealed that only 35% of its genes were unique while at least 17% of the genes were tandemly duplicated.(17) Similarly, in rice, almost 60% of the genome is duplicated while 14% of its genes are tandemly duplicated.(18) The high number of duplicated genes in both the plant genomes indicates that gene diversity in plants has probably arisen through genome duplication.(22) In the recent study done with the IRGSP data, almost 90% of Arabidopsis proteins had rice homologues, while 71% of BioEssays 28.4 427 Genes and genomes 26000 <E-200 <E-150 to E-200 24000 <E-100 to E-150 Number of Predicted Proteins 22000 20000 <E-50 to E-100 <E-10 to E-50 18000 <E-5 to E-10 16000 14000 12000 10000 8000 6000 4000 2000 E. co li tis Sy ne ch oc ys Ye as t C. el eg an s la an Hu m D ro so ph i Ar a bi do ps is 0 Figure 2. Comparison of predicted rice proteins with proteins from model organisms at different e-value cut-offs (modified from reference 18). predicted rice proteins had an Arabidopsis homologue. To eliminate the possibility of wrongly predicted genes, homology search was also done using only those predicted rice genes that were supported by an EST or a cDNA and the percentage of predicted rice genes with an Arabidopsis homologue increased to 88%.(18) Comparison of the predicted rice and Arabidopsis genes shows that the organisms share many common genes. These include most of the disease- and flowering-related genes, phosphate transporters, transcription factors and those involved in metabolism. In contrast, there are several common genes that are absent in these two plant genomes, but are present in other sequenced organisms. These include members of gene families encoding nuclear steroid receptor, p53, Notch/lin12, Janus kinase (JAK) and Signal Transducers and Activators of Transcription (STAT). Another important category was genes that were present either in Arabidopsis or rice. Some of the Arabidopsis genes that do not have rice homologues are FRIGIDA, FLOWERING LOCUS C, UNUSUAL FLORAL ORGANS and SUPERMAN amongst the flowering-related genes and TIR-NB-LRR (TollInterleukin-Region-Nucleotide-Binding site-Leucine-Rich Repeat) amongst the disease-related genes.(21,25) Amongst the predicted rice genes, 8% do not show homologues in 428 BioEssays 28.4 Arabidopsis and include well-known cereal-specific genes like prolamins along with several proteins such as chitinase precursor, seed allergen, starch branching enzyme, woundinduced protease inhibitor and abscisic stress ripening protein.(18) However, the majority of these genes do not show hits in the database or to hypothetical proteins. The basic difference between monocots and dicots will become clear only when the function of these largely unknown cerealspecific genes becomes clear.(18) Comparison of rice with cereal genomes The cereals diverged from their common ancestor around 60 million years ago.(78) Despite this period of independent evolution, the genes as well as their order in cereals is seen to be quite conserved.(12) A major advantage of sequencing the rice genome was its syntenic relationship with other cereal species.(26) Analysis of rice genome sequence draft showed that homologues of almost 98% wheat, barley and maize proteins could be identified in rice.(21) However, most of the analyses, which report strong syntenic relationship among cereals, have been done at a low resolution due to the limited number of common markers available.(79) There are several instances, however, where collinearity between cereals was Genes and genomes found to be disrupted when studied at a higher resolution. For instance, high-resolution mapping was done for studying wheat–rice synteny using a total of 4,485 wheat ESTs for comparison with the rice genome sequence. The analysis revealed that there was a general conservation of genes and their order in the two species. However, several breaks in the collinearity were observed.(79) In a similar study, 2,932 predicted genes from the long arm of chromosome 11 were compared with wheat ESTs. Although the genes were conserved in the analyzed region, several rearrangements could be seen that disrupted the gene order.(80) An analysis of rice sequence with 2629 maize markers identified 656 putative orthologs but revealed several breaks in collinearity.(81) Similar sequence-based alignments of rice done with sorghum and barley revealed that there were some rearrangements along with a general conservation of synteny.(82,83) Also, there are studies where identifying candidate genes on the basis of synteny did not prove useful. For instance, attempts to identify the Rph7 (leaf rust resistance) gene and Phd-H1 (photoperiod response) gene in barley on the basis of their expected collinearity in rice did not yield the expected results.(84,85) However, comparative genomics based on the syntenic relationship of rice with other cereals has helped in identifying several important genes, such as the QTL for malting quality in barley, major heading date QTL in perennial ryegrass, liguleless region in sorghum and Ror2, a gene conferring resistance to powdery mildew disease in barley.(86–89) Hence, from these studies, it seems that cross-species comparison would be useful in identifying genes of interest. However, in each case, collinearity will have to be investigated at the micro level in the region of interest using high-density genetic maps.(90) Conclusions Sequencing of the rice genome was initiated by four different groups. Monsanto and IRGSP used the clone-by-clone approach, while Syngenta and BGI made use of the WGS approach to sequence the rice genome. Amongst these different groups, only the publicly funded IRGSP was interested in the complete sequence information, while the other groups pursued sequencing for gene discovery and marker information.(14) Analysis of the rice genome sequence has confirmed the syntenic relationship amongst cereal crops.(21) However, it also shows that collinearity is not so well preserved as previously thought. Hence, information from rice can be extrapolated to other cereal crops, but only after studying microcollinearity in the region of interest.(90) A total of 18,828 SSRs and a large number of SNPs (0.5–0.8%) have been identified in the finished rice genome sequence, which will aid map-based cloning.(18) In fact the availability of sequence has already facilitated such efforts and several rice genes like Hd1, a major photoperiod-sensitive QTL, PLASTOCHRON1, a regulator of leaf initiation, Spl7, a heat-stress transcription factor, Rf-1, a fertility-restorer gene, Xa26, a gene Box 1. Glossary of terms Bacterial artificial chromosome (BAC): A bacterial cloning vector that can typically carry 100– 150 kb insert DNA. Contig: A contiguous DNA sequence generated by assembling overlapping sequences. Draft sequence: An incomplete sequence in terms of both contiguity and likelihood of errors. Finished sequence: A sequence with an error rate of less than one error per 10 kb, assembled in the correct order and orientation with least possible gaps. Minimum tiling path (MTP): The least number of overlapping clones that span a chromosomal region. P1-derived artificial chromosome (PAC): A cloning vector derived from P1 phage that can carry typically 100–150 kb insert DNA. Retrotransposon: A type of transposon that can move by producing an RNA intermediate. Scaffolds: An ordered set of contigs placed on the chromosome. Sequencing gap: A gap in the sequence that can be filled by sequencing of bridge clones available in the region. Shotgun sequencing: An approach to sequence DNA by breaking it into a large number of fragments that can be sequenced individually. Transposon: Any segment of DNA that can change its position in the genome. Yeast artificial chromosome (YAC): A high capacity cloning vector that can typically carry 300– 400 kb DNA. conferring resistance to Xanthomonas oryzae pv. oryzae and Gn1a, a cytokinin oxidase gene representing a QTL for grain production have been cloned utilizing the rice genome sequence information.(91–96) The sequence availability of two plant genomes (Arabidopsis and rice) provided an opportunity to compare their sequences and understand the special features of plant genomes. Plants have a much larger number of genes compared to other sequenced organisms. This is mainly due to the higher number of duplicated genes.(21) Thus, plant genomes seem to have evolved through polyploidization and subsequent gene loss.(97) There is also a need for functional validation of predicted genes in rice and Arabidopsis to serve as the landmark for other plant genomes for which full sequencing will probably never be done.(98–100) The recent use of rice genome sequence in microarray projects indicates its importance as a tool for global gene expression profiling, discovery of new genes and validating computational gene predictions.(74–76,101) Rice is also amongst the few organisms for which sequences are available in two BioEssays 28.4 429 Genes and genomes subspecies. Analysis of the alignments shows that, although the genes are highly conserved in the two subspecies, the major difference lies in the intergenic regions.(37) Thus, the availability of the rice genome sequences has given a deeper insight about the gene content, regulatory elements and the nature of repeats in its genome.(18) But, in the end, the real worth of rice genome sequence will be measured in terms of the agro-economic benefits. The huge efforts made in studying the rice genome will finally be justified when the sequence information is used in developing better rice varieties with greater yield and enhanced tolerance to various abiotic and biotic stresses. References 1. Peng S, Huang J, Sheeshy JE, Laza RC, Visperas RM, et al. 2004. Rice yields decline with higher night temperature from global warming. Proc Natl Acad Sci USA 101:9971–9975. 2. Sasaki T, Matsumoto T, Antonio BA, Nagamura Y. 2005. From mapping to sequencing, post-sequencing and beyond. Plant Cell Physiol 46:3– 13. 3. Sasaki T. 1998. The rice genome project in Japan. Proc Natl Acad Sci USA 95:2027–2028. 4. Harushima Y, Yano M, Shomura A, Sato M, Shimano T, et al. 1998. A high-density rice genetic linkage map with 2275 markers using a single F2 population. Genetics 148:479–494. 5. Umehara Y, Inagaki A, Tanoue H, Yasukochi Y, Nagamura Y, et al. 1995. Construction and characterization of a rice YAC library for physical mapping. Mol Breed 1:79–89. 6. Saji S, Umehara Y, Antonio BA, Yamane H, Tanoue H, et al. 2001. A physical map with yeast artificial chromosome (YAC) clones covering 63% of the 12 rice chromosomes. Genome 83:32–37. 7. Yamamoto K, Sasaki T. 1997. Large-scale EST sequencing in rice. Plant Mol Biol 35:135–144. 8. Wu J, Maehara T, Shimokawa T, Yamamoto S, Harada C, et al. 2002. A comprehensive rice transcript map containing 6591 expressed sequence tag sites. Plant Cell 14:525–535. 9. Chen M, Presting G, Barbazuk WB, Goicoechea JL, Blackmon B, et al. 2002. An integrated physical and genetic map of the rice genome. Plant Cell 14:537–545. 10. Tyagi AK, Mohanty A. 2000. Rice transformation for crop improvement and functional genomics. Plant Science 158:1–18. 11. Moore G, Devos KM, Wang Z, Gale MD. 1995. Grasses, line up and form a circle. Curr Biol 5:737–739. 12. Gale MD, Devos KM. 1998. Plant comparative genetics after 10 years. Science 282:656–659. 13. Goff SA. 1999. Rice as a model for cereal genomics. Curr Opin Plant Biol 2:86–89. 14. Buell CR. 2002. Obtaining the sequence of the rice genome and lessons learned along the way. Trends Plant Sci 7:538–542. 15. Green ED. 2001. Strategies for the systematic sequencing of complex genomes. Nat Rev Genet 2:573–583. 16. Waterston RH, Lander ES, Sulston JE. 2002. On the sequencing of the human genome. Proc Natl Acad Sci USA 99:3712–3716. 17. The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408:796– 815. 18. International Rice Genome Sequencing Project. 2005. The map-based sequence of the rice genome. Nature 436:793–800. 19. International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409:860– 921. 20. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. 2001. The sequence of the human genome. Science 291:1304–1351. 21. Goff SA, Ricke D, Lan TH, Presting G, Wang R, et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296:92–100. 430 BioEssays 28.4 22. Yu J, Hu S, Wang J, Wong GK, Li S, et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296:79–92. 23. Izawa T, Shimamoto K. 1996. Becoming a model plant: the importance of rice to plant science. Trends Plant Sci 1:95–99. 24. Buell CR. 2002. Current status of the sequence of the rice genome and prospects for finishing the first monocot genome. Plant Physiol 130:1585–1586. 25. Delseny M. 2003. Towards an accurate sequence of the rice genome. Curr Opin Plant Biol 6:101–105. 26. Sasaki T, Burr B. 2000. International Rice Genome Sequencing Project: the effort to completely sequence the rice genome. Curr Opin Plant Biol 3:138–141. 27. Barry GF. 2001. The use of the Monsanto draft rice genome sequence in research. Plant Physiol 125:1164–1165. 28. Eckardt NA. 2000. Sequencing the rice genome. Plant Cell 12:2011– 2017. 29. Tyagi AK, Khurana JP, Khurana P, Raghuvanshi S, Gaur A, et al. 2004. Structural and functional analysis of rice genome. J Genet 83:79–99. 30. Baba T, Katagiri S, Tanoue H, Tanaka R, Chiden Y, et al. 2000. Construction and characterization of rice genomic libraries: PAC library of japonica variety, nipponbare and BAC library of indica variety, Kasalath. Bulletin of the NIAR 14:41–49. 31. Wu J, Mizuno H, Hayashi-Tsugane M, Ito Y, Chiden Y, et al. 2003. Physical maps and recombination frequency of six rice chromosomes. Plant J 36:720–730. 32. Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8:175–185. 33. Ewing B, Green P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8:186–194. 34. Gordon D, Abajian C, Green P. 1998. Consed: a graphical tool for sequence finishing. Genome Res 8:195–202. 35. Davenport RJ. 2001. Rice genome. Syngenta finishes, consortium goes on. Science 291:807. 36. Wang J, Wong GK, Ni P, Han Y, Huang X, et al. 2002. RePS: a sequence assembler that masks exact repeats identified from the shotgun data. Genome Res 12:824–831. 37. Yu J, Wang J, Lin W, Li S, Li H, et al. 2005. The genomes of Oryza sativa: A history of duplications. PLoS Biol 3:e38. 38. Zhong L, Zhang K, Huang X, Ni P, Han Y, et al. 2003. A statistical approach designed for finding mathematically defined repeats in shotgun data and determining the length distribution of clone-inserts. Genomics Proteomics Bioinformatics 1:43–51. 39. Salamov A, Solovyev V. 2000. Ab initio gene finding in Drosophila genomic DNA. Genome Res 10:516–522. 40. Leach J, McCouch S, Slezak T, Sasaki T, Wessler S. 2002. Why finishing the rice genome matters. Science 296:45. 41. Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, et al. 2002. The genome sequence and structure of rice chromosome 1. Nature 420:312–316. 42. Feng Q, Zhang Y, Hao P, Wang S, Fu G, et al. 2002. Sequence and analysis of rice chromosome 4. Nature 420:316–320. 43. The Rice Chromosome 10 Sequencing Consortium. 2003. In-depth view of structure, activity, and evolution of rice chromosome 10. Science 300:1566–1569. 44. de la Bastide M, Johnson D, Balija V, McCombie WR. 2001. Strategies and techniques for finishing genomic sequence. In: Khush GS, Brar DS, Hardy B, editors. Rice Genetics IV. New Delhi: Science Publishers, Inc. pp 197–213. 45. Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, et al. 2003. Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science 301:376–379. 46. Mao L, Wood TC, Yu Y, Budiman MA, Tomkins J, et al. 2000. Rice transposable elements: a survey of 73,000 sequence-tagged-connectors. Genome Res 10:982–990. 47. Turcotte K, Srinivasan S, Bureau T. 2001. Survey of transposable elements from rice genomic sequences. Plant J 25:169–179. 48. Wang S, Wang J, Jiang J, Zhang Q. 2000. Mapping of centromeric regions on the molecular linkage map of rice (Oryza sativa L.) using centromere-associated sequences. Mol Gen Genet 263:165–172. Genes and genomes 49. Paterson AH, Bowers JE, Chapman BA. 2004. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc Natl Acad Sci USA 101:9903– 9908. 50. Salse J, Piégu B, Cooke R, Delseny M. 2002. Synteny between Arabidopsis thaliana and rice at the genome level: a tool to identify conservation in the ongoing rice genome sequencing project. Nucleic Acids Res 30:2316–2328. 51. Vandepoele K, Simillion C, Van de Peer Y. 2003. Evidence that rice and other cereals are ancient aneuploids. Plant Cell 15:2192– 2202. 52. The Rice Chromosomes 11 and 12 Sequencing Consortia. 2005. The sequence of rice chromosomes 11 and 12, rich in disease resistance genes and recent gene duplications. BMC Biology 3:20. 53. Matsuo M, Ito Y, Yamauchi R, Obokata J. 2005. The rice nuclear genome continuously integrates, shuffles, and eliminates the chloroplast genome to cause chloroplast-nuclear DNA flux. Plant Cell 17: 665–675. 54. Wong GK, Wang J, Tao L, Tan J, Zhang J, et al. 2002. Compositional gradients in Gramineae genes. Genome Res 12:851–856. 55. Carel N, Bernardi G. 2000. Two classes of genes in plants. Genetics 154:1819–1825. 56. Cooke HJ. 2004. Silence of the centromeres-not. Trends Biotechnol 22:319–321. 57. Wu J, Yamagata H, Hayashi-Tsugane M, Hijishita S, Fujisawa M, et al. 2004. Composition and structure of the centromeric region of rice chromosome 8. Plant Cell 16:967–976. 58. Zhang Y, Huang Y, Zhang L, Li Y, Lu T, et al. 2004. Structural features of the rice chromosome 4 centromere. Nucleic Acids Res 32:2023– 2030. 59. Lamb JC, Theuri J, Birchler JA. 2004. What’s in a centromere? Genome Biol 5:239. 60. Nagaki K, Cheng Z, Ouyang S, Talbert PB, Kim M, et al. 2004. Sequencing of a rice centromere uncovers active genes. Nat Genet 36:138–145. 61. Hosouchi T, Kumekawa N, Tsuruoka H, Kotani H. 2002. Physical mapbased sizes of the centromeric regions of Arabidopsis thaliana chromosomes 1, 2, and 3. DNA Res 9:117–121. 62. Nagaki K, Talbert PB, Zhong CX, Dawe RK, Henikoff S, et al. 2003. Chromatin immunoprecipitation reveals that the 180-bp satellite repeat is the key functional DNA element of Arabidopsis thaliana centromeres. Genetics 163:1221–1225. 63. Saffery R, Sumer H, Hassan S, Wong LH, Craig JM, et al. 2003. Transcription within a functional human centromere. Mol Cell 12:509– 516. 64. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. 2000. The genome sequence of Drosophila melanogaster. Science 287: 2185–2195. 65. Schoof H, Karlowski WM. 2003. Comparison of rice and Arabidopsis annotation. Curr Opin Plant Biol 6:106–112. 66. Sakata K, Nagamura Y, Numa H, Antonio BA, Nagasaki H, et al. 2002. RiceGAAS: an automated annotation system and database for rice genome sequence. Nucleic Acids Res . 30 :98–102. 67. Yuan Q, Ouyang S, Liu J, Suh B, Cheung F, et al. 2003. The TIGR rice genome annotation resource: annotating the rice genome and creating resources for plant biologists. Nucleic Acids Res 31:229– 233. 68. Ware DH, Jaiswal P, Ni J, Yap IV, Pan X, et al. 2002. Gramene, a tool for grass genomics. Plant Physiol 130:1606–1613. 69. Karlowski WM, Schoof H, Janakiraman V, Stuempflen V, Mayer KF. 2003. MOsDB: an integrated information resource for rice genomics. Nucleic Acids Res 31:190–192. 70. Rensink WA, Buell CR. 2004. Arabidopsis to rice. Applying knowledge from a weed to enhance our understanding of a crop species. Plant Physiol 135:622–629. 71. Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, et al. 2005. The Institute for Genomic Research Osa1 rice genome annotation database. Plant Physiol 138:18–26. 72. Ohyanagi H, Tanaka T, Sakai H, Shigemoto Y, Yamaguchi K, et al. 2006. The Rice Annotation Project Database (RAP-DB): hub for Oryza 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. sativa ssp. japonica genome information. Nucleic Acids Res 1:D741– D744. Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W. 2004. Consistent over-estimation of gene number in complex plant genomes. Curr Opin Plant Biol 7:732–736. Jiao Y, Jia P, Wang X, Su N, Yu S, et al. 2005. A tiling microarray expression analysis of rice chromosome 4 suggest a chromosomelevel regulation of transcription. Plant Cell 17:1641–1657. Li L, Wang X, Xia M, Stolc V, Su N, et al. 2005. Tiling microarray analsyis of rice chromosome 10 to identify the transcriptome and relate its expression to chromosomal architecture. Genome Biol 6: R52. Li L, Wang X, Stolc V, Li X, Zhang D, et al. 2006. Genome-wide transcription analyses in rice using tiling microarrays. Nature Genet 38: 124–129. Wolfe KH, Gouy M, Yang Y-W, Sharp PM, Li W-H. 1989. Date of the monocot-dicot divergence estimated from chloroplast DNA sequence data. Proc Natl Acad Sci USA 86:6201–6205. Petsko GA. 2002. Grain of truth. Genome Biol 3:1007. Sorrells ME, La Rota M, Bermudez-Kandianis CE, Greene RA, Kantety R, et al. 2003. Comparative DNA sequence analysis of wheat and rice genomes. Genome Res 13:1818–1827. Singh NK, Raghuvanshi S, Srivastava SK, Gaur A, Pal AK, et al. 2004. Sequence analysis of the long arm of rice chromosome 11 for ricewheat synteny. Funct Integr Genomics 4:102–117. Salse J, Piegu B, Cooke R, Delseny M. 2004. New in silico insight into the synteny between rice (Oryza sativa L.) and maize (Zea mays L.) highlights reshuffling and identifies new duplications in the rice genome. Plant J 38:396–409. Dubcovsky J, Ramakrishna W, SanMiguel PJ, Busso CS, Yan L, et al. 2001. Comparative sequence analysis of colinear barley and rice bacterial artificial chromosomes. Plant Physiol 125:1342–1353. Klein PE, Klein RR, Vrebalov J, Mullet JE. 2003. Sequence-based alignment of sorghum chromosome 3 and rice chromosome 1 reveals extensive conservation of gene order and one major chromosomal rearrangement. Plant J 34:605–621. Dunford RP, Yano M, Kurata N, Sasaki T, Huestis G. 2002. Comparative mapping of the barley Ppd-H1 photoperiod response gene region, which lies close to a junction between two rice linkage segments. Genetics 161:825–834. Brunner S, Keller B, Feuillet C. 2003. A large rearrangement involving genes and low-copy DNA interrupts the microcollinearity between rice and barley at the Rph7 locus. Genetics 164:673–683. Han F, Kleinhofs A, Ullrich SE, Kilian A, Yano M. 1998. Synteny with rice-analysis of barley malting quality QTLs and RPG4 chromosomal regions. Genome 41:373–380. Zwick MS, Islam-Faridi MN, Czeschin DG, Wing RA, Hart GE, et al. 1998. Physical mapping of the liguleless linkage group in Sorghum bicolor using rice RFLP-selected sorghum BACs. Genetics 148:1983– 1992. Collins NC, Thordal-Christensen H, Lipka V, Bau S, Kombrink E, et al. 2003. SNARE-protein-mediated disease resistance at the plant cell wall. Nature 425:973–977. Armstead IP, Turner LB, Farrell M, Skot L, Gomez P, et al. 2004. Synteny between a major heading-date QTL in perennial ryegrass (Lolium perenne L.) and the Hd3 heading-date locus in rice. Theor Appl Genet 108:822–828. La Rota M, Sorrells ME. 2004. Comparative DNA sequence analysis of mapped wheat ESTs reveals the complexity of genome relationships between rice and wheat. Funct Integr Genomics 4:34–46. Yano M, Katayose Y, Ashikari M, Yamanouchi U, Monna L, et al. 2000. Hd1, a major photoperiod sensitivity quantitative trait locus in rice, is closely related to the Arabidopsis flowering time gene CONSTANS. Plant Cell 12:2473–2484. Yamanouchi U, Yano M, Lin H, Ashikari M, Yamada K. 2002. A rice spotted leaf gene, Spl7, encodes a heat stress transcription factor protein. Proc Natl Acad Sci USA 99:7530–7535. Komori T, Ohta S, Murai N, Takakura Y, Kuraya Y, et al. 2004. Mapbased cloning of a fertility restorer gene, Rf-1, in rice (Oryza sativa L.). Plant J 37:315–325. BioEssays 28.4 431 Genes and genomes 94. Miyoshi K, Ahn BO, Kawakatsu T, Ito Y, Itoh J, et al. 2004. PLASTOCHRON1, a timekeeper of leaf initiation in rice, encodes cytochrome P450. Proc Natl Acad Sci USA 101:875–880. 95. Sun X, Cao Y, Yang Z, Xu C, Li X, et al. 2004. Xa26, a gene conferring resistance to Xanthomonas oryzae pv. oryzae in rice, encodes an LRR receptor kinase-like protein. Plant J 37:517–527. 96. Ashikari M, Sakakibara H, Liu S, Yamamoto T, Takashi T, et al. 2005. Cytokinin oxidase regulates rice grain production. Science 309:741– 745. 97. Bancroft I. 2002. Insights into cereal genomes from two draft genome sequences of rice. Genome Biol 3:10–15. 432 BioEssays 28.4 98. Mayer K, Mewes H-K. 2001. How can we deliver the large plant genomes? Strategies and perspectives. Curr Opin Plant Biol 5:173– 177. 99. Rabinowicz PD, McCombie WR, Martienssen RA. 2003. Gene enrichment in plant genomic shotgun libraries. Curr Opin Plant Biol 6:150– 156. 100. Barbazuk WB, Bedell JA, Rabinowicz PD. 2005. Reduced representation sequencing: a success in maize and a promise for other plant genomes. Bioessays 27:839–848. 101. Rensink WA, Buell CR. 2005. Micoarray expression profiling resources for plant genomics. Trends Plant Sci 10:603–609.