D690–D697 Nucleic Acids Research, 2009, Vol. 37, Database issue
doi:10.1093/nar/gkn828
Published online 25 November 2008
Ensembl 2009
T. J. P. Hubbard1,*, B. L. Aken1, S. Ayling1, B. Ballester2, K. Beal2, E. Bragin1, S. Brent1,
Y. Chen2, P. Clapham1, L. Clarke1, G. Coates1, S. Fairley1, S. Fitzgerald2,
J. Fernandez-Banet1, L. Gordon2, S. Graf2, S. Haider2, M. Hammond2, R. Holland2,
K. Howe1, A. Jenkinson2, N. Johnson2, A. Kahari2, D. Keefe2, S. Keenan2, R. Kinsella2,
F. Kokocinski1, E. Kulesha2, D. Lawson2, I. Longden2, K. Megy2, P. Meidl2,
B. Overduin2, A. Parker1, B. Pritchard1, D. Rios2, M. Schuster2, G. Slater2, D. Smedley2,
W. Spooner2, G. Spudich2, S. Trevanion1, A. Vilella2, J. Vogel1, S. White1, S. Wilder2,
A. Zadissa1, E. Birney2, F. Cunningham2, V. Curwen1, R. Durbin1,
X. M. Fernandez-Suarez2, J. Herrero2, A. Kasprzyk2, G. Proctor2, J. Smith1,
S. Searle1 and P. Flicek2
Wellcome Trust Sanger Institute and 2European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome
Campus, Hinxton, Cambridge, CB10 1SD, UK
Received and Accepted October 14, 2008
ABSTRACT
The Ensembl project (http://www.ensembl.org) is
a comprehensive genome information system featuring an integrated set of genome annotation, databases, and other information for chordate, selected
model organism and disease vector genomes. As of
release 51 (November 2008), Ensembl fully supports
45 species, and three additional species have preliminary support. New species in the past year
include orangutan and six additional low coverage
mammalian genomes. Major additions and improvements to Ensembl since our previous report include
a major redesign of our website; generation of multiple genome alignments and ancestral sequences
using the new Enredo-Pecan-Ortheus pipeline and
development of our software infrastructure, particularly to support the Ensembl Genomes project
(http://www.ensemblgenomes.org/).
INTRODUCTION
The genome sequence of an organism provides a natural
index for organizing and understanding biological data.
The Ensembl project provides a comprehensive genome
information system consisting of data storage, integration,
analysis and visualization of a wide variety of biological
data. Ensembl’s primary focus is around providing gene
annotation and comparative genome integration for chordate genomes, the vast majority of which are vertebrates. Ensembl concentrates particularly on mammalian
genomes having developed initially around the human
genome sequence. In comparison to similar projects
based at the University of California Santa Cruz (1) and
the National Center for Biotechnology Information (2),
some of the distinguishing characteristics of the Ensembl
project are:
(1) It provides consistent sets of annotation data within
and between genomes:
–
–
It provides a geneset for each genome, generated
from an automatic pipeline where no manually
curated geneset exists, with stable identifiers
which are tracked between Ensembl releases.
It provides relationships between genes and genomes in a comparative genomics framework in
the form of sequence alignments, ortholog and
paralog assignments and genetrees, again generated from an automatic pipeline where no manually curated relationships exist.
(2) It is a completely open project, not only through
providing downloads of all data and software
source code, but through multiple levels of programmatic access:
–
It allows its database system to be programmed
against using the Ensembl API, a powerful object
*To whom correspondence should be addressed. Tel: +44 1223 496886; Fax: +44 1223 494919; Email: th@sanger.ac.uk
ß 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Downloaded from http://nar.oxfordjournals.org/ by guest on November 23, 2015
1
Nucleic Acids Research, 2009, Vol. 37, Database issue D691
–
–
oriented representation of biological entities
(e.g. genes) implemented in the Perl programming
language.
It allows its genome annotations, alignments,
variation and functional genomics data to be
dynamically federated with external data sources
via the DAS protocol (3,4) and visualized
through many of its website interfaces (http://
www.ensembl.org/).
It allows its datasets to be dynamically federated
with external datasets for data mining using the
BioMart system (5).
RESULTS
Ensembl Web site redesign
The majority of users access Ensembl through its web
interface, making it a critical component of the project.
It is generally recognized that major factors influencing
website usability are speed and discoverability. As websites grow and their underlying databases become more
complex, individual web pages tend to become larger,
more complex and slower to display and it becomes
harder for users to discover new functionality and navigate to the pages most appropriate to their query. The case
of Ensembl is no different: the data contained in its databases is presented in different ways through a number of
different ‘views’, which have been progressively added as
Downloaded from http://nar.oxfordjournals.org/ by guest on November 23, 2015
The Ensembl project is now being joined by the Ensembl
Genomes project (http://www.ensemblgenomes.org/),
which will use Ensembl technology to ultimately provide
a common interface to genomes across biology.
A continuing driver for developments in Ensembl is its
active involvement in many data generation and analysis
projects. Recent examples have been the Rat haplotype
project (6–8) and the ENCODE project (9). Dealing
with data generated by the ENCODE project in particular
had led to the development of specific algorithms for
experimental data handling, such as approaches for
designing and assessing whole genome tiling arrays (10).
Ensembl has continued to be strongly involved in analysis
for publications of new vertebrate genome sequences,
particularly through its genesets (11–13) (see below).
The report lists only some of the new features, new data
and other improvements that we have added to Ensembl
since our last report (14). Users interested in the most
up-to-date details of the Ensembl project should visit the
Ensembl main page (http://www.ensembl.org) and follow
the ‘What’s new’ link and/or subscribe to the low-volume
‘Ensembl announce’ mailing list by sending email ‘subscribe ensembl-announce’ as the message body to majordomo@ebi.ac.uk. There is also an Ensembl blog (http://
ensembl.blogspot.com/) and associated RSS feeds which
in particular cover upcoming Ensembl training courses
around the world (see below). Users with questions
about Ensembl can consult the extensive online help,
FAQ and tutorial materials (15) (include animated tutorials) or contact the Ensembl helpdesk through the website
or by emailing helpdesk@ensembl.org.
project has developed, starting with relatively straightforward views summarizing information about a given gene,
or displaying a region of genome sequence (16), to increasingly complex views such as TranscriptSNPView (17)
showing sequence variation within a given transcript
across a set of strains or individuals. At the same time,
the amount of data contained in many views has grown,
for example the increased number of species has greatly
added to the data presented in views containing comparative genome information. It is not straightforward to identify bottlenecks for users in web based systems. For
example, analysing web log files does not easily distinguish
between web pages which are of interest to a limited
number of users and pages which most users have not
discovered. Perceived web site performance can also be
very different for different users as a result of different
browsers, desktop machines and network speeds. Since
the last report (14) considerable effort has been invested
in understanding and addressing these issues, culminating
in a substantially redesigned and reengineered website
from Ensembl release 51 (November 2008).
In the new design (release 51) the web-code has been
completely re-developed with improved speed as a high
priority. The changes result in substantially smaller web
pages which load much faster. A single page now requires
far fewer network connections to the web servers, which
substantially improves performance for users distant from
the web servers. This has been achieved through the adoption of standards compliant HTML, Javascript and CSS;
a more streamlined use of the AJAX (Asynchronous
JavaScript and XML) protocol to include additional content; the incorporation of shared memory caching using
memcached (http://www.danga.com/memcached/); and
optimized Apache web server settings to improve browser
performance. To enable the project to prioritize improvements and measure their impact on speed, a system for
continuous automated monitoring of the response speed
of the Ensembl website from more than ten sites around
the world was developed and deployed in early 2008.
In parallel with the redevelopment of the underlying web-code, the website has been redesigned to
improve navigability and discoverability (Figure 1). The
new design organizes different views into four classes:
Location, Gene, Transcript and Variation, which can be
easily navigated between through tabs at the top of each
web page. The location class includes views of the genome
sequence at a range of resolutions and genome sequence
based comparative views (Figure 1A). Gene based views
include textual information about the gene, views of its
local genomic environment, views of the gene in the context of its orthologs and paralog relationships with other
genomes in the Ensembl system and views of sequence
variation within that population (Figure 1B). Transcript
based views are similar to the gene based ones, but
focus around individual transcript structures with more
detail (Figure 1C). Variation based views display information focused around individual SNPs (data not
shown). Information presented in a single view in previous
versions of Ensembl is now presented as separate smaller
views in the new design. The relationship between these
new views is clearly shown by the left hand hierarchical
D692 Nucleic Acids Research, 2009, Vol. 37, Database issue
A
B
Figure 1. Screenshots of the Release 51 Ensembl website illustrating the principles of the new design and some of the new features. The figure shows
an example of three of the four classes of display view using human gene SLC24A5 as the context. (A) An example of a location based view, showing
a region of the genome around the gene. (B) An example of a gene based view, showing the gene tree. (C) An example of a transcript based view,
showing supporting evidence for the transcript model. The three tabs across the top of the page, allow rapid navigation between the three classes of
view. The fourth variation tab (data not shown) appears if an individual SNP is selected. For each class, the left hand menu lists the different views
available. For the location-based views (A), this includes views of a genome at a range of resolutions and genome sequence based comparative
genomic views. For the gene-based views (B), this includes textual information about the gene, views of its local genomic environment, views of the
gene in the context of its orthologs and paralog relationships with other genomes in the Ensembl system and views of sequence variation within that
population. The transcript based views (C) have views similar to the gene based views, but focused around individual transcript structures with more
detail. As well as the overall redesign of the navigation between views, there are substantial improvements to many individual views, based on the
much more extensive use of AJAX in the new web-code. Examples are the genetree view (B) which allows nodes to be expanded or collapsed
interactively, making the view much more usable for large gene families; the substantially redesigned supporting evidence views (C) and the page
configuration options on many views (e.g. A) which are much more intuitive than before and have a much greater range of display options.
Downloaded from http://nar.oxfordjournals.org/ by guest on November 23, 2015
C
Nucleic Acids Research, 2009, Vol. 37, Database issue D693
New species and improved gene annotations
In the past year, seven new species (all mammals) were
added to Ensembl including one new high coverage
genome Pongo pygmaeus abelii (orangutan) and six new
low coverage genomes [Pteropus vampyrus (megabat),
Tursiops truncates (dolphin), Tarsius syrichta (philippine
tarsier), Lama pacos (alpaca), Dipodomys ordii (kangaroo
rat) and Procavia capensis (rock hyrax)]. Ensembl now
supports 19 low coverage 2 genome sequences, the
majority generated as part of the Mamalian Genome
Project (http://www.broad.mit.edu/node/296). So far
only one of the original 2 genomes, Cavia porcellus
(Guinea Pig), has been upgraded to high coverage
(6.8). Together with the other 13 high coverage mammalian genomes, Ensembl contains a total of 32 mammals,
making it an extensive resource for mammalian comparative genomics. In total Ensembl now supports 48 genomes,
41 of which are vertebrates.
One of the major goals of Ensembl is to provide genesets
which are as accurate and complete as possible and these
continue to be used as reference genesets in analysis of new
vertebrate genomes. Recent genome publications based on
Ensembl genesets include those of Platypus Ornithorhynchus anatinus (11), the Oposum Monodelphis domestica
(12) and the Rhesus Macaque Macaca mulatta (13). The
gene build process is based on alignments of protein and
cDNA sequences and there is continuous work to improve
it and generate updated, more accurate and complete genesets. Different gene build strategies are used depending on
the assembly, quality of the genome, its distance to high
quality genomes and the extent of its organism-specific
transcript evidence as has been previously described (18).
This year one focus has been to develop a systematic post
gene build comparative analysis process (using the Ensembl
compara homology pipeline) to identify initial gene
structures that appear to be evolutionarily inconsistent.
These regions are then subject to a second, more computationally expensive localized gene build pipeline with more
sensitive parameters. The major classes of problems identified are split genes, missing orthologous genes, partially
predicted genes and false exons. For the test case of the
horse genome with initially 20 322 gene models, this postprocessing pipeline identified 236 genes that were split;
added 1013 genes that had initially been missed, but for
which there were orthologs; extended 1330 partially predicted genes and removed 840 false exons. The process is
now being systematically applied to other high coverage
mammalian genomes. These genesets will be patched in
subsequent Ensembl releases.
The other major focus has been the ongoing improvement of the human geneset in collaboration with other
groups. Ensembl, together with the Sanger Institute
HAVANA group (19), is part of multiple collaborations
to refine the human geneset including the CCDS (Consensus Coding Sequence) consortium, with RefSeq at NCBI
(20) and UCSC (1), and the new ENCODE scale-up project
GENCODE (http://www.sanger.ac.uk/encode/) with multiple collaborators. CCDS (http://www.ncbi.nlm.nih.gov/
CCDS/) is a stable set of protein coding gene structures
for which all consortium members agree to the base pair.
Since our previous report (14) the human CCDS set has
increased from 18 290 to 20 159 CDSs, which represents an
increase from 16 003 to 17 052 genes with at least one
CCDS entry. There is also a CCDS set for mouse, which
has increased even more, from 13 374 to 17 707 CDSs and
from 13 014 to 16 889 genes. GENCODE builds on CCDS
to validate additional transcripts and extend into UTR
regions, building on the ENCODE pilot project (9,21–23)
and incorporating additional computational and experimental input and validation (24). One new computational
approach, which is being built on within GENCODE, is to
use alignments across the many mammalian genomes now
available to evaluate the conservation of putative coding
sequences (25). Several hundred transcript predictions generated by the Ensembl gene build pipeline which were
found to have low scores in this analysis have been identified as spurious and are now filtered out. The Ensembl/
HAVANA collaboration includes further efforts to
improve geneset consistency, such as tighter links with UniProt (26) and input into the Genome Reference Consortium (http://www.sanger.ac.uk/sequencing/grc/) to flag
discrepancies between the human genome sequence and
transcript evidence.
The Ensembl/HAVANA human geneset shown in
Ensembl is a combined output from these projects, incorporating all CCDS entries and merging HAVANA full
length transcript annotation with the Ensembl gene build.
In the last year, this process has been extended to include
4711 HAVANA pseudogenes and will be more regularly
updated in future to incorporate additional validated
annotation from GENCODE.
One additional geneset development is that the canonical transcripts are now defined for all genes and for all
species. The canonical transcript is defined as either the
Downloaded from http://nar.oxfordjournals.org/ by guest on November 23, 2015
menus which is context specific for each class. Each view
within a class has a common header panel, summarizing
the location or object. Clear and easy navigation between
views is provided through the left hand menu and the left
and right buttons below the header panel. Since only a
specific chunk of information in shown in each view,
this makes pages easier to read as well as improving the
responsiveness of the servers. Configuration controls have
been considerably improved and now take the form of a
context specific pop-up panel for most views, e.g. allowing
tracks to be enabled and disabled in genome sequence
based display elements. The same panel contains controls
to allow external data to be uploaded into Ensembl, or for
external data sources to be federated (DAS).
The ideas for the new design were developed and tested
through extensive interactions with users, including one to
one sessions, testing sessions of design mock ups and webbased questionnaires. Questions investigated preferences
between alternative overall layouts (e.g. use of tabs/left
hand menu bars) as well as detailed behaviour such as
the preference for a consistent name for the protein
product of transcript (translation, peptide, protein). The
results of these surveys have led to a design which is user
driven and was significantly different from the one we had
initially planned. We will be maintaining a user panel to
help in guiding interface development.
D694 Nucleic Acids Research, 2009, Vol. 37, Database issue
longest CDS, if the gene has translated transcripts, or the
longest cDNA. Should a transcript already regarded as
canonical not be selected using the above rules, there is
support for storing this information in the Ensembl
database.
Multiple alignments for comparative genomics
Figure 2. Figure shows a smoothed density plot of the GERP conservation scores (30) calculated from the 9-way EPO mammalian
genome alignment corresponding to human chromosome X (Ensembl
release 51). Four different types of genomic features are plotted: coding
exons (red), non-coding exons (pink), regulatory features (blue) and
ancestral repeats (black). A GERP score of 0 indicates no evidence
of selective constraint, whereas high GERP scores shows evidence of
selective constraint. Non-coding exons include all non-coding positions
of protein-coding genes. Regulatory features include all regulatory features defined by the Ensembl regulatory build (‘Gene Associated’,
‘Non-Gene Associated’, ‘Promoter Associated’ and ‘Unclassified’).
Ancestral repeats include MER type II transposons only, as defined
by RepeatMasker. Conserved features, such as exons and regulatory
features, are clearly distinguished from repeats, a good indication of the
quality of the EPO alignments.
submitted). A significant recent change (release 50) has
been the calculation of site-wise dN/dS values in our
gene trees using the SLR programme (sitewise likelihood
ratio estimation of selection) (32). These values allow us to
detect positions in the alignments that are under different
evolutionary pressure.
Functional Genomics and Variation resources
The availability of genome wide functional data is one of
the major changes in genomics in the last few years.
Driven by involvement in analysis for the ENCODE project (9) and other international research consortia such as
the EU FP6 funded HEROIC (High-throughput Epigenetic Regulatory Organisation In Chromatin) project,
Ensembl has built up an infrastructure to support handling and display of this class of data (14). We have also
recently participated in the creation of a genome-wide
DNA methylation resource that has been incorporated
into Ensembl (33,34). With the availability of next generation sequencing technology, array based ChIP-chip functional data is very rapidly giving way to sequence based
ChIP-seq data. A major activity this year has been the
development of a ChIP-seq analysis pipeline including a
custom algorithm for the analysis of ChIP-seq data.
One of the characteristic features of the Ensembl project
has been to go beyond presenting raw data aligned to the
genome sequence by also presenting high quality consensus biological predictions, generated from automatic analysis pipelines developed to use the raw data as evidence.
Examples are the Ensembl gene build pipeline generating
Downloaded from http://nar.oxfordjournals.org/ by guest on November 23, 2015
The genome-wide Ensembl comparative genomics pipeline
has changed significantly over 2008, and is now based on
the Enredo-Pecan-Ortheus pipeline (EPO). These are a set
of three programs which feed into each other. The Enredo
programme (28) takes a set of genomes and creates a segmentation graph across all the genomes to extract a set
of colinear homologous segments. Unlike the algorithms
Ensembl has used previously, Enredo handles lineage specific duplications (for example, a duplication on the primate lineage giving rise to two copies of a series of genes
in primates compared to other mammals). These colinear
segments are then handed onto Pecan, a consistency based
multiple aligner, which provides a highly accurate alignment of the homologous regions. Using an assessment
based on ancestral repeats, Enredo+Pecan outperforms
other combinations of alignment programs in mammals.
Finally, the ancestral sequence reconstruction programme,
Ortheus (29), generates accurate ancestral sequences
across each region. Ortheus uses a branch transducer
model, a type of HMM, to call deletion and insertion
events, providing a realistic model under which it can
infer the ancestral sequence. Figure 2 shows the results of
GERP (30) analysis of constraint across different feature
types found in Ensembl, showing a sharp distinction
between coding exons and ancestral repeats, with regulatory regions showing a intermediate level of constraint.
Ensembl release 49 (March 2008) saw the first set of EPO
alignments on a set of seven mammals. In release 50 (July
2008), this set of alignments was extended to include lowcoverage genomes, creating a 23 mammals EPO alignment.
A set of 4-way primate EPO alignments was also added
containing human, chimp, orangutan, macaque. We plan
to produce EPO multiple alignments in the teleost lineage
in the future.
To create the 23 mammal EPO alignment, the methodology had to be extended to include low-coverage
genomes. The assemblies of low-coverage genomes are
too fragmented, creating too many breakpoints in the
Enredo graph, to use Enredo directly. The Enredo graph
was therefore built using high-coverage genomes only.
Low-coverage genomes were then mapped on the colinear
regions using pairwise alignments to the human genome.
For each low-coverage genome, the segments defined by
the pairwise alignments were linked with stretches of
N’s to facilitate the process of building the final multiple
sequence alignment. After the alignment has been
obtained, the stretches of N’s were removed.
As well as providing alignments of genome sequence,
the Ensembl comparative genome analysis pipelines also
generate gene trees and orthology/paralogy prediction
across all Ensembl genomes. A full description of the pipeline including its close collaboration between the curated
resource Treefam (31) is forthcoming (Vilella, A. et. al.,
Nucleic Acids Research, 2009, Vol. 37, Database issue D695
Outreach
Ensembl continues to make a substantial investment in
training and user support. We regard this as critical not
only to help users, but also evaluate the relevance of
the data we provide and the easy of use of the services
we provide. As discussed earlier, user engagement has
been critical in developing the web site redesign. The
Ensembl Outreach and Training group provides on-site
courses on request and has run 102 workshops since
May 2007, with an expanding effort in Asia (workshops
in China, Malaysia and India), and a substantial presence
in USA (20 workshops) and Europe (64 workshops). In
addition to this, alongside standalone video tutorials,
eLearning courses are now being developed and piloted
within the EBI training platform (http://www.ebi.ac.uk/
training/user/). Finally the new Ensembl blog (http://
ensembl.blogspot.com/) provides updates on upcoming
Ensembl training courses around the world.
FUTURE DIRECTIONS
The impact of next generation sequencing on genomics is
beginning to be felt and a major focus for Ensembl is
adapting to changes in data type and scale that will
result. As discussed last year (14) the scale of data is a
major challenge for many bioinformatics resources. For
the variation team the immediate challenge is to present
the variation landscape that will be uncovered by the 1000
Genomes Project, which is now running. The gene build
team is starting to develop pipelines that use next generation sequencing transcriptome data. We can envisage such
data being collected systematically for many different cell
types and developmental stages, providing increasingly
complete evidence for alternative splicing variants and
functional annotation of the time and localization of
their expression.
At present the focus for genome sequencing is discovery
of variation, however as both experimental and computational techniques improve, it will become possible to
sequence and assemble large genomes de novo. At this
point it may become cost effective to sequence many
more mammalian genomes. However, a major expansion
of the number of genomes provided using Ensembl technology is already underway, in the form of the Ensembl
Genomes project, (http://www.ensemblgenomes.org/),
which will use Ensembl technology to provide a
common interface to genomes across biology. Significant
API and schema developments have already taken place to
support this, including the ability to store several species
in a single core database.
Finally, it is clear from our website performance monitoring that despite the performance improvements from
our improved web-code, network latency effects will
always reduce performance for users far from our servers.
As a result we have been investing in mirror sites in parallel, to improve performance for users and provide redundancy. We have recently deployed a mirror site in China
in collaboration with the Beijing Genomics Institute,
Shenzhen (BGI-SZ). This site’s primary service region is
our users in and around China as the connections between
the UK and China are relatively slow. We will shortly be
deploying a full mirror to the US west coast and have also
been investigating operating servers in commercially
managed cloud compute facilities.
Downloaded from http://nar.oxfordjournals.org/ by guest on November 23, 2015
protein coding genesets and the Ensembl comparative
analysis pipelines generating genetrees and orthology
and paralog relationships. The Ensembl regulatory build
is the latest such pipeline and provides automatic, evidence based annotation of potential regulatory regions
within the human genome. The primary inputs are maps
of open chromatin created by DNase I hypersensitivity
mapping and covalent modifications of histone protein
tails assayed by chromatin immunoprecipitation (ChIP).
The first build was released in coordination with the
ENCODE Pilot Project publication (9). Since the first
release reported last year (14), we have updated the regulatory build three times, each time adding more data
(35,36) and a more sophisticated analysis of the chromatin
conformation and modification data. The build now consists of approximately 175 000 genomic regions defined
from data collected from several cell types, including
CD4 cells which make up the majority of the supporting
data. Approximately 40 different histone modifications are
now included and more than 2700 combinations of these
factors form patterns associated with protein coding
genes or their promoters allowing over 23 000 of the
regulatory features to be classified as gene- or promoterassociated.
The rapid adoption of next generation sequencing technologies is also having a major impact on variation data in
Ensembl. Whereas data continues to be imported from
dbSNP, a major new source of computationally discovered variation data is from the processing of resequencing
data. This second data source is growing rapidly in parallel with next generation sequencing technology. This year,
Ensembl imported the data from three successive builds
of dbSNP (127, 128, and 129). It has also incorporated
resequencing-based SNPs from platypus and orangutan
and as well as from the resequenced human genomes of
Watson and Venter. The playpus SNPs were submitted to
dbSNP and make up the largest set of SNPs for that species. The orangutan SNPs will be submitted in conjunction
with the publication of that genome.
Within the variation database, we have increased support for copy number variation data and annotation of
individual SNPs [e.g. with disease associations identified in
genome-wide scans and with expression QTLs (37)]. The
Ensembl variation group is synergistic with the European
Genotype Archive (EGA http://www.ebi.ac.uk/ega/) and
the 1000 Genomes Project (http://www.1000genomes.org/
) data coordination centre groups at the EBI (European
Bioinformatics Institute). The EGA was launched in the
spring of 2008 and currently manages data from several
projects including the Wellcome Trust Case Control
Consortium (38) and other projects that are still in prepublication status. The synergies between these projects
will underpin the growth in variation data in Ensembl
and the start of its functional annotation.
D696 Nucleic Acids Research, 2009, Vol. 37, Database issue
ACKNOWLEDGEMENTS
We acknowledge those researchers and organizations that
have provided data to Ensembl prior to publication under
the understandings of the Fort Lauderdale meeting discussing Community Resource Projects. We thank all our
users of our website and other resources, and those who
have provided useful feedback through our mailing list.
FUNDING
Conflict of interest statement. None declared.
REFERENCES
1. Karolchik,D., Kuhn,R.M., Baertsch,R., Barber,G.P., Clawson,H.,
Diekhans,M., Giardine,B., Harte,R.A., Hinrichs,A.S., Hsu,F. et al.
(2008) The UCSC Genome Browser Database: 2008 update. Nucleic
Acids Res., 36, D773–D779.
2. Wheeler,D.L., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K.,
Chetvernin,V., Church,D.M., Dicuccio,M., Edgar,R., Federhen,S.
et al. (2008) Database resources of the National Center for
Biotechnology Information. Nucleic Acids Res., 36, D13–D21.
3. Dowell,R.D., Jokerst,R.M., Day,A., Eddy,S.R. and Stein,L. (2001)
The Distributed Annotation System. BMC Bioinformatics, 2, 7.
4. Jenkinson,A.M., Albrecht,M., Birney,E., Blankenburg,H., Down,T.,
Finn,R.D., Hermjakob,H., Hubbard,T.J., Jimenez,R.C., Jones,P.
et al. (2008) Integrating biological data – the Distributed
Annotation System. BMC Bioinformatics, 9(Suppl 8), S3.
5. Kasprzyk,A., Keefe,D., Smedley,D., London,D., Spooner,W.,
Melsopp,C., Hammond,M., Rocca-Serra,P., Cox,T. and Birney,E.
(2004) EnsMart: a generic system for fast and flexible access to
biological data. Genome Res., 14, 160–169.
6. The Star Consortium (2008) SNP and haplotype mapping for
genetic analysis in the rat. Nat. Genet., 40, 560–566.
7. Twigger,S.N., Pruitt,K.D., Fernández-Suárez,X.M., Karolchik,D.,
Worley,K.C., Maglott,D.R., Brown,G., Weinstock,G., Gibbs,R.A.,
Kent,J. et al. (2008) What everybody should know about the rat
genome and its online resources. Nat. Genet., 40, 523–527.
8. Aitman,T.J., Critser,J.K., Cuppen,E., Dominiczak,A., FernandezSuarez,X.M., Flint,J., Gauguier,D., Geurts,A.M., Gould,M.,
Harris,P.C. et al. (2008) Progress and prospects in rat genetics: a
community view. Nat. Genet., 40, 516–522.
9. The ENCODE Project Consortium (2007) Identification and analysis of functional elements in 1% of the human genome by the
ENCODE pilot project. Nature, 447, 799–816.
10. Gräf,S., Nielsen,F.G., Kurtz,S., Huynen,M.A., Birney,E.,
Stunnenberg,H. and Flicek,P. (2007) Optimized design and
assessment of whole genome tiling arrays. Bioinformatics, 23,
i195–i204.
11. Warren,W.C., Hillier,L.W., Marshall Graves,J.A., Birney,E.,
Ponting,C.P., Grutzner,F., Belov,K., Miller,W., Clarke,L.,
Chinwalla,A.T. et al. (2008) Genome analysis of the
platypus reveals unique signatures of evolution. Nature, 453,
175–183.
12. Mikkelsen,T.S., Wakefield,M.J., Aken,B., Amemiya,C.T.,
Chang,J.L., Duke,S., Garber,M., Gentles,A.J., Goodstadt,L.,
Heger,A. et al. (2007) Genome of the marsupial Monodelphis
Downloaded from http://nar.oxfordjournals.org/ by guest on November 23, 2015
This work was supported by the Wellcome Trust [grant
numbers WT062023]; the European Molecular Biology
Laboratory (EMBL); the National Institutes of Health
(NIH) National Human Genome Research Institute
(NHGRI); the National Institutes of Health (NIH)
National Institute of Allergy and Infectious Diseases
(NIAID); the Biotechnology and Biological Sciences
Research Council (BBSRC); the Medical Research
Council (MRC); and the European Union. Funding for
open access charge: The Wellcome Trust.
domestica reveals innovation in non-coding sequences. Nature, 447,
167–177.
13. Rhesus Macaque Genome Sequencing and Analysis Consortium.
(2007) Evolutionary and biomedical insights from the rhesus
macaque genome. Science, 316, 222–234.
14. Flicek,P., Aken,B.L., Beal,K., Ballester,B., Caccamo,M., Chen,Y.,
Clarke,L., Coates,G., Cunningham,F., Cutts,T. et al. (2008)
Ensembl 2008. Nucleic Acids Res., 36, D707–D714.
15. Spudich,G., Fernandez-Suarez,X.M. and Birney,E. (2007) Genome
browsing with Ensembl: a practical overview. Briefings in functional
genomics & proteomics, 6, 202–219.
16. Hubbard,T., Barker,D., Birney,E., Cameron,G., Chen,Y., Clark,L.,
Cox,T., Cuff,J., Curwen,V., Down,T. et al. (2002) The Ensembl
genome database project. Nucleic Acids Res., 30, 38–41.
17. Cunningham,F., Rios,D., Griffiths,M., Smith,J., Ning,Z., Cox,T.,
Flicek,P., Marin-Garcin,P., Herrero,J., Rogers,J. et al. (2006)
TranscriptSNPView: a genome-wide catalog of mouse coding
variation. Nat. Genet., 38, 853.
18. Hubbard,T.J., Aken,B.L., Beal,K., Ballester,B., Caccamo,M.,
Chen,Y., Clarke,L., Coates,G., Cunningham,F., Cutts,T. et al.
(2007) Ensembl 2007. Nucleic Acids Res., 35, D610–D617.
19. Wilming,L.G., Gilbert,J.G., Howe,K., Trevanion,S., Hubbard,T.
and Harrow,J.L. (2008) The vertebrate genome annotation (Vega)
database. Nucleic Acids Res., 36, D753–D760.
20. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) NCBI reference
sequences (RefSeq): a curated non-redundant sequence database
of genomes, transcripts and proteins. Nucleic Acids Res., 35,
D61–D65.
21. Denoeud,F., Kapranov,P., Ucla,C., Frankish,A., Castelo,R.,
Drenkow,J., Lagarde,J., Alioto,T., Manzano,C., Chrast,J. et al.
(2007) Prominent use of distal 50 transcription start sites and
discovery of a large number of additional exons in ENCODE
regions. Genome Res, 17, 746–759.
22. Harrow,J., Denoeud,F., Frankish,A., Reymond,A., Chen,C.K.,
Chrast,J., Lagarde,J., Gilbert,J.G., Storey,R., Swarbreck,D. et al.
(2006) GENCODE: producing a reference annotation for
ENCODE. Genome Biol, 7(Suppl 1), S41–S49.
23. Guigó,R., Flicek,P., Abril,J.F., Reymond,A., Lagarde,J.,
Denoeud,F., Antonarakis,S., Ashburner,M., Bajic,V.B., Birney,B.
et al. (2006) EGASP: the human ENCODE Genome Annotation
Assessment Project. Genome Biology, 7, S2.
24. Tress,M.L., Martelli,P.L., Frankish,A., Reeves,G.A.,
Wesselink,J.J., Yeats,C., Olason,P.L., Albrecht,M., Hegyi,H.,
Giorgetti,A. et al. (2007) The implications of alternative splicing
in the ENCODE protein complement. Proc. Natl Acad. Sci. USA,
104, 5495–5500.
25. Clamp,M., Fry,B., Kamal,M., Xie,X., Cuff,J., Lin,M.F., Kellis,M.,
Lindblad-Toh,K. and Lander,E.S. (2007) Distinguishing proteincoding and noncoding genes in the human genome. Proc. Natl
Acad. Sci. USA., 104, 19428–19433.
26. The UniProt Consortium (2008) The universal protein resource
(UniProt). Nucleic Acids Res., 36, D190–D195.
27. Bruford,E.A., Lush,M.J., Wright,M.W., Sneddon,T.P., Povey,S.
and Birney,E. (2008) The HGNC Database in 2008: a resource for
the human genome. Nucleic Acids Res., 36, D445–D448.
28. Paten,B., Herrero,J., Beal,K., Fitzgerald,S. and Birney,E. (2008)
Enredo and Pecan: Genome-wide mammalian consistency based
multiple alignment with paralogs. Genome Res., 18, 1814–1828.
29. Paten,B., Herrero,J., Fitzgerald,S., Beal,K., Flicek,P., Holmes,I. and
Birney,E. (2008) Genome-wide nucleotide level mammalian ancestor
reconstruction. Genome Res., 18, 1829–1843.
30. Cooper,G.M., Stone,E.A., Asimenos,G., Program,N.C.S.,
Green,E.D., Batzoglou,S. and Sidow,A. (2005) Distribution and
intensity of constraint in mammalian genomic sequence. Genome
Res., 15, 901–913.
31. Ruan,J., Li,H., Chen,Z., Coghlan,A., Coin,L.J., Guo,Y.,
Heriche,J.K., Hu,Y., Kristiansen,K., Li,R. et al. (2008) TreeFam:
2008 Update. Nucleic Acids Res., 36, D735–D740.
32. Massingham,T. and Goldman,N. (2005) Detecting amino acid sites
under positive selection and purifying selection. Genetics, 169,
1753–1762.
33. Down,T.A., Rakyan,V.K., Turner,D.J., Flicek,P., Li,H.,
Kulesha,E., Graf,S., Johnson,N., Herrero,J., Tomazou,E.M.
et al. (2008) A Bayesian deconvolution strategy for
Nucleic Acids Research, 2009, Vol. 37, Database issue D697
immunoprecipitation-based DNA methylome analysis.
Nat. Biotechnol., 26, 779–785.
34. Rakyan,V., Down,T., Thorne,N., Flicek,P., Kulesha,E., Graf,S.,
Tomazou,E., Backdahl,L., Johnson,N., Herberth,M. et al. (2008)
An integrated resource for genome-wide identification and analysis
of human tissue-specific differentially methylated regions (tDMRs).
Genome Res., 18, 1518–1529.
35. Wang,Z., Zang,C., Rosenfeld,J.A., Schones,D.E., Barski,A.,
Cuddapah,S., Cui,K., Roh,T.Y., Peng,W., Zhang,M.Q. et al. (2008)
Combinatorial patterns of histone acetylations and methylations in
the human genome. Nat. Genet., 40, 897–903.
36. Barski,A., Cuddapah,S., Cui,K., Roh,T.Y., Schones,D.E., Wang,Z.,
Wei,G., Chepelev,I. and Zhao,K. (2007) High-resolution
profiling of histone methylations in the human genome. Cell, 129,
823–837.
37. Stranger,B.E., Nica,A.C., Forrest,M.S., Dimas,A., Bird,C.P.,
Beazley,C., Ingle,C.E., Dunning,M., Flicek,P., Koller,D. et al.
(2007) Population genomics of human gene expression. Nat. Genet.,
39, 1217–1224.
38. Wellcome Trust case control consortium (2007) Genome-wide
association study of 14,000 cases of seven common diseases and
3,000 shared controls. Nature, 447, 661–678.
Downloaded from http://nar.oxfordjournals.org/ by guest on November 23, 2015