Protein Structure Databases
Protein Structure Databases
1007/s12033-010-9372-4
REVIEW
Abstract Web-based protein structure databases come in a wide variety of types and levels of information content. Those having the most general interest are the various atlases that describe each experimentally determined protein structure and provide useful links, analyses and schematic diagrams relating to its 3D structure and biological function. Also of great interest are the databases that classify 3D structures by their folds as these can reveal evolutionary relationships which may be hard to detect from sequence comparison alone. Related to these are the numerous servers that compare foldsparticularly useful for newly solved structures, and especially those of unknown function. Beyond these there are a vast number of databases for the most specialized user, dealing with specic families, diseases, structural features and so on. Keywords Protein structure Protein Data Bank PDB wwPDB Secondary structure Fold classication
revolution has helped by making access to, and display of, protein structural data vastly easier and providing a greater level of information content. However, it has also resulted in a multiplicity of sources from which this information can be obtained. New servers and services continually appear, while existing ones are modied and improved. Conversely, some servers are abandoned or neglected, becoming more and more out of date with time, or else disappear altogether. Thus, it can be quite difcult to know where to go to get the answers one needs. Various lists are available on the web. For example, the Nucleic Acids Research (NAR) journal provides lists of databases of biological interest. One of its sub-lists, at http://www. oxfordjournals.org/nar/database/subcat/4/14, provides links to protein structure databases and, as of December 2010, contained 84 different servers. In this review, we aim to highlight some of the more useful, and up-to-date (at time of writing!), sources of information on protein structure.
Introduction Back in 1971, when the Protein Data Bank (PDB) was founded [1], there were only seven experimentally determined protein 3D structures. The data for each, including the proteins atomic coordinates, were stored in simple, xed-format text les. Any structural analysis of these proteins usually required access to bulky and expensive graphics computers. Since then, the number of solved structures has increased ten thousandfold. The internet
R. A. Laskowski (&) European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK e-mail: roman@ebi.ac.uk
Structures and Structural Data Terminology Firstly, it is important to dene what is meant by the term protein structure. It is a term that tends to be somewhat loosely used. A preferable term is model, as the 3D structures of large molecules, such as proteins, are models of the protein atoms locations (i.e. their x-, y-, z-coordinates), their chemical types (e.g. oxygen, nitrogen) and other parameters. The models are built, and their parameters rened, until one gets a reasonable enough t to the experimental data. The term model is itself often loosely used to mean theoretical, or homology-built, models of proteins. These are indeed models, but not ones built using
123
184
experimental measurements, so need to be prexed by labels such as theoretical or homology-built. Another loosely used term is database. Technically, the databases mentioned here are not databases at all, but rather data resourcesmany of which rely on a database for storing and serving up the data. However, the term database has become common for the types of resources described here, so it is the meaning we will adopt. The Protein Data Bank (PDB) and the wwPDB The primary repository of 3D structural data on proteins (and other biological macromolecules, including RNA, fragments of DNA, carbohydrates, and different complexes of these molecules) is the Protein Data Bank. As mentioned above, this was founded in 1971 and was located at Brookhaven National Laboratories. In October 1998, the management of the archive was taken over by the Research Collaboratory for Structural Bioinformatics (RCSB), a consortium consisting of Rutgers University, the National Institute of Standards and Technology (NIST) and the San Diego Supercomputer Center [2]. Since 2003, the archive has been managed by an international consortium called the world-wide Protein Data Bank (wwPDB) whose partners comprise: the RSCB, the PDBe (formerly the Macromolecular Structure Database, MSD) at the European Bioinformatics Institute (EBI) in the UK, the Protein Data Bank Japan (PDBj) at Osaka University and, more recently, the BioMagResBank (BMRB) at the University of Wisconsin-Madison [3]. Access to the primary data is via the wwPDBs web site: http://www.wwpdb.org. The data come in three different formats: old-style PDB-format les, macro-molecular Crystallographic Information File (mmCIF) format [4] and a XML-style format called PDBML/XML [5]. For many of the structures, the wwPDB also make the original experimental data available. Thus, for structural models solved by X-ray crystallography, one can often download the structure factors from which the model was derived, while for structures solved by nuclear magnetic resonance (NMR) spectroscopy, the original distance and angle restraints can be obtained. As of December 2010, the wwPDB contained just under 70,000 structural models, each identied by a unique 4-character reference code, or PDB identier. Structural Data and Analyses Rather than download the raw data from the wwPDB for any protein, it is usually more convenient to obtain the information of interest directly from one of the myriad protein structure databases on the web. These come in many shapes and sizes, catering for a variety of needs and interests.
At the simplest level are the sites that provide atlas pagesone for every PDB entryeach containing general information obtained from the relevant PDB le. There are usually graphical representations of the structural model together with links that provide interactive 3D visualizations using java-based, or other, viewers. Each of the founding members of the wwPDB has their own atlas pages: the RCSB, the PDBe and PDBj. In addition, there are several other sites that have much to recommend them and some of these will be mentioned below. Beyond the atlases, there are a host of other types of sites and servers. These include those that provide information on specic structural motifs, focus on selected protein families, classify protein folds, compare protein structures, provide homology-built models for proteins for which no structure has been determined, and so on. This review will cherry-pick a few of the more interesting and useful sites to visit.
Atlases Table 1 lists the seven best-known and useful of the atlas sites. All have been developed independently and, not unexpectedly, all have much in common as the information comes from the same source: the PDB entry. Thus the protein name, authors, key reference, experimental methods, atomic coordinates and so on are obviously all identical. Also common to most of the atlases are certain derived data, including quality assessment of each structural model, and information about the proteins likely biological unit. The rst of these, quality assessment, is a crucial issue as not all experimentally determined protein models are equally reliable. Much has been written on this topic over the years [610]. The main problem is that the results of any experiment contain errors, but in protein structure determination it is difcult to estimate the magnitude of these errors. Thus it is not obvious which models are more reliable than others. For X-ray models, the resolution at which the structure was solved and its R-factor can be a rough guide, while for NMR models there is not even a rough measure available. So it is important to have some sort of guide as to whether a given structural model is a reliable one or not and most atlases provide a rough guide. The second important issue is the one of knowing what a given proteins biological unit is. This is not always obvious from the PDB entry itself. The problem is that the deposited coordinates from an X-ray crystal structure determination correspond to the molecule(s) in the asymmetric unit. This may give a false impression of how the protein operates in vivo. So, for example, what may look like a monomer from the PDB entry, is, in real life, a
123
Mol Biotechnol (2011) 48:183198 Table 1 Protein structure atlases Server JenaLib MMDB PDBe OCA PDBj PDBsum RCSB Location Fritz Lipmann Institute, Jena, Germany NCBI, USA EBI, Cambridge, UK Weizmann Institute, Israel Osaka University, Japan EBI, Cambridge, UK Rutgers and San Diego, USA URL http://www.imb-jena.de/IMAGE.html http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml http://www.ebi.ac.uk/pdbe http://bip.weizmann.ac.il/oca-docs/oca-home.html http://www.pdbj.org http://www.ebi.ac.uk/pdbsum http://www.rcsb.org/pdb [30] [2]
185
dimer, or a trimer, etc. Conversely, the PDB entry might give the coordinates of a dimer, yet the biological unit happens to be a monomer. For any structural analysis it is crucial to know what the true biological unit is. For some proteins the biological unit has been determined experimentally, and so is known with great condence. In others it has to be deduced computationally by analysing how the individual chains in the crystal pack together and the extent of the interfaces between them. Some interfaces are more substantial than others and hence likely to represent genuine biological interactions rather than happenstance crystal contacts. Most of the atlases provide information on the known, or predicted, biological unit. These may come from author assignments or from a prediction method such the Protein Interfaces, Surfaces and Assemblies (PISA) program [11]. Beyond these general similarities, the atlases differ in sufcient respects to make them complement one another; they differ in what additional information they pull in, the links they make to external resources, and the analyses of the 3D structure that they provide. Consequently, the atlas of choice for any protein structure can be either a matter of personal preference or depend on the type of information one is after. A recent review compared these atlases, or comprehensive information resources as it called them, and identied their similarities and differences [12]. Here, we include only those that have aspects that make them unique, useful or interesting, and we focus on those features. We start with the atlases provided by the founding members of the wwPDB, and then discuss some of the others.
various tutorials, including a narrated one using Flash, to help users get started [2]. Summary Page At the top of each PDB entrys summary page (Fig. 1) is shown the primary literature citation for the entry, if there is one. The citation usually refers to the authors description of the structure determination experiment, analysis performed on the resultant model and its biological signicance. Next comes a description of the molecules making up the structure that was solved: protein chains, DNA fragments, ligand molecules, etc. Also given are various details of the experiment, including the organism from which the protein came and how it was expressed. Additional annotations, with appropriate links to the relevant databases are also given, including: the SCOP [13] and CATH [14] fold classications, constituent Pfam domains [15], and Gene Ontology (GO) functional annotations [16]. Text in bold blue initiates a search for all other PDB entries having that text in common (e.g. other entries with the same author name, or species, or protein classication, etc.). The thumbnail image of the structure has two modes, and you can click between the two: one mode shows the asymmetric unit and the other shows the biological unit, as described above (although in many cases they are identical). Below the thumbnail are links to various 3D viewers which allow you to view the molecule interactively, rotating and moving it about on screen. The viewers include: Jmol [17], Molecular Biology Toolkit (MBT) SimpleViewer [18] and MBT Protein Workshop. Which of these you use soon becomes a matter of personal preference (and patience when download times are long). Other Information
The RCSB PDB The RCSBs web site (http://www.rcsb.org/pdb) is a very rich source of information about each PDB entry and can be a little overwhelming for novices. Hence there are
Besides the summary information, further structural details are presented on additional pages titled: Sequence details,
123
186
Fig. 1 RCSB atlas page for PDB entry 1ayy, a glycosylasparaginase showing the summary information for this structural model determined by resolution X-ray crystallography at 2.32 A
Sequence Similarity, 3D Similarity, Literature, Biology & Chemistry, Methods and Geometry. Figure 2 shows an example of the Sequence details page which includes a schematic diagram of the proteins secondary structure (showing a- and p-helices, b-sheets and turns), together with any SCOP structural domains. The 3D Similarity page allows you to nd other protein structural models that are similar in overall fold to the current PDB entry.
For ligands, the summary page provides links to the java-based Ligand Explorer program which offers a 3D view of the proteinligand complex and allows you to select and view different types of proteinligand interactions (e.g. hydrogen bonds and hydrophobic contacts). Simplied 2D diagrams of the proteinligand interactions are depicted in plots generated by the PoseView program [19].
123
187
Fig. 2 The sequence details for chain A of entry 1ayy in the RCSB server, showing a schematic diagram of the secondary structure assignments for this chain together with the SCOP structural domain
Quality Assessment For assessing the quality of each protein model, the Geometry tab provides a Ramachandran plot of the proteins main chain u and w torsion angles as generated by the MolProbity structure validation program [20]. The tightness of the clustering of points in the core regions of the plot can indicate that
the structure is of good quality; conversely, a large number of points lying outside these regions can suggest that part, or all, of the structure may be of dubious reliability. Another quality measure is provided by the fold deviation score (FDS) which gives an indication of how far from ideal values various geometrical properties of each residue lie. The properties include bond lengths, bond angles and dihedral angles.
123
188
Molecule of the Month One particularly eye-catching feature of the RCSB site is the Molecule of the Month written by David S. Goodsell of The Scripps Research Institute and illustrated with his beautiful plots. Each month the structure and function of different proteins or protein families is described, with specic references to the PDB entries that have contributed to the understanding of how the proteins achieve their biological roles. The collection of short articles, which are suitable for specialists and non-specialists alike, dates back to the year 2000 and now numbers over 130 entries, providing a nice reference and educational resource. They can be listed by title, by date or by biological category. The PDBe The PDBe [2124], formerly known as the Macromolecular Structure Database (MSD), is the European branch of the wwPDB. Its site provides an extensive set of search and analysis tools that allow one to explore and mine the structural data in the PDB (http://www.ebi.ac.uk/pdbe). The atlas pages for each entry show the usual summary information describing the structure and the experimental details used to obtain it. Additional pages provide information on the proteins Primary, Secondary, Tertiary and Quaternary structure (i.e. the probable biological unit, as predicted by PISA), Experimental method, Taxonomy information, Citation, Ligand and Visualization. Search Tools The full list of search tools can be found on the PDBe home page, but only a few will be mentioned here. The simplest is the Advanced PDB Search which, despite its name, is an easy-to-use web form for searching on text or sequence data. A more sophisticated search tool is PDBeMotif, which incorporates PDBeSite [25], and allows searches for sequence and structural motifs as well as for ligands and specic proteinligand interactions. Structural motifs can be dened in terms of patterns of secondary structure, u/w and v angles and Ca and side-chain positions. Searches are entered either via a simple web form or using a graphical query generator. The hits from a search can be viewed in three dimensions, aligned by ligand, PROSITE pattern, active site residues or by environment. One can generate various statistics on proteinligand interactions (e.g. to compare the different distributions of residues binding to ATP and GTP). Of particular use is an option to upload a PDB le and scan its ligands and active sites against the PDBe data.
PDBeFold uses the secondary structure similarity matching program SSM [26] to nd structurally similar proteins for an uploaded PDB le. This is a fold-matching server; others will be mentioned later. PDBeAnalysis allows you to quickly analyse various structural parameters in the data across the PDB. It shows the distribution of the selected parameter as a pie chart or histogram (either 1D or 2D) which can then be explored using click-and-drag to select bins, or ranges of bins and obtain, say, the list of PDB codes from which the data come. It can be used to perform geometric validation of a given structure, to select data based on various lters, or to perform statistical analyses of the data in the PDBe. One can even submit ones own SQL queries direct to the database. The AstexViewer
TM
The PDBes primary visualization tool, which has been TM partly developed by the PDBe, is the AstexViewer [27, 28]. This is a powerful search, comparison and display tool in its own right. It runs as a Java applet and aims to provide a graphical interface for the data in the PDBe. As some database searches can return hits to multiple structures, so the viewer can present these hits, structurally superposed, along with the corresponding sequence alignments. Structural analyses are presented in a variety of graphs such as histograms, pie charts, dendrograms and so on (see Fig. 3). These are dynamically linked to one another as well as to the 3D structure and sequence views. Thus, selection of data in a graph or vieweither by a mouse-click, or by click-and-drag data brushing to select a region of data pointsis reected by appropriate highlighting in all other graphs and views. A nice feature, also found in some other structure viewers, is the way the viewer ies between different views of the structure, say to centre on a different residue selected from the sequence display by the user. If the new residue is offscreen, the viewer rst zooms out to the whole molecule view before zooming in on the new residue of interest (much like the ight in Google Earth). Other nice features are the hyperbolic display of the protein sequence, wherein the current region of interest is magnied relative to the rest of the sequence, and a magic lens which, when passed over the 3D structure with the mouse, displays additional structural annotation of the macromolecule within the lens region. JenaLib The Jena Library of Biological Macromolecules, JenaLib [29], http://www.imb-jena.de/IMAGE.html, was one of the earliest sites offering atlas pages for each PDB entry, and it specializes in hand-curated images of the structures showing functionally informative views. Rather than split information across several pages, JenaLib shows all the
123
189
Fig. 3 The PDBes Astex viewer showing PDB entry 2oiq, tyrosine kinase c-Src, with the bound drug molecule imatinib. Two pop-up windows are shown on the right, one giving the interactions between
the ligand and protein residues and the other a Ramachandran plot of the proteins wu torsion angles
information on a single page but has a collapse/expand mechanism for controlling what is shown and what is hidden. In addition to several of the standard 3D viewers the site features its own: the JenLib Jmol viewer. This viewer is an extension of Jmol which has a number of options not found in other viewers, such as highlighting of PROSITE motifs, single amino acid polymorphisms and CATH or SCOP domain structures. JenaLib has more links to external databases than the other atlas sites and is particularly strong on its many visualizations of each entryboth in terms of its interactive viewing options and its pre-prepared still images. A particularly useful feature is a form for generating lists of PDB entries according to a number of criteria. Additionally, there are a number of pre-computed lists of structures; e.g. all nucleic acid structures without protein, all carbohydrate structures and so on. OCA OCAs main difference from the other atlases is its linkage between proteins and the diseases associated with them
(http://bip.weizmann.ac.il/oca-docs/oca-home.html). Its search form has a few novel search options including gene name, function, disease and membrane orientation (for membrane-spanning proteins). PDBsum The last of the atlases described here is PDBsum [30], http://www.ebi.ac.uk/pdbsum. This aims to be more pictorial than the other sites, illustrating many of its structural analyses by schematic diagrams rather than as tables of numbers. Also, it allows users to upload their own PDB les and get a set of password-protected PDBsum pages generated for them. Pfam Domain Diagrams and Domain Architecture Networks Each entrys summary page has a few useful features not found in the other atlas sites. One of these is a clickable schematic diagram showing how much of the full-length
123
190
protein sequence is actually represented by the 3D structural model (Fig. 4a). Often, for example, the 3D structure is only of a single domainperhaps because the full protein would not crystallize. The diagram shows the proteins secondary structure and annotates it with any Pfam sequence domains and CATH structural domains. Occasionally, crystallographers assign two separate chain identiers to different parts of a single protein sequence (perhaps because of a sequence break), and these diagrams can show this immediately (see for example PDB entry 1ayy, where the N-terminal domain has been labelled chain A- and the C-terminal domain chain B). The orange ? icon above the diagram identies other PDB entries for the same protein (i.e. same UniProt identier). From the list one can see if there any better or more complete structural models of the given protein, as shown in Fig. 4c. The coloured triangular icon on the right launches the ArchSchema java application [31] which shows a network of related protein architectures that have one or more Pfam domains in common with the protein of interest (Fig. 4b).
Quality Assessment The summary page also provides an at-a-glance guide to the proteins likely reliability by way of a thumbnail Ramachandran plot. Hovering the mouse over the thumbnail pops up a full-size version. As before, a more reliable model will have more points in the core regions (here coloured red). Residues in the yellow, disallowed regions are shown with red labels, so if a model has many labelled residues, it might be an idea to look for an alternative. Clicking on the plot goes to a page showing the summary results from the PROCHECK quality assessment program [32] and from this page you can generate a full PROCHECK report. This is particularly useful for user-submitted structural models. Enzyme Reactions For enzymes, the relevant reaction catalysed by the enzyme is shown by a reaction diagram where possible. If any of
Fig. 4 Pfam domain diagrams in PDBsum showing the relationship between 3D structural models and their corresponding UniProt sequence. a A schematic Pfam diagram taken from the PDBsum atlas page for PDB entry 2cgf, the N- terminal domain of the yeast HSP90 chaperone. The Pfam domains are represented by the cylinders. Beneath these is shown how much of protein is present in its 3D structural model in PDB entry 2cgf. This shows that, indeed, the structural model corresponds to only the N-terminal domain of this protein. Clicking on the ? icon returns all other PDB entries for the given UniProt sequence (HSP82_YEAST), as shown in (c), while clicking on the triangular icon on the right runs the ArchSchema java program that shows all other protein architectures containing one or both of these Pfam domains. b Part of the ArchSchema network diagram for HSP82_YEAST. Its domain architecture is represented
by the slightly enlarged node to the left of centre. The coloured boxes represent the two Pfam domains in (a). The other nodes in the network correspond to the most similar protein architectures containing one or both of these domains. Small boxes represent Pfam-B domains. Underlines indicate that the PDB contains 3D structural models for proteins having these architectures. For clarity, more distant nodes in the network have been removed. c The top 10 PDB entries for HSP82_YEAST, showing that the structures tend to be of either the N- or C-terminal domains. However, the top structure, PDB entry 2cg9, provides the most complete structural model of this protein, albeit at very low resolution and, at the very least, can provide information on how the two domains pack together in 3D (Color gure online)
123
191
the ligands bound to the protein correspond to any of the reactants, cofactors or products the corresponding molecule in the diagram is boxed in red. If a ligand is merely similar to one of these, a blue box surrounds the molecule instead and a percentage similarity is quoted. Figures from Key References The majority of experimentally determined protein structures are reported in the scientic literature, often in high prole journals, and each PDB le cites the key reference, i.e. the one describing the structure determination, analysis and biological signicance of the protein. Like the other atlas sites, PDBsum cites this reference, shows its abstract and provides links to both the PubMed entry and to the online version of the article. Where PDBsum differs is that for many of these references it also gives one or two gures (plus gure legends) taken directly from the key reference itself [33]. This is done with permission from the relevant publishers and is useful for two reasons. Firstly, a carefully selected gure can speak volumes about an important aspect of the proteins structure or function. And secondly, each papers lead author is requested to review the gures selected by PDBsums automated process and, if need be, suggest better choices. About one in six authors take the trouble to do this. And some even add an additional comment to appear on the entrys summary page (e.g. PDB entry 1hz0). Additionally, PDBsum provides a list of more recent scientic articles that cite the key reference. This can provide useful information on recent work on the protein in question. The citations come from the EBIs CiteXplore database (http://www.ebi.ac.uk/citexplore), plus some additional automated harvesting, although at present not all journals are covered. Secondary Structure and Topology Diagrams From the summary page are various additional pages giving schematic diagrams of different aspects of the 3D structure. The Protein page shows a diagram of the chains secondary structure elements, much like the RCSBs diagram shown in Fig. 2. Additional features include the annotation of residues that are catalyticas dened in the Catalytic Site Atlas (CSA) [34]or are included in the SITE records of the PDB le, or interact with a ligand, DNA/RNA or metal, or belong to a PROSITE pattern [35]. CATH structural domains are marked on the sequence, in contrast to the RCSBs diagram which uses SCOP. Where there is information on the conservation of each residue in the sequenceobtained from ConSurfHSSP [36]the secondary structure plot can be redisplayed with the residues coloured by their conservation.
Next to the secondary structure plot is a topology diagram either of the whole chain or, where it has been divided into its constituent CATH domains, of each domain (Fig. 5). The diagram shows the connectivity of the secondary structure elements, with the constituent b-strands of each b-sheet laid side-by-side, parallel or anti-parallel, to show how each sheet in the chain/domain is formed, and where any helices are found relative to the sheets. Intermolecular Interactions Some of the other pages for each PDB entry are devoted to schematic representations of intermolecular interactions. Thus, for each ligand molecule, or metal ion, in the structure there is a schematic LIGPLOT diagram [37] of the hydrogen bonds and non-bonded interactions between
Fig. 5 A topology diagram taken from PDBsum for the second domain of chain A in PDB entry 3ib0: a bovine lactoferrin. The diagram illustrates how the b-strands, represented by the block arrows, join up, side-by-side, to form the domains central b-sheet. The diagram also shows the relative locations of the a-helices, here represented by cylinders. The small arrow indicates the directionality of the protein chain, from the N- to the C-terminus. The numbers within the secondary structural elements correspond to the residue numbering given in the PDB le
123
192 Fig. 6 LIGPLOT for PDB entry 2oiq, tyrosine kinase c-Src, as given in PDBsum showing the interactions between the bound molecule imatinib (a drug, brand name gleevec) with the residues of the protein. Hydrogen bonds are represented by dashed lines. Residues that interact with the ligand via non-bonded contacts only are represented by the eyelashes
the molecule and the residues of the protein to which it is bound (see Fig. 6). Similarly, any DNAprotein interactions are schematically depicted by a NUCPLOT diagram [38]. Proteinprotein interactions at the interface between two or more chains are shown by two plots: the rst shows an overview of which chains interact with which (Fig. 7b), while the second shows which residues actually interact across the interface (Fig. 7c).
Homology Models and Obsolete Entries Homology Modelling Servers As mentioned above, there were nearly 70,000 structural models in the wwPDB as of December 2010. However, some of these were not of proteins and many were duplicates (i.e. the same protein solved under different conditions, or with different ligands bound, or with one or more point mutations). In terms of unique protein sequences, as dened by the UniProt identier, this 70,000 corresponded
to only about 17,000 unique sequences. (Compare this number with the 105 million sequences in EMBL-Bank [39]). Moreover, for many of these, the 3D structure represents only a part of the full sequence, say merely a fragment or a single domain. Thus, if one is interested in a protein for which there are no 3D coordinates available, or the coordinates are only of part of the protein, it is common to build a homology model based on the 3D structural model of a closely related protein (if there is one). The PDB used to accept homology-built models together with the experimentally determined ones but, as of 1 July 2002, moved its holding of theoretical models out of the standard PDB archive to a separate ftp site and then, as of October 15, 2006, stopped accepting any new ones. As of December 2010 there were only 1,358 models on the ftp site so, with such a small number, it is unlikely that ones protein of interest will be among them. The alternative is to build a homology model oneself, and there are various servers that will perform the process largely, or completely, automatically. The best-known is SWISS-MODEL [40]. This accepts a protein sequence and
123
Mol Biotechnol (2011) 48:183198 Fig. 7 Extracts from the proteinprotein interaction diagrams in PDBsum for PDB entry 1cow, bovine mitochondrial F1-ATPase. a Thumbnail image of the 3D structural model which contains seven protein chains: three of ATPA1_BOVIN (chains A, B and C), three of ATPB_BOVIN (chains D, E and F) and a fragment of ATPG_BOVIN (chain G). b Schematic diagram showing the interactions between the chains. The area of each circle is proportional to the surface area of the corresponding protein chain. The extent of the interface region on each chain is represented by a coloured wedge whose colour corresponds to the colour of the other chain and whose size signies the interface surface area. c A schematic diagram showing the residueresidue interactions across one of the interfaces, namely that between chains D and G. Hydrogen bonds and salt bridges are shown as solid lines, while nonbonded contacts are represented by dashed lines (Color gure online)
193
will return a 3D model if it is able to build one. More advanced users can submit multiple sequence alignments and manually rene the nal model. It is important to remember that any homology-built model will, at best, be imperfect and at worst totally misleadingparticularly if one or more of the structural models that act as a template for the model contain errors. So a key part of SWISSMODEL is the various validation checks applied to each model to provide the user with an idea of its likely quality. Table 2i shows a list of automated homology modelling web servers. These are regularly tested by the EVA server [41] which produces statistics on accuracy and ranks the servers by various criteria (Table 2ii). Aside from building a model yourself, it may be possible to download a ready-built, off-the-shelf one. The SWISS-MODEL Repository [42] contained over 3 million models in December 2010, each accessible by its UniProt accession number or identier. Similarly ModBase [43] contains a large number of pre-computed models for sequences in the SwissProt and TrEMBL databases14 million models for 3 million proteins in December 2010. Table 2iii gives the URLs and references for these servers.
Threading Servers What about cases where there is no sufciently similar protein of known structure and thus no possibility of building a homology model? In these cases, it is sometimes necessary to resort to desperate measures such as secondary structure prediction and fold recognition, or threading. The results from these methods need to be treated with extreme care. Occasionally, these methods approximate the right answerusually for small, single-domain proteins where they may produce topologically near correct models [44]but generally, they are wildly wrong and so should only be used only as a last resort. A full list of these servers can be found on the LiveBench web site (http://bioinfo. pl/LiveBench) which regularly evaluates these servers [45]. Obsolete Entries As experimental methods improve, better data sets are collected or earlier errors are detected, so some structural models in the PDB become obsolete. Many are replaced by improved structural models, whereas others are simply
123
References
(i) Automatic homology modelling 3D-JIGSAW CPHmodels ESyPred3D SWISS-MODEL Imperial Cancer Research Fund, UK Technical University of Denmark University of Namur, Belgium Biozentrum Basel, Switzerland http://www.bmm.icnet.uk/servers/3djigsaw http://www.cbs.dtu.dk/services/CPHmodels http://www.fundp.ac.be/urbm/bioinfo/esypred http://swissmodel.expasy.org http://cubic.bioc.columbia.edu/eva http://swissmodel.expasy.org/repository http://modbase.compbio.ucsf.edu ftp://ftp.wwpdb.org/pub/pdb/data/structures/models [54] [55] [56] [40] [41] [42] [43]
(ii) Evaluation of modelling servers EVA Columbia University, USA (iii) Pre-computed homology models SWISS-MODEL Repository ModBase PDB archive Biozentrum Basel, Switzerland University of California San Francisco, USA wwPDB
quietly withdrawn. None of these obsolete entries disappear entirely, though. Some of the atlases mentioned above include the obsolete entries together with the current ones, and there used to be a special database devoted to their memory: the Archive of Obsolete PDB Entries. Sadly, this, too, is now obsolete. However, the original PDB les can still be found on the wwPDBs ftp servers.
Fold Databases Classication Schemes There are currently around 900 known fold groups [46]. Many proteins comprise more than one structural domain, with each domain being described by its own fold and often able to fold up independently of the rest of the protein.
Table 3 Fold classication and comparison servers Server (i) Fold classication CATH SCOP (ii) Fold comparison CE Dali DBAli FATCAT MATRAS SSM (PDBeFold) TOPSCAN VAST University College London, UK University of Cambridge, UK University of California San Diego, USA University of Helsinki, Finland University of California San Francisco, USA Burnham Institute, USA Nara Institute of Science and Technology, Japan European Bioinformatics Institute, UK University College London, UK NCBI, USA Location
There have been a number of efforts to classify protein domains in a hierarchical manner. The two current market leaders in this eld are the SCOP and CATH hierarchical classication systems (see Table 3i). In CATH, protein structures are classied using a combination of automated and manual procedures, with four major levels in the hierarchy: Class, Architecture, Topology (fold family) and Homologous superfamily [14]. In SCOP the classication is more manual, although some automated methods are employed. Comparisons between the two classication schemes have shown there to be much in common, although there are differences, primarily in how the structures are chopped into domains [47]. Recently, it has become apparent that protein folds are not the discrete units that these classication schemes might imply, but rather that protein structure space is a continuum [48]. However, the two databases are very
URL
References
http://www.cathdb.info http://scop.mrc-lmb.cam.ac.uk/scop http://cl.sdsc.edu/ce.html http://ekhidna.biocenter.helsinki./dali_server http://www.salilab.org/DBAli/ http://fatcat.burnham.org http://biunit.aist-nara.ac.jp/matras http://www.ebi.ac.uk/msd-srv/ssm http://www.bioinf.org.uk/topscan http://www.ncbi.nlm.nih.gov/Structure/ VAST/vastsearch.html
[57] [13] [58] [59] [60] [61] [62] [26] [63] [64]
123
195
valuable resources because they group domains by their evolutionary relationships even where this is not apparent from any similarities in the sequences. Fold Comparison Often a given structural domain is associated with a specic biological function. However, the so-called superfolds, which are more common than other folds, tend to be responsible for a wide range of functions [49]. There are a large number of web servers that can identify all proteins sharing a given proteins fold. The main problem is to know which server to use. Each uses a different algorithm or has a different way of assessing the signicance of a match. Table 3ii lists a selection of the more popular servers. A fuller list, together with brief descriptions of the algorithms and a comparison between them, can be found in various comparisons that have been made between them [50, 51].
how well the model agrees with the experimental data on which it was based. The Uppsala Electron Density Server, EDS [52] displays the electron density maps for PDB entries for which the experimental structure factors are available. The server also provides various useful statistics about the models. For example, the plots of the real-space R-factor (RSR) indicate how well each residue ts its electron density; any tall red spikes are regions to be wary of. Other useful plots include: the occupancy-weighted average temperature factor and a Z-score associated with the residues RSR for the given resolution. Curiosities Finally, there are various sites which deal with slightly more offbeat aspects of protein structure. Some are included in Table 5. A couple detect knots in protein folds: Protein Knots and the pKnot web server. The former lists 44 PDB entries containing knotted proteins, classied according to type of knot. Another interesting site, which can while away part of an afternoon, is the Database of Macromolecular Movement which holds many movies showing proteins in motion. Also included is a Morph Server which will produce 2D and 3D animations by interpolating between two submitted protein conformationsvery useful for producing animations for presentations or web sites.
Miscellaneous Databases Selection of Data Sets For any bioinformatics analysis involving 3D structural models it is important to get a valid and representative data set of models of as high quality as possible. To help in this process there are various servers that allow you to obtain such lists based on various selection criteria. Table 4 lists several such servers. Uppsala Electron Density Server (EDS) As has been mentioned a couple of times already, a key aspect of any structural model is how reliably it represents the protein in question. A poor-quality model limits what structural or functional conclusions can be drawn from it. For X-ray models, in addition to the geometrical checks mentioned above, the most useful guide to reliability is
Summary This review has described some of the more generally useful protein structure databases. There are many, many more that have not been mentioned. Some are very small and specialized, such as the so-called hobby databases, created by a single researcher and lovingly crafted and conscientiously updateduntil, e.g. the funding runs out, or the researcher moves on to another post and the database is abandoned and neglected. The larger and more widely used databases have better resources to keep them ticking
Table 4 Selection of data sets Server ASTRAL JenaLib (entry lists) PDBeselect PDBselect PISCES Location University of Berkeley, USA Fritz Lipmann Institute, Jena, Germany European Bioinformatics Institute, UK University of Applied Sciences, Giessen, Germany Fox Chase Cancer Center, Philadelphia, USA URL http://astral.berkeley.edu http://www.imb-jena.de/ImgLibPDB/ pages/entry_list-customize.html http://www.ebi.ac.uk/pdbe-as/pdbeselect http://bioinfo.tg.fh-giessen.de/pdbselect http://dunbrack.fccc.edu/PISCES.php [66] [67] References [65] [29]
123
196 Table 5 Miscellaneous servers Server 3D Complex Allosteric Database (ASD) Database of Macromolecular Movements Electron Density Server (EDS) Orientations of Proteins in Membranes (OPM) pKnot server Protein Knots Location MRC, Cambridge, UK Shanghai Jaiotong University, China Yale, USA Uppsala, Sweden University of Michigan, USA National Chiao Tung University, Taiwan Massachusetts Institute of Technology, USA URL
References
http://www.supfam.org/elevy/ 3dcomplex/Home.cgi http://mdl.shsmu.edu.cn/ASD http://molmovdb.org http://eds.bmc.uu.se/eds http://opm.phar.umich.edu http://pknot.life.nctu.edu.tw http://knots.mit.edu [68] [69] [52] [70] [71] [72]
over, but tend to suffer from a great deal of duplication and overlap. This can be seen in the large numbers of PDB atlases and fold comparison servers. Perhaps one day, a single server of each type will emerge combining the ner aspects of all others to make life a lot easier for the end users of the data.
References
1. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Jr., Brice, M. D., Rodgers, J. R., et al. (1977). The Protein Data Bank: a computer-based archival le of macromolecular structures. Journal of Molecular Biology, 112, 535542. 2. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., et al. (2000). The Protein Data Bank. Nucleic Acids Research, 28, 235242. 3. Berman, H. M., Henrick, K., & Nakamura, H. (2003). Announcing the worldwide Protein Data Bank. Nature Structural Biology, 10, 980. 4. Westbrook, J., & Fitzgerald, P. M. (2003). The PDB format, mmCIF, and other data formats. Methods of Biochemical Analysis, 44, 161179. 5. Westbrook, J., Ito, N., Nakamura, H., Henrick, K., & Berman, H. M. (2005). PDBML: The representation of archival macromolecular structure data in XML. Bioinformatics, 21, 988992. n, C.-I., & Jones, T. A. (1990). Between objectivity and nde 6. Bra subjectivity. Nature, 343, 687689. 7. Hooft, R. W. W., Vriend, G., Sander, C., & Abola, E. E. (1996). Errors in protein structures. Nature, 381, 272. 8. Kleywegt, G. J. (2000). Validation of protein crystal structures. Acta Crystallographica, D56, 249265. 9. Laskowski, R. A. (2009). Structural quality assurance. In J. Gu & P. E. Bourne (Eds.), Structural Bioinformatics (2nd ed., pp. 341375). Hoboken, NJ: Wiley. 10. Brown, E. N., & Ramaswamy, S. (2007). Quality of protein crystal structures. Acta Crystallographica, D63, 941950. 11. Krissinel, E., & Henrick, K. (2007). Inference of macromolecular assemblies from crystalline state. Journal of Molecular Biology, 372, 774797. hne, R., Koch, F. T., & Su hnel, J. (2007). A comparative view 12. Hu at comprehensive information resources on three-dimensional structures of biological macro-molecules. Briengs in Functional Genomics & Proteomics, 6, 220239.
13. Murzin, A. G., Brenner, S. E., Hubbard, T., & Chothia, C. (1995). SCOP: a structural classication of proteins database for the investigation of sequences and structures. Journal of Molecular Biology, 247, 536540. 14. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B., & Thornton, J. M. (1997). CATH: a hierarchic classication of protein domain structures. Structure, 5, 10931108. ckler, B., Grifths-Jones, S., 15. Finn, R. D., Mistry, J., Schuster-Bo Hollich, V., Lassmann, T., et al. (2006). Pfam: Clans, web tools and services. Nucleic Acids Research, 34, D247D251. 16. Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., et al. (2004). The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research, 32, D262D266. 17. Hanson, R. M. (2010). Jmol: A paradigm shift in crystallographic visualization. Journal of Applied Crystallography, 43, 1250 1260. 18. Moreland, J. L., Gramada, A., Buzko, O. V., Zhang, Q., & Bourne, P. E. (2005). The Molecular Biology Toolkit (MBT): A modular platform for developing molecular visualization applications. BMC Bioinformatics, 6, 21. 19. Stierand, K., Maass, P. C., & Rarey, M. (2006). Molecular complexes at a glance: Automated generation of two-dimensional complex diagrams. Bioinformatics, 22, 17101716. 20. Lovell, S. C., Davis, I. W., Arendall, W. B., I. I. I., de Bakker, P. I. W., Word, J. M., Prisant, M. G., et al. (2003). Structure validation by C-alpha geometry: phi, psi, and C-beta deviation. Proteins Structure Function and Genetics, 50, 437450. 21. Boutselakis, H., Dimitropoulos, D., Fillon, J., Golovin, A., Henrick, K., Hussain, A., et al. (2003). E-MSD: The European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acids Research, 31, 458462. 22. Golovin, A., Oldeld, T. J., Tate, J. G., Velankar, S., Barton, G. J., Boutselakis, H., et al. (2004). E-MSD: An integrated data resource for bioinformatics. Nucleic Acids Research, 32, D211 D216. 23. Velankar, S., McNeil, P., Mittard-Runte, V., Suarez, A., Barrell, D., Apweiler, R., et al. (2005). E-MSD: An integrated data resource for bioinformatics. Nucleic Acids Research, 33, D262 D265. 24. Tagari, M., Tate, J., Swaminathan, G. J., Newman, R., Naim, A., Vranken, W., et al. (2006). E-MSD: Improving data deposition and structure quality. Nucleic Acids Research, 34, D287D290. 25. Golovin, A., Dimitropoulos, D., Oldeld, T., Rachedi, A., & Henrick, K. (2005). MSDsite: A database search and retrieval system for the analysis and viewing of bound ligands and active sites. Proteins, 58, 190199.
123
Mol Biotechnol (2011) 48:183198 26. Krissinel, E., & Henrick, K. (2004). Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallographica, D60, 22562268. 27. Hartshorn, M. J. (2002). AstexViewer: A visualisation aid for structure-based drug design. Journal of Computer-Aided Molecular Design, 16, 871881. 28. Oldeld, T. J. (2004). A Java applet for multiple linked visualization of protein structure and sequence. Journal of ComputerAided Molecular Design, 18, 225234. hnel, J. (2002). The IMB Jena Image Library 29. Reichert, J., & Su of Biological Macromolecules: 2002 update. Nucleic Acids Research, 30, 253254. 30. Laskowski, R. A. (2009). PDBsum new things. Nucleic Acids Research, 37, D355D359. 31. Tamuri, A. U., & Laskowski, R. A. (2010). ArchSchema: A tool for interactive graphing of related Pfam domain architectures. Bioinformatics, 26, 12601261. 32. Laskowski, R. A., MacArthur, M. W., Moss, D. S., & Thornton, J. M. (1993). PROCHECKa program to check the stereochemical quality of protein structures. Journal of Applied Crystallography, 26, 283291. 33. Laskowski, R. A. (2007). Enhancing the functional annotation of PDB structures in PDBsum using key gures extracted from the literature. Bioinformatics, 23, 18241827. 34. Porter, C. T., Bartlett, G. J., & Thornton, J. M. (2004). The Catalytic Site Atlas: A resource of catalytic sites and residues identied in enzymes using structural data. Nucleic Acids Research, 32, D129D133. 35. Sigrist, C. J. A., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., et al. (2002). PROSITE: A documented database using patterns and proles as motif descriptors. Briengs in Bioinformatics, 3, 265274. 36. Glaser, F., Rosenberg, Y., Kessel, A., Pupko, T., & Ben Tal, N. (2004). The ConSurf-HSSP database: The mapping of evolutionary conservation among homologs onto PDB structures. Proteins, 58, 610617. 37. Wallace, A. C., Laskowski, R. A., & Thornton, J. M. (1995). LIGPLOT: A program to generate schematic diagrams of proteinligand interactions. Protein Engineering, 8, 127134. 38. Luscombe, N. M., Laskowski, R. A., & Thornton, J. M. (1997). NUCPLOT: A program to generate schematic diagrams of proteinnucleic acid interactions. Nucleic Acids Research, 25, 4940 4945. 39. Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, M., Baldwin, A., et al. (2007). EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Research, 35, D16D20. 40. Schwede, T., Kopp, J., Guex, N., & Peitsch, M. C. (2003). SWISS-MODEL: An automated protein-homology server. Nucleic Acids Research, 31, 33813385. 41. Eyrich, V. A., Marti-Renom, M. A., Przybylski, D., Madhusudhan, M. S., Fiser, A., Pazos, F., et al. (2001). EVA: Continuous automatic evaluation of protein structure prediction servers. Bioinformatics, 17, 12421243. 42. Kopp, J., & Schwede, T. (2004). The SWISS-MODEL Repository of annotated three-dimensional protein structure homology models. Nucleic Acids Research, 32, D230D234. 43. Pieper, U., Eswar, N., Braberg, H., Madhusudhan, M. S., Davis, F. P., Stuart, A. C., et al. (2004). MODBASE: A database of annotated comparative protein structure models and associated resources. Nucleic Acids Research, 32, D217D222. 44. Moult, J. (2005). A decade of CASP: Progress, bottlenecks and prognosis in protein structure prediction. Current Opinion in Structural Biology, 15, 285289. 45. Bujnicki, J. M., Elofsson, A., Fischer, D., & Rychlewski, L. (2001). Livebench-1: Continuous benchmarking of protein structure prediction servers. Protein Science, 10, 352361.
197 46. Marsden, R. L., Ranea, J. A. G., Sillero, A., Redfern, O., Yeats, C., Maibaum, M., et al. (2006). Exploiting protein structure data to explore the evolution of protein function and biological complexity. Philosophical Transactions of the Royal Society B: Biological Sciences, 361, 425440. 47. Jefferson, E. R., Walsh, T. P., & Barton, G. J. (2008). A comparison of SCOP and CATH with respect to domain-domain interactions. Proteins, 70, 5462. 48. Kolodny, R., Petrey, D., & Honig, B. (2006). Protein structure comparison: Implications for the nature of fold space, and structure and function prediction. Current Opinion in Structural Biology, 16, 393398. 49. Orengo, C. A., Jones, D. T., & Thornton, J. M. (1994). Protein superfamilies and domain superfolds. Nature, 372, 631634. 50. Novotny, M., Madsen, D., & Kleywegt, G. J. (2004). Evaluation of protein fold comparison servers. Proteins, 54, 260270. 51. Carugo, O. (2006). Rapid methods for comparing protein structures and scanning structure datahases. Current Bioinformatics, 1, 7583. hlby, 52. Kleywegt, G. J., Harris, M. R., Zou, J-y, Taylor, T. C., Wa A., & Jones, T. A. (2004). The uppsala electron-density server. Acta Crystallographica, D60, 22402249. 53. Chen, J., Anderson, J. B., DeWeese-Scott, C., Fedorova, N. D., Geer, L. Y., He, S., et al. (2003). MMDB: Entrezs 3D-structure database. Nucleic Acids Research, 31, 474477. 54. Bates, P. A., Kelley, L. A., MacCallum, R. M., & Sternberg, M. J. E. (2001). Enhancement of protein modelling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM. Proteins, 5, 3946. 55. Lund, O., Frimand, K., Gorodkin, J., Bohr, H., Bohr, J., Hansen, J., et al. (1997). Protein distance constraints predicted by neural networks and probability density functions. Protein Engineering, 10, 12411248. 56. Lambert, C., Leonard, N., De Bolle, X., & Depiereux, E. (2002). ESyPred3D: Prediction of proteins 3D structures. Bioinformatics, 18, 12501256. 57. Pearl, F., Todd, A., Sillitoe, I., Dibley, M., Redfern, O., Lewis, T., et al. (2005). The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Research, 33, D247D251. 58. Shindyalov, I. N., & Bourne, P. E. (1998). Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering, 11, 739747. 59. Holm, L., & Sander, C. (1996). Mapping the protein universe. Science, 273, 595603. 60. Marti-Renom, M. A., Pieper, U., Madhusudhan, M. S., Rossi, A., Eswar, N., Davis, F. P., et al. (2007). DBAli tools: Mining the protein structure space. Nucleic Acids Research, 35, W393W397. 61. Ye, Y., & Godzik, A. (2003). Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics, 19, ii246ii255. 62. Kawabata, T. (2003). MATRAS: A program for protein 3D structure comparison. Nucleic Acids Research, 31, 33673369. 63. Martin, A. C. R. (2000). The ups and downs of protein topology; rapid comparison of protein structure. Protein Engineering, 13, 829837. 64. Gibrat, J. F., Madej, T., & Bryant, S. H. (1996). Surprising similarities in structure comparison. Current Opinion in Structural Biology, 6, 377385. 65. Chandonia, J. M., Hon, G., Walker, N. S., Lo Conte, L., Koehl, P., Levitt, M., et al. (2004). The ASTRAL compendium in 2004. Nucleic Acids Research, 32, D189D192. 66. Hobohm, U., Scharf, M., Schneider, R., & Sander, C. (1992). Selection of representative protein data sets. Protein Science, 1, 409417.
123
198 67. Wang, G., & Dunbrack, R. L., Jr. (2003). PISCES: A protein sequence culling server. Bioinformatics, 19, 15891591. 68. Huang, Z., Zhu, L., Cao, Y., Wu, G., Liu, X., Chen, Y., Wang, Q., Shi, T., Zhao, Y., Wang, Y., Li, W., Li, Y., Chen, H., Chen, G., & Zhang, J. (2011). ASD: A comprehensive database of allosteric proteins and modulators. Nucleic Acids Research, 39, D663 D669. 69. Gerstein, M., & Krebs, W. (1998). A database of macromolecular motions. Nucleic Acids Research, 26, 42804290.
Mol Biotechnol (2011) 48:183198 70. Lomize, M. A., Lomize, A. L., Pogozheva, I. D., & Mosberg, H. I. (2006). OPM: Orientations of proteins in membranes database. Bioinformatics, 22, 623625. 71. Lai, Y. L., Yen, S. C., Yu, S. H., & Hwang, J. K. (2007). pKNOT: The protein KNOT web server. Nucleic Acids Research, 35, W420W424. 72. Kolesov, G., Virnau, P., Kardar, M., & Mirny, L. A. (2007). Protein knot server: Detection of knots in protein structures. Nucleic Acids Research, 35, W425W428.
123