Abstract
The availability of human genome sequence has transformed biomedical research over the past decade. However, an equivalent map for the human proteome with direct measurements of proteins and peptides does not exist yet. Here we present a draft map of the human proteome using high-resolution Fourier-transform mass spectrometry. In-depth proteomic profiling of 30 histologically normal human samples, including 17 adult tissues, 7 fetal tissues and 6 purified primary haematopoietic cells, resulted in identification of proteins encoded by 17,294 genes accounting for approximately 84% of the total annotated protein-coding genes in humans. A unique and comprehensive strategy for proteogenomic analysis enabled us to discover a number of novel protein-coding regions, which includes translated pseudogenes, non-coding RNAs and upstream open reading frames. This large human proteome catalogue (available as an interactive web-based resource at http://www.humanproteomemap.org) will complement available human genome and transcriptome data to accelerate biomedical research in health and disease.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 51 print issues and online access
$199.00 per year
only $3.90 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57â74 (2012)
Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198â207 (2003)
Bensimon, A., Heck, A. J. & Aebersold, R. Mass spectrometry-based proteomics and network biology. Annu. Rev. Biochem. 81, 379â405 (2012)
Cravatt, B. F., Simon, G. M. & Yates, J. R., III The biological impact of mass-spectrometry-based proteomics. Nature 450, 991â1000 (2007)
Nagaraj, N. et al. System-wide perturbation analysis with nearly complete coverage of the yeast proteome by single-shot ultra HPLC runs on a bench top Orbitrap. Mol. Cell. Proteomics 11, M111.013722 (2012)
Picotti, P. et al. A complete mass-spectrometric map of the yeast proteome applied to quantitative trait analysis. Nature 494, 266â270 (2013)
Kelkar, D. S. et al. Proteogenomic analysis of Mycobacterium tuberculosis by high resolution mass spectrometry. Mol. Cell. Proteomics 10, M111.011627 (2011)
Huttlin, E. L. et al. A tissue-specific atlas of mouse protein phosphorylation and expression. Cell 143, 1174â1189 (2010)
Gholami, A. M. et al. Global proteome analysis of the NCI-60 cell line panel. Cell Rep. 4, 609â620 (2013)
Branca, R. M. et al. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nature Methods 11, 59â62 (2014)
Farrah, T. et al. The state of the human proteome in 2012 as viewed through PeptideAtlas. J. Proteome Res. 12, 162â171 (2013)
Craig, R., Cortens, J. P. & Beavis, R. C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 3, 1234â1242 (2004)
Gaudet, P. et al. neXtProt: organizing protein knowledge in the context of human proteome projects. J. Proteome Res. 12, 293â298 (2013)
Uhlen, M. et al. Towards a knowledge-based Human Protein Atlas. Nature Biotechnol. 28, 1248â1250 (2010)
Pruitt, K. D. et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 42, D756âD763 (2014)
Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551â3567 (1999)
Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976â989 (1994)
Käll, L., Canterbury, J. D., Weston, J., Noble, W. S. & MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 4, 923â925 (2007)
Lane, L. et al. Metrics for the human proteome project 2013â2014 and strategies for finding missing proteins. J. Proteome Res. 13, 15â20 (2014)
Mosley, A. L. et al. Highly reproducible label free quantitative proteomic analysis of RNA polymerase complexes. Mol. Cell. Proteomics 10, M110.000687 (2011)
Fountoulakis, M., Juranville, J. F., Dierssen, M. & Lubec, G. Proteomic analysis of the fetal brain. Proteomics 2, 1547â1576 (2002)
Ying, W. et al. A dataset of human fetal liver proteome identified by subcellular fractionation and multiple protein separation and identification technology. Mol. Cell. Proteomics 5, 1703â1707 (2006)
Jansen, R., Greenbaum, D. & Gerstein, M. Relating whole-genome expression data with protein-protein interactions. Genome Res. 12, 37â46 (2002)
Ge, H., Liu, Z., Church, G. M. & Vidal, M. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nature Genet. 29, 482â486 (2001)
Ruepp, A. et al. CORUM: the comprehensive resource of mammalian protein complexesâ2009. Nucleic Acids Res. 38, D497âD501 (2010)
Ferrington, D. A. & Gregerson, D. S. Immunoproteasomes: structure, function, and antigen presentation. Prog. Mol. Biol. Transl. Sci. 109, 75â112 (2012)
Steen, H. & Mann, M. The abcâs (and xyzâs) of peptide sequencing. Nature Rev. Mol. Cell Biol. 5, 699â711 (2004)
Sugimoto, J., Sugimoto, M., Bernstein, H., Jinno, Y. & Schust, D. A novel human endogenous retroviral protein inhibits cell-cell fusion. Sci. Rep. 3, 1462 (2013)
Guttman, M., Russell, P., Ingolia, N. T., Weissman, J. S. & Lander, E. S. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell 154, 240â251 (2013)
Kalyana-Sundaram, S. et al. Expressed pseudogenes in the transcriptional landscape of human cancers. Cell 149, 1622â1634 (2012)
Pei, B. et al. The GENCODE pseudogene resource. Genome Biol. 13, R51 (2012)
Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56â65 (2012)
Peri, S. & Pandey, A. A reassessment of the translation initiation codon in vertebrates. Trends Genet. 17, 685â687 (2001)
Legrain, P. et al. The human proteome project: current state and future direction. Mol. Cell. Proteomics 10, M111.009993 (2011)
Paik, Y. K. et al. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nature Biotechnol. 30, 221â223 (2012)
Marko-Varga, G., Omenn, G. S., Paik, Y. K. & Hancock, W. S. A first step toward completion of a genome-wide characterization of the human proteome. J. Proteome Res. 12, 1â5 (2013)
Shevchenko, A., Tomas, H., Havlis, J., Olsen, J. V. & Mann, M. In-gel digestion for mass spectrometric characterization of proteins and proteomes. Nature Protocols 1, 2856â2860 (2007)
Wang, Y. et al. Reversed-phase chromatography with multiple fraction concatenation strategy for proteome profiling of human MCF10A cells. Proteomics 11, 2019â2026 (2011)
Olsen, J. V. et al. Parts per million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. Cell. Proteomics 4, 2010â2021 (2005)
VizcaÃno, J. A. et al. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 41, D1063âD1069 (2013)
Craig, R. & Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466â1467 (2004)
Meyer, L. R. et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 41, D64âD69 (2013)
Razick, S., Magklaras, G. & Donaldson, I. M. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics 9, 405 (2008)
Zuberi, K. et al. GeneMANIA prediction server 2013 update. Nucleic Acids Res. 41, W115âW122 (2013)
Acknowledgements
We would like to acknowledge the National Development and Research Institutes for some of the tissues. We acknowledge the assistance of V. Sandhya, V. Puttamallesh, U. Guha and B. Cole for help with analysis of some of the samples. We thank L. Lane and B. Amos for their assistance with the list of missing genes. This work was supported by an NIH roadmap grant for Technology Centers of Networks and Pathways (U54GM103520), NCIâs Clinical Proteomic Tumor Analysis Consortium initiative (U24CA160036), a contract (HHSN268201000032C) from the National Heart, Lung and Blood Institute and the Sol Goldman Pancreatic Cancer Research Center. The authors acknowledge the joint participation by the Adrienne Helis Malvin Medical Research Foundation and the Diana Helis Henry Medical Research Foundation through its direct engagement in the continuous active conduct of medical research in conjunction with The Johns Hopkins Hospital and the Johns Hopkins University School of Medicine and the Foundationâs Parkinsonâs Disease Programs. The analysis work was partially supported by the National Resource for Network Biology (P41GM103504). A.Mah., S.K.Sh., P.S. and T.S.K.P. are supported by DBT Program Support on Neuroproteomics (BT/01/COE/08/05) to IOB and NIMHANS. H.G. is a Wellcome Trust-DBT India Alliance Early Career Fellow. We thank Council of Scientific and Industrial Research, University Grants Commission and Department of Science and Technology, Government of India for research fellowships for S.M.P., R.S.N., A.R., M.K., G.J.S., S.C., P.R., J.S., S.S.M., D.S.K., S.R., S.K.Sr., K.K.D., Y.S., A.S., S.D.Y., N.S., S.A. and G.D.
Author information
Authors and Affiliations
Contributions
A.P., H.G., R.C., M.-S.K. designed the study; A.P., H.G., M.-S.K. managed the study; D.G., C.L.K., C.A.I.-D., K.R.M. collected human cells/tissues; M.-S.K., R.C., D.G. developed the pipeline of experiment and analysis; D.G., M.-S.K., S.M.P., K.M., R.C., S.R., J.Z., X.W., P.G.S., M.S.Z., T.-C.H. prepared peptide samples for LC-MS/MS; M.-S.K., R.S.N., S.M.P., R.C., D.S.K., S.R., G.J.S. performed LC-MS/MS; M.-S.K., S.M.P., S.P., S.S.M., C.J.M., J.A. and A.K.M. processed MS data and managed data; A.K.M., S.S.M., B.G., A.H.P., Y.S., M.-S.K. performed comparison analysis with PeptideAtlas, neXtProt and GPMDB; R.I., S.Jai., G.D.B. performed interaction and complex analysis; M.-S.K., S.M.P., S.S.M., P.K., A.K.M., N.A.S., R.S.N., L.B., L.D.N.S., D.S.K., V.N., A.R., T.S., M.K., S.K.Sr., G.D., A.Mar., R.R., S.C., K.K.D., A.S., S.D.Y., S.Jay., P.R., A.H.P., B.G., J.S., N.S., R.G., G.J.S., A.A.K., S.A., D.F., T.S.K.P., H.G., A.P. performed proteogenomic analysis; A.C., H.L., R.S., J.T.S., K.K.M., S.S., A.Mah., S.K.Sh., P.S., S.D.L., C.G.D., A.Mai., M.K.H., R.H.H., C.L.K., C.A.I.-D. assisted with analysis of the data; M.-S.K., S.M.P., T.-C.H., P.L.-R. performed western blot experiments; M.-S.K., J.K.T., A.K.M., B.M., S.P., S.M.P. designed the Human Proteome Map web portal; M.-S.K., A.K.M., J.K.T. generated selected reaction monitoring (SRM) database; M.-S.K., K.M., G.D., S.M.P., S.S.M. illustrated figures with help of other authors; A.P., M.-S.K., H.G. wrote the manuscript with inputs from other authors.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Additional information
The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the PRIDE partner repository with the dataset identifier PXD000561.
Extended data figures and tables
Extended Data Figure 1 Summary of proteome analysis.
a, Mass error in parts per million for precursor ions of all identified peptides. b, Number of peptides detected per gene binned as shown. c, Distribution of sequence coverage of identified proteins. dâf, %FDR with a q value of <0.01 plotted against peptide length in number of amino acids, charge state of peptide ion and number of cleavage sites missed by enzyme. P values computed from two-tailed t-test are shown. Error bars indicate s.d. calculated from FDRs of multiple fetal samples. g, h, A comparison of peptides identified in this study with PeptideAtlas and GPMDB. i, Mass error in parts per million for precursor ions identified from proteogenomics analysis.
Extended Data Figure 2 Tissue-wise gene expression and housekeeping proteins.
a, A heat map shows a partial list of not well-characterized, hypothetical genes. b, The bulk of protein mass is contributed by only a small number of genes. Only 2,350 âhousekeeping genesâ account for â¼75% of proteome mass. c, The number of cell/tissue types where a gene was observed was counted. Some genes were found to be specifically restricted in a few samples while others were observed in the majority of samples analysed. For example, 1,537 genes were detected only in one sample, and 2,350 genes were found in all samples. These latter genes can be defined as highly abundant âhousekeeping proteinsâ. d, Distribution of genes in the RefSeq database based on the number of protein isoforms resulting from their annotated transcripts (left). Distribution of the transcripts with two or more protein isoforms annotated based on the number of isoform-specific or shared peptides (right). e, A representative example of sequence coverage of PSMB8 protein along with tissue distribution of all of its identified peptides and the MS/MS spectrum of one of the peptides is shown along with seven selected reaction monitoring (SRM) transitions.
Extended Data Figure 3 Western blot analysis of select tissue-restricted proteins.
a, Eight proteins showing tissue-restricted expression were tested using western blot analysis in 17 adult tissues. GAPDH was used as a loading control. b, Four proteins found to be expressed in a broad range of tissues, although bands that do not correspond to the expected molecular weight are also observed. CST, Cell Signalling Technology; SCB, Santa Cruz Biotechnology.
Extended Data Figure 4 Identification of novel genes/ORFs and translated non-coding RNAs.
a, An example of a novel ORF in an alternate reading frame located in the 3â²âUTR of CHTF8 gene. The relative abundance of peptides from the CHTF8 protein and the protein encoded by the novel ORF is shown (bottom). b, An example of translated non-coding RNA (NR_027693.1) identified by searching 3-frame-translated transcript database. The MS/MS spectrum of one of the five identified peptides (LEVASSPPVSEAVPR) is shown along with a similar fragmentation pattern observed from the corresponding synthetic peptide.
Extended Data Figure 5 Human genome annotation through proteogenomic analysis using GeneSpring.
a, Four genome search specific peptides (GSSPs; red boxes) map to an upstream ORF (denoted as black hashes) located in 5â²âUTR of the SLC35A4 gene (ORF shown as blue rectangle). b, GSSP mapping in the intergenic region between two RefSeq annotated genes NDUFv3 and PKNOX1. The ORF region is depicted in dotted lines of human endogenous retroviral element (HERV). c, GSSPs mapping to an annotated pseudogene MAGEB6P1, the alignments of parent gene and pseudogene are shown below the peptides.
Extended Data Figure 6 Frequency of nucleotides surrounding translational start sites.
a, Frequency of nucleotides at positions ranging from â5 to +1 surrounding the AUG codon for confirmed translational start sites. b, Frequency of nucleotides at positions ranging from â5 to +1 surrounding the AUG codon for novel translational start sites identified in this study.
Supplementary information
Supplementary Information
This file contains a Supplementary Discussion and additional references. (PDF 106 kb)
Supplementary Data
This file contains Supplementary Data. (PDF 3594 kb)
Supplementary Table 1
This file contains a summary of results from proteogenomics analysis; a list of peptides indicating novel signal peptide cleavage sites; and a draft map of the human proteome. (XLSX 1178 kb)
Rights and permissions
About this article
Cite this article
Kim, MS., Pinto, S., Getnet, D. et al. A draft map of the human proteome. Nature 509, 575â581 (2014). https://doi.org/10.1038/nature13302
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nature13302