Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

Goldman, Nick; Bertone, Paul; Chen, Siyuan; Dessimoz, Christophe; LeProust, Emily M.; Sipos, Botond; Birney, Ewan

doi:10.1038/nature11875

Letter
Published: 23 January 2013

Towards practical, high-capacity, low-maintenance information storage in synthesized DNA

Nick Goldman¹,
Paul Bertone¹,
Siyuan Chen²,
Christophe Dessimoz¹,
Emily M. LeProust²,
Botond Sipos¹ &
â¦
Ewan Birney¹Â

Nature volumeÂ 494,Â pages 77â80 (2013)Cite this article

75k Accesses
678 Citations
1210 Altmetric
Metrics details

Subjects

Abstract

Digital production, transmission and storage have revolutionized how we access and use information but have also made archiving an increasingly complex task that requires active, continuing maintenance of digital media. This challenge has focused some interest on DNA as an attractive target for information storage¹ because of its capacity for high-density information encoding, longevity under easily achieved conditions^2,3,4 and proven track record as an information bearer. Previous DNA-based information storage approaches have encoded only trivial amounts of information^5,6,7 or were not amenable to scaling-up⁸, and used no robust error-correction and lacked examination of their cost-efficiency for large-scale information archival⁹. Here we describe a scalable method that can reliably store more information than has been handled before. We encoded computer files totalling 739 kilobytes of hard-disk storage and with an estimated Shannon information¹⁰ of 5.2âÃâ10⁶ bits into a DNA code, synthesized this DNA, sequenced it and reconstructed the original files with 100% accuracy. Theoretical analysis indicates that our DNA-based storage scheme could be scaled far beyond current global information volumes and offers a realistic technology for large-scale, long-term and infrequently accessed digital archiving. In fact, current trends in technological advances are reducing DNA synthesis costs at a pace that should make our scheme cost-effective for sub-50-year archiving within a decade.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Figure 1: **Digital information encoding in DNA.**

Figure 2: **Scaling properties and robustness of DNA-based storage.**

Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction

Article Open access 22 October 2020

Efficient DNA-based data storage using shortmer combinatorial encoding

Article Open access 02 April 2024

Reading and writing digital data in DNA

Article 29 November 2019

Accession codes

Primary accessions

Sequence Read Archive

ERP002040

Data deposits

Data are available at http://www.ebi.ac.uk/goldman-srv/DNA-storage and in the Sequence Read Archive (SRA) with accession number ERP002040.

References

Baum, E. B. Building an associative memory vastly larger than the brain. Science 268, 583â585 (1995)
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Cox, J. P. L. Long-term data storage in DNA. Trends Biotechnol. 19, 247â250 (2001)
ArticleÂ CASÂ Google ScholarÂ
Anchordoquy, T. J. & Molina, M. C. Preservation of DNA. Cell Preserv. Technol. 5, 180â188 (2007)
ArticleÂ CASÂ Google ScholarÂ
Bonnet, J. et al. Chain and conformation stability of solid-state DNA: implications for room temperature storage. Nucleic Acids Res. 38, 1531â1546 (2010)
ArticleÂ CASÂ Google ScholarÂ
Clelland, C. T., Risca, V. & Bancroft, C. Hiding messages in DNA microdots. Nature 399, 533â534 (1999)
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Kac, E. Genesis (1999); available at http://www.ekac.org/geninfo.html (accessed, 10 May 2012)
Google ScholarÂ
Ailenberg, M. & Rotstein, O. D. An improved Huffman coding method for archiving text, images, and music characters in DNA. Biotechniques 47, 747â754 (2009)
ArticleÂ CASÂ Google ScholarÂ
Gibson, D. G. et al. Creation of a bacterial cell controlled by a chemically synthesized genome. Science 329, 52â56 (2010)
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Church, G. M., Gao, Y. & Kosuri, S. Next-generation digital information storage in DNA. Science 337, 1628 (2012)
ArticleÂ ADSÂ CASÂ Google ScholarÂ
MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms (Cambridge Univ. Press, 2003)
MATHÂ Google ScholarÂ
Erlich, H. A., Gelfand, D. & Sninsky, J. J. Recent advances in the polymerase chain reaction. Science 252, 1643â1651 (1991)
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Monaco, A. P. & Larin, Z. YACs, BACs, PACs and MACs: artificial chromosomes as research tools. Trends Biotechnol. 12, 280â286 (1994)
ArticleÂ CASÂ Google ScholarÂ
Carr, P. A. & Church, G. M. Genome engineering. Nature Biotechnol. 27, 1151â1162 (2009)
ArticleÂ CASÂ Google ScholarÂ
Willerslev, E. et al. Ancient biomolecules from deep ice cores reveal a forested southern Greenland. Science 317, 111â114 (2007)
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710â722 (2010)
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Kari, L. & Mahalingam, K. in Algorithms and Theory of Computation Handbook Vol. 2, 2nd edn (eds Atallah, M. J. & Blanton, M. ) 31-1â31-24 (Chapman & Hall, 2009)
Google ScholarÂ
PÄun, G., Rozenberg, G. & Salomaa, A. DNA Computing: New Computing Paradigms (Springer, 1998)
BookÂ Google ScholarÂ
Watson, J. D. & Crick, F. H. C. Molecular structure of nucleic acids. Nature 171, 737â738 (1953)
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Niedringhaus, T. P., Milanova, D., Kerby, M. B., Snyder, M. P. & Barron, A. E. Landscape of next-generation sequencing technologies. Anal. Chem. 83, 4327â4341 (2011)
ArticleÂ CASÂ Google ScholarÂ
LeProust, E. M. et al. Synthesis of high-quality libraries of long (150mer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522â2540 (2010)
ArticleÂ CASÂ Google ScholarÂ
Massingham, T. & Goldman, N. All Your Base: a fast and accurate probabilistic approach to base calling. Genome Biol. 13, R13 (2012)
ArticleÂ Google ScholarÂ
Gantz, J. & Reinsel, D. Extracting Value from Chaos (IDC, 2011)
Google ScholarÂ
Brand, S. The Clock of the Long Now (Basic Books, 1999)
Google ScholarÂ
Digital. archiving. History flushed. Economist 403, 56â57 (28 April 2012); available at http://www.economist.com/node/21553410 (2012)
Bessone, N., Cancio, G., Murray, S. & Taurelli, G. Increasing the efficiency of tape-based storage backends. J. Phys. Conf. Ser. 219, 062038 (2010)
ArticleÂ Google ScholarÂ
Baker, M. et al. in Proc. 1st ACM SIGOPS/EuroSys European Conf. on Computer Systems (eds Berbers, Y. & Zwaenepoel, W. ) 221â234 (ACM, 2006)
Yuille, M. et al. The UK DNA banking network: a âfair accessâ biobank. Cell Tissue Bank. 11, 241â251 (2010)
ArticleÂ Google ScholarÂ
Global Crop Diversity Trust Svalbard Global Seed Vault. (2012); available at http://www.croptrust.org/main/content/svalbard-global-seed-vault (accessed, 10 May 2012)

Download references

Acknowledgements

At the University of Cambridge: D. MacKay and G. Mitchison for advice on codes for run-length-limited channels. At CERN: B. Jones for discussions on data archival. At EBI: A. LÃ¶ytynoja for custom multiple sequence alignment software, H. Marsden for computing base calls and for detecting an error in the original parity-check encoding, T. Massingham for computing base calls and advice on code theory and K. Gori, D. Henk, R. Loos, S. Parks and R. Schwarz for assistance with revisions to the manuscript. In the Genomics Core Facility at EMBL Heidelberg: V. Benes for advice on Next-Generation Sequencing protocols, D. PavliniÄ for sequencing and J. Blake for data handling. C.D. is supported by a fellowship from the Swiss National Science Foundation (grant 136461). B.S. is supported by an EMBL Interdisciplinary Postdoctoral Fellowship under Marie Curie Actions (COFUND).

Author information

Authors and Affiliations

European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK ,
Nick Goldman,Â Paul Bertone,Â Christophe Dessimoz,Â Botond SiposÂ &Â Ewan Birney
Agilent Technologies, GenomicsâLSSU, 5301 Stevens Creek Boulevard, Santa Clara, California 95051, USA ,
Siyuan ChenÂ &Â Emily M. LeProust

Authors

Nick Goldman
View author publications
You can also search for this author in PubMedÂ Google Scholar
Paul Bertone
View author publications
You can also search for this author in PubMedÂ Google Scholar
Siyuan Chen
View author publications
You can also search for this author in PubMedÂ Google Scholar
Christophe Dessimoz
View author publications
You can also search for this author in PubMedÂ Google Scholar
Emily M. LeProust
View author publications
You can also search for this author in PubMedÂ Google Scholar
Botond Sipos
View author publications
You can also search for this author in PubMedÂ Google Scholar
Ewan Birney
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

N.G. and E.B. conceived and planned the project and devised the information-encoding methods. P.B. advised on oligo design and Next-Generation Sequencing protocols, prepared the DNA library and managed the sequencing process. S.C. and E.M.L. provided custom oligonucleotides. N.G. wrote the software for encoding and decoding information into/from DNA and analysed the data. N.G., E.B., C.D. and B.S. modelled the scaling properties of DNA storage. N.G. wrote the paper with discussions and contributions from all other authors. N.G. and C.D. produced the figures.

Corresponding author

Correspondence to Nick Goldman.

Ethics declarations

Competing interests

S.C. and E.M.L. are employees of Agilent Technologies, a commercial provider of OLS pools. N.G. and E.B. are named inventors on a patent application on technologies described in this work.

Supplementary information

Supplementary Information 1

This file contains Supplementary Tables 1-4, Supplementary Figures 1-9, Supplementary Methods and Data, a Supplementary Discussion and Supplementary references. This file was replaced on 14 February 2013 to correct the DNA sequence in Supplementary Figure 8, which was misaligned. (PDF 2027 kb)

Supplementary Information 2

This file contains the full formal specification of the digital information encoding scheme. (PDF 244 kb)

Supplementary Information 3

This file contains FastQC QC report on Illumina HiSeq 2000 sequencing run. (PDF 411 kb)

Supplementary Data 1

This zipped file contains the five original files encoded and decoded in this study, namely wssnt10.txt (ASCII text file containing text of all 154 Shakespeare sonnets), watsoncrick.pdf (PDF of Watson & Crickâs (1953) paper describing the structure of DNA), MLK_excerpt_VBR_45-85.mp3 (MP3 file containing a 26 s excerpt from Martin Luther King's 1963 "I Have A Dream" speech), EBI.jp2 (JPEG 2000 format medium resolution colour photograph of the European Bioinformatics Institute) and View_huff3.cd.new (ASCII text file defining the Huffman code used to convert bytes of encoded files to base 3). (ZIP 646 kb)

Supplementary Data 2

This file contains the GATK ErrorRatePerCycle report on Illumina HiSeq 2000 sequencing run. (TXT 6 kb)

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goldman, N., Bertone, P., Chen, S. et al. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature 494, 77â80 (2013). https://doi.org/10.1038/nature11875

Download citation

Received: 15 May 2012
Accepted: 12 December 2012
Published: 23 January 2013
Issue Date: 07 February 2013
DOI: https://doi.org/10.1038/nature11875

This article is cited by

In-vitro validated methods for encoding digital data in deoxyribonucleic acid (DNA)
- Golam Md Mortuza
- Jorge Guerrero
- Tim Andersen
BMC Bioinformatics (2023)
Magnetic DNA random access memory with nanopore readouts and exponentially-scaled combinatorial addressing
- Billy Lau
- Shubham Chandak
- Hanlee P. Ji
Scientific Reports (2023)
Performance analysis of DNA crossbar arrays for high-density memory storage applications
- Arpan De
- Hashem Mohammad
- M. P. Anantram
Scientific Reports (2023)
Digital data storage on DNA tape using CRISPR base editors
- Afsaneh Sadremomtaz
- Robert F. Glass
- Reza Zadegan
Nature Communications (2023)
Towards high-density storage of text and images into DNA by the âXiao-Pangâ codec system
- Mingwei Lu
- Yang Wang
- Junbiao Dai
Science China Life Sciences (2023)