Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3307339.3342171acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

ParRefCom: Parallel Reference-based Compression of Paired-end Genomics Read Datasets

Published: 04 September 2019 Publication History

Abstract

Transmission, storage, and archival of high-throughput sequencing (HTS) short-read datasets pose significant challenges due to the large size of such datasets. Constant improvements to HTS technology, in the form of increasing throughput and decreasing cost, and its increasing adoption amplify the problem. General-purpose compression algorithms have been widely adopted for representing read datasets in a compact form. However, they are unable to fully leverage the domain-specific properties of read datasets. In response, researchers proposed special-purpose compression algorithms which improve upon the compression efficiency of general-purpose compression algorithms. In this paper, we present ParRefCom, a parallel reference-based algorithm for compressing HTS genomics short-read datasets. HTS instruments are typically used to generate paired-end reads as they hold significance for biological analysis. In contrast to existing special-purpose compression algorithms, ParRefCom treats paired-end reads as first-class citizens. Owing to this treatment of paired-end reads, our algorithm is able to significantly improve compression efficiency over the state-of-the-art. More specifically, for a benchmark human dataset, the size of the compressed output is 21% smaller than that produced by the current best algorithm. Further, ParRefCom is scalable and its compression and decompression speeds are better than those of reference-free methods.
Implementation : https://github.com/ParBLiSS/refcom

References

[1]
Gaëtan Benoit, Claire Lemaitre, Dominique Lavenier, Erwan Drezen, Thibault Dayris, Raluca Uricaru, and Guillaume Rizk. 2015. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC bioinformatics, Vol. 16, 1 (2015), 288.
[2]
James K Bonfield and Matthew V Mahoney. 2013. Compression of FASTQ and SAM format sequencing data. PloS one, Vol. 8, 3 (2013), e59190.
[3]
Stefan Canzar and Steven L Salzberg. 2017. Short read mapping: An algorithmic tour. Proc. IEEE, Vol. 105, 3 (2017), 436--458.
[4]
Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. 2018. SPRING: a next-generation compressor for FASTQ data. Bioinformatics (2018).
[5]
Peter JA Cock, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M Rice. 2009. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research, Vol. 38, 6 (2009), 1767--1771.
[6]
Anirban Dutta, Mohammed Monzoorul Haque, Tungadri Bose, Ch V Siva K Reddy, and Sharmila S Mande. 2015. FQC: A novel approach for efficient compression, archival, and dissemination of fastq datasets. Journal of bioinformatics and computational biology, Vol. 13, 03 (2015), 1541003.
[7]
Faraz Hach, Ibrahim Numanagić, Can Alkan, and S Cenk Sahinalp. 2012. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics, Vol. 28, 23 (2012), 3051--3057.
[8]
Zhi-An Huang, Zhenkun Wen, Qingjin Deng, Ying Chu, Yiwen Sun, and Zexuan Zhu. 2017. LW-FQZip 2: a parallelized reference-based compression of FASTQ files. BMC bioinformatics, Vol. 18, 1 (2017), 179.
[9]
Daniel C Jones, Walter L Ruzzo, Xinxia Peng, and Michael G Katze. 2012. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic acids research, Vol. 40, 22 (2012), e171--e171.
[10]
Carl Kingsford and Rob Patro. 2015. Reference-based compression of short-read sequences using path encoding. Bioinformatics, Vol. 31, 12 (2015), 1920--1928.
[11]
Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows--Wheeler transform. Bioinformatics, Vol. 25, 14 (2009), 1754--1760.
[12]
Yuansheng Liu, Zuguo Yu, Marcel E Dinger, and Jinyan Li. 2018. Index suffix--prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics, Vol. 35, 12 (2018), 2066--2074.
[13]
Ibrahim Numanagić, James K Bonfield, Faraz Hach, Jan Voges, Jörn Ostermann, Claudio Alberti, Marco Mattavelli, and S Cenk Sahinalp. 2016. Comparison of high-throughput sequencing data compression tools. Nature Methods (2016).
[14]
Rob Patro and Carl Kingsford. 2015. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics, Vol. 31, 17 (2015), 2770--2777.
[15]
Jason A Reuter, Damek V Spacek, and Michael P Snyder. 2015. High-throughput sequencing technologies. Molecular cell, Vol. 58, 4 (2015), 586--597.
[16]
Łukasz Roguski and Sebastian Deorowicz. 2014. DSRC 2 - Industry-oriented compression of FASTQ files. Bioinformatics, Vol. 30, 15 (2014), 2213--2215.
[17]
Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, and Sebastian Deorowicz. 2018. FaStore--a space-saving solution for raw sequencing data. Bioinformatics, Vol. 1 (2018).
[18]
Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE transactions on Information Theory, Vol. 24, 5 (1978), 530--536.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
September 2019
716 pages
ISBN:9781450366663
DOI:10.1145/3307339
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. high-throughput sequencing
  2. paired-end genomics short-read datasets
  3. parallel algorithms
  4. reference-based compression

Qualifiers

  • Research-article

Funding Sources

  • U.S. National Science Foundation

Conference

BCB '19
Sponsor:

Acceptance Rates

BCB '19 Paper Acceptance Rate 42 of 157 submissions, 27%;
Overall Acceptance Rate 254 of 885 submissions, 29%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 114
    Total Downloads
  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media