research-article

ParRefCom: Parallel Reference-based Compression of Paired-end Genomics Read Datasets

Authors:

Nagakishore Jammula,

Srinivas AluruAuthors Info & Claims

BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pages 447 - 456

https://doi.org/10.1145/3307339.3342171

Published: 04 September 2019 Publication History

Abstract

Transmission, storage, and archival of high-throughput sequencing (HTS) short-read datasets pose significant challenges due to the large size of such datasets. Constant improvements to HTS technology, in the form of increasing throughput and decreasing cost, and its increasing adoption amplify the problem. General-purpose compression algorithms have been widely adopted for representing read datasets in a compact form. However, they are unable to fully leverage the domain-specific properties of read datasets. In response, researchers proposed special-purpose compression algorithms which improve upon the compression efficiency of general-purpose compression algorithms. In this paper, we present ParRefCom, a parallel reference-based algorithm for compressing HTS genomics short-read datasets. HTS instruments are typically used to generate paired-end reads as they hold significance for biological analysis. In contrast to existing special-purpose compression algorithms, ParRefCom treats paired-end reads as first-class citizens. Owing to this treatment of paired-end reads, our algorithm is able to significantly improve compression efficiency over the state-of-the-art. More specifically, for a benchmark human dataset, the size of the compressed output is 21% smaller than that produced by the current best algorithm. Further, ParRefCom is scalable and its compression and decompression speeds are better than those of reference-free methods.

Implementation : https://github.com/ParBLiSS/refcom

References

[1]

Gaëtan Benoit, Claire Lemaitre, Dominique Lavenier, Erwan Drezen, Thibault Dayris, Raluca Uricaru, and Guillaume Rizk. 2015. Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph. BMC bioinformatics, Vol. 16, 1 (2015), 288.

[2]

James K Bonfield and Matthew V Mahoney. 2013. Compression of FASTQ and SAM format sequencing data. PloS one, Vol. 8, 3 (2013), e59190.

[3]

Stefan Canzar and Steven L Salzberg. 2017. Short read mapping: An algorithmic tour. Proc. IEEE, Vol. 105, 3 (2017), 436--458.

[4]

Shubham Chandak, Kedar Tatwawadi, Idoia Ochoa, Mikel Hernaez, and Tsachy Weissman. 2018. SPRING: a next-generation compressor for FASTQ data. Bioinformatics (2018).

[5]

Peter JA Cock, Christopher J Fields, Naohisa Goto, Michael L Heuer, and Peter M Rice. 2009. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research, Vol. 38, 6 (2009), 1767--1771.

[6]

Anirban Dutta, Mohammed Monzoorul Haque, Tungadri Bose, Ch V Siva K Reddy, and Sharmila S Mande. 2015. FQC: A novel approach for efficient compression, archival, and dissemination of fastq datasets. Journal of bioinformatics and computational biology, Vol. 13, 03 (2015), 1541003.

[7]

Faraz Hach, Ibrahim Numanagić, Can Alkan, and S Cenk Sahinalp. 2012. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics, Vol. 28, 23 (2012), 3051--3057.

Digital Library

[8]

Zhi-An Huang, Zhenkun Wen, Qingjin Deng, Ying Chu, Yiwen Sun, and Zexuan Zhu. 2017. LW-FQZip 2: a parallelized reference-based compression of FASTQ files. BMC bioinformatics, Vol. 18, 1 (2017), 179.

[9]

Daniel C Jones, Walter L Ruzzo, Xinxia Peng, and Michael G Katze. 2012. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic acids research, Vol. 40, 22 (2012), e171--e171.

[10]

Carl Kingsford and Rob Patro. 2015. Reference-based compression of short-read sequences using path encoding. Bioinformatics, Vol. 31, 12 (2015), 1920--1928.

[11]

Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows--Wheeler transform. Bioinformatics, Vol. 25, 14 (2009), 1754--1760.

Digital Library

[12]

Yuansheng Liu, Zuguo Yu, Marcel E Dinger, and Jinyan Li. 2018. Index suffix--prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics, Vol. 35, 12 (2018), 2066--2074.

[13]

Ibrahim Numanagić, James K Bonfield, Faraz Hach, Jan Voges, Jörn Ostermann, Claudio Alberti, Marco Mattavelli, and S Cenk Sahinalp. 2016. Comparison of high-throughput sequencing data compression tools. Nature Methods (2016).

[14]

Rob Patro and Carl Kingsford. 2015. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics, Vol. 31, 17 (2015), 2770--2777.

[15]

Jason A Reuter, Damek V Spacek, and Michael P Snyder. 2015. High-throughput sequencing technologies. Molecular cell, Vol. 58, 4 (2015), 586--597.

[16]

Łukasz Roguski and Sebastian Deorowicz. 2014. DSRC 2 - Industry-oriented compression of FASTQ files. Bioinformatics, Vol. 30, 15 (2014), 2213--2215.

[17]

Łukasz Roguski, Idoia Ochoa, Mikel Hernaez, and Sebastian Deorowicz. 2018. FaStore--a space-saving solution for raw sequencing data. Bioinformatics, Vol. 1 (2018).

[18]

Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE transactions on Information Theory, Vol. 24, 5 (1978), 530--536.

Digital Library

Index Terms

ParRefCom: Parallel Reference-based Compression of Paired-end Genomics Read Datasets
1. Applied computing
  1. Life and medical sciences
    1. Bioinformatics
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Data compression
    2. Parallel algorithms
      1. Shared memory algorithms

Recommendations

Circular RNA Detection from High-throughput Sequencing
RACS '17: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Alternative splicing refers to the production of multiple mRNA isoforms from a single gene due to alternative selection of exons or splice sites during pre-mRNA splicing. While canonical alternative splicing produces a linear form of RNA by joining an ...
How Bioinformatics Can Aid Biodiversity Description: The Case of a Probable New Species of Orthonychiurus (Collembola, Hexapoda)
Advances in Bioinformatics and Computational Biology
Abstract
The description of all living species is an ultimate goal of biology. Species description, however, is a time-consuming effort that requires specialized taxonomists in the vast array of existing taxa. With the current rate of habitat loss, it is ...
Sequence homology in circular RNA detection
RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems

Over the past two decades, researchers have shown an increasing interest in a special form of alternative splicing (AS) that produces a circular form of RNA distinct from the canonical linear form of RNA. Although several circular RNA detection tools ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB '19: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

September 2019

716 pages

ISBN:9781450366663

DOI:10.1145/3307339

General Chairs:
Xinghua (Mindy) Shi
Temple University, USA
,
Michael Buck
University of Buffalo, USA
,
Program Chairs:
Jian Ma
Carnegie Mellon University, USA
,
Pierangelo Veltri
University Magna Graecia of Catanzaro, Italy

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGBio: ACM Special Interest Group on Bioinformatics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. National Science Foundation

Conference

BCB '19

Sponsor:

SIGBio

BCB '19: 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

September 7 - 10, 2019

NY, Niagara Falls, USA

Acceptance Rates

BCB '19 Paper Acceptance Rate 42 of 157 submissions, 27%;

Overall Acceptance Rate 254 of 885 submissions, 29%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
114
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents