poster

Accelerating the quality control of genetic sequences through stream processing

Authors:

Óscar Castellanos-Rodríguez,

Roberto R. Expósito,

Juan TouriñoAuthors Info & Claims

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

Pages 398 - 401

https://doi.org/10.1145/3555776.3577785

Published: 07 June 2023 Publication History

Abstract

Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing. In this paper we present SeQual-Stream, a Big Data tool that allows performing quality control on genomic datasets in a fast, distributed and scalable way. To do so, our tool relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements when compared to a batch processing tool, providing a maximum speedup of 2.7x.

References

[1]

José M. Abuín, Juan C. Pichel, Tomás F. Pena, and Jorge Amigo. 2015. BigBWA: Approaching the Burrows-Wheeler aligner to Big Data technologies. Bioinformatics 31, 24 (2015), 4003--4005.

[2]

Vito Adrian Cantu, Jeffrey Sadural, and Robert Edwards. 2019. PRINSEQ++, a multi-threaded tool for fast and efficient quality control and preprocessing of sequencing datasets. Peer J Preprints 7, Article e27553v1 (2019), 3 pages.

[3]

Wei-Chun Chung, Jan-Ming Ho, Chung-Yen Lin, and Der-Tsai Lee. 2017. CloudEC: A MapReduce-based algorithm for correcting errors in next-generation sequencing big data. In Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017). Boston, MA, USA, 2836--2842.

[4]

Peter J. A. Cock, Christopher J. Fields, Naohisa Goto, Michael L. Heuer, and Peter M. Rice. 2009. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 38, 6 (2009), 1767--1771.

[5]

Columbia Genome Center. [n. d.]. Genome sequencing: Defining your experiment. Retrieved Dec. 2022 from https://systemsbiology.columbia.edu/genome-sequencing-defining-your-experiment

[6]

Robert C. Edgar and Henrik Flyvbjerg. 2015. Error filtering, pair assembly and error correction for next-generation sequencing reads. Bioinformatics 31, 21 (2015), 3476--3482.

[7]

Roberto R. Expósito, Roi Galego-Torreiro, and Jorge González-Domínguez. 2020. SeQual: Big Data tool to perform quality control and data preprocessing of large NGS datasets. IEEE Access 8 (2020), 146075--146084.

[8]

Assaf Gordon and Gregory J. Hannon. 2010. FASTX-Toolkit: FASTQ/A short-reads pre-processing tools. Retrieved Dec. 2022 from http://hannonlab.cshl.edu/fastx_toolkit

[9]

Binsheng He et al. 2020. Assessing the impact of data preprocessing on analyzing Next Generation Sequencing data. Frontiers in Bioengineering and Biotechnology 8, 817 (2020), 1--12.

[10]

Yuichi Kodama, Martin Shumway, and Rasko Leinonen. 2011. The sequence read archive: Explosive growth of sequencing data. Nucleic Acids Research 40, D1 (2011), D54--D56.

[11]

André Minoche, Juliane Dohm, and Heinz Himmelbauer. 2011. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biology 12, R112 (2011), 1--15.

[12]

National Center for Biotechnology Information. [n. d.]. NCBI. Retrieved Dec. 2022 from https://www.ncbi.nlm.nih.gov/

[13]

National Center for Biotechnology Information. [n. d.]. The Sequence Read Archive (SRA). Retrieved Dec. 2022 from https://www.ncbi.nlm.nih.gov/sra

[14]

Kathryn A Phillips. 2018. Assessing the value of next-generation sequencing technologies: An introduction. Value in Health 21, 9 (2018), 1031--1032.

[15]

Robert Schmieder and Robert Edwards. 2011. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 6 (2011), 863--864.

Digital Library

[16]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST 2010). Incline Village, NV, USA, 1--10.

Digital Library

[17]

The Apache Software Foundation. [n. d.]. Apache Hadoop. Retrieved Dec. 2022 from https://hadoop.apache.org

[18]

The Apache Software Foundation. [n. d.]. Spark SQL, DataFrames and Datasets guide. Retrieved Dec. 2022 from https://spark.apache.org/docs/latest/sql-programming-guide.html

[19]

The Apache Software Foundation. [n. d.]. Structured Streaming programming guide. Retrieved Dec. 2022 from https://spark.apache.org/docs/3.1.1/structured-streaming-programming-guide.html

[20]

David L. Wheeler et al. 2007. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 36, suppl_1 (2007), D13--D21.

[21]

Matei Zaharia et al. 2016. Apache Spark: A unified engine for BigData processing. Commun. ACM 59, 11 (2016), 56--65.

Digital Library

[22]

Zhang Lab. [n. d.]. What is FASTA format? Retrieved Dec. 2022 from https://zhanggroup.org/FASTA/

[23]

Qian Zhou, Xiaoquan Su, Anhui Wang, Jian Xu, and Kang Ning. 2013. QC-Chain: Fast and holistic quality control method for next-generation sequencing data. PLOS ONE 8, 4, Article e60234 (2013), 10 pages.

Index Terms

Accelerating the quality control of genetic sequences through stream processing
1. Applied computing
  1. Life and medical sciences
    1. Bioinformatics
    2. Computational biology
      1. Computational genomics
2. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. MapReduce algorithms

Recommendations

A Survey of Distributed Stream Processing Systems for Smart City Data Analytics
SCIOT '18: Proceedings of the international conference on smart cities and internet of things

The widespread grow of big data and the evolution of Internet of Things (IoT) technologies enable cities to obtain valuable intelligence from a large amount of real-time produced data. In a Smart City various IoT devices generate data continuously which ...
Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

In recent years there has been a surge in applications focusing on streaming data to generate insights in real-time. Both academia, as well as industry, have tried to address this use case by developing a variety of Stream Processing Engines (SPEs) with ...
Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

March 2023

1932 pages

ISBN:9781450395175

DOI:10.1145/3555776

Conference Chairs:
Jiman Hong
Soongsil University, South Korea
,
Maart Lanperne
Tallinn University, Estonia
,
Program Chairs:
Juw Won Park
University of Louisville, USA
,
Tomas Cerny
Baylor University, USA
,
Publication Chair:
Hossain Shahriar
Kennesaw State University, USA

Copyright © 2023 Owner/Author(s).

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2023

Check for updates

Author Tags

Qualifiers

Poster

Funding Sources

Ministry of Science and Innovation, Spain
Xunta de Galicia and FEDER funds of the European Union

Conference

SAC '23

Sponsor:

SIGAPP

SAC '23: 38th ACM/SIGAPP Symposium on Applied Computing

March 27 - 31, 2023

Tallinn, Estonia

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
31
Total Downloads

Downloads (Last 12 months)16
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents