Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3555776.3577785acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
poster

Accelerating the quality control of genetic sequences through stream processing

Published: 07 June 2023 Publication History

Abstract

Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing. In this paper we present SeQual-Stream, a Big Data tool that allows performing quality control on genomic datasets in a fast, distributed and scalable way. To do so, our tool relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements when compared to a batch processing tool, providing a maximum speedup of 2.7x.

References

[1]
José M. Abuín, Juan C. Pichel, Tomás F. Pena, and Jorge Amigo. 2015. BigBWA: Approaching the Burrows-Wheeler aligner to Big Data technologies. Bioinformatics 31, 24 (2015), 4003--4005.
[2]
Vito Adrian Cantu, Jeffrey Sadural, and Robert Edwards. 2019. PRINSEQ++, a multi-threaded tool for fast and efficient quality control and preprocessing of sequencing datasets. Peer J Preprints 7, Article e27553v1 (2019), 3 pages.
[3]
Wei-Chun Chung, Jan-Ming Ho, Chung-Yen Lin, and Der-Tsai Lee. 2017. CloudEC: A MapReduce-based algorithm for correcting errors in next-generation sequencing big data. In Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017). Boston, MA, USA, 2836--2842.
[4]
Peter J. A. Cock, Christopher J. Fields, Naohisa Goto, Michael L. Heuer, and Peter M. Rice. 2009. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 38, 6 (2009), 1767--1771.
[5]
Columbia Genome Center. [n. d.]. Genome sequencing: Defining your experiment. Retrieved Dec. 2022 from https://systemsbiology.columbia.edu/genome-sequencing-defining-your-experiment
[6]
Robert C. Edgar and Henrik Flyvbjerg. 2015. Error filtering, pair assembly and error correction for next-generation sequencing reads. Bioinformatics 31, 21 (2015), 3476--3482.
[7]
Roberto R. Expósito, Roi Galego-Torreiro, and Jorge González-Domínguez. 2020. SeQual: Big Data tool to perform quality control and data preprocessing of large NGS datasets. IEEE Access 8 (2020), 146075--146084.
[8]
Assaf Gordon and Gregory J. Hannon. 2010. FASTX-Toolkit: FASTQ/A short-reads pre-processing tools. Retrieved Dec. 2022 from http://hannonlab.cshl.edu/fastx_toolkit
[9]
Binsheng He et al. 2020. Assessing the impact of data preprocessing on analyzing Next Generation Sequencing data. Frontiers in Bioengineering and Biotechnology 8, 817 (2020), 1--12.
[10]
Yuichi Kodama, Martin Shumway, and Rasko Leinonen. 2011. The sequence read archive: Explosive growth of sequencing data. Nucleic Acids Research 40, D1 (2011), D54--D56.
[11]
André Minoche, Juliane Dohm, and Heinz Himmelbauer. 2011. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems. Genome Biology 12, R112 (2011), 1--15.
[12]
National Center for Biotechnology Information. [n. d.]. NCBI. Retrieved Dec. 2022 from https://www.ncbi.nlm.nih.gov/
[13]
National Center for Biotechnology Information. [n. d.]. The Sequence Read Archive (SRA). Retrieved Dec. 2022 from https://www.ncbi.nlm.nih.gov/sra
[14]
Kathryn A Phillips. 2018. Assessing the value of next-generation sequencing technologies: An introduction. Value in Health 21, 9 (2018), 1031--1032.
[15]
Robert Schmieder and Robert Edwards. 2011. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 6 (2011), 863--864.
[16]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The Hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST 2010). Incline Village, NV, USA, 1--10.
[17]
The Apache Software Foundation. [n. d.]. Apache Hadoop. Retrieved Dec. 2022 from https://hadoop.apache.org
[18]
The Apache Software Foundation. [n. d.]. Spark SQL, DataFrames and Datasets guide. Retrieved Dec. 2022 from https://spark.apache.org/docs/latest/sql-programming-guide.html
[19]
The Apache Software Foundation. [n. d.]. Structured Streaming programming guide. Retrieved Dec. 2022 from https://spark.apache.org/docs/3.1.1/structured-streaming-programming-guide.html
[20]
David L. Wheeler et al. 2007. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research 36, suppl_1 (2007), D13--D21.
[21]
Matei Zaharia et al. 2016. Apache Spark: A unified engine for BigData processing. Commun. ACM 59, 11 (2016), 56--65.
[22]
Zhang Lab. [n. d.]. What is FASTA format? Retrieved Dec. 2022 from https://zhanggroup.org/FASTA/
[23]
Qian Zhou, Xiaoquan Su, Anhui Wang, Jian Xu, and Kang Ning. 2013. QC-Chain: Fast and holistic quality control method for next-generation sequencing data. PLOS ONE 8, 4, Article e60234 (2013), 10 pages.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing
March 2023
1932 pages
ISBN:9781450395175
DOI:10.1145/3555776
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2023

Check for updates

Author Tags

  1. big data
  2. stream processing
  3. next generation sequencing (NGS)
  4. quality control
  5. apache spark

Qualifiers

  • Poster

Funding Sources

  • Ministry of Science and Innovation, Spain
  • Xunta de Galicia and FEDER funds of the European Union

Conference

SAC '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 31
    Total Downloads
  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media