research-article

Open access

Big omics data experience

Authors:

Patricia Kovatch,

Svetlana MazurkovaAuthors Info & Claims

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 39, Pages 1 - 12

https://doi.org/10.1145/2807591.2807595

Published: 15 November 2015 Publication History

Abstract

As personalized medicine becomes more integrated into healthcare, the rate at which human genomes are being sequenced is rising quickly together with a concomitant acceleration in compute and storage requirements. To achieve the most effective solution for genomic workloads without re-architecting the industry-standard software, we performed a rigorous analysis of usage statistics, benchmarks and available technologies to design a system for maximum throughput. We share our experiences designing a system optimized for the "Genome Analysis ToolKit (GATK) Best Practices" whole genome DNA and RNA pipeline based on an evaluation of compute, workload and I/O characteristics. The characteristics of genomic-based workloads are vastly different from those of traditional HPC workloads, requiring different configurations of the scheduler and the I/O subsystem to achieve reliability, performance and scalability. By understanding how our researchers and clinicians work, we were able to employ techniques not only to speed up their workflow yielding improved and repeatable performance, but also to make more efficient use of storage and compute resources.

References

[1]

Fromer, M. et al. 2014. De novo mutations in schizophrenia implicate synaptic networks. Nature 506 (2014), 179--84.

[2]

Purcell, S. et al. 2014. A polygenic burden of rare disruptive mutations in schizophrenia. Nature 506 (2014), 185--90.

[3]

Gaugler, T. et al. 2014. Most genetic risk for autism resides with common variation. Nat. Genet. 46 (2014), 881--885.

[4]

Neale, B. et al. 2012. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485 (2012), 242--245.

[5]

Tu, Z. et al. 2012. Integrative Analysis of a Cross-Loci Regulation Network Identifies App as a Gene Regulating Insulin Secretion from Pancreatic Islets. PLoS Genet. 8 (2012), e1003107.

[6]

Deloukas, P. et al. 2012. Large-scale association analysis identifies new risk loci for coronary artery disease. Nat Genet 45 (2012), 25--33.

[7]

Gomes, I. et al. 2013. Identification of a μ-Δ opioid receptor heteromer-biased agonist with antinociceptive activity. Proc. Natl. Acad. Sci. U. S. A. 110, (2013), 12072--7.

[8]

Gandy, S., Haroutunian, V. et al. CR1 and the vanishing amyloid hypothesis of Alzheimer's disease. Biol. Psychiatry 73 (2013), 393--395.

[9]

McKenna, A., Hanna, M., et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. 2010 Genome Research 20 (2010), 1297--303.

[10]

DePristo, M., et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43 (2011), 491--498.

[11]

Van der Auwera, G., et al. 2013. From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics 43 (2013), 11.10.1--11.10.33.

[12]

GATK Best Practices, 2012. Retrieved July 1, 2012, from Broad Institute: https://www.broadinstitute.org/gatk/guide/best-practices.

[13]

Goldsmith, P. Broad Institute, Google Genomics combine bioinformatics and computing expertise to expand access to research tools. Retrieved July 15, 2015, from Broad Institute: https://www.broadinstitute.org/news/6994.

[14]

Introduction to GATK, 2012. Retrieved July 1, 2012, from Broad Institute: https://www.broadinstitute.org/gatk/about/.

[15]

RNASeq Overview, GATK Best Practices. Retrieved July 1, 2012 from Broad Institute: https://www.broadinstitute.org/gatk/guide/best-practices?bpm=RNAseq.

[16]

Pabinger, S., et al. 2014. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform (2014), 15(2), 256--278.

[17]

Liu, X., et al. 2013. Variant callers for next-generation sequencing data: a comparison study. PLoS One (Sep. 2013), 8(9): e75619.

[18]

GENALICE, Technology for People and Science. Retrieved June 15, 2015 from GENALICE: http://www.genalice.com/.

[19]

Langmead, B., Schatz, M., Lin, J., Pop, M., Salzberg, S. 2009. Searching for SNPs with cloud computing. Genome Biology (2009), 0:R134.

[20]

Xie, Y., Wu, G., et al. 2014. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics (Jun. 2014),15;30(12):1660--6

[21]

O'Rawe, J., Jiang, T., et al. 2013. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine (2013), 5:28

[22]

Linderman, M. D.; Brandt, T.; Edelmann, L. et al. 2014. Analytical Validation of Whole Exome and Whole Genome Sequencing for Clinical Applications. BMC Med Genomics (Apr. 2014), 7:20. 24758382.

[23]

De Rubeis, S., He, X., Goldberg, A., et al. 2014. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature (2014), 515(7526): 209--215.

[24]

Puckelwartz, M. J. et al. 2014. Supercomputing for the parallelization of whole genome analysis. Bioinformatics (2014), 30, 1508--1513.

[25]

Chapman, B. 2013. Scaling variant detection pipelines for whole genome sequencing analysis. Blue Collar Bioinformatics, May 22, 2013. Retrieved June 15, 2015 from http://bcb.io/2013/05/22/scaling-variant-detection-pipelines-for-whole-genome-sequencing-analysis/.

[26]

Baker, M. 2010. Next-generation sequencing: adjusting to data overload. Nature Methods (2010) 7:495--499.

[27]

Schmuck, F., Haskin, R 2002. GPFS: A Shared-Disk File System for Large Computing Clusters, USENIX File and Storage Technologies Conference (San Jose, CA, 2002).

Digital Library

[28]

Schatz, M. 2009. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics (2009), 25 (11), 1,363--1,369.

Digital Library

[29]

Mohamed, N., Lin, H., Feng, W. 2013. Accelerating Data-Intensive Genome Analysis in the Cloud. Proceedings from the 5^th International Conference on Bioinformatics and Computational Biology (2013).

[30]

qinyuan/ganglia-gpfs-plugin. Retrieved December 15, 2013 2013 from GitHub: https://hithub.com/qinyuan/ganglia-gpfs-plugin.

[31]

The Glade Environment. Retrieved December 15, 2013 from GPFS User Group: http://files.gpfsug.org/presentations/2013/SC13GPFSUserForum.pdf.

[32]

Interpreting GPFS Waiter Information. Retrieved December 15, 2013 from IBM developerWorks: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Interpreting%20GPFS%20Waiter%20Information

[33]

IBM Corp. 2013. General Parallel File System Administration and Programming Reference, Version 3 Release 5.0.11, 386--390.

[34]

Monitoring GPFS I/O performance with the mmpmon command. Retrieved December 15, 2013 from IBM Knowledge Center. http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_mmpmonch.htm

[35]

Highcharts Products. Retrieved December 30, 2015 from Highcharts: http://www.highcharts.com/

[36]

CERN IT Facility Planning and Procurement. Retrieved January 15, 2014 from Indico: https://indico.cern.ch/event/92498/session/7/contribution/7/material/slides/0.pdf.

[37]

Kawalia, A., Motameny, S., et al. 2015. Leveraging the Power of High Performance Computing for Next Generation Sequencing Data Analysis: Tricks and Twists from a High Throughput Exome Workflow. PLoS ONE (2015). 10(5): e0126321.

[38]

Broad Institute GATK on Google Cloud Platform. Retrieved June 15, 2015 from: Google Cloud Platform. https://cloud.google.com/genomics/gatk?hl=en

[39]

Proffitt, A. Playing the Markets: ClusterK Launches Cloud Scheduler, Open Source GATK Pipeline. Retrieved June 15, 2015, from Bio-IT World: http://www.bio-itworld.com/2015/2/18/playing-markets-clusterk-launches-cloud-scheduler-gatk-pipeline.html.

[40]

Bhuvaneshwar, K., et al. A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Computational and Structural Biotechnology Journal, 13 (2015). 64--74.

[41]

Juve, G., et al. 2013. Characterizing and profiling scientific workflows. Future Generation Computer Systems, 29 (2013), 682--692.

Digital Library

[42]

Singh, G., Deelman, E. The interplay of resource provisioning and workflow optimization in scientific applications. Concurrency and Computation: Practice and Experience, 23, (Jan. 2011), 1969--1989.

Digital Library

[43]

Wiewiorka, M., et al. 2014. SparkSeq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics (May 19, 2014).

[44]

Massie, M., et al. ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013-207 (Dec. 15, 2013).

[45]

Kovatch, P., et al. 2011. The Malthusian Catastrophe is upon us! Are the largest HPC machines ever up?. Euro-Par 2011: Parallel Processing Workshops (Bordeaux, France, August 29 -- September 2, 2011), Lecture Notes in Computer Science, 7156. 211--220. Berlin/Heidelberg: Springer.

Digital Library

[46]

Andrews, P., Kovatch, P., et al. 2010. Scheduling a 100,000-core supercomputer for maximum utilization and capability. 39th International Conference on Parallel Processing Workshops (ICPPW), (San Diego, California, September 13-16, 2010). 421--427.

Digital Library

[47]

Samuel, T., Baer, T., Brook, R., Ezell, M., Kovatch, P. 2011. Scheduling diverse high performance computing systems with the goal of maximizing utilization. 18th International Conference on High Performance Computing (HiPC), (Bangalore, India, December 18-21, 2011), 1--6.

Digital Library

[48]

Margo, M., Yoshimoto, K., Kovatch, P., Andrews, P. 2008. Impact of reservations on production job scheduling. Job Scheduling Strategies for Parallel Processing, Lecture notes in Computer Science, 4942, 116--131. Berlin/Heidelberg: Springer.

[49]

Andrews, P., Banister, B., Kovatch, P., et al. 2005. Scaling a global file system to the greatest possible extent, performance, capacity and number of users. Twenty-Second IEEE / Thirteenth NASA Goddard Conference on Mass Storage Systems and Technologies (Monterey, California, April 11-14, 2005). 109--117.

Digital Library

[50]

Andrews, P., Kovatch, P., Jordan, C. 2005. Massive high performance global file systems for grid computing. Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference (May 12-18, 2005). 53--66.

Digital Library

[51]

Carns, P., et al., 2011. Understanding and improving computational science storage access through continuous characterization. MSST'11 Proceedings of the 2011 IEEE 27^th Symposium on Mass Storage Systems and Technologies (Denver, Colorado, May 23-27, 2011). 1--14.

Digital Library

[52]

Oral, S., et al. 2014. Best practices and lessons learned from deploying and operating large-scale data-centric parallel file systems. SC'14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, Louisiana, November 16-21, 2014). 217--228.

Digital Library

[53]

Park, S., et al. 2012. A performance evaluation of specific I/O workloads on flash-based SSDs. Proceedings from the Workshop on Interfaces and Abstractions for Scientific Data Storage (Beijing, China, September 24-28, 2012).

[54]

Ilyushkin, A., et al. 2015. Towards a realistic scheduler for mixed workloads with workflows. Proceedings from the 15^th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, (Shenzhen, Guangdong, China, May 4-7, 2015). 653--756.

Digital Library

[55]

Mattoso, M., Dias, J., Ocana, K., Ogasawara, E., Costa, F., Horta, F., Silva, V. Oliveira, D. 2015. Dynamic steering of HPC scientific workflows: a survey. Future Generation Computer Systems (May 2015), 46, 100--113.

Digital Library

[56]

Chen, W., da Silva, R., Deelman, E., Sakellariou, R. 2015. Using imbalance metrics to optimize task clustering in scientific workflow executions. Future Generation Computer Systems (May 2015), 46, 69--84.

Digital Library

[57]

Ferreira da Silva, R. et al. 2014. A unified approach for modeling and optimization of energy, makespan and reliability for scientific workflows on large-scale computing Infrastructures. Workshop on Modeling & Simulation of Systems and Applications (Seattle, Washington, August 13-14, 2014).

[58]

Maheshwari, K., et al. 2014. Improving multisite workflow performance using model-based scheduling. 43^rd International Conference on Parallel Processing, (Minneapolis, MN, September 9-12, 2014), 131--140.

Digital Library

[59]

Blankenberg, D., et al. 2010. Galaxy: a web-based genome analysis tool for experimentalists. Current Protocols in Molecular Biology (Jan. 2010). 89:19.10.1--19.10.21.

[60]

Evani, U., et al. 2012. Atlas2 Cloud: a framework for personal genome analysis in the cloud. BMC Genomics (2012), 13(Suppl. 6).

[61]

Wandelt, S., et al. 2012. Data management challenges in next generation sequencing. Datenbank-Spektrum (Nov. 2012), 12(3), 161--171.

[62]

El-Metwally, S., et al. 2013. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Computational Biology (Dec. 2013).

[63]

Carns, P., et al. 2009. Small-file access in parallel file systems. IPDPS 2009. Proceedings of the 2009 IEEE International Parallel & Distributed Processing Symposium (Rome, May 23-29, 2009), 1--11.

Digital Library

[64]

Nurmi, D., et al. 2007. QBETS: queue bounds estimation from time series. 13^th Workshop on Job Scheduling Strategies for Parallel Processing (June, 2007).

Digital Library

[65]

High Throughput Computing---Looking Under the Hood of Nitro, 2015. Retrieved July 1, 2015 from Adaptive Computing: http://www.adaptivecomputing.com/products/high-throughput-nitro/

[66]

Anisimov, V. The aggregate job launcher of single-core or single-node applications on HPC sites, 2015. Retrieved July 30, 2015 from GitHub, Inc.: https://github.com/ncsa/Scheduler

[67]

PBS Tools: parallel-command-processor, 2013. Retrieved July 30, 2015 from The National Institute for Computational Sciences: http://www.nics.tennessee.edu/~troy/pbstools/man/parallel-command-processor.1.html

[68]

Parallel-command-processor, 2009. Retrieved July 1, 2015 from the Ohio Supercomputer Center: http://archive.osc.edu/supercomputing/software/apps/parallel.shtml

[69]

Thain, D., et al. 2005. Distributed computing in practice: the condor experience. Concurrency-Practice and Experience, volume 17, number 2-4, 323--356.

Digital Library

[70]

Couvares, P., et al. 2007. Workflow in Condor. In Workflows for e-Science, Editors: I. Taylor, E. Deelman, D. Gannon, M. Shields, Springer Press, January 2007 (ISBN: 1-84628-519-4).

Cited By

Krittanawong CVirk HKumar AAydar MWang ZStewart MHalperin J(2021)Machine learning and deep learning to predict mortality in patients with spontaneous coronary artery dissectionScientific Reports10.1038/s41598-021-88172-011:1Online publication date: 26-Apr-2021
https://doi.org/10.1038/s41598-021-88172-0
Kovatch PGai LCho HFluder EJiang D(2020)Optimizing High-Performance Computing Systems for Biomedical Workloads2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00040(183-192)Online publication date: May-2020
https://doi.org/10.1109/IPDPSW50202.2020.00040
Wang LTimsina PPandey G(2020)Computational performance of heterogeneous ensemble frameworks on high-performance computing platforms2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378392(2843-2850)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9378392
Show More Cited By

Index Terms

Recommendations

IKAROS

Proposes a dynamically coordinated I/O architecture based on input parameters.Creates, on the fly, dedicated or semi-dedicated clusters of HDDs per job.Provides coordinated parallel data transfers on the overall data flow.Minimizes disk and network ...
Optimization of I/O Intensive Genome Assemblies on the Cori Supercomputer with Burst Buffer
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Since the development of next generation sequencing technologies, genome assembly has become one of the most computational and I/O intensive analyses done on the genomic data. The flood of genomic sequence data has increased the demand for more ...
Highly parallelized inference of large genome-based phylogenies

Genome Blast Distance Phylogeny GBDP infers distances and phylogenetic relationships between organisms from completely or partially sequenced genomes. It is well suited for parallelization as pairwise distances are calculated independently. As exemplar ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2015

985 pages

ISBN:9781450337236

DOI:10.1145/2807591

General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee

Copyright © 2015 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Institutes of Health

Conference

SC15

Sponsor:

SIGHPC
SIGARCH
IEEE-CS

SC15: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 15 - 20, 2015

Texas, Austin

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
918
Total Downloads

Downloads (Last 12 months)154
Downloads (Last 6 weeks)24

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Krittanawong CVirk HKumar AAydar MWang ZStewart MHalperin J(2021)Machine learning and deep learning to predict mortality in patients with spontaneous coronary artery dissectionScientific Reports10.1038/s41598-021-88172-011:1Online publication date: 26-Apr-2021
https://doi.org/10.1038/s41598-021-88172-0
Kovatch PGai LCho HFluder EJiang D(2020)Optimizing High-Performance Computing Systems for Biomedical Workloads2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00040(183-192)Online publication date: May-2020
https://doi.org/10.1109/IPDPSW50202.2020.00040
Wang LTimsina PPandey G(2020)Computational performance of heterogeneous ensemble frameworks on high-performance computing platforms2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378392(2843-2850)Online publication date: 10-Dec-2020
https://doi.org/10.1109/BigData50022.2020.9378392
Shaikh TAli R(2020)Computer-aided Big Healthcare Data (BHD) AnalyticsBig Data Analytics and Intelligence: A Perspective for Health Care10.1108/978-1-83909-099-820201010(115-138)Online publication date: 30-Sep-2020
https://doi.org/10.1108/978-1-83909-099-820201010
Moharana MRautaray SPandey M(2020)A Survey on Big Data Solution for Complex Bio-medical InformationSecond International Conference on Computer Networks and Communication Technologies10.1007/978-3-030-37051-0_26(229-237)Online publication date: 22-Jan-2020
https://doi.org/10.1007/978-3-030-37051-0_26
Li XTan GWang BSun N(2018)High-performance genomic analysis framework with in-memory computingACM SIGPLAN Notices10.1145/3200691.317851153:1(317-328)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3200691.3178511
Li XTan GWang BSun NKrall AGross T(2018)High-performance genomic analysis framework with in-memory computingProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178511(317-328)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3178487.3178511
Gupta SPatel TEngelmann CTiwari DMohr BRaghavan P(2017)Failures in large scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126937(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126937
Pedersen EBongo L(2016)Big Biological Data ManagementResource Management for Big Data Platforms10.1007/978-3-319-44881-7_13(265-277)Online publication date: 28-Oct-2016
https://doi.org/10.1007/978-3-319-44881-7_13

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten