Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2807591.2807595acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

Big omics data experience

Published: 15 November 2015 Publication History

Abstract

As personalized medicine becomes more integrated into healthcare, the rate at which human genomes are being sequenced is rising quickly together with a concomitant acceleration in compute and storage requirements. To achieve the most effective solution for genomic workloads without re-architecting the industry-standard software, we performed a rigorous analysis of usage statistics, benchmarks and available technologies to design a system for maximum throughput. We share our experiences designing a system optimized for the "Genome Analysis ToolKit (GATK) Best Practices" whole genome DNA and RNA pipeline based on an evaluation of compute, workload and I/O characteristics. The characteristics of genomic-based workloads are vastly different from those of traditional HPC workloads, requiring different configurations of the scheduler and the I/O subsystem to achieve reliability, performance and scalability. By understanding how our researchers and clinicians work, we were able to employ techniques not only to speed up their workflow yielding improved and repeatable performance, but also to make more efficient use of storage and compute resources.

References

[1]
Fromer, M. et al. 2014. De novo mutations in schizophrenia implicate synaptic networks. Nature 506 (2014), 179--84.
[2]
Purcell, S. et al. 2014. A polygenic burden of rare disruptive mutations in schizophrenia. Nature 506 (2014), 185--90.
[3]
Gaugler, T. et al. 2014. Most genetic risk for autism resides with common variation. Nat. Genet. 46 (2014), 881--885.
[4]
Neale, B. et al. 2012. Patterns and rates of exonic de novo mutations in autism spectrum disorders. Nature 485 (2012), 242--245.
[5]
Tu, Z. et al. 2012. Integrative Analysis of a Cross-Loci Regulation Network Identifies App as a Gene Regulating Insulin Secretion from Pancreatic Islets. PLoS Genet. 8 (2012), e1003107.
[6]
Deloukas, P. et al. 2012. Large-scale association analysis identifies new risk loci for coronary artery disease. Nat Genet 45 (2012), 25--33.
[7]
Gomes, I. et al. 2013. Identification of a μ-Δ opioid receptor heteromer-biased agonist with antinociceptive activity. Proc. Natl. Acad. Sci. U. S. A. 110, (2013), 12072--7.
[8]
Gandy, S., Haroutunian, V. et al. CR1 and the vanishing amyloid hypothesis of Alzheimer's disease. Biol. Psychiatry 73 (2013), 393--395.
[9]
McKenna, A., Hanna, M., et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. 2010 Genome Research 20 (2010), 1297--303.
[10]
DePristo, M., et al. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43 (2011), 491--498.
[11]
Van der Auwera, G., et al. 2013. From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline. Current Protocols in Bioinformatics 43 (2013), 11.10.1--11.10.33.
[12]
GATK Best Practices, 2012. Retrieved July 1, 2012, from Broad Institute: https://www.broadinstitute.org/gatk/guide/best-practices.
[13]
Goldsmith, P. Broad Institute, Google Genomics combine bioinformatics and computing expertise to expand access to research tools. Retrieved July 15, 2015, from Broad Institute: https://www.broadinstitute.org/news/6994.
[14]
Introduction to GATK, 2012. Retrieved July 1, 2012, from Broad Institute: https://www.broadinstitute.org/gatk/about/.
[15]
RNASeq Overview, GATK Best Practices. Retrieved July 1, 2012 from Broad Institute: https://www.broadinstitute.org/gatk/guide/best-practices?bpm=RNAseq.
[16]
Pabinger, S., et al. 2014. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform (2014), 15(2), 256--278.
[17]
Liu, X., et al. 2013. Variant callers for next-generation sequencing data: a comparison study. PLoS One (Sep. 2013), 8(9): e75619.
[18]
GENALICE, Technology for People and Science. Retrieved June 15, 2015 from GENALICE: http://www.genalice.com/.
[19]
Langmead, B., Schatz, M., Lin, J., Pop, M., Salzberg, S. 2009. Searching for SNPs with cloud computing. Genome Biology (2009), 0:R134.
[20]
Xie, Y., Wu, G., et al. 2014. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics (Jun. 2014),15;30(12):1660--6
[21]
O'Rawe, J., Jiang, T., et al. 2013. Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing. Genome Medicine (2013), 5:28
[22]
Linderman, M. D.; Brandt, T.; Edelmann, L. et al. 2014. Analytical Validation of Whole Exome and Whole Genome Sequencing for Clinical Applications. BMC Med Genomics (Apr. 2014), 7:20. 24758382.
[23]
De Rubeis, S., He, X., Goldberg, A., et al. 2014. Synaptic, transcriptional and chromatin genes disrupted in autism. Nature (2014), 515(7526): 209--215.
[24]
Puckelwartz, M. J. et al. 2014. Supercomputing for the parallelization of whole genome analysis. Bioinformatics (2014), 30, 1508--1513.
[25]
Chapman, B. 2013. Scaling variant detection pipelines for whole genome sequencing analysis. Blue Collar Bioinformatics, May 22, 2013. Retrieved June 15, 2015 from http://bcb.io/2013/05/22/scaling-variant-detection-pipelines-for-whole-genome-sequencing-analysis/.
[26]
Baker, M. 2010. Next-generation sequencing: adjusting to data overload. Nature Methods (2010) 7:495--499.
[27]
Schmuck, F., Haskin, R 2002. GPFS: A Shared-Disk File System for Large Computing Clusters, USENIX File and Storage Technologies Conference (San Jose, CA, 2002).
[28]
Schatz, M. 2009. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics (2009), 25 (11), 1,363--1,369.
[29]
Mohamed, N., Lin, H., Feng, W. 2013. Accelerating Data-Intensive Genome Analysis in the Cloud. Proceedings from the 5th International Conference on Bioinformatics and Computational Biology (2013).
[30]
qinyuan/ganglia-gpfs-plugin. Retrieved December 15, 2013 2013 from GitHub: https://hithub.com/qinyuan/ganglia-gpfs-plugin.
[31]
The Glade Environment. Retrieved December 15, 2013 from GPFS User Group: http://files.gpfsug.org/presentations/2013/SC13GPFSUserForum.pdf.
[32]
Interpreting GPFS Waiter Information. Retrieved December 15, 2013 from IBM developerWorks: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Interpreting%20GPFS%20Waiter%20Information
[33]
IBM Corp. 2013. General Parallel File System Administration and Programming Reference, Version 3 Release 5.0.11, 386--390.
[34]
Monitoring GPFS I/O performance with the mmpmon command. Retrieved December 15, 2013 from IBM Knowledge Center. http://www-01.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_mmpmonch.htm
[35]
Highcharts Products. Retrieved December 30, 2015 from Highcharts: http://www.highcharts.com/
[36]
CERN IT Facility Planning and Procurement. Retrieved January 15, 2014 from Indico: https://indico.cern.ch/event/92498/session/7/contribution/7/material/slides/0.pdf.
[37]
Kawalia, A., Motameny, S., et al. 2015. Leveraging the Power of High Performance Computing for Next Generation Sequencing Data Analysis: Tricks and Twists from a High Throughput Exome Workflow. PLoS ONE (2015). 10(5): e0126321.
[38]
Broad Institute GATK on Google Cloud Platform. Retrieved June 15, 2015 from: Google Cloud Platform. https://cloud.google.com/genomics/gatk?hl=en
[39]
Proffitt, A. Playing the Markets: ClusterK Launches Cloud Scheduler, Open Source GATK Pipeline. Retrieved June 15, 2015, from Bio-IT World: http://www.bio-itworld.com/2015/2/18/playing-markets-clusterk-launches-cloud-scheduler-gatk-pipeline.html.
[40]
Bhuvaneshwar, K., et al. A case study for cloud based high throughput analysis of NGS data using the globus genomics system. Computational and Structural Biotechnology Journal, 13 (2015). 64--74.
[41]
Juve, G., et al. 2013. Characterizing and profiling scientific workflows. Future Generation Computer Systems, 29 (2013), 682--692.
[42]
Singh, G., Deelman, E. The interplay of resource provisioning and workflow optimization in scientific applications. Concurrency and Computation: Practice and Experience, 23, (Jan. 2011), 1969--1989.
[43]
Wiewiorka, M., et al. 2014. SparkSeq: fast, scalable, cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics (May 19, 2014).
[44]
Massie, M., et al. ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013-207 (Dec. 15, 2013).
[45]
Kovatch, P., et al. 2011. The Malthusian Catastrophe is upon us! Are the largest HPC machines ever up?. Euro-Par 2011: Parallel Processing Workshops (Bordeaux, France, August 29 -- September 2, 2011), Lecture Notes in Computer Science, 7156. 211--220. Berlin/Heidelberg: Springer.
[46]
Andrews, P., Kovatch, P., et al. 2010. Scheduling a 100,000-core supercomputer for maximum utilization and capability. 39th International Conference on Parallel Processing Workshops (ICPPW), (San Diego, California, September 13-16, 2010). 421--427.
[47]
Samuel, T., Baer, T., Brook, R., Ezell, M., Kovatch, P. 2011. Scheduling diverse high performance computing systems with the goal of maximizing utilization. 18th International Conference on High Performance Computing (HiPC), (Bangalore, India, December 18-21, 2011), 1--6.
[48]
Margo, M., Yoshimoto, K., Kovatch, P., Andrews, P. 2008. Impact of reservations on production job scheduling. Job Scheduling Strategies for Parallel Processing, Lecture notes in Computer Science, 4942, 116--131. Berlin/Heidelberg: Springer.
[49]
Andrews, P., Banister, B., Kovatch, P., et al. 2005. Scaling a global file system to the greatest possible extent, performance, capacity and number of users. Twenty-Second IEEE / Thirteenth NASA Goddard Conference on Mass Storage Systems and Technologies (Monterey, California, April 11-14, 2005). 109--117.
[50]
Andrews, P., Kovatch, P., Jordan, C. 2005. Massive high performance global file systems for grid computing. Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference (May 12-18, 2005). 53--66.
[51]
Carns, P., et al., 2011. Understanding and improving computational science storage access through continuous characterization. MSST'11 Proceedings of the 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (Denver, Colorado, May 23-27, 2011). 1--14.
[52]
Oral, S., et al. 2014. Best practices and lessons learned from deploying and operating large-scale data-centric parallel file systems. SC'14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, Louisiana, November 16-21, 2014). 217--228.
[53]
Park, S., et al. 2012. A performance evaluation of specific I/O workloads on flash-based SSDs. Proceedings from the Workshop on Interfaces and Abstractions for Scientific Data Storage (Beijing, China, September 24-28, 2012).
[54]
Ilyushkin, A., et al. 2015. Towards a realistic scheduler for mixed workloads with workflows. Proceedings from the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, (Shenzhen, Guangdong, China, May 4-7, 2015). 653--756.
[55]
Mattoso, M., Dias, J., Ocana, K., Ogasawara, E., Costa, F., Horta, F., Silva, V. Oliveira, D. 2015. Dynamic steering of HPC scientific workflows: a survey. Future Generation Computer Systems (May 2015), 46, 100--113.
[56]
Chen, W., da Silva, R., Deelman, E., Sakellariou, R. 2015. Using imbalance metrics to optimize task clustering in scientific workflow executions. Future Generation Computer Systems (May 2015), 46, 69--84.
[57]
Ferreira da Silva, R. et al. 2014. A unified approach for modeling and optimization of energy, makespan and reliability for scientific workflows on large-scale computing Infrastructures. Workshop on Modeling & Simulation of Systems and Applications (Seattle, Washington, August 13-14, 2014).
[58]
Maheshwari, K., et al. 2014. Improving multisite workflow performance using model-based scheduling. 43rd International Conference on Parallel Processing, (Minneapolis, MN, September 9-12, 2014), 131--140.
[59]
Blankenberg, D., et al. 2010. Galaxy: a web-based genome analysis tool for experimentalists. Current Protocols in Molecular Biology (Jan. 2010). 89:19.10.1--19.10.21.
[60]
Evani, U., et al. 2012. Atlas2 Cloud: a framework for personal genome analysis in the cloud. BMC Genomics (2012), 13(Suppl. 6).
[61]
Wandelt, S., et al. 2012. Data management challenges in next generation sequencing. Datenbank-Spektrum (Nov. 2012), 12(3), 161--171.
[62]
El-Metwally, S., et al. 2013. Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Computational Biology (Dec. 2013).
[63]
Carns, P., et al. 2009. Small-file access in parallel file systems. IPDPS 2009. Proceedings of the 2009 IEEE International Parallel & Distributed Processing Symposium (Rome, May 23-29, 2009), 1--11.
[64]
Nurmi, D., et al. 2007. QBETS: queue bounds estimation from time series. 13th Workshop on Job Scheduling Strategies for Parallel Processing (June, 2007).
[65]
High Throughput Computing---Looking Under the Hood of Nitro, 2015. Retrieved July 1, 2015 from Adaptive Computing: http://www.adaptivecomputing.com/products/high-throughput-nitro/
[66]
Anisimov, V. The aggregate job launcher of single-core or single-node applications on HPC sites, 2015. Retrieved July 30, 2015 from GitHub, Inc.: https://github.com/ncsa/Scheduler
[67]
PBS Tools: parallel-command-processor, 2013. Retrieved July 30, 2015 from The National Institute for Computational Sciences: http://www.nics.tennessee.edu/~troy/pbstools/man/parallel-command-processor.1.html
[68]
Parallel-command-processor, 2009. Retrieved July 1, 2015 from the Ohio Supercomputer Center: http://archive.osc.edu/supercomputing/software/apps/parallel.shtml
[69]
Thain, D., et al. 2005. Distributed computing in practice: the condor experience. Concurrency-Practice and Experience, volume 17, number 2-4, 323--356.
[70]
Couvares, P., et al. 2007. Workflow in Condor. In Workflows for e-Science, Editors: I. Taylor, E. Deelman, D. Gannon, M. Shields, Springer Press, January 2007 (ISBN: 1-84628-519-4).

Cited By

View all
  • (2021)Machine learning and deep learning to predict mortality in patients with spontaneous coronary artery dissectionScientific Reports10.1038/s41598-021-88172-011:1Online publication date: 26-Apr-2021
  • (2020)Optimizing High-Performance Computing Systems for Biomedical Workloads2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00040(183-192)Online publication date: May-2020
  • (2020)Computational performance of heterogeneous ensemble frameworks on high-performance computing platforms2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378392(2843-2850)Online publication date: 10-Dec-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
  • General Chair:
  • Jackie Kern,
  • Program Chair:
  • Jeffrey S. Vetter
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2015

Check for updates

Author Tags

  1. GPFS
  2. LSF
  3. benchmarking
  4. flash memory
  5. genomic sequencing
  6. high performance
  7. high throughput and data-intensive computing
  8. parallel file systems
  9. performance analysis
  10. scheduling and resource management

Qualifiers

  • Research-article

Funding Sources

Conference

SC15
Sponsor:

Acceptance Rates

SC '15 Paper Acceptance Rate 79 of 358 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)154
  • Downloads (Last 6 weeks)24
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Machine learning and deep learning to predict mortality in patients with spontaneous coronary artery dissectionScientific Reports10.1038/s41598-021-88172-011:1Online publication date: 26-Apr-2021
  • (2020)Optimizing High-Performance Computing Systems for Biomedical Workloads2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00040(183-192)Online publication date: May-2020
  • (2020)Computational performance of heterogeneous ensemble frameworks on high-performance computing platforms2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378392(2843-2850)Online publication date: 10-Dec-2020
  • (2020)Computer-aided Big Healthcare Data (BHD) AnalyticsBig Data Analytics and Intelligence: A Perspective for Health Care10.1108/978-1-83909-099-820201010(115-138)Online publication date: 30-Sep-2020
  • (2020)A Survey on Big Data Solution for Complex Bio-medical InformationSecond International Conference on Computer Networks and Communication Technologies10.1007/978-3-030-37051-0_26(229-237)Online publication date: 22-Jan-2020
  • (2018)High-performance genomic analysis framework with in-memory computingACM SIGPLAN Notices10.1145/3200691.317851153:1(317-328)Online publication date: 10-Feb-2018
  • (2018)High-performance genomic analysis framework with in-memory computingProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178511(317-328)Online publication date: 10-Feb-2018
  • (2017)Failures in large scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126937(1-12)Online publication date: 12-Nov-2017
  • (2016)Big Biological Data ManagementResource Management for Big Data Platforms10.1007/978-3-319-44881-7_13(265-277)Online publication date: 28-Oct-2016

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media