Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3225058.3225128acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Public Access

Task-parallel Analysis of Molecular Dynamics Trajectories

Published: 13 August 2018 Publication History

Abstract

Different parallel frameworks for implementing data analysis applications have been proposed by the HPC and Big Data communities. In this paper, we investigate three task-parallel frameworks: Spark, Dask and RADICAL-Pilot with respect to their ability to support data analytics on HPC resources and compare them to MPI. We investigate the data analysis requirements of Molecular Dynamics (MD) simulations which are significant consumers of supercomputing cycles, producing immense amounts of data. A typical large-scale MD simulation of a physical system of O(100k) atoms over μsecs can produce from O(10) GB to O(1000) GBs of data. We propose and evaluate different approaches for parallelization of a representative set of MD trajectory analysis algorithms, in particular the computation of path similarity and leaflet identification. We evaluate Spark, Dask and RADICAL-Pilot with respect to their abstractions and runtime engine capabilities to support these algorithms. We provide a conceptual basis for comparing and understanding different frameworks that enable users to select the optimal system for each application. We also provide a quantitative performance analysis of the different algorithms across the three frameworks.

References

[1]
2016. Scikit-Learn: Nearest Neighbors. http://scikit-learn.org/stable/modules/neighbors.html.
[2]
V. Balasubramanian, I. Bethune, A. Shkurti, E. Breitmoser, E. Hruska, C. Clementi, C. Laughton, and S. Jha. 2016. ExTASY: Scalable and flexible coupling of MD simulations and advanced sampling techniques. In 2016 IEEE 12th International Conference on e-Science (e-Science). 361--370.
[3]
Vivek Balasubramanian, Matteo Turilli, Weiming Hu, Matthieu Lefebvre, Wenjie Lei, Guido Cervone, Jeroen Tromp, and Shantenu Jha. 2018. Harnessing the Power of Many: Extensible Toolkit for Scalable Ensemble Applications. IPDPS 2018 (accepted) (2018). https://arxiv.org/abs/1710.08491.
[4]
T. Cheatham and D. Roe. 2015. The impact of heterogeneous computing on workflows for biomolecular simulation and analysis. Computing in Science Engineering 17, 2 (2015), 30--39.
[5]
Jumana Dakka and et al. 2017. High-throughput Binding Affinity Calculations at Extreme Scales. accepted Computational Approaches for Cancer Workshop, SC'17 (2017). http://arxiv.org/abs/1712.09168.
[6]
Lisandro Dalcín, Rodrigo Paz, and Mario Storti. 2005. MPI for Python. J. Parallel and Distrib. Comput. 65, 9 (2005), 1108--1115.
[7]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation. USENIX Association, Berkeley, CA, USA, 137--150.
[8]
Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: A Runtime for Iterative MapReduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, 810--818.
[9]
Geoffrey Fox, Judy Qiu, Shantenu Jha, Supun Kamburugamuve, and Andre Luckow. 2015. HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack. In Proceedings of Workshop on Scalable Computing For Real-Time Big Data Applications (SCRAMBL'15). 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Shenzhen, China.
[10]
Geoffrey C. Fox, Shantenu Jha, Judy Qiu, and Andre Luckow. 2014. Towards an Understanding of Facets and Exemplars of Big Data Applications. In Proceedings of Beowulf'14. ACM, Annapolis, MD, USA.
[11]
Konrad Hinsen, Eric Pellegrini, Sławomir Stachura, and Gerald R. Kneller. 2012. nMoldyn 3: Using task farming for a parallel spectroscopy-oriented analysis of molecular dynamics simulations. Journal of Computational Chemistry 33, 25 (2012), 2043--2048.
[12]
Daniel P. Huttenlocher, Gregory A. Klanderman, and William J Rucklidge. 1993. Comparing images using the Hausdorff distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 9 (1993), 850--863.
[13]
Shantenu Jha, Daniel S. Katz, Andre Luckow, Neil Chue Hong, Omer Rana, and Yogesh Simmhan. 2017. Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure. Concurrency and Computation: Practice and Experience (2017), e4032-n/a. e4032 cpe.4032.
[14]
Shantenu Jha, Judy Qiu, André Luckow, Pradeep Kumar Mantha, and Geoffrey Charles Fox. 2014. A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures. Proceedings of 3rd IEEE Internation Congress of Big Data abs/1403.1528 (2014).
[15]
Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001-. SciPy: Open source scientific tools for Python.
[16]
Supun Kamburugamuve, Geoffrey Fox, Pulasthi Wickramasinghe, Govindarajan Kannan, and Vibhatha Abeykoon. 2018. Twister:Net - Communication Library for Big Data Processing in HPC and Cloud Environments.
[17]
Supun Kamburugamuve, Pulasthi Wickramasinghe, Saliya Ekanayake†, and Geoffrey C. Fox. 2017. Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink. In Technical Report. Indiana University, Bloomington.
[18]
X. Lu, D. Shankar, S. Gugnani, and D. K. Panda. 2016. High-Performance Design of Apache Spark with RDMA and Its Benefits on Various Workloads.
[19]
Andre Luckow, Ioannis Paraskevakos, George Chantzialexiou, and Shantenu Jha. 2016. Hadoop on HPC: Integrating Hadoop and Pilot-Based Dynamic Resource Management. 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (2016), 1607--1616.
[20]
Andre Luckow, Mark Santcroos, Andre Merzky, Ole Weidner, Pradeep Mantha, and Shantenu Jha. 2012. P*: A model of pilot-abstractions. IEEE 8th International Conference on e-Science (2012), 1--10.
[21]
Pradeep Kumar Mantha, Andre Luckow, and Shantenu Jha. 2012. Pilot-MapReduce: an extensible and flexible MapReduce implementation for distributed data. In Proceedings of third international workshop on MapReduce and its Applications (MapReduce '12). ACM, New York, NY, USA, 17--24.
[22]
Robert T. McGibbon, Kyle A. Beauchamp, Matthew P. Harrigan, Christoph Klein, Jason M. Swails, Carlos X. Hernández, Christian R. Schwantes, Lee-Ping Wang, Thomas J. Lane, and Vijay S. Pande. 2015. MDTraj: A Modern Open Library for the Analysis of Molecular Dynamics Trajectories. Biophysical Journal 109, 8 (2015), 1528 - 1532.
[23]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016. MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research 17, 34 (2016), 1--7.
[24]
Andre Merzky, Matteo Turilli, Manuel Maldonado, and Shantenu Jha. 2018. Design and Performance Characterization of RADICAL-Pilot on Titan. in preparation (2018). https://arxiv.org/abs/1801.01843.
[25]
Andre Merzky, Matteo Turilli, Manuel Maldonado, Mark Santcroos, and Shantenu Jha. 2018. Using Pilot Systems to Execute Many Task Workloads on Supercomputers. (2018). http://arxiv.org/abs/1512.08194.
[26]
Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. 2011. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of Computational Chemistry 32, 10 (2011), 2319--2327.
[27]
Cameron Mura and Charles E. McAnany. 2014. An introduction to biomolecular simulations and docking. Molecular Simulation 40, 10-11 (2014), 732--764.
[28]
Stephen M. Omohundro. 1989. Five Balltree Construction Algorithms. Technical Report.
[29]
Richard J. Gowers, Max Linke, Jonathan Barnoud, Tyler J. E. Reddy, Manuel N. Melo, Sean L. Seyler, Jan Domański, David L. Dotson, Sébastien Buchoux, Ian M. Kenney, and Oliver Beckstein. 2016. MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. In Proceedings of the 15th Python in Science Conference, Sebastian Benthall and Scott Rostrup (Eds.). 98--105.
[30]
Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference, Kathryn Huff and James Bergstra (Eds.). 130--136.
[31]
Daniel R. Roe and III Thomas E. Cheatham. 2013. PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. Journal of Chemical Theory and Computation 9, 7 (2013), 3084--3095. 26583988.
[32]
Daniel R. Roe and III Thomas E. Cheatham. 2018. Parallelization of CPPTRAJ Enables Large Scale Analysis of Molecular Dynamics Trajectory Data. Journal of Computational Chemistry (2018). in press.
[33]
Sean L. Seyler, Avishek Kumar, M. F. Thorpe, and Oliver Beckstein. 2015. Path Similarity Analysis: A Method for Quantifying Macromolecular Pathways. PLoS Comput Biol 11, 10 (10 2015), 1--37.
[34]
A. A. Taha and A. Hanbury. 2015. An Efficient Algorithm for Calculating the Exact Hausdorff Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 11 (Nov 2015), 2153--2163.
[35]
A. Treikalis, A. Merzky, H. Chen, T. S. Lee, D. M. York, and S. Jha. 2016. RepEx: A Flexible Framework for Scalable Replica Exchange Molecular Dynamics Simulations. In 2016 45th International Conference on Parallel Processing (ICPP). 628--637.
[36]
Tiankai Tu, C. A. Rendleman, D. W. Borhani, R. O. Dror, J. Gullingsrud, M. O. Jensen, J. L. Klepeis, P. Maragakis, P. Miller, K. A. Stafford, and D. E. Shaw. 2008. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.
[37]
M. Turilli, Y. N. Babuji, A. Merzky, M. T. Ha, M. Wilde, D. S. Katz, and S. Jha. 2017. Evaluating Distributed Execution of Workloads. In 2017 IEEE 13th International Conference on e-Science (e-Science). 276--285.
[38]
Matteo Turilli, Andre Merzky, Vivek Balasubramanian, and Shantenu Jha. 2018. A Building Blocks Approach towards Domain Specific Workflow Systems? Short Paper (IEEE/ACM CCGrid 2018) (2018). http://arxiv.org/abs/1609.03484.
[39]
Matteo Turilli, Mark Santcroos, and Shantenu Jha. 2017. A Comprehensive Perspective on Pilot-Jobs. ACM Computing Surveys (accepted, in press), arXiv preprint arXiv:1508.04180v3 (2017). https://arxiv.org/abs/1508.04180.
[40]
Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. 2011. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering 13, 2 (2011), 22--30.
[41]
Semen O. Yesylevskyy. 2015. Pteros 2.0: Evolution of the fast parallel molecular analysis library for C++ and python. Journal of Computational Chemistry 36, 19 (2015), 1480--1488.
[42]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2--2.
[43]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10). USENIX Association, Berkeley, CA, USA, 10--10.

Cited By

View all
  • (2023)Molecular Dynamics Simulation by Implementing Parallel Computing and Big Data Principles2023 IEEE IAS Global Conference on Emerging Technologies (GlobConET)10.1109/GlobConET56651.2023.10150063(1-6)Online publication date: 19-May-2023
  • (2023)Performance comparison of Dask and Apache Spark on HPC systems for neuroimagingConcurrency and Computation: Practice and Experience10.1002/cpe.763535:21Online publication date: 22-Jan-2023
  • (2022)Scalable transcriptomics analysis with Dask: applications in data science and machine learningBMC Bioinformatics10.1186/s12859-022-05065-323:1Online publication date: 30-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data analytics
  2. MD Simulations Analysis
  3. MD analysis
  4. task-parallel

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2018

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;
Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)101
  • Downloads (Last 6 weeks)14
Reflects downloads up to 27 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Molecular Dynamics Simulation by Implementing Parallel Computing and Big Data Principles2023 IEEE IAS Global Conference on Emerging Technologies (GlobConET)10.1109/GlobConET56651.2023.10150063(1-6)Online publication date: 19-May-2023
  • (2023)Performance comparison of Dask and Apache Spark on HPC systems for neuroimagingConcurrency and Computation: Practice and Experience10.1002/cpe.763535:21Online publication date: 22-Jan-2023
  • (2022)Scalable transcriptomics analysis with Dask: applications in data science and machine learningBMC Bioinformatics10.1186/s12859-022-05065-323:1Online publication date: 30-Nov-2022
  • (2022)An elastic framework for ensemble-based large-scale data assimilationThe International Journal of High Performance Computing Applications10.1177/1094342022111050736:4(543-563)Online publication date: 28-Jun-2022
  • (2020)Methods and Experiences for Developing Abstractions for Data-intensive, Scientific Applications2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00106(636-645)Online publication date: May-2020
  • (2020)Parallel performance of molecular dynamics trajectory analysisConcurrency and Computation: Practice and Experience10.1002/cpe.578932:19Online publication date: 27-Apr-2020
  • (2019)Performance Evaluation of Big Data Processing Strategies for Neuroimaging2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)10.1109/CCGRID.2019.00059(449-458)Online publication date: May-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media