Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Design and Evaluation of a Simple Data Interface for Efficient Data Transfer across Diverse Storage

Published: 29 May 2021 Publication History

Abstract

Modern science and engineering computing environments often feature storage systems of different types, from parallel file systems in high-performance computing centers to object stores operated by cloud providers. To enable easy, reliable, secure, and performant data exchange among these different systems, we propose Connector, a plug-able data access architecture for diverse, distributed storage. By abstracting low-level storage system details, this abstraction permits a managed data transfer service (Globus, in our case) to interact with a large and easily extended set of storage systems. Equally important, it supports third-party transfers: that is, direct data transfers from source to destination that are initiated by a third-party client but do not engage that third party in the data path. The abstraction also enables management of transfers for performance optimization, error handling, and end-to-end integrity. We present the Connector design, describe implementations for different storage services, evaluate tradeoffs inherent in managed vs. direct transfers, motivate recommended deployment options, and propose a model-based method that allows for easy characterization of performance in different contexts without exhaustive benchmarking.

References

[1]
David Abramson, Jake Carroll, Chao Jin, Michael Mallon, Zane van Iperen, Hoang Nguyen, Allan McRae, and Liang Ming. 2019. A cache-based data movement infrastructure for on-demand scientific cloud computing. In Supercomputing Frontiers, David Abramson and Bronis R. de Supinski (Eds.). Springer International Publishing, Cham, 38–56.
[2]
William Allcock. 2003. GridFTP: Protocol extensions to FTP for the Grid. Retrieved from http://www.ggf.org/documents/GFD.20.pdf.
[3]
William Allcock, John Bresnahan, Rajkumar Kettimuthu, Michael Link, Catalin Dumitrescu, Ioan Raicu, and Ian Foster. 2005. The Globus striped GridFTP framework and server. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’05). IEEE Computer Society, Washington, DC.
[4]
I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, and S. Mock. 2004. Kepler: An extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management. ACM, New York, NY, 423–424.
[5]
Engin Arslan, Kemal Guner, and Tevfik Kosar. 2016. HARP: Predictive transfer optimization based on historical analysis and real-time probing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New York, NY, 288–299.
[6]
Engin Arslan and Tevfik Kosar. 2018. High-speed transfer optimization based on historical analysis and real-time tuning. IEEE Trans. Parallel Distrib. Syst. 29, 6 (2018), 1303–1316.
[7]
Engin Arslan, Bahadir A. Pehlivan, and Tevfik Kosar. 2018. Big data transfer optimization through adaptive parameter tuning. J. Parallel Distrib. Comput. 120 (2018), 89–100.
[8]
B2Stage GridFTP [n.d.]. B2STAGE GridFTP (iRODS-DSI). Retrieved from https://github.com/EUDAT-B2STAGE/B2STAGE-GridFTP.
[9]
Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S. Katz, Ben Clifford, Rohan Kumar, Luksaz Lacinski, Ryan Chard, Justin M. Wozniak, Ian Foster, Michael Wilde, and Kyle Chard. 2019. Parsl: Pervasive parallel programming in Python. In Proceedings of the 28th ACM International Symposium on High-performance Parallel and Distributed Computing. ACM, New York, NY.
[10]
Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise Reduction in Speech Processing. Springer, 1–4.
[11]
Tim Berners-Lee, Robert Cailliau, Ari Luotonen, Henrik Frystyk Nielsen, and Arthur Secret. 1994. The world-wide web. Commun. ACM 37, 8 (1994), 76–82.
[12]
Boto3 [n.d.]. AWS SDK for Python (Boto3). Retrieved from https://aws.amazon.com/sdk-for-python.
[13]
Box SDK [n.d.]. Introducing the Box SDK. Retrieved from http://opensource.box.com/box-python-sdk.
[14]
Kyle Chard, Eli Dart, Ian Foster, David Shifflett, Steven Tuecke, and Jason Williams. 2018. The modern research data portal: A design pattern for networked, data-intensive science. PeerJ Comput. Sci. 4 (2018), e144.
[15]
Kyle Chard, Ian Foster, and Steven Tuecke. 2017. Globus: Research data management as service and platform. In Proceedings of the Practice and Experience in Advanced Research Computing on Sustainability, Success and Impact (PEARC’17). Association for Computing Machinery, New York, NY.
[16]
Kyle Chard, Steven Tuecke, and Ian Foster. 2016. Globus: Recent enhancements and future plans. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale. ACM, 27.
[17]
Batyr Charyyev, Ahmed Alhussen, Hemanta Sapkota, Eric Pouyoul, Mehmet H. Gunes, and Engin Arslan. 2019. Towards securing data transfers against silent data corruption. In Proceedings of the IEEE/ACM International Symposium in Cluster, Cloud, and Grid Computing.
[18]
Joaquin Chung, Zhengchun Liu, Rajkumar Kettimuthu, and Ian Foster. 2019. Toward an elastic data transfer infrastructure. In Proceedings of the 15th International Conference on eScience. 262–265.
[19]
David D. Clark. 1985. The structuring of systems using upcalls. In Proceedings of the 10th ACM Symposium on Operating Systems Principles. 171–180.
[20]
Peter Cornillon, James Gallagher, and Tom Sgouros. 2003. OPeNDAP: Accessing data in a distributed, heterogeneous environment. Data Sci. J. 2 (2003), 164–174.
[21]
Eli Dart, Lauren Rotman, Brian Tierney, Mary Hester, and Jason Zurawski. 2014. The science DMZ: A network design pattern for data-intensive science. Sci. Prog. 22, 2 (2014), 173–185.
[22]
Ewa Deelman, Christopher Carothers, Anirban Mandal, Brian Tierney, Jeffrey S. Vetter, Ilya Baldin, Claris Castillo, Gideon Juve, Dariusz Król, Vickie Lynch, Ben Mayer, Jeremy Meredith, Thomas Proffen, Paul Ruth, and Rafael Ferreira da Silva. 2017. PANORAMA: An approach to performance modeling and diagnosis of extreme-scale workflows. Int. J. High Perf. Comput. Applic. 31, 1 (2017), 4–18.
[23]
Ewa Deelman, Karan Vahi, Gideon Juve, Mats Rynge, Scott Callaghan, Philip J. Maechling, Rajiv Mayani, Weiwei Chen, Rafael Ferreira Da Silva, Miron Livny, and Kent Wenger. 2015. Pegasus, a workflow management system for science automation. Fut. Gen. Comput. Syst. 46 (2015), 17–35.
[24]
Alvise Dorigo, Peter Elmer, Fabrizio Furano, and Andrew Hanushevsky. 2005. XROOTD–A highly scalable architecture for data access. WSEAS Trans. Comput. 1, 4.3 (2005), 348–353.
[25]
Editorial. 2018. Data sharing and the future of science. Nat. Commun. 9, 1 (19 July 2018), 2817.
[26]
Mike Folk, Albert Cheng, and Kim Yates. 1999. HDF5: A file format and I/O library for high performance computing applications. In Proceedings of Supercomputing, Vol. 99. 5–33.
[27]
Philip L. Frana. 2004. Before the web there was Gopher. IEEE Ann. Hist. Comput. 26, 1 (2004), 20–41.
[28]
Jeremy Goecks, Anton Nekrutenko, James Taylor, and Galaxy Team. 2010. Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, 8 (2010), R86.
[29]
Google [n.d.]. Cloud Application Programming Interface. Retrieved from https://cloud.google.com/apis.
[30]
Google. [n.d.]. G Suit. Retrieved from https://gsuite.google.com.
[31]
Jim Gray. 2004. Distributed computing economics. Queue 6, 3 (2004).
[32]
GridFTP-DSI-for-HPSS [n.d.]. GridFTP DSI for HPSS. Retrieved from https://github.com/JasonAlt/GridFTP-DSI-for-HPSS.
[33]
Thomas J. Hacker, Brian D. Noble, and Brian D. Athey. 2004. Improving throughput and maintaining fairness using parallel TCP. In Proceedings of the IEEE INFOCOM 2004, Vol. 4. IEEE, 2480–2489.
[34]
HPSS Collaboration [n.d.]. High Performance Storage System. Retrieved from http://www.hpss-collaboration.org/.
[35]
Jianwei Li, Wei-keng Liao, A. Choudhary, R. Ross, R. Thakur, W. Gropp, R. Latham, A. Siegel, B. Gallagher, and M. Zingale. 2003. Parallel netCDF: A high-performance scientific I/O interface. In Proceedings of the ACM/IEEE Conference on Supercomputing. 39–39.
[36]
Kate Keahey, Pierre Riteau, Dan Stanzione, Tim Cockerill, Joe Mambretti, Paul Rad, and Paul Ruth. 2019. Chameleon: A scalable production testbed for computer science research. In Contemporary High Performance Computing: From Petascale toward Exascale (1st ed.), Jeffrey Vetter (Ed.). Chapman & Hall/CRC Computational Science, Vol. 3. CRC Press, Boca Raton, FL123–148.
[37]
Rajkumar Kettimuthu, Zhengchun Liu, David Wheeler, Ian Foster, Katrin Heitmann, and Franck Cappello. 2018. Transferring a petabyte in a day. Fut. Gen. Comput. Syst. 88 (2018), 191–198.
[38]
Qing Liu, Jeremy Logan, Yuan Tian, Hasan Abbasi, Norbert Podhorszki, Jong Youl Choi, Scott Klasky, Roselyne Tchoua, Jay Lofstead, Ron Oldfield, Manish Parashar, Nagiza Samatova, Karsten Schwan, Arie Shoshani, Matthew Wolf, Kesheng Wu, and Weikuan Yu. 2014. Hello ADIOS: The challenges and lessons of developing leadership class I/O frameworks. Concur. Comput.: Pract. Exper. 26, 7 (2014), 1453–1473.
[39]
Yuanlai Liu, Zhengchun Liu, Rajkumar Kettimuthu, Nageswara Rao, Zizhong Chen, and Ian Foster. 2019. Data transfer between scientific facilities—Bottleneck analysis, insights and optimizations. In Proceedings of the 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 122–131.
[40]
Zhengchun Liu, Prasanna Balaprakash, Rajkumar Kettimuthu, and Ian Foster. 2017. Explaining wide area data transfer performance. In Proceedings of the 26th International Symposium on High-performance Parallel and Distributed Computing (HPDC’17). ACM, New York, NY, 167–178.
[41]
Zhengchun Liu, Rajkumar Kettimuthu, Prasanna Balaprakash, Nageswara S. V. Rao, and Ian Foster. 2019. Building a wide-area file transfer performance predictor: An empirical study. In Machine Learning for Networking, Éric Renault, Paul Mühlethaler, and Selma Boumerdassi (Eds.). Springer International Publishing, Cham, 56–78.
[42]
Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Peter H. Beckman. 2018. Toward a smart data transfer node. Fut. Gen. Comput. Syst. 89 (2018), 10–18.
[43]
Zhengchun Liu, Rajkumar Kettimuthu, Ian Foster, and Nageswara S. V. Rao. 2018. Cross-geography scientific data transferring trends and behavior. In Proceedings of the 27th International Symposium on High-performance Parallel and Distributed Computing (HPDC’18). ACM, New York, NY, 267–278.
[44]
Zhengchun Liu, Ryan Lewis, Rajkumar Kettimuthu, Kevin Harms, Philip Carns, Nageswara Rao, Ian Foster, and Michael Papka. 2020. Characterization and identification of HPC applications at a leadership computing facility. In Proceedings of the 34th ACM International Conference on Supercomputing.
[45]
Richard Moore. 2013. Data Services for Campus Researchers. Retrieved from https://bit.ly/2XYGKbK.
[46]
MultCloud [n.d.]. Multiple Cloud Storage Manager. Retrieved from https://www.multcloud.com/.
[47]
Irene V. Pasquetto, Bernadette M. Randles, and Christine L. Borgman. 2017. On the reuse of scientific data. Data Sci. J. (2017).
[48]
María S. Pérez, Jesús Carretero, Félix García, José M. Peña, and Víctor Robles. 2006. MAPFS: A flexible multiagent parallel file system for clusters. Fut. Gen. Comput. Syst. 22, 5 (2006), 620–632.
[49]
Arcot Rajasekar, Reagan Moore, Chien-yi Hou, Christopher A. Lee, Richard Marciano, Antoine de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, Chien-Yi Hou, Christopher A. Lee, Richard Marciano, Paul Tooby, Antoine de Torcy, and Bing Zhu. 2010. iRODS primer: Integrated rule-oriented data system. Synth. Lect. Inf. Conc., Retr., Serv. 2, 1 (2010), 1–143.
[50]
Rclone [n.d.]. Rclone—rsync for cloud storage. Retrieved from https://rclone.org/.
[51]
J. Rehn, T. Barrass, D. Bonacorsi, J. Hernandez, I. Semeniouk, L. Tuura, and Y. Wu. 2006. PhEDEx high-throughput data transfer management system. In Computing in High Energy and Nuclear Physics, Vol. 2006.
[52]
Robert Ross, Lee Ward, Philip Carns, Gary Grider, Scott Klasky, Quincey Koziol, Glenn K. Lockwood, Kathryn Mohror, Bradley Settlemyer, and Matthew Wolf. 2019. Storage Systems and I/O: Organizing, Storing, and Accessing Data for Scientific Discovery. Technical Report. US-DOE Office of Science.
[53]
Alberto Sánchez, María S. Pérez, Pierre Gueant, Jesús Montes, and Pilar Herrero. 2006. A parallel data storage interface to GridFTP. In Proceedings of the OTM Confederated International Conferences on the Move to Meaningful Internet Systems. Springer, 1203–1212.
[54]
Jonathan Stone and Craig Partridge. 2000. When the CRC and TCP checksum disagree. ACM SIGCOMM Comput. Commun. Rev. 30, 4 (2000), 309–319.
[55]
StoRM GridFTP DSI [n.d.]. StoRM GridFTP DSI. Retrieved from https://github.com/italiangrid/storm-gridftp-dsi.
[56]
Yu Su, Yi Wang, Gagan Agrawal, and Rajkumar Kettimuthu. 2013. SDQuery DSI: Integrating data management support with a wide area data transfer protocol. In Proceedings of the International Conference on High-performance Computing, Networking, Storage and Analysis. 1–12.
[57]
Rajeev Thakur, William Gropp, and Ewing Lusk. 1999. On implementing MPI-IO portably and with high performance. In Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems. 23–32.
[58]
John Towns, Timothy Cockerill, Maytal Dahan, Ian Foster, Kelly Gaither, Andrew Grimshaw, Victor Hazlewood, Scott Lathrop, Dave Lifka, Gregory D. Peterson, Ralph Roskies, J. Ray Scott, and Nancy Wilkins-Diehr. 2014. XSEDE: Accelerating scientific discovery. Comput. Sci. Eng. 16, 5 (2014), 62–74.
[59]
Steven Tuecke, Rachana Ananthakrishnan, Kyle Chard, Mattias Lidman, Brendan McCollam, Stephen Rosen, and Ian Foster. 2016. Globus Auth: A research identity and access management platform. In Proceedings of the IEEE 12th International Conference on e-Science (e-Science’16). IEEE, 203–212.
[60]
Marie van de Sanden, Christine Staiger, Claudio Cacciari, Roberto Mucci, Carl Johan Hakansson, Adil Hasan, Stephane Coutin, Hannes Thiemann, Benedikt von St Vieth, and Jens Jensen. 2015. D5.3: Final Report on EUDAT Services. Technical Report. EUDAT.
[61]
M. Wan, R. Moore, and A. Rajasekar. 2009. Integration of cloud storage with data grids. In Proceedings of the 3rd International Conference on the Virtual Computing Initiative.
[62]
Wasabi [n.d.]. Cloud Object Storage by Wasabi. Retrieved from https://wasabi.com.
[63]
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation. ACM, New York, NY, 307–320.
[64]
E. James Whitehead and Meredith Wiggins. 1998. WebDAV: IETF standard for collaborative authoring on the Web. IEEE Internet Comput. 2, 5 (1998), 34–40.
[65]
Mark D. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan Willem Boiten, Luiz Bonino da Silva Santos, Philip E. Bourne, Jildau Bouwman, Anthony J. Brookes, Tim Clark, Mercè Crosas, Ingrid Dillo, Olivier Dumon, Scott Edmunds, Chris T. Evelo, Richard Finkers, Alejandra Gonzalez-Beltran, Alasdair J. G. Gray, Paul Groth, Carole Goble, Jeffrey S. Grethe, Jaap Heringa, Peter A. C, ’t Hoen, Rob Hooft, Tobias Kuhn, Ruben Kok, Joost Kok, Scott J. Lusher, Maryann E. Martone, Albert Mons, Abel L. Packer, Bengt Persson, Philippe Rocca-Serra, Marco Roos, Rene van Schaik, Susanna Assunta Sansone, Erik Schultes, Thierry Sengstag, Ted Slater, George Strawn, Morris A. Swertz, Mark Thompson, Johan Van Der Lei, Erik Van Mulligen, Jan Velterop, Andra Waagmeester, Peter Wittenburg, Katherine Wolstencroft, Jun Zhao, and Barend Mons. 2016. The FAIR guiding principles for scientific data management and stewardship. Sci. Data 3 (2016), 160018.
[66]
Justin M. Wozniak, Timothy G. Armstrong, Michael Wilde, Daniel S. Katz, Ewing Lusk, and Ian T. Foster. 2013. Swift/T – Large-scale application composition via distributed-memory dataflow processing. In Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE, 95–102.
[67]
Esma Yildirim, Engin Arslan, Jangyoung Kim, and Tevfik Kosar. 2015. Application-level optimization of big data transfers through pipelining, parallelism and concurrency. IEEE Trans. Cloud Comput. 4, 1 (2015), 63–75.

Cited By

View all
  • (2022)SciStreamProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531475(185-198)Online publication date: 27-Jun-2022
  • (2022)Linking scientific instruments and computation: Patterns, technologies, and experiencesPatterns10.1016/j.patter.2022.1006063:10(100606)Online publication date: Oct-2022
  • (2021)Analyzing the Performance of the S3 Object Storage API for HPC WorkloadsApplied Sciences10.3390/app1118854011:18(8540)Online publication date: 14-Sep-2021
  • Show More Cited By

Index Terms

  1. Design and Evaluation of a Simple Data Interface for Efficient Data Transfer across Diverse Storage

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Modeling and Performance Evaluation of Computing Systems
    ACM Transactions on Modeling and Performance Evaluation of Computing Systems  Volume 6, Issue 1
    March 2021
    111 pages
    ISSN:2376-3639
    EISSN:2376-3647
    DOI:10.1145/3458922
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 May 2021
    Accepted: 01 February 2021
    Revised: 01 February 2021
    Received: 01 July 2020
    Published in TOMPECS Volume 6, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data transfer
    2. cloud storage
    3. storage interface

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • U.S. Department of Energy, Office of Science

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)SciStreamProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531475(185-198)Online publication date: 27-Jun-2022
    • (2022)Linking scientific instruments and computation: Patterns, technologies, and experiencesPatterns10.1016/j.patter.2022.1006063:10(100606)Online publication date: Oct-2022
    • (2021)Analyzing the Performance of the S3 Object Storage API for HPC WorkloadsApplied Sciences10.3390/app1118854011:18(8540)Online publication date: 14-Sep-2021
    • (2021)Bridging Data Center AI Systems with Edge Computing for Actionable Information Retrieval2021 3rd Annual Workshop on Extreme-scale Experiment-in-the-Loop Computing (XLOOP)10.1109/XLOOP54565.2021.00008(15-23)Online publication date: Nov-2021
    • (undefined)Linking Scientific Instruments and HPC: Patterns, Technologies, ExperiencesSSRN Electronic Journal10.2139/ssrn.4141629

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media