Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3291656.3291753acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Dac-Man: data change management for scientific datasets on HPC systems

Published: 11 November 2018 Publication History

Abstract

Scientific data is growing rapidly and often change due to instrument configurations, software updates or quality assessments. These changes in datasets can result in significant waste of compute and storage resources on HPC systems as downstream pipelines are reprocessed. Data changes need to be detected, tracked and analyzed for understanding the impact of data change, managing data provenance, and making efficient and effective decisions about reprocessing and use of HPC resources. Existing methods for identifying and capturing change are often manual, domain-specific and error-prone and do not scale to large scientific datasets. In this paper, we describe the design and implementation of Dac-Man framework, which identifies, captures and manages change in large scientific datasets, and enables plug-in of domain-specific change analysis with minimal user effort. Our evaluations show that it can retrieve file changes from directories containing millions of files and terabytes of data in less than a minute.

References

[1]
"Ameriflux," http://ameriflux.lbl.gov/, 2017.
[2]
"SDSS," http://www.sdss.org/, 2017.
[3]
R. Love, "Kernel korner: Intro to inotify," Linux J., vol. 2005, no. 139, pp. 8-, Nov. 2005. {Online}. Available: http://dl.acm.org/citation.cfm?id=1103050.1103058
[4]
E. W. Myers, "An o (nd) difference algorithm and its variations," Algorithmica, vol. 1, no. 1, pp. 251--266, 1986.
[5]
"Python filecmp," https://docs.python.org/3.6/library/filecmp.html, 2017.
[6]
"Python difflib," https://docs.python.org/3.6/library/difflib.html, 2017.
[7]
"Github," https://github.com/, 2017.
[8]
"Apache subversion," https://subversion.apache.org/, 2017.
[9]
R. Schuler, C. Kesselman, and K. Czajkowski, "Data centric discovery with a data-oriented architecture," in Proceedings of the 1st Workshop on The Science of Cyberinfrastructure: Research, Experience, Applications and Models, ser. SCREAM '15. New York, NY, USA: ACM, 2015, pp. 37--44.
[10]
R. Chard, K. Chard, J. Alt, D. Y. Parkinson, S. Tuecke, and I. Foster, "Ripple: Home automation for research data management," in 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW), June 2017, pp. 389--394.
[11]
V. Hendrix, L. Ramakrishnan, Y. Ryu, C. van Ingen, K. R. Jackson, and D. Agarwal, "CAMP: Community Access MODIS Pipeline," Future Generation Computer Systems, vol. 36, 2014.
[12]
V. I. Levenshtein, "Binary codes capable of correcting deletions, insertions, and reversals," in Soviet physics doklady, vol. 10, no. 8, 1966, pp. 707--710.
[13]
T. P. Robitaille, E. J. Tollerud, P. Greenfield, M. Droettboom, E. Bray, T. Aldcroft, M. Davis, A. Ginsburg, A. M. Price-Whelan, W. E. Kerzendorf et al., "Astropy: A community python package for astronomy," Astronomy & Astrophysics, vol. 558, p. A33, 2013.
[14]
E. B. Knudsen, H. O. Sørensen, J. P. Wright, G. Goret, and J. Kieffer, "Fabio: easy access to two-dimensional x-ray detector images in python," Journal of Applied Crystallography, vol. 46, no. 2, pp. 537--539, 2013.
[15]
A. Collette, Python and HDF5: Unlocking Scientific Data. "O'Reilly Media, Inc.", 2013.
[16]
J. Ponz, R. Thompson, and J. Munoz, "The fits image extension," Astronomy and Astrophysics Supplement Series, vol. 105, 1994.
[17]
B. Kemp and J. Olivan, "European data format plus(edf+), an edf alike standard format for the exchange of physiological data," Clinical Neurophysiology, vol. 114, no. 9, pp. 1755--1761, 2003.
[18]
S. S. Chawathe and H. Garcia-Molina, "Meaningful change detection in structured data," in Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD '97. New York, NY, USA: ACM, 1997, pp. 26--37.
[19]
J. Inglada and G. Mercier, "A new statistical similarity measure for change detection in multitemporal sar images and its extension to multiscale change analysis," IEEE transactions on geoscience and remote sensing, vol. 45, no. 5, pp. 1432--1445, 2007.
[20]
Y. Wang, D. J. DeWitt, and J.-Y. Cai, "X-diff: An effective change detection algorithm for xml documents," in Data Engineering, 2003. Proceedings. 19th International Conference on. IEEE, 2003, pp. 519--530.
[21]
D. Ognyanov and A. Kiryakov, "Tracking changes in rdf (s) repositories," Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web, pp. 373--378, 2002.
[22]
P. Missier, K. Belhajjame, and J. Cheney, "The w3c prov family of specifications for modelling provenance metadata," in Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013, pp. 773--776.
[23]
I. Suriarachchi, Q. Zhou, and B. Plale, "Komadu: A capture and visualization system for scientific data provenance," Journal of Open Research Software, vol. 3, no. 1, 2015.
[24]
P. Missier, S. Woodman, H. Hiden, and P. Watson, "Provenance and data differencing for workflow reproducibility analysis," Concurrency and Computation: Practice and Experience, vol. 28, no. 4, pp. 995--1015, 2016.
[25]
D. Ghoshal and B. Plale, "Provenance from log files: a bigdata problem," in Proceedings of the Joint EDBT/ICDT 2013 Workshops. ACM, 2013, pp. 290--297.
[26]
J. Frew, D. Metzger, and P. Slaughter, "Automatic capture and reconstruction of computational provenance," Concurrency and Computation: Practice and Experience, vol. 20, no. 5, pp. 485--496, 2008.
[27]
L. Torvalds and J. Hamano, "Git: Fast version control system," URL http://git-scm.com, 2010.
[28]
H. Shan and J. Shalf, "Using ior to analyze the i/o performance for hpc platforms," 2007.
[29]
W. Yu, J. Vetter, R. S. Canon, and S. Jiang, "Exploiting lustre file joining for effective collective io," in Cluster Computing and the Grid, 2007. CCGRID 2007. Seventh IEEE International Symposium on. IEEE, 2007, pp. 267--274.
[30]
A. Kougkas, M. Dorier, R. Latham, R. Ross, and X.-H. Sun, "Leveraging burst buffer coordination to prevent i/o interference," in e-Science (e-Science), 2016 IEEE 12th International Conference on. IEEE, 2016, pp. 371--380.
[31]
C. Daley, D. Ghoshal, G. Lockwood, S. Dosanjh, L. Ramakrishnan, and N. Wright, "Performance characterization of scientific workflows for the optimal use of burst buffers," Future Generation Computer Systems, 2017.
[32]
J. W. Hunt and M. MacIlroy, An algorithm for differential file comparison. Bell Laboratories Murray Hill, 1976.
[33]
"Git lfs," https://git-lfs.github.com, 2017.
[34]
C. Percival, "Naive differences of executable code," Draft Paper, http://www.daemonology.net/bsdiff, 2003.
[35]
A. Kumar, V. J. Tsotras, and C. Faloutsos, "Designing access methods for bitemporal databases," IEEE Transactions on Knowledge and Data Engineering, vol. 10, no. 1, pp. 1--20, 1998.
[36]
J. F. Roddick, "A survey of schema versioning issues for database systems," Information and Software Technology, vol. 37, no. 7, pp. 383--393, 1995.
[37]
R. Bliujute, S. Saltenis, G. Slivinskas, and G. Jensen, "Systematic change management in dimensional data warehousing," Time Center Technical Report TR-23, Tech. Rep., 1998.
[38]
R. Kimball and M. Ross, The data warehouse toolkit: the complete guide to dimensional modeling. John Wiley & Sons, 2011.
[39]
T. Celik, "Unsupervised change detection in satellite images using principal component analysis and k-means clustering," IEEE Geoscience and Remote Sensing Letters, vol. 6, no. 4, pp. 772--776, 2009.
[40]
R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, "Image change detection algorithms: a systematic survey," IEEE Transactions on Image Processing, vol. 14, no. 3, pp. 294--307, March 2005.
[41]
G. Canfora, L. Cerulo, and M. Di Penta, "Ldiff: An enhanced line differencing tool," in Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 2009, pp. 595--598.
[42]
D. Jackson, D. A. Ladd et al., "Semantic diff: A tool for summarizing the effects of modifications." in ICSM, vol. 94, 1994, pp. 243--252.
[43]
R. S. Gonçalves, B. Parsia, and U. Sattler, "Ecco: A hybrid diff tool for owl 2 ontologies." in OWLED, vol. 849, 2012.
[44]
R. W. Hamming, "Error detecting and error correcting codes," Bell Labs Technical Journal, vol. 29, no. 2, pp. 147--160, 1950.
[45]
W. E. Winkler, "String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage." 1990.
[46]
V. Drakopoulos and N. P. Nikolaou, "Efficient computation of the hutchinson metric between digitized images," IEEE transactions on image processing, vol. 13, no. 12, pp. 1581--1588, 2004.
[47]
P. Buneman, S. Khanna, and W.-C. Tan, "Data provenance: Some basic issues," in International Conference on Foundations of Software Technology and Theoretical Computer Science. Springer, 2000, pp. 87--93.
[48]
K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. I. Seltzer, "Provenance-aware storage systems." in USENIX Annual Technical Conference, General Track, 2006, pp. 43--56.
[49]
S. Miles, S. C. Wong, W. Fang, P. Groth, K. peter Zauner, and L. Moreau, "Provenance-based validation of e-science experiments," in ISWC. Springer-Verlag, 2005, pp. 801--815.
[50]
P. Buneman, A. Chapman, and J. Cheney, "Provenance management in curated databases," in Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 2006, pp. 539--550.
[51]
S. B. Davidson, J. Crabtree, B. P. Brunk, J. Schug, V. Tannen, G. C. Overton, and C. J. Stoeckert, "K2/kleisli and gus: Experiments in integrated access to genomic data sources," IBM systems journal, vol. 40, no. 2, pp. 512--531, 2001.
[52]
P. Groth, S. Miles, and L. Moreau, "PReServ: Provenance Recording for Services," in UK e-Science All Hands Meeting 2005. EPSRC, 2005.
[53]
I. Foster, J. Vockler, M. Wilde, and Y. Zhao, "Chimera: a virtual data system for representing, querying, and automating data derivation," in Scientific and Statistical Database Management, 2002. Proceedings. 14th International Conference on, 2002, pp. 37--46.
[54]
P. Townend, P. Groth, and J. Xu, "A Provenance-Aware Weighted Fault Tolerance Scheme for Service-Based Applications," in Proceedings of the Eighth IEEE International Symposium on Object-Oriented Real-Time Distributed Computing, ser. ISORC '05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 258--266.
[55]
J. Cheney, A. Ahmed, and U. A. Acar, "Provenance as dependency analysis," in Proceedings of the 11th international conference on Database programming languages, ser. DBPL'07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 138--152.
[56]
K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher et al., "The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud," Nucleic acids research, vol. 41, no. W1, pp. W557--W561, 2013.
[57]
I. Altintas, O. Barney, and E. Jaeger-Frank, "Provenance collection support in the Kepler Scientific Workflow System," in Provenance and Annotation of Data, ser. Lecture Notes in Computer Science, L. Moreau and I. Foster, Eds. Springer Berlin / Heidelberg, 2006, vol. 4145, pp. 118--132.
  1. Dac-Man: data change management for scientific datasets on HPC systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
      November 2018
      932 pages

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      IEEE Press

      Publication History

      Published: 11 November 2018

      Check for updates

      Qualifiers

      • Research-article

      Conference

      SC18
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 155
        Total Downloads
      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 04 Oct 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media