Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3295500.3356146acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

MIQS: metadata indexing and querying service for self-describing file formats

Published: 17 November 2019 Publication History

Abstract

Scientific applications often store datasets in self-describing data file formats, such as HDF5 and netCDF. Regrettably, to efficiently search the metadata within these files remains challenging due to the sheer size of the datasets. Existing solutions extract the metadata and store it in external database management systems (DBMS) to locate desired data. However, this practice introduces significant overhead and complexity in extraction and querying. In this research, we propose a novel <u>M</u>etadata <u>I</u>ndexing and <u>Q</u>uerying <u>S</u>ervice (MIQS), which removes the external DBMS and utilizes in-memory index to achieve efficient metadata searching. MIQS follows the self-contained data management paradigm and provides portable and schema-free metadata indexing and querying functionalities for self-describing file formats. We have evaluated MIQS with the state-of-the-art MongoDB-based metadata indexing solution. MIQS achieved up to 99% time reduction in index construction and up to 172kx search performance improvement with up to 75% reduction in memory footprint.

References

[1]
2018. MongoDB Limits and Thresholds, https://docs.mongodb.com/manual/reference/limits/.
[2]
2018. Simple and fast C library implementing a thread-safe API to manage hash-tables, linked lists, lock-free ring buffers and queues, https://github.com/xant/libhl.
[3]
MG Aartsen, K Abraham, M Ackermann, J Adams, JA Aguilar, M Ahlers, M Ahrens, D Altmann, K Andeen, T Anderson, et al. 2016. Search for sources of High-Energy neutrons with four years of data from the Icetop Detector. The Astrophysical Journal 830, 2 (2016), 129.
[4]
MG Aartsen, M Ackermann, J Adams, JA Aguilar, Markus Ahlers, M Ahrens, I Al Samarai, D Altmann, K Andeen, T Anderson, et al. 2017. Constraints on galactic neutrino emission with seven years of IceCube data. The Astrophysical Journal 849, 1 (2017), 67.
[5]
Christopher P Ahn, Rachael Alexandroff, Carlos Allende Prieto, Scott F Anderson, Timothy Anderton, Brett H Andrews, Éric Aubourg, Stephen Bailey, Eduardo Balbinot, Rory Barnes, et al. 2012. The ninth data release of the Sloan Digital Sky Survey: first spectroscopic data from the SDSS-III Baryon Oscillation Spectroscopic Survey. The Astrophysical Journal Supplement Series 203, 2 (2012), 21.
[6]
Shadab Alam, Metin Ata, Stephen Bailey, Florian Beutler, Dmitry Bizyaev, Jonathan A Blazek, Adam S Bolton, Joel R Brownstein, Angela Burden, Chia-Hsun Chuang, et al. 2017. The clustering of galaxies in the completed SDSS-III Baryon Oscillation Spectroscopic Survey: cosmological analysis of the DR12 galaxy sample. Monthly Notices of the Royal Astronomical Society 470, 3 (2017), 2617--2652.
[7]
bsonspec.org. 2018. Binary JSON Specification, http://bsonspec.org/spec.html.
[8]
Chi Chen, Zhi Deng, Richard Tran, Hanmei Tang, Iek-Heng Chu, and Shyue Ping Ong. 2017. Accurate force field for molybdenum by machine learning large materials data. Physical Review Materials 1, 4 (2017), 043603.
[9]
Tull Craig E., Essiari Abdelilah, Gunter Dan, et al. 2013. The SPOT Suite project. http://spot.nersc.gov/.
[10]
Digital Curation Conference (DCC). 2018. Scientific Metadata. http://www.dcc.ac.uk/resources/curation-reference-manual/chapters-production/scientific-metadata.
[11]
Jeffrey J Donatelli, James A Sethian, and Peter H Zwart. 2017. Reconstruction from limited single-particle diffraction data via simultaneous determination of state, orientation, intensity, and phase. Proceedings of the National Academy of Sciences 114, 28 (2017), 7222--7227.
[12]
Bin Dong, Surendra Byna, and Kesheng Wu. 2015. Spatially clustered join on heterogeneous scientific data sets. In 2015 IEEE International Conference on Big Data (Big Data). IEEE, 371--380.
[13]
Mike Folk, Albert Cheng, and Kim Yates. 1999. HDF5: A file format and I/O library for high performance computing applications. In Proceedings of supercomputing, Vol. 99. 5--33.
[14]
P Greenfield, M Droettboom, and E Bray. 2015. ASDF: A new data format for astronomy. Astronomy and Computing 12 (2015), 240--251.
[15]
The HDF Group. 2018. HDF5 Topic Parallel Indexing Branch. https://git.hdfgroup.org/users/jsoumagne/repos/hdf5/browse?at=refs%2Fheads%2Ftopic-parallel-indexing.
[16]
The HDF Group. 2018. HDF5 Users. https://support.hdfgroup.org/HDF5/users5.html.
[17]
Kohei Hiraga, Osamu Tatebe, and Hideyuki Kawashima. 2018. PPMDS: A Distributed Metadata Server Based on Nonblocking Transactions. In Fifth International Conference on Social Networks Analysis, Management and Security SNAMS 2018, Valencia, Spain, October 15--18, 2018. 202--208.
[18]
Joint Genome Institute. 2013. The JGI Archive and Meta-data Organizer(JAMO). http://cs.lbl.gov/news-media/news/2013/new-metadata-organizer-streamlines-jgi-data-management.
[19]
json.org. 2018. Introducing JSON. https://www.json.org.
[20]
Daniel Korenblum, Daniel Rubin, Sandy Napel, Cesar Rodriguez, and Chris Beaulieu. 2011. Managing biomedical image metadata for search and retrieval of similar images. Journal of digital imaging 24, 4 (2011), 739--748.
[21]
Margaret Lawson and Jay Lofstead. 2018. Using a Robust Metadata Management System to Accelerate Scientific Discovery at Extreme Scales. In Proceedings of the 2nd PDSW-DISCS '18.
[22]
Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The Adaptive Radix Tree: ARTful Indexing for Main-memory Databases. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013) (ICDE '13). IEEE Computer Society, Washington, DC, USA, 38--49.
[23]
Andrew W Leung, Minglong Shao, Timothy Bisson, Shankar Pasupathy, and Ethan L Miller. 2009. Spyglass: Fast, Scalable Metadata Search for Large-Scale Storage Systems. In FAST, Vol. 9. 153--166.
[24]
Jialin Liu, Debbie Bard, Quincey Koziol, Stephen Bailey, et al. 2017. Searching for millions of objects in the BOSS spectroscopic survey data with H5Boss. In 2017 New York Scientific Data Summit (NYSDS). 1--9.
[25]
Yaning Liu, George Shu Heng Pau, and Stefan Finsterle. 2017. Implicit sampling combined with reduced order modeling for the inversion of vadose zone hydrological data. Computers & Geosciences (2017).
[26]
Jay F. Lofstead, Scott Klasky, et al. 2008. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS). In CLADE. 15--24.
[27]
Arun Mannodi-Kanakkithodi, Tran Doan Huan, and Rampi Ramprasad. 2017. Mining materials design rules from data: The example of polymer dielectrics. Chemistry of Materials 29, 21 (2017), 9001--9010.
[28]
Vilobh Meshram, Xavier Besseron, Xiangyong Ouyang, Raghunath Rajachandrasekar, Ravi Prakash, and Dhabaleswar K. Panda. 2011. Can a Decentralized Metadata Service Layer Benefit Parallel Filesystems?. In 2011 IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, USA, September 26--30, 2011. 484--493.
[29]
MongoDB. 2018. MongoDB. https://www.mongodb.com.
[30]
mongodb.com. 2018. The MongoDB 4.0 Manual, https://docs.mongodb.com/manual/.
[31]
David Paez-Espino, I Chen, A Min, Krishna Palaniappan, Anna Ratner, Ken Chu, Ernest Szeto, Manoj Pillay, Jinghua Huang, Victor M Markowitz, et al. 2017. IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses. Nucleic acids research 45, D1 (2017), D457--D465.
[32]
PostgreSQL. 2018. PostgreSQL. https://www.postgresql.org.
[33]
Russ Rew and Glenn Davis. 1990. NetCDF: an interface for scientific data access. IEEE computer graphics and applications 10, 4 (1990), 76--82.
[34]
Frank B Schmuck and Roger L Haskin. 2002. GPFS: A Shared-Disk File System for Large Computing Clusters. In FAST, Vol. 2.
[35]
Philip Schwan et al. 2003. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux symposium, Vol. 2003. 380--386.
[36]
Self-balancing binary search tree. 2019. Self-balancing binary search tree --- Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Self-balancing_binary_search_tree [Online; accessed 10-April-2019].
[37]
Arie Shoshani and Doron Rotem. 2009. Scientific data management: challenges, technology, and deployment. Chapman and Hall/CRC.
[38]
Hyogi Sim, Youngjae Kim, Sudharshan S. Vazhkudai, Geoffroy R. Vallée, Seung-Hwan Lim, and Ali Raza Butt. 2017. Tagit: an integrated indexing and search service for file systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, Denver, CO, USA, November 12 - 17, 2017. 5:1--5:12.
[39]
sqlite.org. 2017. SQLite. https://sqlite.org.
[40]
Houjun Tang, Suren Byna, Bin Dong, Jialin Liu, and Quincey Koziol. 2017. SoMeta: Scalable Object-Centric Metadata Management for High Performance Computing. In Cluster Computing (CLUSTER), 2017 IEEE International Conference on. IEEE, 359--369.
[41]
Alexander Thomson and Daniel J. Abadi. 2015. CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems. In Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST 2015, Santa Clara, CA, USA, February 16--19, 2015, Jiri Schindler and Erez Zadok (Eds.). USENLX Association, 1--14. https://www.usenix.org/conference/fast15/technical-sessions/presentation/thomson
[42]
Teng Wang, Adam Moody, Yue Zhu, Kathryn Mohror, Kento Sato, Tanzima Islam, and Weikuan Yu. 2017. MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers. In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, FL, USA, May 29 - June 2, 2017. IEEE Computer Society, 1174--1183.
[43]
Zeyi Wen, Xingyang Liu, Hongjian Cao, and Bingsheng He. 2018. RTSI: An Index Structure for Multi-Modal Real-Time Search on Live Audio Streaming Services. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16--19, 2018. 1495--1506.
[44]
WiredTiger. 2018. WiredTiger. http://www.wiredtiger.com/.
[45]
Quanqing Xu, Rajesh Vellore Arumugam, Khai Leong Yang, and Sridhar Mahadevan. 2013. Drop: Facilitating distributed metadata management in eb-scale storage systems. In 2013 IEEE 29th symposium on mass storage systems and technologies (MSST). IEEE, 1--10.
[46]
Wei Zhang, Houjun Tang, Suren Byna, and Yong Chen. 2018. DART: Distributed Adaptive Radix Tree for Efficient Affix-based Keyword Search on HPC Systems. In Proceedings of The 27th International Conference on Parallel Architectures and Compilation Techniques (PACT'18).
[47]
Dongfang Zhao, Kan Qiao, Zhou Zhou, Tonglin Li, Zhihan Lu, and Xiaohua Xu. 2017. Toward Efficient and Flexible Metadata Indexing of Big Data Systems. IEEE Trans. Big Data 3, 1 (2017), 107--117.
[48]
Qing Zheng, Charles D. Cranor, Danhao Guo, Gregory R. Ganger, George Amvrosiadis, Garth A. Gibson, Bradley W. Settlemyer, Gary Grider, and Fan Guo. 2018. Scaling embedded in-situ indexing with deltaFS. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11--16, 2018. 3:1--3:15. http://dl.acm.org/citation.cfm?id=3291660

Cited By

View all
  • (2024)SCIPIS: Scalable and Concurrent Persistent Indexing and Search in High-End Computing SystemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104878(104878)Online publication date: Mar-2024
  • (2024)DAI: How Pre-computation Speeds up Data AnalysisComputational Science – ICCS 202410.1007/978-3-031-63751-3_8(116-130)Online publication date: 27-Jun-2024
  • (2023)PSQS: Parallel Semantic Querying Service for Self-describing File Formats2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386205(536-541)Online publication date: 15-Dec-2023
  • Show More Cited By

Index Terms

  1. MIQS: metadata indexing and querying service for self-describing file formats

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2019
    1921 pages
    ISBN:9781450362290
    DOI:10.1145/3295500
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. HDF5 metadata management
    2. metadata search

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)105
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 18 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)SCIPIS: Scalable and Concurrent Persistent Indexing and Search in High-End Computing SystemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104878(104878)Online publication date: Mar-2024
    • (2024)DAI: How Pre-computation Speeds up Data AnalysisComputational Science – ICCS 202410.1007/978-3-031-63751-3_8(116-130)Online publication date: 27-Jun-2024
    • (2023)PSQS: Parallel Semantic Querying Service for Self-describing File Formats2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386205(536-541)Online publication date: 15-Dec-2023
    • (2023)Self-describing Digital Assets and Their Applications in an Integrated Science and Engineering EcosystemAccelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation10.1007/978-3-031-23606-8_17(274-287)Online publication date: 18-Jan-2023
    • (2022)SCANNS: Towards Scalable and Concurrent Data Indexing and Searching in High-End Computing System2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid54584.2022.00014(51-60)Online publication date: May-2022
    • (2022)Domain-Specific Type-Safe APIs for Hierarchical Scientific Data with Modern C++Responsible Data Science10.1007/978-981-19-4453-6_14(191-204)Online publication date: 15-Nov-2022
    • (2021)Dissecting self-describing data formats to enable advanced querying of file metadataProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463778(1-7)Online publication date: 14-Jun-2021
    • (2021)MOSIQS: Persistent Memory Object Storage With Metadata Indexing and Querying for Scientific ComputingIEEE Access10.1109/ACCESS.2021.30875029(85217-85231)Online publication date: 2021
    • (2020)Cross-facility science with the Superfacility Project at LBNL2020 IEEE/ACM 2nd Annual Workshop on Extreme-scale Experiment-in-the-Loop Computing (XLOOP)10.1109/XLOOP51963.2020.00006(1-7)Online publication date: Nov-2020
    • (2020)Persistent Memory Object Storage and Indexing for Scientific Computing2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)10.1109/MCHPC51950.2020.00006(1-9)Online publication date: Nov-2020
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media