Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3472163.3472185acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

A Zone-Based Data Lake Architecture for IoT, Small and Big Data

Published: 07 September 2021 Publication History

Abstract

Data lakes are supposed to enable analysts to perform more efficient and efficacious data analysis by crossing multiple existing data sources, processes and analyses. However, it is impossible to achieve that when a data lake does not have a metadata governance system that progressively capitalizes on all the performed analysis experiments. The objective of this paper is to have an easily accessible, reusable data lake that capitalizes on all user experiences. To meet this need, we propose an analysis-oriented metadata model for data lakes. This model includes the descriptive information of datasets and their attributes, as well as all metadata related to the machine learning analyzes performed on these datasets. To illustrate our metadata solution, we implemented a web application of data lake metadata management. This application allows users to find and use existing data, processes and analyses by searching relevant metadata stored in a NoSQL data store within the data lake. To demonstrate how to easily discover metadata with the application, we present two use cases, with real data, including datasets similarity detection and machine learning guidance.

Supplementary Material

p94-zhao-supplement (p94-zhao-supplement.pptx)
Presentation slides

References

[1]
Ayman Alserafi, Alberto Abelló, Oscar Romero, and Toon Calders. 2016. Towards Information Profiling: Data Lake Content Metadata Management. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). 178–185. https://doi.org/10.1109/ICDMW.2016.0033
[2]
Fabián Constante Nicolalde, Fernando Silva, Boris Herrera, and António Pereira. 2018. Big Data Analytics in IOT: Challenges, Open Research Issues and Tools. In Trends and Advances in Information Systems and Technologies(Advances in Intelligent Systems and Computing), Álvaro Rocha, Hojjat Adeli, Luís Paulo Reis, and Sandra Costanzo (Eds.). Springer International Publishing, Cham, 775–788. https://doi.org/10.1007/978-3-319-77712-2_73
[3]
James Dixon. 2010. Pentaho, Hadoop, and Data Lakes.
[4]
Mouzhi Ge, Hind Bangui, and Barbora Buhnova. 2018. Big Data for Internet of Things: A Survey. Future Generation Computer Systems 87 (Oct. 2018), 601–614. https://doi.org/10.1016/j.future.2018.04.053
[5]
Taiwo Kolajo, Olawande Daramola, and Ayodele Adebiyi. 2019. Big data stream analysis: a systematic literature review. Journal of Big Data 6, 1 (June 2019), 47. https://doi.org/10.1186/s40537-019-0210-7
[6]
Ruoran Liu, Haruna Isah, and Farhana Zulkernine. 2020. A Big Data Lake for Multilevel Streaming Analytics. arXiv:2009.12415 [cs] (Sept. 2020). arxiv:2009.12415 [cs]
[7]
Tatiana Lukoianova and Victoria L. Rubin. 2013. Veracity Roadmap: Is Big Data Objective, Truthful and Credible?Advances in Classification Research Online 24, 1 (2013), 4–15. https://doi.org/10.7152/acro.v24i1.14671
[8]
Mohsen Marjani, Fariza Nasaruddin, Abdullah Gani, Ahmad Karim, Ibrahim Abaker Targio Hashem, Aisha Siddiqa, and Ibrar Yaqoob. 2017. Big IoT Data Analytics: Architecture, Opportunities, and Open Research Challenges. IEEE Access 5(2017), 5247–5261. https://doi.org/10.1109/ACCESS.2017.2689040
[9]
Imen Megdiche, Franck Ravat, and Yan Zhao. 2021. Metadata Management on Data Processing in Data Lakes. In SOFSEM 2021: Theory and Practice of Computer Science(Lecture Notes in Computer Science), Tomáš Bureš, Riccardo Dondi, Johann Gamper, Giovanna Guerrini, Tomasz Jurdziński, Claus Pahl, Florian Sikora, and Prudence W.H. Wong (Eds.). Springer International Publishing, Cham, 553–562. https://doi.org/10.1007/978-3-030-67731-2_40
[10]
Hassan Mehmood, Ekaterina Gilman, Marta Cortes, Panos Kostakos, Andrew Byrne, Katerina Valta, Stavros Tekes, and Jukka Riekki. 2019. Implementing Big Data Lake for Heterogeneous Data Sources. In 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW). 37–44. https://doi.org/10.1109/ICDEW.2019.00-37
[11]
Amr A. Munshi and Yasser Abdel-Rady I. Mohamed. 2018. Data Lake Lambda Architecture for Smart Grids Big Data Analytics. IEEE Access 6(2018), 40463–40471. https://doi.org/10.1109/ACCESS.2018.2858256
[12]
Franck Ravat and Yan Zhao. 2019. Data Lakes: Trends and Perspectives. In Database and Expert Systems Applications(Lecture Notes in Computer Science), Sven Hartmann, Josef Küng, Sharma Chakravarthy, Gabriele Anderst-Kotsis, A Min Tjoa, and Ismail Khalil (Eds.). Springer International Publishing, Cham, 304–313. https://doi.org/10.1007/978-3-030-27615-7_23
[13]
Shabnam Shadroo and Amir Masoud Rahmani. 2018. Systematic survey of big data and data mining in internet of things. Computer Networks 139 (July 2018), 19–47. https://doi.org/10.1016/j.comnet.2018.04.001
[14]
Yan Zhao, Imen Megdiche, and Franck Ravat. 2021. Analysis-oriented Metadata for Data Lakes. In 25th International Database Engineering & Applications Symposium (IDEAS 2021).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IDEAS '21: Proceedings of the 25th International Database Engineering & Applications Symposium
July 2021
308 pages
ISBN:9781450389914
DOI:10.1145/3472163
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 September 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data Lake
  2. Metadata
  3. Stream IoT Data
  4. Technical Architecture
  5. Zone-based

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

IDEAS 2021

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)77
  • Downloads (Last 6 weeks)5
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media