Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3472163.3472173acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
short-paper

Data Management in the Data Lake: A Systematic Mapping

Published: 07 September 2021 Publication History

Abstract

The computer science community is paying more and more attention to data due to its crucial role in performing analysis and prediction. Researchers have proposed many data containers such as files, databases, data warehouses, cloud systems, and recently data lakes in the last decade. The latter enables holding data in its native format, making it suitable for performing massive data prediction, particularly for real-time application development. Although data lake is well adopted in the computer science industry, its acceptance by the research community is still in its infancy stage. This paper sheds light on existing works for performing analysis and predictions on data placed in data lakes. Our study reveals the necessary data management steps, which need to be followed in a decision process, and the requirements to be respected, namely curation, quality evaluation, privacy-preservation, and prediction. This study aims to categorize and analyze proposals related to each step mentioned above.

References

[1]
Alessandro Acquisti, Leslie K. John, and George Loewenstein. 2013. What is privacy worth?Journal of Legal Studies 42, 2 (2013), 249–274.
[2]
Ahmad Ahmadov, Maik Thiele, Julian Eberius, Wolfgang Lehner, and Robert Wrembel. 2016. Towards a Hybrid Imputation Approach Using Web Tables. Proceedings - 2015 2nd IEEE/ACM International Symposium on Big Data Computing, BDC 2015Ml(2016), 21–30.
[3]
Amin Beheshti, Boualem Benatallah, Reza Nouri, and Alireza Tabebordbar. 2018. CoreKG: a Knowledge Lake Service. Knowledge Lake Service. PVLDB 11, 12 (2018), 1942–1945.
[4]
Seyed Mehdi Reza Beheshti, Alireza Tabebordbar, Boualem Benatallah, and Reza Nouri. 2019. On automating basic data curation tasks. 26th International World Wide Web Conference 2017, WWW 2017 Companion (2019), 165–169.
[5]
Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: a data discovery system. Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018 (2018), 1001–1012.
[6]
Manoj Diwakar, Amrendra Tripathi, Kapil Joshi, Minakshi Memoria, Prabhishek Singh, and Neeraj Kumar. 2020. Latest trends on heart disease prediction using machine learning and image fusion. Materials Today: Proceedings 37, Part 2 (2020), 3213–3218.
[7]
Huang Fang. 2015. Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. In 2015 IEEE International Conference on Cyber Technology in Automation, Control and Intelligent Systems, IEEE-CYBER 2015. Institute of Electrical and Electronics Engineers Inc., 820–824.
[8]
Muneeb Ul Hassan, Mubashir Husain Rehmani, and Jinjun Chen. 2019. Privacy preservation in blockchain based IoT systems: Integration issues, prospects, challenges, and future research directions. Future Generation Computer Systems 97 (2019), 512–529.
[9]
Nikolaos Konstantinou, Edward Abel, Luigi Bellomarini, Alex Bogatu, Cristina Civili, Endri Irfanie, Martin Koehler, Lacramioara Mazilu, Emanuel Sallinger, Alvaro A.A. Fernandes, Georg Gottlob, John A. Keane, and Norman W. Paton. 2019. VADA: an architecture for end user informed data preparation. Journal of Big Data 6, 1 (2019), 1–32.
[10]
Endah Kristiani, Yuan An Chen, Chao Tung Yang, Chin Yin Huang, Yu Tse Tsan, and Wei Cheng Chan. 2021. Using deep ensemble for influenza-like illness consultation rate prediction. Future Generation Computer Systems 117 (2021), 369–386.
[11]
Faiza Loukil, Chirine Ghedira-Guegan, Aïcha Nabila Benharkat, Khouloud Boukadi, and Zakaria Maamar. 2017. Privacy-aware in the IoT applications: A systematic literature review. In Lecture Notes in Computer Science, Vol. 10573 LNCS. Springer Verlag, 552–569.
[12]
Rajganesh Nagarajan and Ramkumar Thirunavukarasu. 2020. A Service Context-Aware QoS Prediction and Recommendation of Cloud Infrastructure Services. Arabian Journal for Science and Engineering 45, 4 (2020), 2929–2943.
[13]
Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2018. Data lake management: Challenges and opportunities. Proceedings of the VLDB Endowment 12, 12 (2018), 1986–1989.
[14]
Ana León Palacio, Óscar Pastor López, and Juan Carlos Casamayor Ródenas. 2018. A method to identify relevant genome data: Conceptual modeling for the medicine of precision. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11157 LNCS. Springer Verlag, 597–609.
[15]
Kai Petersen, Robert Feldt, Shahid Mujtaba, and Michael Mattsson. 2008. Systematic mapping studies in software engineering. Technical Report.
[16]
Giri Kumar Tayi and Donald P. Ballou. 1998. Examining Data Quality. Commun. ACM 41, 2 (1998), 54–57.
[17]
Leizhi Wang, Zhenduo Zhu, Lauren Sassoubre, Guan Yu, Chen Liao, Qingfang Hu, and Yintang Wang. 2021. Improving the robustness of beach water quality modeling using an ensemble machine learning approach. Science of the Total Environment 765 (2021), 142760.

Cited By

View all
  • (2023)Cloud-Based Simulation Model for Agriculture Big Data in the Kingdom of BahrainProceedings of Eighth International Congress on Information and Communication Technology10.1007/978-981-99-3243-6_59(741-757)Online publication date: 25-Jul-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IDEAS '21: Proceedings of the 25th International Database Engineering & Applications Symposium
July 2021
308 pages
ISBN:9781450389914
DOI:10.1145/3472163
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 September 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data lake
  2. Data management
  3. Systematic mapping

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

IDEAS 2021

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)64
  • Downloads (Last 6 weeks)3
Reflects downloads up to 09 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Cloud-Based Simulation Model for Agriculture Big Data in the Kingdom of BahrainProceedings of Eighth International Congress on Information and Communication Technology10.1007/978-981-99-3243-6_59(741-757)Online publication date: 25-Jul-2023

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media