Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3548785.3548797acmotherconferencesArticle/Chapter ViewAbstractPublication PagesideasConference Proceedingsconference-collections
research-article

A Formal Framework for Data Lakes Based on Category Theory

Published: 13 September 2022 Publication History

Abstract

The management of Big Data requires flexible systems to handle the heterogeneity of data models as well as the complexity of analytical workflows. Traditional systems like data warehouses have reached their limits due to their rigid schema-on-write paradigm, that requires well identified and defined use cases to ingest data. Data lakes, with their schema-on-read paradigm, have been proposed as more flexible systems in which raw data are directly stored in their original format associated with metadata, to be accessed and transformed only when users need to process or analyze them. Thus, it is necessary to define and control the different levels of abstraction and the dependencies among functionalities of a data lake to use it efficiently. In this article, we present a formal framework aiming to define a data lake pattern and to unify the interactions among the functionalities. We use the category theory as theoretical foundations to benefit from its high level of abstraction and its compositionality. By relying on different categories and functors, we ensure the navigation among the functionalities and allow the composition of multiples operations, while keeping track of the entire lineage of data. We also show how our framework can be applied on a simple example of data lake.

References

[1]
Ayman Alserafi, Alberto Abelló, Oscar Romero, and Toon Calders. 2016. Towards information profiling: data lake content metadata management. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, 178–185.
[2]
Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, 2020. Delta lake: high-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment 13, 12 (2020), 3411–3424.
[3]
Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. 2021. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. In Proceedings of CIDR.
[4]
Amin Beheshti, Boualem Benatallah, Reza Nouri, and Alireza Tabebordbar. 2018. CoreKG: a knowledge lake service. Proceedings of the VLDB Endowment 11, 12 (2018), 1942–1945.
[5]
Manfred Broy. 2011. Can practitioners neglect theory and theoreticians neglect practice?Computer 44, 10 (2011), 19–24.
[6]
Manfred Broy and María Victoria Cengarle. 2011. UML formal semantics: lessons learned. Software & Systems Modeling 10, 4 (2011), 441–446.
[7]
Isabel Cafezeiro and Edward Hermann Haeusler. 2007. Semantic Interoperability via Category Theory. In ER (Tutorials, Posters, Panels & Industrial Contributions). Citeseer, 197–202.
[8]
Edgar Frank Codd. 1983. A relational model of data for large shared data banks. Commun. ACM 26, 1 (1983), 64–69.
[9]
Julia Couto, Olimar Teixeira Borges, Duncan D Ruiz, Sabrina Marczak, and Rafael Prikladnicki. 2019. A Mapping Study about Data Lakes: An Improved Definition and Possible Architectures. In SEKE. 453–578.
[10]
Zhamak Dehghani. 2019. How to move beyond a monolithic data lake to a distributed data mesh. Martin Fowler’s Blog(2019).
[11]
Claudia Diamantini, Domenico Potena, and Emanuele Storti. 2021. A Semantic Data Lake Model for Analytic Query-Driven Discovery. In The 23rd International Conference on Information Integration and Web Intelligence. 183–186.
[12]
Zinovy Diskin. 1997. The Arrow Logic of Metadata Environment: A Formalised Graph-Based Framework for Structuring Metadata Repositories. (1997).
[13]
James Dixon. 2010. Pentaho, Hadoop, and data lakes. blog, Oct (2010).
[14]
Hartmut Ehrig, Martin Große-Rhode, and Uwe Wolter. 1998. Applications of category theory to the area of algebraic specification in computer science. Applied categorical structures 6, 1 (1998), 1–35.
[15]
Samuel Eilenberg and Saunders MacLane. 1945. General theory of natural equivalences. Trans. Amer. Math. Soc. 58, 2 (1945), 231–294.
[16]
Mina Farid, Alexandra Roatis, Ihab F Ilyas, Hella-Franziska Hoffmann, and Xu Chu. 2016. CLAMS: bringing quality to data lakes. In Proceedings of the 2016 International Conference on Management of Data. 2089–2092.
[17]
Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, and Michael Stonebraker. 2018. Aurum: A data discovery system. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1001–1012.
[18]
Andrew U Frank. 1998. Metamodels for data quality description. Data Quality in Geographic Information-From Error to Uncertainty 192 (1998).
[19]
Yihan Gao, Silu Huang, and Aditya Parameswaran. 2018. Navigating the data lake with datamaran: Automatically extracting structure from log datasets. In Proceedings of the 2018 International Conference on Management of Data. 943–958.
[20]
I. Gartner. 2014. Gartner Says Beware of the Data Lake Fallacy.https://www.gartner.com/newsroom/id/2809117.
[21]
Annabelle Gillet, Éric Leclercq, and Nadine Cullot. 2021. Lambda+, the Renewal of the Lambda Architecture: Category Theory to the Rescue. In International Conference on Advanced Information Systems Engineering. Springer, 381–396.
[22]
Georg Gottlob and Christoph Koch. 2002. Monadic queries over tree-structured data. In Proceedings 17th annual IEEE symposium on logic in computer science. IEEE, 189–202.
[23]
Torsten Grust. 2004. Monad comprehensions: a versatile representation for queries. In The Functional Approach to Data Management. Springer, 288–311.
[24]
Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An intelligent data lake system. In Proceedings of the 2016 international conference on management of data. 2097–2100.
[25]
Rihan Hai, Christoph Quix, and Matthias Jarke. 2021. Data lake concept and systems: a survey. arXiv preprint arXiv:2106.09592(2021).
[26]
Rihan Hai, Christoph Quix, and Dan Wang. 2019. Relaxed functional dependency discovery in heterogeneous data lakes. In International Conference on Conceptual Modeling. Springer, 225–239.
[27]
Alon Y Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Managing Google’s data lake: an overview of the Goods system.IEEE Data Eng. Bull. 39, 3 (2016), 5–14.
[28]
Irena Holubova, Pavel Contos, and Martin Svoboda. 2021. Categorical Management of Multi-Model Data. In 25th International Database Engineering & Applications Symposium. 134–140.
[29]
Pontus Johnson, Mathias Ekstedt, and Ivar Jacobson. 2012. Where’s the theory for software engineering?IEEE software 29, 5 (2012), 96–96.
[30]
P Kolencık. 1998. Categorical Framework for Object-Oriented Database Model. Ph. D. Dissertation. PhD thesis.
[31]
Pengfei Liu, Sabine Loudcher, Jérôme Darmont, and Camille Noûs. 2021. ArchaeoDAL: A Data Lake for Archaeological Data Management and Analytics. In 25th International Database Engineering & Applications Symposium. 252–262.
[32]
Zhen Hua Liu, Jiaheng Lu, Dieter Gawlick, Heli Helskyaho, Gregory Pogossiants, and Zhe Wu. 2018. Multi-model database management systems-a look forward. In Heterogeneous Data Management, Polystores, and Analytics for Healthcare. Springer, 16–29.
[33]
Antonio Maccioni and Riccardo Torlone. 2018. KAYAK: a framework for just-in-time data preparation in a data lake. In International Conference on Advanced Information Systems Engineering. Springer, 474–489.
[34]
Jacob McPadden, Thomas JS Durant, Dustin R Bunch, Andreas Coppi, Nathan Price, Kris Rodgerson, Charles J Torre Jr, William Byron, H Patrick Young, Allen L Hsiao, 2018. A scalable data science platform for healthcare and precision medicine research. arXiv preprint arXiv:1808.04849(2018).
[35]
Daniel-Jesus Munoz, Dilian Gurov, Monica Pinto, and Lidia Fuentes. 2021. Category Theory Framework for Variability Models with Non-functional Requirements. In International Conference on Advanced Information Systems Engineering. Springer, 397–413.
[36]
Amr A Munshi and Yasser Abdel-Rady I Mohamed. 2018. Data lake lambda architecture for smart grids big data analytics. IEEE Access 6(2018), 40463–40471.
[37]
Iuri D Nogueira, Maram Romdhane, and Jérôme Darmont. 2018. Modeling data lake metadata with a data vault. In Proceedings of the 22nd International Database Engineering & Applications Symposium. 253–261.
[38]
Christoph Quix, Rihan Hai, and Ivan Vatov. 2016. Metadata extraction and management in data lakes with GEMMS. Complex Systems Informatics and Modeling Quarterly9 (2016), 67–83.
[39]
Raghu Ramakrishnan, Baskar Sridharan, John R Douceur, Pavan Kasturi, Balaji Krishnamachari-Sampath, Karthick Krishnamoorthy, Peng Li, Mitica Manu, Spiro Michaylov, Rogério Ramos, 2017. Azure data lake store: a hyperscale distributed file service for big data analytics. In Proceedings of the 2017 ACM International Conference on Management of Data. 51–63.
[40]
Sarathkumar Rangarajan, Huai Liu, Hua Wang, and Chuan-Long Wang. 2015. Scalable architecture for personalized healthcare service recommendation using big data lake. In Service research and innovation. Springer, 65–79.
[41]
Franck Ravat and Yan Zhao. 2019. Data lakes: Trends and perspectives. In International Conference on Database and Expert Systems Applications. Springer, 304–313.
[42]
Franck Ravat and Yan Zhao. 2019. Metadata management for data lakes. In European Conference on Advances in Databases and Information Systems. Springer, 37–44.
[43]
David Sarramia, Alexandre Claude, Francis Ogereau, Jérémy Mezhoud, and Gilles Mailhot. 2022. CEBA: A Data Lake for Data Sharing and Environmental Monitoring. Sensors 22, 7 (2022), 2733.
[44]
Pegdwendé Sawadogo and Jérôme Darmont. 2021. On data lake architectures and metadata management. Journal of Intelligent Information Systems 56, 1 (2021), 97–120.
[45]
Etienne Scholly, Pegdwendé Sawadogo, Pengfei Liu, Javier Alfonso Espinosa-Oviedo, Cécile Favre, Sabine Loudcher, Jérôme Darmont, and Camille Noûs. 2021. Coining goldMEDAL: a new contribution to data lake generic metadata modeling. arXiv preprint arXiv:2103.13155(2021).
[46]
Dan Shiebler, Bruno Gavranović, and Paul Wilson. 2021. Category theory in machine learning. arXiv preprint arXiv:2106.07032(2021).
[47]
David Spivak. 2011. Categorical Information Theory. Technical Report. Massachusetts Inst. of Tech.
[48]
David I Spivak. 2012. Functorial data migration. Information and Computation 217 (2012), 31–51.
[49]
David I Spivak. 2014. Database queries and constraints via lifting problems. Mathematical structures in computer science 24, 6 (2014).
[50]
Laurent Thiry, Heng Zhao, and Michel Hassenforder. 2018. Categories for (Big) Data models and optimization. Journal of Big Data 5, 1 (2018), 1–20.
[51]
David Toth. 2008. Database engineering from the category theory viewpoint. Databases, Texts (2008), 37.
[52]
Valter Uotila and Jiaheng Lu. 2021. A Formal Category Theoretical Framework for Multi-model Data Transformations. In Heterogeneous Data Management, Polystores, and Analytics for Healthcare. Springer, 14–28.
[53]
Paul-Christophe Varoutas, Philippe Rizand, and Alain Livartowski. 2006. Using category theory as a basis for a heterogeneous data source search meta-engine: the Prométhée framework. In International Conference on Algebraic Methodology and Software Technology. Springer, 381–387.
[54]
Paul Vickers, Joe Faith, and Nick Rossiter. 2012. Understanding visualization: A formal approach using category theory and semiotics. IEEE transactions on visualization and computer graphics 19, 6(2012), 1048–1061.
[55]
Yan Zhao, Imen Megdiche, Franck Ravat, and Vincent-nam Dang. 2021. A Zone-Based Data Lake Architecture for IoT, Small and Big Data. In 25th International Database Engineering & Applications Symposium. 94–102.

Cited By

View all
  • (2024)Metadata Management in Data Lake Environments: A SurveyJournal of Library Metadata10.1080/19386389.2024.235931024:4(215-274)Online publication date: 15-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
IDEAS '22: Proceedings of the 26th International Database Engineered Applications Symposium
August 2022
174 pages
ISBN:9781450397094
DOI:10.1145/3548785
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Architecture Pattern
  2. Category Theory
  3. Data Lakes

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

IDEAS'22

Acceptance Rates

Overall Acceptance Rate 74 of 210 submissions, 35%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)36
  • Downloads (Last 6 weeks)1
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Metadata Management in Data Lake Environments: A SurveyJournal of Library Metadata10.1080/19386389.2024.235931024:4(215-274)Online publication date: 15-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media