Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A semantic and service-based approach for adaptive mutli-structured data curation in data lakehouses

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Recently, we noticed the emergence of several data management architectures to cope with the challenges imposed by big data. Among them, data lakehouses are receiving much interest from industrial and academic fields due to their ability to hold disparate multi-structured batch and streaming data sources in a single data repository. Thus, the heterogeneous and complex aspect of the data requires a dedicated process to improve their quality and retrieve value from them. Therefore, data curation encompasses several tasks that clean and enrich data to ensure it continues to fit the user requirements. Nevertheless, most existing data curation approaches need more dynamics, flexibility, and customization in constituting the data curation pipeline to align with end user requirements that may vary according to her/his decision context. Moreover, they are dedicated to curating only a single type of structure of batch data sources (e.g., semi-structured). Considering the changing requirements of the user and the need to build a customized data curation pipeline according to the users and the data source characteristics, we propose a service-based framework for adaptive data curation in data lakehouses that encompasses five modules: data collection, data quality evaluation, data characterization, curation service composition, and data curation. The proposed framework is built upon new data characterization and evaluation modular ontology and a curation service composition approach that we detail in the following paper. The experimental findings validate the contributions’ performance in terms of effectiveness and execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Availability of data and materials

This declaration is not applicable.

Notes

  1. https://www.google.com/covid19/mobility/

  2. https://archive.ics.uci.edu/ml/datasets/Health+News+in+Twitter

  3. Data analysis is out of the scope of the present paper.

References

  1. Hlupić, T., Oreščanin, D., Ružak, D., Baranović, M.: An overview of current data lake architecture models. pp. 1082–1087 (2022) https://doi.org/10.23919/MIPRO55190.2022.9803717

  2. Lord, P., Macdonald, A., Lyon, L., Giaretta, D.: From data deluge to data curation. In: In Proc 3th UK e-Science All Hands Meeting. pp. 371–375 (2004)

  3. Akoka, J., Comyn-Wattiau, I., Laoufi, N.: Research on Big Data - A systematic mapping study. Computer Standards and Interfaces. 54, 105–115 (2017)

    Article  Google Scholar 

  4. Tempini, N.: Data curation-research: Practices of data standardization and exploration in a precision medicine database. New Genet. Soc. 40 (2020)

  5. Beheshti, A., Vaghani, K., Benatallah, B., Tabebordbar, A.: Crowdcorrect: A curation pipeline for social data cleansing and curation. Inf. Syst. Big Data Era, 24–38 (2018)

  6. Konstantinou, N., Abel, E., Bellomarini, L., Bogatu, A., Civili, C., Irfanie, E., Koehler, M., Mazilu, L., Sallinger, E., Fernandes, A.A.A., Gottlob, G., Keane, J.A., Paton, N.W.: VADA: an architecture for end user informed data preparation. J Big Data. 6(1), 1–32 (2019)

    Article  Google Scholar 

  7. Maccioni, A., Torlone, R.: Kayak: A framework for just-in-time data preparation in a data lake. Adv. Inform. Syst. Eng. 474–489 (2018)

  8. Bellomarini, L., Fayzrakhmanov, R.R., Gottlob, G., Kravchenko, A., Laurenza, E., Nenov, Y., Reissfelder, S., Sallinger, E., Sherkhonov, E., Vahdati, S., Wu, L.: Data science with vadalog: Knowledge graphs with machine learning and reasoning in practice. Futur. Gener. Comput. Syst. 129, 407–422 (2022)

    Article  Google Scholar 

  9. Debattista, J., Lange, C., Auer, S.: daq, an ontology for dataset quality information. CEUR Workshop Proceedings. pp. 1184 (2014)

  10. Lebo, T., Sahoo, S., Mcguinness, D., Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: PROV-O: The PROV Ontology. (2013)

  11. Liu, Z., Xu, Z., Xia, X.: Towards systematic analysis and summary of duv-based dataset usage information. pp. 169–172 (2016) https://doi.org/10.1109/WISA.2016.42

  12. Shin, D., Lee, S., Kang, J., Park, E.: Data catalogue standards based on dcat for transportation data: Dcat-trans. Journal of Korean Society of Transportation. 37, 430–444 (2019). https://doi.org/10.7470/jkst.2019.37.5.430

    Article  Google Scholar 

  13. Haller, A., Janowicz, K., Cox, S., Phuoc, D., Taylor, K., Lefrançois, M.: Semantic Sensor Network Ontology. (2017)

  14. Albertoni, R., Isaac, A.: Introducing the data quality vocabulary (dqv). Semantic Web. 12,(2020). https://doi.org/10.3233/SW-200382

  15. Batini, C., Scannapieco, M.: Erratum to: Data and Information Quality: Dimensions, Principles and Techniques, pp. 1–1 (2016). https://doi.org/10.1007/978-3-319-24106-7_15

  16. Walker, J., Frank, M., Thompson, N.: User centred methods for measuring the value of open data. (2015)

  17. Zouari, F., Ghedira, C., Kabachi, N., Boukadi, K.: Towards an adaptive curation services composition based on machine learning. IEEE International Conference on Web Services (ICWS), 73–78 (2021)

  18. Zouari, F., Ghedira, C., Kabachi, N., Boukadi, K.: A service-based framework for adaptive data curation in data lakehouses. IEEE International Conference on Web Services (ICWS). (2022)

  19. Wang, H., Zhou, X., Zhou, X., Liu, W., Li, W., Bouguettaya, A.: Adaptive service composition based on reinforcement learning. Lecture Notes in Computer Science. 6470 LNCS (60673175), 92–107 (2010)

  20. Szepesvári, C.: Algorithms for Reinforcement Learning 9, 1–89 (2010)

    Google Scholar 

  21. Lauras, M., Truptil, S., Bénaben, F.: Towards a better management of complex emergencies through crisis management meta-modelling. Disasters 39(4), 687–714 (2015)

    Article  Google Scholar 

  22. Sirin, E., Parsia, B.: Pellet: An owl dl reasoner. Description Logics, 212–213 (2004)

  23. Poveda-Villalón, M., Gomez-Perez, A., Suárez-Figueroa, M.C.: Oops!: A pitfall-based system for ontology diagnosis, 120–148 (2018) https://doi.org/10.4018/978-1-5225-5042-6.ch005

  24. Debnath, N.C., Patel, A., Mazumder, D., Manh, P.N., Minh, N.H.: Evaluation of covid-19 ontologies through ontometrics and oops! tools, 351–365 (2022)

  25. Alkhariji, L., De, S., Rana, O., Perera, C.: Semantics-based privacy by design for internet of things applications. Futur. Gener. Comput. Syst. 138, 280–295 (2023). https://doi.org/10.1016/j.future.2022.08.013

    Article  Google Scholar 

  26. Yahya, M., Zhou, B., Zheng, Z., Zhou, D., Breslin, J.G., Ali, M.I., Kharlamov, E.: Towards generalized welding ontology in line with iso and knowledge graph construction, 83–88 (2022)

  27. Lourdusamy, R., John, A.: A review on metrics for ontology evaluation. 2018 2nd International Conference on Inventive Systems and Control (ICISC), 1415–1421 (2018)

  28. Parejo, J., Segura, S., Fernandez, P., Ruiz-Cortés, A.: Qos-aware web services composition using grasp with path relinking. Expert Syst. Appl. 41, 4211–4223 (2014). https://doi.org/10.1016/j.eswa.2013.12.036

    Article  Google Scholar 

  29. Gao, H., Huang, W., Duan, Y.: The cloud-edge-based dynamic reconfiguration to service workflow for mobile ecommerce environments: A qos prediction perspective. ACM Trans. Internet Technol. 21, 1–23 (2021). https://doi.org/10.1145/3391198

    Article  Google Scholar 

  30. Zhang, W., Chang, C.K., Feng, T., Jiang, H.-y.: Qos-based dynamic web service composition with ant colony optimization, 493–502 (2010) https://doi.org/10.1109/COMPSAC.2010.76

  31. Raj, T.F.M., Sivapragasam, P., Balakrishnan, R., Lalithambal, G., Ragasubha, S.: Qos based classification using k-nearest neighbor algorithm for effective web service selection. 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 1–4 (2015)

  32. Canfora, G., Di Penta, M., Esposito, R., Villani, M.L.: An approach for qos-aware service composition based on genetic algorithms. GECCO 2005-Genetic and Evolutionary Computation Conference. 3387 (2005) https://doi.org/10.1145/1068009.1068189

Download references

Acknowledgements

The authors would like to thank Dr. Fatma Guermazi, an oncology physician at Léon-Bérard Center, Lyon, France, for agreeing to validate the effectiveness of the medical enrichment proposed by our data curation approach.

Funding

No funding.

Author information

Authors and Affiliations

Authors

Contributions

F. Z. and C.G.G. wrote and reviewed the main manuscript text, and K.B. and N.K. participated in elaborating the scientific contribution.

Corresponding authors

Correspondence to Firas Zouari or Chirine Ghedira-Guegan.

Ethics declarations

Competing interests

The authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Ethical Approval

This declaration is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Web Information Systems Engineering 2022 Guest Editors: Richard Chbeir, Helen Huang, Yannis Manolopoulos and Fabrizio Silvestri .

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zouari, F., Ghedira-Guegan, C., Boukadi, K. et al. A semantic and service-based approach for adaptive mutli-structured data curation in data lakehouses. World Wide Web 26, 4001–4023 (2023). https://doi.org/10.1007/s11280-023-01218-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-023-01218-3

Keywords