Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Semantic Data Integration and Querying: A Survey and Challenges

Published: 26 April 2024 Publication History
  • Get Citation Alerts
  • Abstract

    Digital revolution produces massive, heterogeneous and isolated data. These latter remain underutilized, unsuitable for integrated querying and knowledge discovering. Hence the importance of this survey on data integration which identifies challenging issues and trends. First, an overview of the different generations and basics of data integration is given. Then, semantic data integration is focused, since it semantically links data allowing wider insights and decision-making. More than thirty works are reviewed. The goal is to help analysts to identify relevant criteria to compare then choose among semantic data integration approaches, focusing on the category (materialized, virtual or hybrid) and querying techniques.

    References

    [1]
    Pekka Aarnio, Ilkka Seilonen, and Mats Friman. 2014. Semantic repository for case-based reasoning in CBM services. In Proceedings of the 2014 IEEE Emerging Technology and Factory Automation (ETFA). IEEE, 1–8.
    [2]
    Hanen Abbes and Faiez Gargouri. 2018. MongoDB-based modular ontology building for big data integration. Journal on Data Semantics 7, 1 (2018), 1–27.
    [3]
    A. A. Adeyelu and E. O. Anyebe. 2018. Implementing an improved mediator wrapper paradigm for heterogeneous database integration. Nigerian Annals of Pure and Applied Sciences 1 (2018), 224–235.
    [4]
    Panos Alexopoulos. 2020. Semantic Modeling for Data. O’Reilly Media.
    [5]
    Ioannis N. Athanasiadis. 2015. Challenges in modelling of environmental semantics. In International Symposium on Environmental Software Systems. Springer, 19–25.
    [6]
    Srividya K. Bansal and Sebastian Kagemann. 2015. Integrating big data: A semantic extract-transform-load framework. Computer 48, 3 (2015), 42–50.
    [7]
    Laïla Benhlima and Dalila Chiadmi. 2006. Vers l’interopérabilité des systèmes d’information hétérogènes. Electronic Journal of Information Technology3 (2006).
    [8]
    Konstantina Bereta. 2023. Geospatial ontology-based data access. Geospatial Data Science (2023).
    [9]
    Konstantina Bereta, George Papadakis, and Manolis Koubarakis. 2020. OBDA for the web: Creating virtual RDF graphs on top of web data sources. arXiv preprint arXiv:2005.11264 (2020).
    [10]
    Konstantina Bereta, George Stamoulis, and Manolis Koubarakis. 2018. Ontology-based data access and visualization of big vector and raster data. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 407–410.
    [11]
    Diego Calvanese, Benjamin Cogrel, Sarah Komla-Ebri, Roman Kontchakov, Davide Lanti, Martin Rezk, Mariano Rodriguez-Muro, and Guohui Xiao. 2017. Ontop: Answering SPARQL queries over relational databases. Semantic Web 8, 3 (2017), 471–487.
    [12]
    Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. 2007. Tractable reasoning and efficient query answering in description logics: The DL-lite family. Journal of Automated Reasoning 39 (2007), 385–429.
    [13]
    Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, and Riccardo Rosati. 2018. Ontology-based data access and integration. In Encyclopedia of Database Systems.
    [14]
    Diego Calvanese, Davide Lanti, Tarcisio Mendes de Farias, Alessandro Mosca, and Guohui Xiao. 2021. Accessing scientific data through knowledge graphs with Ontop. Patterns 2, 10 (2021), 100346.
    [15]
    Upen S. Chakravarthy, John Grant, and Jack Minker. 1990. Logic-based approach to semantic query optimization. ACM Transactions on Database Systems (TODS) 15, 2 (1990), 162–207.
    [16]
    Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey Ullman, and Jennifer Widom. 1994. The TSIMMIS project: Integration of heterogenous information sources. (1994).
    [17]
    Michelle Cheatham and Catia Pesquita. 2017. Semantic data integration. In Handbook of Big Data Technologies. Springer, 263–305.
    [18]
    Songting Chen, Bin Liu, and Elke A. Rundensteiner. 2004. Multiversion-based view maintenance over distributed data sources. ACM Transactions on Database Systems (TODS) 29, 4 (2004), 675–709.
    [19]
    Mandy Chessell, Ferd Scheepers, Nhan Nguyen, Ruud van Kessel, and Ron van der Starre. 2014. Governing and managing big data for analytics and decision makers. IBM Redguides for Business Leaders (2014).
    [20]
    Oscar Corcho, Freddy Priyatna, and David Chaves-Fraga. 2020. Towards a new generation of ontology based data access. Semantic Web 11, 1 (2020), 153–160.
    [21]
    Olivier Curé and Guillaume Blin. 2014. RDF Database Systems: Triples Storage and SPARQL Query Processing. Morgan Kaufmann.
    [22]
    Richard Cyganiak, Chris Bizer, Jorg Garbers, Oliver Maresch, and Christian Becker. 2012. The D2RQ mapping language. D2RQ Mapp. Lang. D2RQ Platf. URL http://d2rq.org/d2rqlanguage (2012).
    [23]
    Souripriya Das, Seema Sundara, and Richard Cyganiak. 2012. R2RML: RDB to RDF mapping language (2012). (2012). http://www.w3.org/TR/2012/REC-r2rml-20120927
    [24]
    Umeshwar Dayal and Hai-Yann Hwang. 1984. View definition and generalization for database integration in a multidatabase system. IEEE Transactions on Software Engineering6 (1984), 628–645.
    [25]
    Chinmay Dhekne and Srividya Kona Bansal. 2018. MOOClink: An aggregator for MOOC offerings from various providers. Journal of Electrical Engineering & Technology 2018 (2018).
    [26]
    Belén Díaz-Agudo and Pedro Antonio González-Calero. 2002. CBROnto: A task/method ontology for CBR. In The Florida AI Research Society.
    [27]
    Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik Mannens, and Rik Van de Walle. 2014. RML: A generic language for integrated RDF mappings of heterogeneous data. (2014).
    [28]
    Oliver M. Duschka, Michael R. Genesereth, and Alon Y. Levy. 2000. Recursive query plans for data integration. The Journal of Logic Programming 43, 1 (2000), 49–73.
    [29]
    Mathieu d’Aquin, Jean Lieber, and Amedeo Napoli. 2006. Case-based reasoning within semantic web technologies. In Artificial Intelligence: Methodology, Systems, Applications.
    [30]
    Fajar Ekaputra, Marta Sabou, Estefanía Serral Asensio, Elmar Kiesling, and Stefan Biffl. 2017. Ontology-based data integration in multi-disciplinary engineering environments: A review. Open Journal of Information Systems 4, 1 (2017), 1–26.
    [31]
    Kemele M. Endris, Mikhail Galkin, Ioanna Lytra, Mohamed Nadjib Mami, Maria-Esther Vidal, and Sören Auer. 2017. MULDER: Querying the linked data web by bridging RDF molecule templates. In International Conference on Database and Expert Systems Applications. Springer, 3–18.
    [32]
    Marc T. Friedman, Alon Y. Levy and Todd D. Millstein. 1999. Navigational plans for data integration. AAAI/IAAI 1999 (1999), 67–73.
    [33]
    Carlo Ghezzi. 2001. Designing data marts for data warehouses. ACM Transactions on Software Engineering and Methodology (TOSEM) 10, 4 (2001), 452–483.
    [34]
    Olaf Görlitz and Steffen Staab. 2011. Splendid: SPARQL endpoint federation exploiting void descriptions. In Proceedings of the Second International Conference on Consuming Linked Data-Volume 782. CEUR-WS. Org, 13–24.
    [35]
    Bernardo Cuenca Grau, Ian Horrocks, Yevgeny Kazakov, and Ulrike Sattler. 2007. A logical framework for modularity of ontologies. In International Joint Conference on Artificial Intelligence.
    [36]
    Rong Gu, Wei Hu, and Yihua Huang. 2014. Rainbow: A distributed and hierarchical RDF triple store with dynamic scalability. In 2014 IEEE International Conference on Big Data (Big Data). IEEE, 561–566.
    [37]
    Venkat N. Gudivada, Dhana Rao, and Vijay V. Raghavan. 2014. NoSQL systems for big data management. In 2014 IEEE World Congress on Services. IEEE, 190–197.
    [38]
    Shubham Gupta, Pedro A. Szekely, Craig A. Knoblock, Aman Goel, Mohsen Taheriyan, and Maria Muslea. 2012. Karma: A system for mapping structured sources into the semantic web. In Extended Semantic Web Conference.
    [39]
    Mohand-Said Hacid and Chantal Reynaud. 2004. L’intégration de sources de données. Revue Information-Interaction-Intelligence 3 (2004), 4.
    [40]
    Ali Hasnain, Qaiser Mehmood, Syeda Sana e Zainab, Muhammad Saleem, Claude Warren, Durre Zehra, Stefan Decker, and Dietrich Rebholz-Schuhmann. 2017. BioFed: Federated query processing over life sciences linked open data. Journal of Biomedical Semantics 8, 1 (2017), 1–19.
    [41]
    Katja Hose, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. 2011. Database foundations for scalable RDF processing. In Reasoning Web International Summer School. Springer, 202–249.
    [42]
    Jafreen Hossain, Nor Fazlida Mohd Sani, Lilly Suriani Affendey, Iskandar Ishak, and Khairul Azhar Kasmiran. 2014. Semantic schema matching approaches: A review. Journal of Theoretical & Applied Information Technology 62, 1 (2014).
    [43]
    William H. Inmon. 2005. Building the Data Warehouse. John Wiley & Sons.
    [44]
    Petar Jovanovic, Sergi Nadal, Oscar Romero, Alberto Abelló, and Besim Bilalli. 2020. Quarry: A user-centered big data integration platform. Information Systems Frontiers (2020), 1–25.
    [45]
    Sebastian Kagemann and Srividya Kona Bansal. 2015. MOOCLink: Building and utilizing linked data from Massive Open Online Courses. Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015) (2015), 373–380.
    [46]
    Vaishali Kalra and Rashmi Aggarwal. 2017. Importance of text data preprocessing & implementation in RapidMiner. In ICITKM. 71–75.
    [47]
    Zoi Kaoudi, Kostis Kyzirakos, and Manolis Koubarakis. 2010. SPARQL query optimization on top of DHTs. In International Semantic Web Conference. Springer, 418–435.
    [48]
    Mohamed Hedi Karray, Neil Otte, Dimitris Kiritsis, Farhad Ameri, Boonserm Kulvatunyou, Chris Will, Rebeca Arista, Rahul Rai, and Barry Smith. 2020. The Industrial Ontologies Foundry (IOF) perspectives. Proceedings of I-ESA 2020 Workshops (2020).
    [49]
    Yannis Katsis and Yannis Papakonstantinou. 2009. View-based data integration (2009). In Encyclopedia of Database Systems, L. LIU and M. T. ÖZSU, (Eds.)., Springer, Boston, MA.
    [50]
    Shokoh Kermanshahani. 2009. IXIA (IndeX-based Integration Approach) A Hybrid Approach to Data Integration. Ph. D. Dissertation.
    [51]
    Evgeny Kharlamov, Theofilos Mailis, Gulnar Mehdi, Christian Neuenstadt, Özgür Özçep, Mikhail Roshchin, Nina Solomakhina, Ahmet Soylu, Christoforos Svingos, Sebastian Brandt, Martin Giese, Yannis Ioannidis, Steffen Lamparter, Ralf Möller, Yannis Kotidis, Arild Waaler. 2017. Semantic access to streaming and static data at Siemens. Journal of Web Semantics 44 (2017), 54–74
    [52]
    Craig A. Knoblock, Pedro Szekely, José Luis Ambite, Aman Goel, Shubham Gupta, Kristina Lerman, Maria Muslea, Mohsen Taheriyan, and Parag Mallick. 2012. Semi-automatically mapping structured sources into the semantic web. In Extended Semantic Web Conference. Springer, 375–390.
    [53]
    Kostis Kyzirakos, Dimitrianos Savva, Ioannis Vlachopoulos, Alexandros Vasileiou, Nikolaos Karalis, Manolis Koubarakis, and Stefan Manegold. 2018. GeoTriples: Transforming geospatial data into RDF graphs using R2RML and RML mappings. Journal of Web Semantics 52 (2018), 16–32.
    [54]
    Günter Ladwig and Thanh Tran. 2010. Linked data query processing strategies. In International Semantic Web Conference. Springer, 453–469.
    [55]
    Doug Laney. 2001. 3D data management: Controlling data volume, velocity and variety. META Group Research Note 6, 70 (2001), 1.
    [56]
    Andreas Langegger, Wolfram Wöß, and Martin Blöchl. 2008. A semantic web middleware for virtual data integration on the web. In European Semantic Web Conference. Springer, 493–507.
    [57]
    Maurizio Lenzerini. 2002. Data integration: A theoretical perspective. In Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 233–246.
    [58]
    Junxian Li and Wei Wang. 2017. Graph summarization for source selection of querying over Linked Open Data. In 2017 IEEE 2nd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC). IEEE, 357–362.
    [59]
    Ling Liu and M. Tamer Özsu. 2009. Encyclopedia of Database Systems. Vol. 6. Springer New York, NY, USA.
    [60]
    Maroua Masmoudi, Mohamed Hedi Karray, Sana Ben Abdallah Ben Lamine, Hajer Baazaoui Zghal, and Bernard Archimede. 2020. MEMOn: Modular environmental monitoring ontology to link heterogeneous Earth observed data. Environmental Modelling & Software 124 (2020), 104581.
    [61]
    Maroua Masmoudi, Sana Ben Abdallah Ben Lamine, Hajer Baazaoui Zghal, Bernard Archimede, and Mohamed Hedi Karray. 2021. Knowledge hypergraph-based approach for data integration and querying: Application to Earth Observation. Future Generation Computer Systems 115 (2021), 720–740.
    [62]
    Peter McBrien and Alexandra Poulovassilis. 2003. Data integration by bi-directional schema transformation rules. In Proceedings 19th International Conference on Data Engineering (Cat. No. 03CH37405). IEEE, 227–238.
    [63]
    Franck Michel, Johan Montagnat, and Catherine Faron Zucker. 2014. A survey of RDB to RDF translation approaches and tools. https://api.semanticscholar.org/CorpusID:10444395
    [64]
    H. Gilbert Miller and Peter Mork. 2013. From data to decisions: A value chain for big data. IT Professional 15, 1 (2013), 57–59.
    [65]
    Michalis Mountantonakis and Yannis Tzitzikas. 2019. Large-scale semantic integration of linked data: A survey. ACM Computing Surveys (CSUR) 52, 5 (2019), 1–40.
    [66]
    Imadeddine Mountasser, Brahim Ouhbi, Ferdaous Hdioud, and Bouchra Frikh. 2021. Semantic-based big data integration framework using scalable distributed ontology matching strategy. Distrib. Parallel Databases 39, 4 (Dec.2021), 891–937.
    [67]
    Vatsala Nundloll, Rob Lamb, Barry Hankin, and Gordon S. Blair. 2021. A semantic approach to enable data integration for the domain of flood risk management. Environmental Challenges 3 (2021), 100064.
    [68]
    Damla Oguz, Belgin Ergenc, Shaoyi Yin, Oguz Dikenelli, and Abdelkader Hameurlain. 2015. Federated query processing on linked data: A qualitative survey and open challenges. The Knowledge Engineering Review 30, 5 (2015), 545–563.
    [69]
    Radoslaw Oldakowski. 2011. D2RQ platform – Treating non-RDF databases as virtual RDF graphs. Nature Proceedings (2011).
    [70]
    M. Tamer Özsu and Patrick Valduriez. 1999. Principles of Distributed Database Systems. Vol. 2. Springer.
    [71]
    Özgür L. Özçep, Ralf Möller, and Christian Neuenstadt. 2014. A stream-temporal query language for ontology based data access. In Description Logics.
    [72]
    Kostas Patroumpas, Michalis Alexakis, Giorgos Giannopoulos, and Spiros Athanasiou. 2014. TripleGeo: An ETL tool for transforming geospatial data into RDF triples. In EDBT/ICDT Workshops. 275–278.
    [73]
    Kostas Patroumpas, Dimitrios Skoutas, Georgios M. Mandilaras, Giorgos Giannopoulos, and Spiros Athanasiou. 2019. Exposing points of interest as linked geospatial data. In Proceedings of the 16th International Symposium on Spatial and Temporal Databases, SSTD 2019, Vienna, Austria, August 19-21, 2019, Walid G. Aref, Michela Bertolotto, Panagiotis Bouros, Christian S. Jensen, Ahmed R. Mahmood, Kjetil Nørvåg, Dimitris Sacharidis, and Mohamed Sarwat (Eds.). ACM, 21–30.
    [74]
    Pieter Pauwels, María Poveda-Villalón, Álvaro Sicilia, and Jérôme Euzenat. 2018. Semantic technologies and interoperability in the built environment. Semantic Web 9, 6 (2018), 731–734.
    [75]
    Bastian Quilitz and Ulf Leser. 2008. Querying distributed RDF data sources with SPARQL. In European Semantic Web Conference. Springer, 524–538.
    [76]
    Marouane Radaoui, Sana Ben Abdallah Ben Lamine, Hajer Baazaoui Zghal, Chirine Ghedira Guegan, and Nadia Kabachi. 2019. Knowledge guided integration of structured and unstructured data in health decision process. (2019).
    [77]
    Nur Aini Rakhmawati, Jürgen Umbrich, Marcel Karnstedt, Ali Hasnain, and Michael Hausenblas. 2013. Querying over federated SPARQL endpoints—A state of the art survey. arXiv preprint arXiv:1306.1723 (2013).
    [78]
    P. Shobha Rani, R. M. Suresh, and R. Sethukarasi. 2019. Multi-level semantic annotation and unified data integration using semantic web ontology in big data processing. Cluster Computing 22, 5 (2019), 10401–10413.
    [79]
    Manuel A. Regueiro, José R. R. Viqueira, Christoph Stasch, and José A. Taboada. 2017. Semantic mediation of observation datasets through sensor observation services. Future Generation Computer Systems 67 (2017), 47–56.
    [80]
    Sherif Sakr, Marcin Wylot, Raghava Mutharaju, Danh Le Phuoc, and Irini Fundulaki. 2018. Distributed RDF query processing. In Linked Data. Springer, 51–83.
    [81]
    Muhammad Saleem, Yasar Khan, Ali Hasnain, Ivan Ermilov, and Axel-Cyrille Ngonga Ngomo. 2016. A fine-grained evaluation of SPARQL endpoint federation systems. Semantic Web 7, 5 (2016), 493–518.
    [82]
    Muhammad Saleem and Axel-Cyrille Ngonga Ngomo. 2014. Hibiscus: Hypergraph-based source selection for SPARQL endpoint federation. In European Semantic Web Conference. Springer, 176–191.
    [83]
    Sébastien Sauvé, Sophie Bernard, and Pamela Sloan. 2016. Environmental sciences, sustainable development and circular economy: Alternative concepts for trans-disciplinary research. Environmental Development 17 (2016), 48–56.
    [84]
    Andreas Schwarte, Peter Haase, Katja Hose, Ralf Schenkel, and Michael Schmidt. 2011. FedX: Optimization techniques for federated query processing on linked data. In International Semantic Web Conference. Springer, 601–616.
    [85]
    Amit Sheth. 2015. Changing focus on interoperability in information systems: From system, syntax, structure to semantics. (072015).
    [86]
    John Miles Smith, Philip A. Bernstein, Umeshwar Dayal, Nathan Goodman, Terry Landers, Ken W. T. Lin, and Eugene Wong. 1981. Multibase: Integrating heterogeneous distributed database systems. In Proceedings of the May 4-7, 1981, National Computer Conference. 487–499.
    [87]
    Roger Strange and Antonella Zucchella. 2017. Industry 4.0, global value chains and international business. Multinational Business Review (2017).
    [88]
    Samiya Tamtam, Ahmed Laguidi, and Abderrafiaa Elkalay. 2023. Data integration systems and bibliometrics. In ITM Web of Conferences, Vol. 52. EDP Sciences, 01008.
    [89]
    Julio Cesar Cardoso Tesolin and Maria Cláudia Cavalcanti. 2017. Choosing data integration approaches based on data source characterization. In International Conference on Database and Expert Systems Applications. Springer, 222–229.
    [90]
    Antonis Troumpoukis, Stasinos Konstantopoulos, and Nefeli Prokopaki-Kostopoulou. 2022. A geospatial source selector for federated GeoSPARQL querying. Open Research Europe 2, 48 (2022), 48.
    [91]
    Holger Wache, Thomas Voegele, Ubbo Visser, Heiner Stuckenschmidt, Gerhard Schuster, Holger Neumann, and Sebastian Hübner. 2001. Ontology-based integration of information-a survey of existing approaches. In OIS@IJCAI.
    [92]
    Gio Wiederhold. 1992. Mediators in the architecture of future information systems. Computer 25, 3 (1992), 38–49.
    [93]
    Guohui Xiao, Linfang Ding, Benjamin Cogrel, and Diego Calvanese. 2019. Virtual knowledge graphs: An overview of systems and use cases. Data Intelligence 1, 3 (2019), 201–223.
    [94]
    Ali Zidane, Ali El-Bastawissy, and Osman Hegazi. 2018. V-DIF: Virtual data integration framework. Journal of Computer Information Systems 5 (052018), 26–32.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 56, Issue 8
    August 2024
    963 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/3613627
    • Editors:
    • David Atienza,
    • Michela Milano
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 April 2024
    Online AM: 23 March 2024
    Accepted: 10 March 2024
    Revised: 26 February 2024
    Received: 02 January 2023
    Published in CSUR Volume 56, Issue 8

    Check for updates

    Author Tags

    1. Data integration
    2. ontology
    3. query processing
    4. ETL
    5. OBDA
    6. semantic mapping

    Qualifiers

    • Survey

    Funding Sources

    • OntoCommons project funded by the European Union’s Horizon 2020 research and innovation

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 376
      Total Downloads
    • Downloads (Last 12 months)376
    • Downloads (Last 6 weeks)89

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media