Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2882903.2915206acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets

Published: 14 June 2016 Publication History

Abstract

Inclusion dependencies (INDs) form an important integrity constraint on relational databases, supporting data management tasks, such as join path discovery and query optimization. Conditional inclusion dependencies (CINDs), which define including and included data in terms of conditions, allow to transfer these capabilities to RDF data. However, CIND discovery is computationally much more complex than IND discovery and the number of CINDs even on small RDF datasets is intractable. To cope with both problems, we first introduce the notion of pertinent CINDs with an adjustable relevance criterion to filter and rank CINDs based on their extent and implications among each other. Second, we present RDFind, a distributed system to efficiently discover all pertinent CINDs in RDF data. RDFind employs a lazy pruning strategy to drastically reduce the CIND search space. Also, its exhaustive parallelization strategy and robust data structures make it highly scalable. In our experimental evaluation, we show that RDFind is up to 419 times faster than the state-of-the-art, while considering a more general class of CINDs. Furthermore, it is capable of processing a very large dataset of billions of triples, which was entirely infeasible before.

References

[1]
Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: A survey. VLDB Journal, 24(4):557--581, 2015.
[2]
Z. Abedjan, T. Grütze, A. Jentzsch, and F. Naumann. Mining and profiling RDF data with ProLOD
[3]
. In Proceedings of the International Conference on Data Engineering (ICDE), pages 1198--1201, 2014. Demo.
[4]
Z. Abedjan and F. Naumann. Improving RDF data through association rule mining. Datenbank-Spektrum, 13(2):111--120, 2013.
[5]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. Proceedings of the International Conference on Very Large Databases (VLDB), pages 487--499, 1994.
[6]
M. Arias, J. Fernández, M. Martınez-Prieto, and P. de la Fuente. An Empirical Study of Real-World SPARQL Queries. In International Workshop on Usage Analysis and the Web of Data (USEWOD), 2011.
[7]
S. Auer, J. Demter, M. Martin, and J. Lehmann. LODStats -- an extensible framework for high-performance dataset analytics. In Proceedings of the International Conference on Knowledge Engineering and Knowledge Management (EKAW), pages 353--362, 2012.
[8]
J. Bauckmann, Z. Abedjan, U. Leser, H. Müller, and F. Naumann. Discovering conditional inclusion dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 2094--2098, 2012.
[9]
C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - The Story So Far. International Journal on Semantic Web and Information Systems (IJSWIS), 5(3):1--22, 2009.
[10]
M. A. Bornea, J. Dolby, A. Kementsietsidis, K. Srinivas, P. Dantressangle, O. Udrea, and B. Bhattacharjee. Building an efficient RDF store over a relational database. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 121--132, 2013.
[11]
L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 243--254, 2007.
[12]
D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, A. Poggi, M. Rodriguez-Muro, R. Rosati, M. Ruzzi, and D. F. Savo. The MASTRO system for ontology-based data access. Semantic Web Journal (SWJ), 2(1):43--53, 2011.
[13]
E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An efficient SQL-based RDF querying scheme. In Proceedings of the International Conference on Very Large Databases (VLDB), pages 1216--1227, 2005.
[14]
F. De Marchi, S. Lopes, and J.-M. Petit. Efficient algorithms for mining inclusion dependencies. In Proceedings of the International Conference on Extending Database Technology (EDBT), pages 464--476. 2002.
[15]
X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the International Conference on Knowledge discovery and data mining (SIGKDD), KDD '14, pages 601--610, New York, NY, USA, 2014. ACM.
[16]
W. Fan. Dependencies revisited for improving data quality. In Proceedings of the Symposium on Principles of Database Systems (PODS), pages 159--170, 2008.
[17]
L. Golab, F. Korn, and D. Srivastava. Efficient and Effective Analysis of Data Quality using Pattern Tableaux. IEEE Data Engineering Bulletin, 34(3):26--33, 2011.
[18]
J. Gryz. Query folding with inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE), pages 126--133, 1998.
[19]
P. Hayes and P. F. Patel-Schneider. RDF 1.1 Semantics. W3C Recommendation, February 2014. https://www.w3.org/TR/rdf11-mt/.
[20]
I. F. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. Cords: automatic discovery of correlations and soft functional dependencies. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 647--658. ACM, 2004.
[21]
T. Kafer, A. Abdelrahman, J. Umbrich, P. O'Byrne, and A. Hogan. Observing linked data dynamics. In Proceedings of the Extended Semantic Web Conference (ESWC), pages 213--227, 2013.
[22]
Z. Kaoudi and I. Manolescu. RDF in the clouds: a survey. VLDB Journal, 24(1):67--91, 2015.
[23]
S. Khatchadourian and M. P. Consens. ExpLOD: Summary-based exploration of interlinking and RDF usage in the linked open data cloud. In Proceedings of the Extended Semantic Web Conference (ESWC), pages 272--287, 2010.
[24]
L. Kolb and E. Rahm. Parallel Entity Resolution with Dedoop. Datenbank Spektrum, 13(1):23--32, 2012.
[25]
S. Kruse, T. Papenbrock, and F. Naumann. Scaling out the discovery of inclusion dependencies. In Proceedings of the Conference Datenbanksysteme in Business, Technologie und Web Technik (BTW), pages 445--454, 2015.
[26]
M. Levene and M. W. Vincent. Justification for inclusion dependency normal form. IEEE Transactions on Knowledge and Data Engineering (TKDE), 12(2):281--291, 2000.
[27]
H. Li. Data Profiling for Semantic Web Data. In Proceedings of the International Conference on Web Information Systems and Mining (WISM), pages 472--479, 2012.
[28]
LUBM. http://swat.cse.lehigh.edu/projects/lubm/.
[29]
S. Ma, W. Fan, and L. Bravo. Extending inclusion dependencies with conditions. Theoretical Computer Science, 515:64--95, 2014.
[30]
M. S. Marshall, R. Boyce, H. F. Deus, J. Zhao, E. L. Willighagen, M. Samwald, E. Pichler, J. Hajagos, E. Prud'hommeaux, and S. Stephens. Emerging practices for mapping and linking life sciences data using RDF - A case series. Web Semantics: Science, Services and Agents on the World Wide Web, 2012.
[31]
T. Neumann and G. Weikum. The RDF-3X engine for scalable management of RDF data. VLDB Journal, 19(1):91--113, 2010.
[32]
T. Papenbrock, S. Kruse, J.-A. Quiané-Ruiz, and F. Naumann. Divide & conquer-based inclusion dependency discovery. Proceedings of the VLDB Endowment, 8(7):774--785, 2015.
[33]
N. Redaschi and UniProt Consortium. UniProt in RDF: Tackling Data Integration and Distributed Annotation with the Semantic Web. In Proceedings of the International Biocuration Conference, 2009.
[34]
M. Zhang, M. Hadjieleftheriou, B. C. Ooi, C. M. Procopiuc, and D. Srivastava. On multi-column foreign key discovery. Proceedings of the VLDB Endowment, 3(1--2):805--814, 2010.

Cited By

View all
  • (2024)Reasoning on property graphs with graph generating dependenciesInformation Sciences: an International Journal10.1016/j.ins.2024.120675672:COnline publication date: 1-Jun-2024
  • (2023)Fast Discovery of Inclusion Dependencies with Desbordante2023 33rd Conference of Open Innovations Association (FRUCT)10.23919/FRUCT58615.2023.10143047(264-275)Online publication date: 24-May-2023
  • (2022)Knowledge Graph Quality Management: a Comprehensive SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3150080(1-1)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CIND discovery
  2. RDF
  3. RDFind
  4. conditional inclusion dependencies
  5. data profiling

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)2
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reasoning on property graphs with graph generating dependenciesInformation Sciences: an International Journal10.1016/j.ins.2024.120675672:COnline publication date: 1-Jun-2024
  • (2023)Fast Discovery of Inclusion Dependencies with Desbordante2023 33rd Conference of Open Innovations Association (FRUCT)10.23919/FRUCT58615.2023.10143047(264-275)Online publication date: 24-May-2023
  • (2022)Knowledge Graph Quality Management: a Comprehensive SurveyIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3150080(1-1)Online publication date: 2022
  • (2022)Cloud-Based RDF Data ManagementundefinedOnline publication date: 25-Feb-2022
  • (2020)Data Profiling in Property Graph DatabasesJournal of Data and Information Quality10.1145/340947312:4(1-27)Online publication date: 15-Oct-2020
  • (2017)Das Fachgebiet „Informationssysteme“ am Hasso-Plattner-InstitutDatenbank-Spektrum10.1007/s13222-016-0239-017:1(69-76)Online publication date: 3-Jan-2017
  • (2016)The Information Systems Group at HPIACM SIGMOD Record10.1145/3003665.300367845:2(63-68)Online publication date: 28-Sep-2016

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media