Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Keys for graphs

Published: 01 August 2015 Publication History

Abstract

Keys for graphs aim to uniquely identify entities represented by vertices in a graph. We propose a class of keys that are recursively defined in terms of graph patterns, and are interpreted with subgraph isomorphism. Extending conventional keys for relations and XML, these keys find applications in object identification, knowledge fusion and social network reconciliation. As an application, we study the entity matching problem that, given a graph G and a set Σ of keys, is to find all pairs of entities (vertices) in G that are identified by keys in Σ. We show that the problem is intractable, and cannot be parallelized in logarithmic rounds. Nonetheless, we provide two parallel scalable algorithms for entity matching, in MapReduce and a vertex-centric asynchronous model. Using real-life and synthetic data, we experimentally verify the effectiveness and scalability of the algorithms.

References

[1]
Dbpedia. http://wiki.dbpedia.org/Downloads2014.
[2]
Full version. http://homepages.inf.ed.ac.uk/s1368930/keys.pdf.
[3]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
[4]
F. N. Afrati, V. R. Borkar, M. J. Carey, N. Polyzotis, and J. D. Ullman. Map-reduce extensions and recursive queries. In EDBT, 2011.
[5]
F. N. Afrati and C. H. Papadimitriou. The parallel complexity of simple logic programs. J. ACM, 40(4), 1993.
[6]
A. Arasu, C. Ré, and D. Suciu. Large-scale deduplication with constraints using Dedupalog. In ICDE, 2009.
[7]
O. Benjelloun, H. Garcia-Molina, H. Gong, H. Kawai, T. Larson, D. Menestrina, and S. Thavisomboon. D-swoosh: A family of algorithms for generic, distributed entity resolution. In ICDCS, 2007.
[8]
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007.
[9]
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: Efficient iterative data processing on large clusters. PVLDB, 2010.
[10]
P. Buneman, S. Davidson, W. Fan, C. Hara, and W.-C. Tan. Keys for XML. In WWW, 2001.
[11]
P. Buneman and G. Silvello. A Rule-Based Citation System for Structured and Evolving Datasets. IEEE Data Eng. Bull., 33(3):33--41, 2010.
[12]
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 24, 2012.
[13]
L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub) graph isomorphism algorithm for matching large graphs. TPAMI, 26(10):1367--1372, 2004.
[14]
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. SIGMOD, 2005.
[15]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 2014.
[16]
X. L. Dong, K. Murphy, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In KDD, 2014.
[17]
W. Fan, H. Gao, X. Jia, J. Li, and S. Ma. Dynamic constraints for record matching. VLDBJ, 2011.
[18]
M. A. Gallego, J. D. Fernández, M. A. Martínez-Prieto, and P. de la Fuente. An empirical study of real-world SPARQL queries. In USEWOD workshop, 2011.
[19]
M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, 1979.
[20]
L. Getoor and A. Machanavajjhala. Entity resolution: Theory, practice & open challenges. PVLDB, 5(12), 2012.
[21]
N. Z. Gong, W. Xu, L. Huang, P. Mittal, E. Stefanov, V. Sekar, and D. Song. Evolution of social-attribute networks: Measurements, modeling, and implications using google+. IMC '12, 2012.
[22]
E. L. Goodman and D. Grunwald. Using vertex-centric programming platforms to implement SPARQL queries on large graphs. IA3, pages 25--32, 2014.
[23]
R. V. Guha. Communicating and resolving entity references. http://arxiv.org/abs/1406.6973.
[24]
W.-S. Han, J. Lee, and J.-H. Lee. Turboiso: Towards ultrafast and robust subgraph isomorphism search in large graph databases. In SIGMOD, pages 337--348, 2013.
[25]
M. Herschel, F. Naumann, S. Szott, and M. Taubert. Scalable iterative graph duplicate detection. TKDE, 2012.
[26]
S.-H. Kim, K.-H. Lee, H. Choi, and Y.-J. Lee. Parallel processing of multiple graph queries using MapReduce. In DBKDA, pages 33--38, 2013.
[27]
L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficient deduplication with hadoop. PVLDB, 2012.
[28]
N. Korula and S. Lattanzi. An efficient reconciliation algorithm for social networks. PVLDB, 7(5), 2014.
[29]
N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, 2011.
[30]
W. Le, A. Kementsietsidis, S. Duan, and F. Li. Scalable multi-query optimization for SPARQL. In ICDE, 2012.
[31]
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. PVLDB, 5(8), 2012.
[32]
P. Malhotra, P. Agarwal, and G. Shroff. Graph-parallel entity resolution using LSH & IMM. In EDBT/ICDT Workshops, 2014.
[33]
N. Pernelle, F. Saïs, and D. Symeonidou. An automatic key discovery approach for data linking. J. Web Sem., 23, 2013.
[34]
N. Preda, G. Kasneci, F. M. Suchanek, T. Neumann, W. Yuan, and G. Weikum. Active knowledge: dynamically enriching RDF knowledge bases by web services. In SIGMOD, 2010.
[35]
R. Raman, O. van Rest, S. Hong, Z. Wu, H. Chafi, and J. Banerjee. PGX.ISO: Parallel and efficient in-memory engine for subgraph isomorphism. GRADES, 2014.
[36]
V. Rastogi, N. Dalvi, and M. Garofalakis. Large-scale collective entity matching. PVLDB, 2011.
[37]
J. Seo, J. Park, J. Shin, and M. S. Lam. Distributed socialite: A datalog-based language for large-scale graph analysis. PVLDB, 2013.
[38]
Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li. Efficient subgraph matching on billion node graphs. PVLDB, 2012.

Cited By

View all
  • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
  • (2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 28-Feb-2024
  • (2024)PG-FD: Mapping Functional Dependencies to the Future Property Graph Schema StandardAdvances in Databases and Information Systems10.1007/978-3-031-70626-4_4(45-59)Online publication date: 28-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 12
Proceedings of the 41st International Conference on Very Large Data Bases, Kohala Coast, Hawaii
August 2015
728 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2015
Published in PVLDB Volume 8, Issue 12

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Making It Tractable to Detect and Correct Errors in GraphsACM Transactions on Database Systems10.1145/370231549:4(1-75)Online publication date: 2-Nov-2024
  • (2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/363936349:1(1-50)Online publication date: 28-Feb-2024
  • (2024)PG-FD: Mapping Functional Dependencies to the Future Property Graph Schema StandardAdvances in Databases and Information Systems10.1007/978-3-031-70626-4_4(45-59)Online publication date: 28-Aug-2024
  • (2023)Making It Tractable to Catch Duplicates and Conflicts in GraphsProceedings of the ACM on Management of Data10.1145/35889401:1(1-28)Online publication date: 30-May-2023
  • (2023)Uniqueness Constraints for Object StoresJournal of Data and Information Quality10.1145/358175815:2(1-29)Online publication date: 19-Jan-2023
  • (2023)FastAGEDs: Fast Approximate Graph Entity Dependency DiscoveryWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_35(451-465)Online publication date: 25-Oct-2023
  • (2023)Managing Linked Nulls in Property Graphs: Tools to Ensure Consistency and Reduce RedundancyAdvances in Databases and Information Systems10.1007/978-3-031-42914-9_13(180-194)Online publication date: 4-Sep-2023
  • (2022)Big graphsProceedings of the VLDB Endowment10.14778/3554821.355489915:12(3782-3797)Online publication date: 1-Aug-2022
  • (2022)Discovering association rules from big graphsProceedings of the VLDB Endowment10.14778/3523210.352322415:7(1479-1492)Online publication date: 1-Mar-2022
  • (2022)Incremental Graph Computations: Doable and UndoableACM Transactions on Database Systems10.1145/350093047:2(1-44)Online publication date: 23-May-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media