research-article

An incremental graph-partitioning algorithm for entity resolution

Authors:

Moises SuditAuthors Info & Claims

Volume 46, Issue C

Pages 171 - 183

https://doi.org/10.1016/j.inffus.2018.06.001

Published: 01 March 2019 Publication History

Highlights

•

A novel incremental data association algorithm is proposed for entity resolution.

•

Order of magnitude faster than batch algorithms, with little/no loss in accuracy.

•

Shows 30–40% better F-Score on John Smith dataset, compared to leading heuristics.

•

Proposed algorithm leverages the Clique Partition Problem.

•

Quickly updates solution for new references as well as changes in similarity scores.

Abstract

Entity resolution is an important data association task when fusing information from multiple sources. Oftentimes the information arrives continuously and the entity resolution algorithm needs to efficiently update its solution upon receiving new information. In this work, we introduce an incremental entity resolution algorithm based on a graph partitioning formulation. The developed algorithm is able to handle both incrementally arriving entity references, as well as incrementally arriving information which changes the pairwise similarity scores between the references. New information is handled in a way that allows the algorithm to reconsider past decisions when contradicting information arrives. Because the graph partitioning formulation used is NP-Hard, a heuristic algorithm is developed to produce good solutions, which is also compatible with a blocking technique to limit the number of required comparisons. The algorithm is tested on a variety of datasets (randomly generated and real) and it is shown that allowing the algorithm to consider revised scores and revisit prior decisions offers a substantial improvement to accuracy (approximately 30–40% better F-Score on a natural language dataset), compared to other greedy heuristics on the same set of coefficients. It is also shown that, on a test set with 100 references, the incremental algorithm is up to an order of magnitude faster than a batch algorithm approach that re-solves the entire problem.

References

[1]

A. Bagga, B. Baldwin, Entity-based cross-document coreferencing using the vector space model, Proceedings of the Thirty-Sixth Annual Meeting on Association for Computational Linguistics, 1, 1998, p. 79.

[2]

H. Bandelt, A. Maas, F. Spieksma, Local search heuristics for multi-index assignment problems with decomposable costs, J. Oper. Res. Soc. 55 (2004) 694–704.

[3]

H.-J. Bandelt, Y. Crama, F. Spieksma, Approximation algorithms for multi-dimensional assignment problems with decomposable costs, Discret. Appl. Math. 49 (1994) 25–50.

[4]

R. Baxter, P. Christen, T. Churches, A comparison of fast blocking methods for record linkage, Proceedings of the ACM SIGKDD ’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003, pp. 6–8.

[5]

I. Bhattacharya, L. Getoor, Collective entity resolution in relational data, ACM Trans. Knowl. Discov. Data 1 (2007) 1–36.

[6]

M. Charikar, C. Chekuri, T. Feder, R. Motwani, Incremental clustering and dynamic information retrieval, Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, 1997, pp. 626–635.

[7]

Chen Z., Graph-based event coreference resolution, Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, 2009, pp. 54–57.

[8]

Chen Z., Ji H., Graph-based clustering for computational linguistics: a survey, Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language Processing, 2010, pp. 1–9.

[9]

P. Christen, R. Gayler, Towards scalable real-time entity resolution using a similarity-aware inverted index approach, Proceedings of the Seventh Australasian Data Mining Conference (AusDM 2008), 2008, pp. 51–60.

[10]

W.W. Cohen, P. Ravikumar, S.E. Fienberg, A comparison of string distance metrics for name-matching tasks, Proceedings of the 2003 Workshop on Information Integration on the Web, IIWEB, 2003, pp. 73–78.

[11]

G. Costa, G. Manco, R. Ortale, An incremental clustering scheme for data de-duplication, Data Min. Knowl. Discov. 20 (2010) 157–189.

[12]

K. Date, G.A. Gross, R. Nagi, Test and evaluation of data association algorithms in hard+soft data fusion, Proceedings of the Seventeenth International Conference on Information Fusion (FUSION), IEEE, 2014, pp. 1–8.

[13]

U. Dorndorf, E. Pesch, Fast clustering algorithms, ORSA J. Comput. 6 (1994) 141–168.

[14]

J. Finkel, M. Christopher, Enforcing transitivity in coreference resolution, Proceedings of the Forty-Sixth Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, 2008, pp. 45–48.

[15]

J.R. Finkel, T. Grenager, C. Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, Proceedings of the Forty-Third Annual Meeting of the Association for Computational Linguistics, 2005, pp. 363–370.

[16]

D. Firmani, B. Saha, D. Srivastava, Online entity resolution using an oracle, Proc. VLDB Endow. 9 (5) (2016) 384–395.

Digital Library

[17]

J. Gehrke, P. Ginsparg, J.M. Kleinberg, Overview of the 2003 KDD cup, SIGKDD Explor. 5 (2) (2003) 149–151.

[18]

C.H. Gooi, J. Allan, Cross-document coreference on a large scale corpus, Proceedings of the 2004 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, HLT-NAACL, 2004, pp. 9–16.

[19]

J.L. Graham, D.L. Hall, J. Rimland, A synthetic dataset for evaluating soft and hard fusion algorithms, Proceedings of the 2011 SPIE, 8062, 2011, p. 80620F.

[20]

G.A. Gross, K. Date, D.R. Schlegel, J.J. Corso, J. Llinas, R. Nagi, S.C. Shapiro, Systemic test and evaluation of a hard+ soft information fusion framework: challenges and current approaches, Proceedings of the Seventeenth International Conference on Information Fusion (FUSION), IEEE, 2014, pp. 1–8.

[21]

M. Grötschel, Y. Wakabayashi, A cutting plane algorithm for a clustering problem, Math. Program. 45 (1989) 59–96.

[22]

A. Gruenheid, Dong X.L., D. Srivastava, Incremental record linkage, Proc. VLDB Endow. 7 (9) (2014) 697–708.

[23]

P. Hansen, B. Jaumard, Cluster analysis and mathematical programming, Math. Program. 79 (1–3) (1997) 191–215.

[24]

J.A. Hartigan, Clustering, Annu. Rev. Biophys. Bioeng. 2 (1973) 81–101.

[25]

M. Hernandez, S. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, 1995, pp. 127–138.

[26]

M. Klenner, E. Ailloud, Optimization in coreference resolution is not needed: a nearly-optimal algorithm with intensional constraints, Proceedings of the Twelfth Conference of the European Chapter of the ACL, 2009, pp. 442–450.

[27]

M. Klenner, D. Tuggener, An incremental model for coreference resolution with restrictive antecedent accessibility, Proceedings of the Fifteenth conference on Computational Natural Language Learning, 2011.

[28]

J. Llinas, Information fusion for natural and man-made disasters, Proceedings of the Fifth International Conference on Information Fusion, 1, IEEE, 2002, pp. 570–576.

[29]

A. McCallum, K. Nigam, L.H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’00, ACM Press, New York, USA, 2000, pp. 169–178.

[30]

A. Mehrotra, M. Trick, Cliques and clustering: a combinatorial approach, Oper. Res. Lett. 22 (1) (1998) 1–12.

[31]

J. Mulvey, H. Crowder, Cluster analysis: an application of lagrangian relaxation, Manag. Sci. 25 (4) (1979) 329–340.

[32]

MURI, Unified research on network-based hard/soft information fusion, Multidisciplinary University Research Initiative (MURI) grant (Number W911NF-09-1-0392) by the US Army Research Office (ARO)(2009).

[33]

V. Ng, C. Cardie, Improving machine learning approaches to coreference resolution, Proceedings of the Fortieth Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002, pp. 104–111.

[34]

C. Nicolae, G. Nicolae, Bestcut: a graph algorithm for coreference resolution, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 2006, pp. 275–283.

[35]

Ning H., Xu W., Chi Y., Gong Y., Incremental spectral clustering with application to monitoring of evolving blog communities, SIAM Int. Conf. Data Min. (2007) 261–272.

[36]

Ning H., Xu W., Chi Y., Gong Y., Huang T.S., Incremental spectral clustering by efficiently updating the eigen-system, Pattern Recognit. 43 (2010) 113–127.

[37]

M. Oosten, J.H.G.C. Rutten, F.C.R. Spieksma, The Facial Structure of the Clique Partitioning Polytope, Technical Report, 1996.

[38]

M. Pfaff, C. Newlon, H. Patel, K. MacDorman, Information fusion for civilians: the prospects of mega-collaboration, Hum.-Cent. Inf. Fusion (2010) 211–229.

[39]

P. van der Putten, J.C. Kok, A. Gupta, Data Fusion Through Statistical Matching, Working Paper, 2002.

[40]

B. Ramadan, P. Christen, Liang H., Dynamic sorted neighborhood indexing for real-time entity resolution, Proceedings of the 2014 Australasian Database Conference, Springer, 2014, pp. 1–12.

[41]

D. Rao, P. McNamee, Streaming cross document entity coreference resolution, Proceedings of the Twenty-Third International Conference on Computational Linguistics: Posters, 2010, pp. 1050–1058.

[42]

M. Rao, Cluster analysis and mathematical programming, J. Am. Stat. Assoc. 66 (335) (1971) 622–626.

[43]

G. Tauer, R. Nagi, A map-reduce lagrangian heuristic for multidimensional assignment problems with decomposable costs, Parallel Comput. 39 (11) (2013) 653–668.

[44]

G. Tauer, R. Nagi, M. Sudit, The graph association problem: mathematical models and a lagrangian heuristic, Naval Res. Logist. (NRL) 60 (3) (2013) 251–268.

[45]

Wang H., T. Obremski, B. Alidaee, Clique partitioning for clustering: a comparison with k-means and latent class analysis, Commun. Stat. Simul. Comput. 37 (2008) 1–13.

[46]

M.J. Welch, A. Sane, C. Drome, Fast and accurate incremental entity resolution relative to an entity knowledge base, Proceedings of the Twenty-First ACM International Conference on Information and Knowledge Management, ACM, 2012, pp. 2667–2670.

[47]

S.E. Whang, H. Garcia-Molina, Incremental entity resolution on rules and data, The VLDB J. 23 (1) (2014) 77–102.

[48]

S.E. Whang, D. Marmaros, H. Garcia-Molina, Pay-as-you-go entity resolution, IEEE Trans. Knowl. Data Eng. (2012) 1–14.

[49]

S.E. Whang, D. Menestrina, G. Koutrika, M. Theobald, H. Garcia-Molina, Entity resolution with iterative blocking, Proceedings of the Thirty-Fifth SIGMOD International Conference on Management of Data – SIGMOD ’09, ACM Press, New York, New York, USA, 2009, p. 219.

[50]

M. Wick, S. Singh, A. McCallum, A discriminative hierarchical model for fast coreference at large scale, Proceedings of the Fiftieth Annual Meeting of the Association for Computational Linguistics: Long Papers, 2012, pp. 379–388.

[51]

Xu R., D. Wunsch, Survey of clustering algorithms, IEEE Trans. Neural Netw. 16 (3) (2005) 645–678.

Digital Library

Cited By

Aassem YHafidi IAboutabit NBen Ahmed MBoudhir A(2020)Enhanced Duplicate Count StrategyProceedings of the 3rd International Conference on Networking, Information Systems & Security10.1145/3386723.3387877(1-7)Online publication date: 31-Mar-2020
https://dl.acm.org/doi/10.1145/3386723.3387877

Index Terms

An incremental graph-partitioning algorithm for entity resolution

Index terms have been assigned to the content through auto-classification.

Recommendations

An incremental algorithm for attribute reduction with variable precision rough sets

Display Omitted Two Boolean row vectors are introduced to characterize the disdernibility matrix and reduct.Rather than the whole discernibility matrix, minimal elements are incrementally computed.The attribute reduction process is studied to reveal how ...
Force-based incremental algorithm for mining community structure in dynamic network
Special section on China AVS standard

Community structure is an important property of network. Being able to identify communities can provide invaluable help in exploiting and understanding both social and non-social networks. Several algorithms have been developed up till now. However, all ...
An incremental privacy-preservation algorithm for the (k, e)-Anonymous model

Display Omitted An efficient algorithm is developed to prevent incremental privacy breach.Only the most recent previously-released data is required for privacy preservation.The solution can always be guaranteed the optimal result. An important issue to ...

Comments

Information & Contributors

Information

Published In

cover image Information Fusion

Information Fusion Volume 46, Issue C

Mar 2019

231 pages

ISSN:1566-2535

Issue’s Table of Contents

Copyright © 2018.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 March 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Aassem YHafidi IAboutabit NBen Ahmed MBoudhir A(2020)Enhanced Duplicate Count StrategyProceedings of the 3rd International Conference on Networking, Information Systems & Security10.1145/3386723.3387877(1-7)Online publication date: 31-Mar-2020
https://dl.acm.org/doi/10.1145/3386723.3387877

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents