Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An incremental graph-partitioning algorithm for entity resolution

Published: 01 March 2019 Publication History

Highlights

A novel incremental data association algorithm is proposed for entity resolution.
Order of magnitude faster than batch algorithms, with little/no loss in accuracy.
Shows 30–40% better F-Score on John Smith dataset, compared to leading heuristics.
Proposed algorithm leverages the Clique Partition Problem.
Quickly updates solution for new references as well as changes in similarity scores.

Abstract

Entity resolution is an important data association task when fusing information from multiple sources. Oftentimes the information arrives continuously and the entity resolution algorithm needs to efficiently update its solution upon receiving new information. In this work, we introduce an incremental entity resolution algorithm based on a graph partitioning formulation. The developed algorithm is able to handle both incrementally arriving entity references, as well as incrementally arriving information which changes the pairwise similarity scores between the references. New information is handled in a way that allows the algorithm to reconsider past decisions when contradicting information arrives. Because the graph partitioning formulation used is NP-Hard, a heuristic algorithm is developed to produce good solutions, which is also compatible with a blocking technique to limit the number of required comparisons. The algorithm is tested on a variety of datasets (randomly generated and real) and it is shown that allowing the algorithm to consider revised scores and revisit prior decisions offers a substantial improvement to accuracy (approximately 30–40% better F-Score on a natural language dataset), compared to other greedy heuristics on the same set of coefficients. It is also shown that, on a test set with 100 references, the incremental algorithm is up to an order of magnitude faster than a batch algorithm approach that re-solves the entire problem.

References

[1]
A. Bagga, B. Baldwin, Entity-based cross-document coreferencing using the vector space model, Proceedings of the Thirty-Sixth Annual Meeting on Association for Computational Linguistics, 1, 1998, p. 79.
[2]
H. Bandelt, A. Maas, F. Spieksma, Local search heuristics for multi-index assignment problems with decomposable costs, J. Oper. Res. Soc. 55 (2004) 694–704.
[3]
H.-J. Bandelt, Y. Crama, F. Spieksma, Approximation algorithms for multi-dimensional assignment problems with decomposable costs, Discret. Appl. Math. 49 (1994) 25–50.
[4]
R. Baxter, P. Christen, T. Churches, A comparison of fast blocking methods for record linkage, Proceedings of the ACM SIGKDD ’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003, pp. 6–8.
[5]
I. Bhattacharya, L. Getoor, Collective entity resolution in relational data, ACM Trans. Knowl. Discov. Data 1 (2007) 1–36.
[6]
M. Charikar, C. Chekuri, T. Feder, R. Motwani, Incremental clustering and dynamic information retrieval, Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, 1997, pp. 626–635.
[7]
Chen Z., Graph-based event coreference resolution, Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, 2009, pp. 54–57.
[8]
Chen Z., Ji H., Graph-based clustering for computational linguistics: a survey, Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language Processing, 2010, pp. 1–9.
[9]
P. Christen, R. Gayler, Towards scalable real-time entity resolution using a similarity-aware inverted index approach, Proceedings of the Seventh Australasian Data Mining Conference (AusDM 2008), 2008, pp. 51–60.
[10]
W.W. Cohen, P. Ravikumar, S.E. Fienberg, A comparison of string distance metrics for name-matching tasks, Proceedings of the 2003 Workshop on Information Integration on the Web, IIWEB, 2003, pp. 73–78.
[11]
G. Costa, G. Manco, R. Ortale, An incremental clustering scheme for data de-duplication, Data Min. Knowl. Discov. 20 (2010) 157–189.
[12]
K. Date, G.A. Gross, R. Nagi, Test and evaluation of data association algorithms in hard+soft data fusion, Proceedings of the Seventeenth International Conference on Information Fusion (FUSION), IEEE, 2014, pp. 1–8.
[13]
U. Dorndorf, E. Pesch, Fast clustering algorithms, ORSA J. Comput. 6 (1994) 141–168.
[14]
J. Finkel, M. Christopher, Enforcing transitivity in coreference resolution, Proceedings of the Forty-Sixth Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, 2008, pp. 45–48.
[15]
J.R. Finkel, T. Grenager, C. Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, Proceedings of the Forty-Third Annual Meeting of the Association for Computational Linguistics, 2005, pp. 363–370.
[16]
D. Firmani, B. Saha, D. Srivastava, Online entity resolution using an oracle, Proc. VLDB Endow. 9 (5) (2016) 384–395.
[17]
J. Gehrke, P. Ginsparg, J.M. Kleinberg, Overview of the 2003 KDD cup, SIGKDD Explor. 5 (2) (2003) 149–151.
[18]
C.H. Gooi, J. Allan, Cross-document coreference on a large scale corpus, Proceedings of the 2004 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, HLT-NAACL, 2004, pp. 9–16.
[19]
J.L. Graham, D.L. Hall, J. Rimland, A synthetic dataset for evaluating soft and hard fusion algorithms, Proceedings of the 2011 SPIE, 8062, 2011, p. 80620F.
[20]
G.A. Gross, K. Date, D.R. Schlegel, J.J. Corso, J. Llinas, R. Nagi, S.C. Shapiro, Systemic test and evaluation of a hard+ soft information fusion framework: challenges and current approaches, Proceedings of the Seventeenth International Conference on Information Fusion (FUSION), IEEE, 2014, pp. 1–8.
[21]
M. Grötschel, Y. Wakabayashi, A cutting plane algorithm for a clustering problem, Math. Program. 45 (1989) 59–96.
[22]
A. Gruenheid, Dong X.L., D. Srivastava, Incremental record linkage, Proc. VLDB Endow. 7 (9) (2014) 697–708.
[23]
P. Hansen, B. Jaumard, Cluster analysis and mathematical programming, Math. Program. 79 (1–3) (1997) 191–215.
[24]
J.A. Hartigan, Clustering, Annu. Rev. Biophys. Bioeng. 2 (1973) 81–101.
[25]
M. Hernandez, S. Stolfo, The merge/purge problem for large databases, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, 1995, pp. 127–138.
[26]
M. Klenner, E. Ailloud, Optimization in coreference resolution is not needed: a nearly-optimal algorithm with intensional constraints, Proceedings of the Twelfth Conference of the European Chapter of the ACL, 2009, pp. 442–450.
[27]
M. Klenner, D. Tuggener, An incremental model for coreference resolution with restrictive antecedent accessibility, Proceedings of the Fifteenth conference on Computational Natural Language Learning, 2011.
[28]
J. Llinas, Information fusion for natural and man-made disasters, Proceedings of the Fifth International Conference on Information Fusion, 1, IEEE, 2002, pp. 570–576.
[29]
A. McCallum, K. Nigam, L.H. Ungar, Efficient clustering of high-dimensional data sets with application to reference matching, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’00, ACM Press, New York, USA, 2000, pp. 169–178.
[30]
A. Mehrotra, M. Trick, Cliques and clustering: a combinatorial approach, Oper. Res. Lett. 22 (1) (1998) 1–12.
[31]
J. Mulvey, H. Crowder, Cluster analysis: an application of lagrangian relaxation, Manag. Sci. 25 (4) (1979) 329–340.
[32]
MURI, Unified research on network-based hard/soft information fusion, Multidisciplinary University Research Initiative (MURI) grant (Number W911NF-09-1-0392) by the US Army Research Office (ARO)(2009).
[33]
V. Ng, C. Cardie, Improving machine learning approaches to coreference resolution, Proceedings of the Fortieth Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2002, pp. 104–111.
[34]
C. Nicolae, G. Nicolae, Bestcut: a graph algorithm for coreference resolution, Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 2006, pp. 275–283.
[35]
Ning H., Xu W., Chi Y., Gong Y., Incremental spectral clustering with application to monitoring of evolving blog communities, SIAM Int. Conf. Data Min. (2007) 261–272.
[36]
Ning H., Xu W., Chi Y., Gong Y., Huang T.S., Incremental spectral clustering by efficiently updating the eigen-system, Pattern Recognit. 43 (2010) 113–127.
[37]
M. Oosten, J.H.G.C. Rutten, F.C.R. Spieksma, The Facial Structure of the Clique Partitioning Polytope, Technical Report, 1996.
[38]
M. Pfaff, C. Newlon, H. Patel, K. MacDorman, Information fusion for civilians: the prospects of mega-collaboration, Hum.-Cent. Inf. Fusion (2010) 211–229.
[39]
P. van der Putten, J.C. Kok, A. Gupta, Data Fusion Through Statistical Matching, Working Paper, 2002.
[40]
B. Ramadan, P. Christen, Liang H., Dynamic sorted neighborhood indexing for real-time entity resolution, Proceedings of the 2014 Australasian Database Conference, Springer, 2014, pp. 1–12.
[41]
D. Rao, P. McNamee, Streaming cross document entity coreference resolution, Proceedings of the Twenty-Third International Conference on Computational Linguistics: Posters, 2010, pp. 1050–1058.
[42]
M. Rao, Cluster analysis and mathematical programming, J. Am. Stat. Assoc. 66 (335) (1971) 622–626.
[43]
G. Tauer, R. Nagi, A map-reduce lagrangian heuristic for multidimensional assignment problems with decomposable costs, Parallel Comput. 39 (11) (2013) 653–668.
[44]
G. Tauer, R. Nagi, M. Sudit, The graph association problem: mathematical models and a lagrangian heuristic, Naval Res. Logist. (NRL) 60 (3) (2013) 251–268.
[45]
Wang H., T. Obremski, B. Alidaee, Clique partitioning for clustering: a comparison with k-means and latent class analysis, Commun. Stat. Simul. Comput. 37 (2008) 1–13.
[46]
M.J. Welch, A. Sane, C. Drome, Fast and accurate incremental entity resolution relative to an entity knowledge base, Proceedings of the Twenty-First ACM International Conference on Information and Knowledge Management, ACM, 2012, pp. 2667–2670.
[47]
S.E. Whang, H. Garcia-Molina, Incremental entity resolution on rules and data, The VLDB J. 23 (1) (2014) 77–102.
[48]
S.E. Whang, D. Marmaros, H. Garcia-Molina, Pay-as-you-go entity resolution, IEEE Trans. Knowl. Data Eng. (2012) 1–14.
[49]
S.E. Whang, D. Menestrina, G. Koutrika, M. Theobald, H. Garcia-Molina, Entity resolution with iterative blocking, Proceedings of the Thirty-Fifth SIGMOD International Conference on Management of Data – SIGMOD ’09, ACM Press, New York, New York, USA, 2009, p. 219.
[50]
M. Wick, S. Singh, A. McCallum, A discriminative hierarchical model for fast coreference at large scale, Proceedings of the Fiftieth Annual Meeting of the Association for Computational Linguistics: Long Papers, 2012, pp. 379–388.
[51]
Xu R., D. Wunsch, Survey of clustering algorithms, IEEE Trans. Neural Netw. 16 (3) (2005) 645–678.

Cited By

View all
  • (2020)Enhanced Duplicate Count StrategyProceedings of the 3rd International Conference on Networking, Information Systems & Security10.1145/3386723.3387877(1-7)Online publication date: 31-Mar-2020

Index Terms

  1. An incremental graph-partitioning algorithm for entity resolution
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Information Fusion
          Information Fusion  Volume 46, Issue C
          Mar 2019
          231 pages

          Publisher

          Elsevier Science Publishers B. V.

          Netherlands

          Publication History

          Published: 01 March 2019

          Author Tags

          1. Entity resolution
          2. Data association
          3. Graph partitioning
          4. Incremental algorithm

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 22 Sep 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2020)Enhanced Duplicate Count StrategyProceedings of the 3rd International Conference on Networking, Information Systems & Security10.1145/3386723.3387877(1-7)Online publication date: 31-Mar-2020

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media