Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Swoosh: a generic approach to entity resolution

Published: 01 January 2009 Publication History

Abstract

We consider the entity resolution (ER) problem (also known as deduplication, or merge---purge), in which records determined to represent the same real-world entity are successively located and merged. We formalize the generic ER problem, treating the functions for comparing and merging records as black-boxes, which permits expressive and extensible ER solutions. We identify four important properties that, if satisfied by the match and merge functions, enable much more efficient ER algorithms. We develop three efficient ER algorithms: G-Swoosh for the case where the four properties do not hold, and R-Swoosh and F-Swoosh that exploit the four properties. F-Swoosh in addition assumes knowledge of the "features" (e.g., attributes) used by the match function. We experimentally evaluate the algorithms using comparison shopping data from Yahoo! Shopping and hotel information data from Yahoo! Travel. We also show that R-Swoosh (and F-Swoosh) can be used even when the four match and merge properties do not hold, if an "approximate" result is acceptable.

References

[1]
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, pp. 586- 597 (2002).
[2]
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918-929 (2006).
[3]
Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: FOCS, p. 238 (2002).
[4]
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD'03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation (2003). http://citeseer.ist.psu.edu/article/ baxter03comparison.html
[5]
Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: WWW, pp. 463-470 (2005).
[6]
Benjelloun, O., Garcia-Molina, H., Jonas, J., Menestrina, D., Whang, S., Su, Q., Widom, J.: Swoosh : a generic approach to entity resolution. Technical Report, Stanford University (2006). http://dbpubs.stanford.edu/pub/2005-5
[7]
Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-Swoosh : a family of algorithms for generic, distributed entity resolution. In: ICDCS (2007).
[8]
Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (2004).
[9]
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Sixth SIAM Conference on Data Mining (2006).
[10]
Blume, M.: Automatic entity disambiguation: benefits to NER, relation extraction, link analysis, and inference. In: International Conference on Intelligence Analysis (2005). https://analysis. mitre.org/
[11]
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD, pp. 313-324 (2003).
[12]
Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of ICDE, Tokyo, Japan (2005).
[13]
Cohen, W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18, 288-321 (2000).
[14]
Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD (2005).
[15]
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183-1210 (1969).
[16]
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative data cleaning: Language, model, and algorithms. In: Proceedings of VLDB, pp. 371-380 (2001).
[17]
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491-500 (2001).
[18]
Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences (2003).
[19]
Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD, pp. 127-138 (1995).
[20]
Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9-37 (1998).
[21]
IBM: DB2 Entity Analytic Solutions. http://www-306.ibm.com/ software/data/db2/eas/
[22]
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414-420 (1989).
[23]
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of International Conference on Database Systems for Advanced Applications, p. 137 (2003).
[24]
Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: Proceedings of the SIAM International Conference on Data Mining, Newport Beach, CA (2005).
[25]
McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of KDD, pp. 169-178, Boston, MA (2000).
[26]
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB (2006).
[27]
Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23-29 (1997).
[28]
Motro, A., Anokhin, P.: Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources. Inf. Fusion 7(2), 176-196 (2006).
[29]
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954-959 (1959).
[30]
Parag, D.P.: Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31-48 (2004).
[31]
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD, Edmonton, Alberta (2002).
[32]
Schallehn, E., Sattler, K.U., Saake, G.: Extensible and similarity-based grouping for data integratio. In: ICDE, p. 277 (2002).
[33]
Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Proceedings of PKDD, pp. 297-308 (2005).
[34]
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195-197 (1981).
[35]
Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM. 22(2), 215-225 (1975).
[36]
Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635- 656 (2001).
[37]
Verykios, V.S., Moustakides, G.V., Elfeky, M.G.: A bayesian decision model for cost optimal record matching. VLDB J. 12(1), 28-40(2003). http://www.cs.purdue.edu/homes/mgelfeky/ Papers/vldbj12(1).pdf
[38]
Winkler, W.: Overview of record linkage and current research directions. Technical Report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC (2006).
[39]
Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. In: American Statistical Association, Proceedings of the Section on Survey Research Methods, pp. 667-671 (1988).

Cited By

View all
  • (2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
  • (2023)Combining global and local merges in logic-based entity resolutionProceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning10.24963/kr.2023/74(742-746)Online publication date: 2-Sep-2023
  • (2023)A framework for combining entity resolution and query answering in knowledge basesProceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning10.24963/kr.2023/23(229-239)Online publication date: 2-Sep-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 18, Issue 1
January 2009
373 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 January 2009

Author Tags

  1. Data cleaning
  2. Entity resolution
  3. Generic entity resolution

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)11
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Automatic Data Repair: Are We Ready to Deploy?Proceedings of the VLDB Endowment10.14778/3675034.367505117:10(2617-2630)Online publication date: 1-Jun-2024
  • (2023)Combining global and local merges in logic-based entity resolutionProceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning10.24963/kr.2023/74(742-746)Online publication date: 2-Sep-2023
  • (2023)A framework for combining entity resolution and query answering in knowledge basesProceedings of the 20th International Conference on Principles of Knowledge Representation and Reasoning10.24963/kr.2023/23(229-239)Online publication date: 2-Sep-2023
  • (2023)On evaluating text similarity measures for customer data deduplicationProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3578724(297-300)Online publication date: 27-Mar-2023
  • (2023)OAG: Linking Entities Across Large-Scale Heterogeneous Knowledge GraphsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322216835:9(9225-9239)Online publication date: 1-Sep-2023
  • (2023)Data Integration Landscapes: The Case for Non-optimal Solutions in Network Diffusion ModelsComputational Science – ICCS 202310.1007/978-3-031-35995-8_35(494-508)Online publication date: 3-Jul-2023
  • (2022)A Hybrid Approach to Discover Entity SynonymsInternational Journal of Information Retrieval Research10.4018/IJIRR.30029612:3(1-18)Online publication date: 26-Aug-2022
  • (2022)FrostProceedings of the VLDB Endowment10.14778/3554821.355482315:12(3292-3305)Online publication date: 1-Aug-2022
  • (2022)Entity resolution on-demandProceedings of the VLDB Endowment10.14778/3523210.352322615:7(1506-1518)Online publication date: 22-Jun-2022
  • (2022)Domain Adaptation for Deep Entity ResolutionProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517870(443-457)Online publication date: 10-Jun-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media