Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/1884017.1884103guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

EIF: a framework of effective entity identification

Published: 15 July 2010 Publication History

Abstract

Entity identification, that is to build corresponding relationships between objects and entities in dirty data, plays an important role in data cleaning. The confusion between entities and their names often results in dirty data. That is, different entities may share the identical name and different names may correspond to the identical entity. Therefore, the major task of entity identification is to distinguish entities sharing the same name and recognize different names referring to the same entity. However, current research focuses on only one aspect and cannot solve the problem completely. To address this problem, in this paper, EIF, a framework of entity identification with the consideration of the both kinds of confusions, is proposed. With effective clustering techniques, approximate string matching algorithms and a flexible mechanism of knowledge integration, EIF can be widely used to solve many different kinds of entity identification problems. In this paper, as an application of EIF, we solved the author identification problem. The effectiveness of this framework is verified by extensive experiments.

References

[1]
Newcombe, H., Kennedy, J., Axford, S.: Automatic Linkage of Vital Records. Science 130, 954-959 (1959).
[2]
Yin, X., Han, J., Yu, P.S.: Object Distinction: Distinguishing Objects with Identical Names. In: ICDE 2007 (2007).
[3]
http://www.cervantesvirtual.com/research/congresos/ jbidi2003/slides/jbidi2003-michael.ley.ppt
[4]
Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE 2008 (2008).
[5]
Arasu, A., Kaushik, R.: A grammar-based entity representation framework for data cleaning. In: SIGMOD, pp. 233-244 (2009).
[6]
Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting context analysis for combining multiple entity resolution systems. In: SIGMOD, pp. 207-218 (2009).
[7]
Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD, pp. 219-232 (2009).
[8]
Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: Proc. CIKM 2005, pp. 257-258 (2005).
[9]
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proc. SIGMOD 2005, pp. 85-96 (2005).
[10]
Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 297-308. Springer, Heidelberg (2005).
[11]
Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE 2009 (2009).
[12]
Koudas, N., Saha, A., Srivastava, D., et al.: Metric functional dependencies In: ICDE (2009).
[13]
Arasu, A., Chaudhuri, S., Kaushik, R.: Learning string transformations from examples. In: VLDB 2009 (2009).
[14]
Chaudhuri, S., Chen, B.C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB 2007 (2007).
[15]
Milch, B., Marthi, B., Sontag, D., Russell, S., Ong, D.L.: BLOG: Probabilistic models with unknown objects. In: Proc. IJCAI 2005, pp. 1352-1359 (2005).
[16]
Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: SIGMOD Conference, pp. 802-803 (2006).
[17]
http://dblp.uni-trier.de/
[18]
Barabási, Albert-László, et al.: Scale-Free Networks. Scientific American 288, 50-59 (2003).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
WAIM'10: Proceedings of the 11th international conference on Web-age information management
July 2010
784 pages
ISBN:3642142451
  • Editors:
  • Lei Chen,
  • Changjie Tang,
  • Jun Yang,
  • Yunjun Gao

Sponsors

  • NSF of China: National Natural Science Foundation of China
  • Wisesoft
  • Sichuan University
  • East China Normal University

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 15 July 2010

Author Tags

  1. data cleaning
  2. entity identification
  3. graph partition

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2018)An effective weighted rule-based method for entity resolutionDistributed and Parallel Databases10.1007/s10619-018-7240-636:3(593-612)Online publication date: 1-Sep-2018
  • (2017)CleanCloudProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133187(2543-2546)Online publication date: 6-Nov-2017
  • (2016)CleanixACM SIGMOD Record10.1145/2935694.293570244:4(35-40)Online publication date: 9-May-2016
  • (2016)Feature-Based Researcher Identification Framework Using Timeline DataWireless Personal Communications: An International Journal10.1007/s11277-016-3662-591:4(1653-1667)Online publication date: 1-Dec-2016
  • (2014)CleanixProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2661837(2024-2026)Online publication date: 3-Nov-2014

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media