Article

EIF: a framework of effective entity identification

Authors:

Jianzhong LiAuthors Info & Claims

WAIM'10: Proceedings of the 11th international conference on Web-age information management

Pages 717 - 728

Published: 15 July 2010 Publication History

Abstract

Entity identification, that is to build corresponding relationships between objects and entities in dirty data, plays an important role in data cleaning. The confusion between entities and their names often results in dirty data. That is, different entities may share the identical name and different names may correspond to the identical entity. Therefore, the major task of entity identification is to distinguish entities sharing the same name and recognize different names referring to the same entity. However, current research focuses on only one aspect and cannot solve the problem completely. To address this problem, in this paper, EIF, a framework of entity identification with the consideration of the both kinds of confusions, is proposed. With effective clustering techniques, approximate string matching algorithms and a flexible mechanism of knowledge integration, EIF can be widely used to solve many different kinds of entity identification problems. In this paper, as an application of EIF, we solved the author identification problem. The effectiveness of this framework is verified by extensive experiments.

References

[1]

Newcombe, H., Kennedy, J., Axford, S.: Automatic Linkage of Vital Records. Science 130, 954-959 (1959).

[2]

Yin, X., Han, J., Yu, P.S.: Object Distinction: Distinguishing Objects with Identical Names. In: ICDE 2007 (2007).

[3]

http://www.cervantesvirtual.com/research/congresos/ jbidi2003/slides/jbidi2003-michael.ley.ppt

[4]

Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE 2008 (2008).

Digital Library

[5]

Arasu, A., Kaushik, R.: A grammar-based entity representation framework for data cleaning. In: SIGMOD, pp. 233-244 (2009).

Digital Library

[6]

Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting context analysis for combining multiple entity resolution systems. In: SIGMOD, pp. 207-218 (2009).

Digital Library

[7]

Whang, S.E., Menestrina, D., Koutrika, G., Theobald, M., Garcia-Molina, H.: Entity resolution with iterative blocking. In: SIGMOD, pp. 219-232 (2009).

Digital Library

[8]

Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: Proc. CIKM 2005, pp. 257-258 (2005).

Digital Library

[9]

Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proc. SIGMOD 2005, pp. 85-96 (2005).

Digital Library

[10]

Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 297-308. Springer, Heidelberg (2005).

Digital Library

[11]

Arasu, A., Re, C., Suciu, D.: Large-scale deduplication with constraints using Dedupalog. In: ICDE 2009 (2009).

Digital Library

[12]

Koudas, N., Saha, A., Srivastava, D., et al.: Metric functional dependencies In: ICDE (2009).

Digital Library

[13]

Arasu, A., Chaudhuri, S., Kaushik, R.: Learning string transformations from examples. In: VLDB 2009 (2009).

Digital Library

[14]

Chaudhuri, S., Chen, B.C., Ganti, V., Kaushik, R.: Example-driven design of efficient record matching queries. In: VLDB 2007 (2007).

Digital Library

[15]

Milch, B., Marthi, B., Sontag, D., Russell, S., Ong, D.L.: BLOG: Probabilistic models with unknown objects. In: Proc. IJCAI 2005, pp. 1352-1359 (2005).

Digital Library

[16]

Koudas, N., Sarawagi, S., Srivastava, D.: Record linkage: similarity measures and algorithms. In: SIGMOD Conference, pp. 802-803 (2006).

Digital Library

[17]

http://dblp.uni-trier.de/

[18]

Barabási, Albert-László, et al.: Scale-Free Networks. Scientific American 288, 50-59 (2003).

Cited By

Abu Ahmad HWang H(2018)An effective weighted rule-based method for entity resolutionDistributed and Parallel Databases10.1007/s10619-018-7240-636:3(593-612)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1007/s10619-018-7240-6
Wang HDing XChen XLi JGao HLim EWinslett MSanderson MFu ASun JCulpepper SLo EHo JDonato DAgrawal RZheng YCastillo CSun ATseng VLi C(2017)CleanCloudProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133187(2543-2546)Online publication date: 6-Nov-2017
https://dl.acm.org/doi/10.1145/3132847.3133187
Wang HLi MBu YLi JGao HZhang J(2016)CleanixACM SIGMOD Record10.1145/2935694.293570244:4(35-40)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.1145/2935694.2935702
Show More Cited By

Recommendations

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant ...
Febrl: a freely available record linkage system with a graphical user interface
HDKM '08: Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80

Record or data linkage is an important enabling technology in the health sector, as linked data is a cost-effective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the ...
Unsupervised Graph-Based Entity Resolution for Complex Entities
Entity resolution (ER) is the process of linking records that refer to the same entity. Traditionally, this process compares attribute values of records to calculate similarities and then classifies pairs of records as referring to the same entity or not ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

WAIM'10: Proceedings of the 11th international conference on Web-age information management

July 2010

784 pages

ISBN:3642142451

Editors:
Lei Chen
Hong Kong University of Science and Technology, Department of Computer Science, Hong Kong, China
,
Changjie Tang
Sichuan University, Computer Department, Chengdu, China
,
Jun Yang
Duke University, Department of Computer Science, Durham, NC
,
Yunjun Gao
Zhejiang University, College of Computer Science, Hangzhou, China

Sponsors

NSF of China: National Natural Science Foundation of China
Wisesoft
Sichuan University
East China Normal University

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 15 July 2010

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Abu Ahmad HWang H(2018)An effective weighted rule-based method for entity resolutionDistributed and Parallel Databases10.1007/s10619-018-7240-636:3(593-612)Online publication date: 1-Sep-2018
https://dl.acm.org/doi/10.1007/s10619-018-7240-6
Wang HDing XChen XLi JGao HLim EWinslett MSanderson MFu ASun JCulpepper SLo EHo JDonato DAgrawal RZheng YCastillo CSun ATseng VLi C(2017)CleanCloudProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133187(2543-2546)Online publication date: 6-Nov-2017
https://dl.acm.org/doi/10.1145/3132847.3133187
Wang HLi MBu YLi JGao HZhang J(2016)CleanixACM SIGMOD Record10.1145/2935694.293570244:4(35-40)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.1145/2935694.2935702
Gim JJang YJung HJeong D(2016)Feature-Based Researcher Identification Framework Using Timeline DataWireless Personal Communications: An International Journal10.1007/s11277-016-3662-591:4(1653-1667)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1007/s11277-016-3662-5
Wang HLi MBu YLi JGao HZhang JLi JWang XGarofalakis MSoboroff ISuel TWang M(2014)CleanixProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management10.1145/2661829.2661837(2024-2026)Online publication date: 3-Nov-2014
https://dl.acm.org/doi/10.1145/2661829.2661837

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents