Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2487575.2487662acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

Exploiting user clicks for automatic seed set generation for entity matching

Published: 11 August 2013 Publication History

Abstract

Matching entities from different information sources is a very important problem in data analysis and data integration. It is, however, challenging due to the number and diversity of information sources involved, and the significant editorial efforts required to collect sufficient training data. In this paper, we present an approach that leverages user clicks during Web search to automatically generate training data for entity matching. The key insight of our approach is that Web pages clicked for a given query are likely to be about the same entity. We use random walk with restart to reduce data sparseness, rely on co-clustering to group queries and Web pages, and exploit page similarity to improve matching precision. Experimental results show that: (i) With 360K pages from 6 major travel websites, we obtain 84K matchings (of 179K pages) that refer to the same entities, with an average precision of 0.826; (ii) The quality of matching obtained from a classifier trained on the resulted seed data is promising: the performance matches that of editorial data at small size and improves with size.

References

[1]
R. Baeza-Yates and A. Tiberi. Extracting semantic relations fromquery logs. In KDD, 2002.
[2]
M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006.
[3]
M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In SIGKDD, 2003.
[4]
B. Billerbeck, G. Demartini, C. S. Firan, T. Iofciu, and R. Krestel. Ranking entities using web search query logs. In ECDL, 2010.
[5]
H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context-aware query suggestion by mining click-through and session data. In KDD, 2008.
[6]
D. Chakrabarti and R. R. Mehta. The paths more taken: matching DOM trees to search logs for accurate webpage clustering. In WWW, 2010.
[7]
T. F. Coleman and J. J. Moré. Estimation of sparse Jacobian matrices and graph coloring problems. SIAM Journal on Numerical Analysis, 1983.
[8]
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill, 2001.
[9]
N. Craswell and M. Szummer. Random walks on the click graph. In SIGIR, 2007.
[10]
I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In SIGKDD, 2001.
[11]
R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the world-wide web. In AGENTS, 1997.
[12]
C. F. Dorneles, R. Gonçalves, and R. dos Santos Mello. Approximate data instance matching: a survey. Knowledge and Information Systems, 2011.
[13]
L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice and open challenges. PVLDB, 2012.
[14]
J. Greiner. A comparison of parallel algorithms for connected components. In SPAA, 1994.
[15]
C. Kang, S. Vadrevu, R. Zhang, R. v. Zwol, L. G. Pueyo, N. Torzec, J. He, and Y. Chang. Ranking related entities for web search queries. In WWW, 2011.
[16]
H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data Knowl. Eng., 69(2):197--210, 2010.
[17]
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.
[18]
P. N. Mendes, P. Mika, H. Zaragoza, and R. Blanco. Measuring website similarity using an entity-aware click graph. In CIKM, 2012.
[19]
J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In SIGKDD, 2004.
[20]
V. Rastogi, N. Dalvi, and M. Garofalakis. Large-scale collective entity matching. PVLDB, 2011.
[21]
S. Tejada, C. A. Knoblock, and S. Minton. Learning object identification rules for information integration. Inf. Syst., 26(8):607--633, Dec. 2001.
[22]
H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its applications. In ICDM, 2006.
[23]
C. Wang, F. Jing, L. Zhang, and H.-J. Zhang. Image annotation refinement using random walk with restarts. In ACM MM, 2006.
[24]
J. Yi and F. Maghoul. Query clustering using click-through graph. In WWW, 2009.

Cited By

View all
  • (2015)Incorporating Social Context and Domain Knowledge for Entity RecognitionProceedings of the 24th International Conference on World Wide Web10.1145/2736277.2741135(517-526)Online publication date: 18-May-2015

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2013
1534 pages
ISBN:9781450321747
DOI:10.1145/2487575
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 August 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. co-clustering
  2. entity matching
  3. random walk
  4. user clicks

Qualifiers

  • Poster

Conference

KDD' 13
Sponsor:

Acceptance Rates

KDD '13 Paper Acceptance Rate 125 of 726 submissions, 17%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2015)Incorporating Social Context and Domain Knowledge for Entity RecognitionProceedings of the 24th International Conference on World Wide Web10.1145/2736277.2741135(517-526)Online publication date: 18-May-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media