Improving Entity Resolution with Global Constraints

Gemmell, Jim; Rubinstein, Benjamin I. P.; Chandra, Ashok K.

Computer Science > Databases

arXiv:1108.6016 (cs)

[Submitted on 30 Aug 2011]

Title:Improving Entity Resolution with Global Constraints

Authors:Jim Gemmell, Benjamin I. P. Rubinstein, Ashok K. Chandra

View PDF

Abstract:Some of the greatest advances in web search have come from leveraging socio-economic properties of online user behavior. Past advances include PageRank, anchor text, hubs-authorities, and TF-IDF. In this paper, we investigate another socio-economic property that, to our knowledge, has not yet been exploited: sites that create lists of entities, such as IMDB and Netflix, have an incentive to avoid gratuitous duplicates. We leverage this property to resolve entities across the different web sites, and find that we can obtain substantial improvements in resolution accuracy. This improvement in accuracy also translates into robustness, which often reduces the amount of training data that must be labeled for comparing entities across many sites. Furthermore, the technique provides robustness when resolving sites that have some duplicates, even without first removing these duplicates. We present algorithms with very strong precision and recall, and show that max weight matching, while appearing to be a natural choice turns out to have poor performance in some situations. The presented techniques are now being used in the back-end entity resolution system at a major Internet search engine.

Comments:	10 pages, 13 figures
Subjects:	Databases (cs.DB); Information Retrieval (cs.IR)
ACM classes:	H.2; H.3.3; I.5.4
Report number:	MSR-TR-2011-100
Cite as:	arXiv:1108.6016 [cs.DB]
	(or arXiv:1108.6016v1 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.1108.6016

Submission history

From: Benjamin Rubinstein [view email]
[v1] Tue, 30 Aug 2011 17:30:54 UTC (827 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.IR

< prev | next >

new | recent | 2011-08

Change to browse by:

cs
cs.DB

References & Citations

DBLP - CS Bibliography

listing | bibtex

Jim Gemmell
Benjamin I. P. Rubinstein
Ashok K. Chandra

export BibTeX citation

Computer Science > Databases

Title:Improving Entity Resolution with Global Constraints

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Improving Entity Resolution with Global Constraints

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators