Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries

M Georgescu, DD Pham, CS Firan, W Nejdl… - Proceedings of the 21st …, 2012 - dl.acm.org
M Georgescu, DD Pham, CS Firan, W Nejdl, J Gaugaz
Proceedings of the 21st ACM international conference on Information and …, 2012dl.acm.org
Detecting duplicate entities, usually by examining metadata, has been the focus of much
recent work. Several methods try to identify duplicate entities, while focusing either on
accuracy or on efficiency and speed-with still no perfect solution. We propose a combined
layered approach for duplicate detection with the main advantage of using Crowdsourcing
as a training and feedback mechanism. By using Active Learning techniques on human
provided examples, we fine tune our algorithm toward better duplicate detection accuracy …
Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.
ACM Digital Library