Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1516360.1516413acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article
Free access

Efficient top-k count queries over imprecise duplicates

Published: 24 March 2009 Publication History

Abstract

We propose efficient techniques for processing various Top-K count queries on data with noisy duplicates. Our method differs from existing work on duplicate elimination in two significant ways: First, we dedup on the fly only the part of the data needed for the answer --- a requirement in massive and evolving sources where batch deduplication is expensive. The non-local nature of the problem of partitioning data into duplicate groups, makes it challenging to filter only those tuples forming the K largest groups. We propose a novel method of successively collapsing and pruning records which yield an order of magnitude reduction in running time compared to deduplicating the entire data first.
Second, we return multiple high scoring answers to handle situations where it is impossible to resolve if two records are indeed duplicates of each other. Since finding even the highest scoring deduplication is NP-hard, the existing approach is to deploy one of many variants of score-based clustering algorithms which do not easily generalize to finding multiple groupings. We model deduplication as a segmentation of a linear embedding of records and present a polynomial time algorithm for finding the R highest scoring answers. This method closely matches the accuracy of an exact exponential time algorithm on several datasets.

References

[1]
R. Ananthakrishna, S. chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002.
[2]
P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In ICDE, 2006.
[3]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006.
[4]
N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In FOCS '02: Proceedings of the 43rd Symposium on Foundations of Computer Science, page 238, Washington, DC, USA, 2002. IEEE Computer Society.
[5]
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. TKDD, 1(1), 2007.
[6]
M. Bilenko. Learnable similarity functions and their applications to clustering and record linkage. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, Sixteenth Conference on Innovative Applications of Artificial Intelligence, July 25--29, 2004, San Jose, California, USA, pages 981--982. AAAI Press / The MIT Press, 2004.
[7]
M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In ICDM, 2005.
[8]
M. Bilenko, B. Kamath, and R. J. Mooney. Adaptive blocking: Learning to scale up record linkage. In ICDM, 2006.
[9]
M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg. Adaptive name-matching in information integration. IEEE Intelligent Systems, 2003.
[10]
M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. J. Comput. Syst. Sci., 71(3):360--383, 2005.
[11]
S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, pages 327--338, 2007.
[12]
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.
[13]
S. Chaudhuri, V. Ganti, and R. Motwani. Robust identification of fuzzy duplicates. In ICDE, 2005.
[14]
D. Cheng, R. Kannan, S. Vempala, and G. Wang. A divide-and-merge methodology for clustering. ACM Trans. Database Syst., 31(4):1499--1525, 2006.
[15]
W. Cohen and J. Richman. Learning to match and cluster entity names. In ACM SIGIR' 01 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001.
[16]
W. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288--321, July 2000.
[17]
X. Dong, A. Y. Halevy, and C. Yu. Data integration with uncertainty. In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 687--698, 2007.
[18]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64:1183--1210, 1969.
[19]
L. Gravano, P. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. of the 27th Int'l Conference on Very Large Databases (VLDB), Rome, Italy, 2001.
[20]
I. F. Ilyas, G. Beskales, and M. A. Soliman. Survey of top-k query processing techniques in relational database systems,. To Appear in the ACM Computing Surveys, 2008, 2008.
[21]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[22]
J. Ko, T. Mitamura, and E. Nyberg. Language-independent probabilistic answer ranking for question answering. In ACL, 2007.
[23]
D. Koller and N. Friedman. Structured probabilistic models. Under preparation, 2007.
[24]
Y. Koren and D. Harel. A multi-scale algorithm for the linear arrangement problem. In WG, 2002.
[25]
C. Li, K. C.-C. Chang, and I. F. Ilyas. Supporting ad-hoc ranking aggregates. In SIGMOD Conference, 2006.
[26]
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Knowledge Discovery and Data Mining, pages 169--178, 2000.
[27]
A. McCallum and B. Wellner. Toward conditional models of identity uncertainty with application to proper noun coreference. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pages 79--86, Acapulco, Mexico, Aug. 2003.
[28]
A. E. Monge and C. P. Elkan. The field matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996.
[29]
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Processing Systems 15, Vancouver, British Columbia, 2002. MIT Press.
[30]
S. Sarawagi. The crf project: a java implementation. http://crf.sourceforge.net, 2004.
[31]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2002), Edmonton, Canada, July 2002.
[32]
S. Sarawagi and A. Kirpal. Scaling up the alias duplicate elimination system: A demostration. In Proc. of the 19th IEEE Int'l Conference on Data Engineering (ICDE), Bangalore, March 2003.
[33]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2004.
[34]
A. D. Sarma, X. Dong, and A. Halevy. Bootstrapping pay-as-you-go data integration systems. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008.
[35]
M. A. Soliman, I. F. Ilyas, and K. C.-C. Chang. Probabilistic top- and ranking-aggregate queries. ACM Trans. Database Syst., 33(3), 2008.
[36]
M. L. Wick, K. Rohanimanesh, K. Schultz, and A. McCallum. A unified approach for schema matching, coreference and canonicalization. In KDD, 2008.
[37]
X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In SIGMOD Conference, pages 353--364, 2008.

Cited By

View all
  • (2013)A Web-Based Multimedia Retrieval System with MCA-Based Filtering and Subspace-Based Learning AlgorithmsInternational Journal of Multimedia Data Engineering & Management10.4018/jmdem.20130401024:2(13-45)Online publication date: 1-Apr-2013
  • (2013)Data-based research at IIT BombayACM SIGMOD Record10.1145/2481528.248153642:1(38-43)Online publication date: 1-May-2013
  • (2010)Queries over Unstructured Data: Probabilistic Methods to the RescueEnabling Real-Time Business Intelligence10.1007/978-3-642-14559-9_1(1-13)Online publication date: 2010
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
March 2009
1180 pages
ISBN:9781605584225
DOI:10.1145/1516360
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 March 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

EDBT/ICDT '09
EDBT/ICDT '09: EDBT/ICDT '09 joint conference
March 24 - 26, 2009
Saint Petersburg, Russia

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)9
Reflects downloads up to 09 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2013)A Web-Based Multimedia Retrieval System with MCA-Based Filtering and Subspace-Based Learning AlgorithmsInternational Journal of Multimedia Data Engineering & Management10.4018/jmdem.20130401024:2(13-45)Online publication date: 1-Apr-2013
  • (2013)Data-based research at IIT BombayACM SIGMOD Record10.1145/2481528.248153642:1(38-43)Online publication date: 1-May-2013
  • (2010)Queries over Unstructured Data: Probabilistic Methods to the RescueEnabling Real-Time Business Intelligence10.1007/978-3-642-14559-9_1(1-13)Online publication date: 2010
  • (2009)Enhancing Concept Detection by Pruning Data with MCA-Based Transaction WeightsProceedings of the 2009 11th IEEE International Symposium on Multimedia10.1109/ISM.2009.125(304-311)Online publication date: 14-Dec-2009

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media