Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2594538.2594554acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Is min-wise hashing optimal for summarizing set intersection?

Published: 18 June 2014 Publication History

Abstract

Min-wise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a "min-hash") of each set. One application is estimation of the number of data points that satisfy the conjunction of m >= 2 simple predicates, where a min-hash is available for the set of points satisfying each predicate. This has application in query optimization and for approximate computation of COUNT aggregates.
In this paper we address the question: How many bits is it necessary to allocate to each summary in order to get an estimate with (1 +/- epsilon)-relative error? The state-of-the-art technique for minimizing the encoding size, for any desired estimation error, is b-bit min-wise hashing due to Li and König (Communications of the ACM, 2011). We give new lower and upper bounds:
Using information complexity arguments, we show that b-bit min-wise hashing is em space optimal for m=2 predicates in the sense that the estimator's variance is within a constant factor of the smallest possible among all summaries with the given space usage. But for conjunctions of m>2 predicates we show that the performance of b-bit min-wise hashing (and more generally any method based on "k-permutation" min-hash) deteriorates as m grows.
We describe a new summary that nearly matches our lower bound for m >= 2. It asymptotically outperform all k-permutation schemes (by around a factor Omega(m/log m)), as well as methods based on subsampling (by a factor Omega(log n_max), where n_max is the maximum set size).

References

[1]
Z. Bar-Yossef. The complexity of massive data set computations. PhD thesis, University of California at Berkeley, 2002.
[2]
P. Bille, A. Pagh, and R. Pagh. Fast evaluation of union-intersection expressions. In Proceedings of the 18th International Symposium on Algorithms And Computation (ISAAC '07), pages 739--750.
[3]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, 1970.
[4]
M. Braverman, A. Garg, D. Pankratov, and O. Weinstein. Information lower bounds via self-reducibility. In CSR, pages 183--194, 2013.
[5]
A. Z. Broder. On the resemblance and containment of documents. In In Compression and Complexity of Sequences (SEQUENCES), pages 21--29, 1997.
[6]
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60:327--336, 1998.
[7]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Selected papers from the sixth international conference on World Wide Web, pages 1157--1166, 1997.
[8]
A. Chakrabarti and O. Regev. An optimal lower bound on the communication complexity of gap-hamming-distance. SIAM J. Comput., 41(5):1299--1317, 2012.

Cited By

View all
  • (2024)On the Feasibility of Forgetting in Data StreamsProceedings of the ACM on Management of Data10.1145/36516032:2(1-17)Online publication date: 14-May-2024
  • (2024)Probabilistic Support Prediction: Fast Frequent Itemset Mining in Dense DataIEEE Access10.1109/ACCESS.2024.337647712(39330-39350)Online publication date: 2024
  • (2022)Efficient and Privacy Preserving Approximation of Distributed Statistical QueriesIEEE Transactions on Big Data10.1109/TBDATA.2021.30525168:5(1399-1413)Online publication date: 1-Oct-2022
  • Show More Cited By

Index Terms

  1. Is min-wise hashing optimal for summarizing set intersection?

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
    June 2014
    300 pages
    ISBN:9781450323758
    DOI:10.1145/2594538
    • General Chair:
    • Richard Hull,
    • Program Chair:
    • Martin Grohe
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 June 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. communication complexity
    2. lower bound
    3. min-wise hashing
    4. set intersection

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'14
    Sponsor:

    Acceptance Rates

    PODS '14 Paper Acceptance Rate 22 of 67 submissions, 33%;
    Overall Acceptance Rate 642 of 2,707 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)39
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)On the Feasibility of Forgetting in Data StreamsProceedings of the ACM on Management of Data10.1145/36516032:2(1-17)Online publication date: 14-May-2024
    • (2024)Probabilistic Support Prediction: Fast Frequent Itemset Mining in Dense DataIEEE Access10.1109/ACCESS.2024.337647712(39330-39350)Online publication date: 2024
    • (2022)Efficient and Privacy Preserving Approximation of Distributed Statistical QueriesIEEE Transactions on Big Data10.1109/TBDATA.2021.30525168:5(1399-1413)Online publication date: 1-Oct-2022
    • (2021)Competitive data-structure dynamizationProceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3458064.3458199(2269-2287)Online publication date: 10-Jan-2021
    • (2020)HyperMinHash: MinHash in LogLog spaceIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.2981311(1-1)Online publication date: 2020
    • (2020)Similarity Search for Dynamic Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.291685832:11(2241-2253)Online publication date: 1-Nov-2020
    • (2020)Subsets and Supermajorities: Optimal Hashing-based Set Similarity Search2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS46700.2020.00073(728-739)Online publication date: Nov-2020
    • (2019)Joins on samplesProceedings of the VLDB Endowment10.14778/3372716.337272613:4(547-560)Online publication date: 9-Dec-2019
    • (2019)Fast Eclat Algorithms Based on Minwise Hashing for Large Scale TransactionsIEEE Internet of Things Journal10.1109/JIOT.2018.28858516:2(3948-3961)Online publication date: Apr-2019
    • (2019)Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00109(1190-1201)Online publication date: Apr-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media