Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2611462.2611501acmconferencesArticle/Chapter ViewAbstractPublication PagespodcConference Proceedingsconference-collections
research-article

Beyond set disjointness: the communication complexity of finding the intersection

Published: 15 July 2014 Publication History

Abstract

We consider the following fundamental communication problem - there is data that is distributed among servers, and the servers want to compute the intersection of their data sets, e.g., the common records in a relational database. They want to do this with as little communication and as few messages (rounds) as possible. They are willing to use randomization, and fail with a tiny probability. Given a protocol for computing the intersection, it can also be used to compute the exact Jaccard similarity, the rarity, the number of distinct elements, and joins between databases. Computing the intersection is at least as hard as the set disjointness problem, which asks whether the intersection is empty. Formally, in the two-server setting, the players hold subsets S, T ⊆ [n]. In many realistic scenarios, the sizes of S and T are significantly smaller than n, so we impose the constraint that |S|, |T| ≤ k. We study the minimum number of bits the parties need to communicate in order to compute the intersection set S ∩ T, given a certain number r of messages that are allowed to be exchanged. While O(k log (n/k)) bits is achieved trivially and deterministically with a single message, we ask what is possible with more than one message and with randomization. We give a smooth communication/round tradeoff which shows that with O(log* k) rounds, O(k) bits of communication is possible, which improves upon the trivial protocol by an order of magnitude. This is in contrast to other basic problems such as computing the union or symmetric difference, for which Ω(k log(n/k)) bits of communication is required for any number of rounds. For two players, known lower bounds for the easier problem of set disjointness imply our algorithms are optimal up to constant factors in communication and number of rounds. We extend our protocols to $m$-player protocols, obtaining an optimal O(mk) bits of communication with a similarly small number of rounds.

References

[1]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci., 58(1):137--147, 1999.
[2]
K. D. Ba, P. Indyk, E. Price, and D. P. Woodruff. Lower bounds for sparse recovery. In SODA, pages 1190--1197, 2010.
[3]
Z. Bar-Yossef, T. S. Jayram, R. Kumar, and D. Sivakumar. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68(4):702--732, 2004.
[4]
M. Braverman, F. Ellen, R. Oshman, T. Pitassi, and V. Vaikuntanathan. A tight bound for set disjointness in the message-passing model. In FOCS, pages 668--677, 2013.
[5]
M. Braverman and A. Rao. Information equals amortized communication. In FOCS, pages 748--757, 2011.
[6]
J. Brody, A. Chakrabarti, R. Kondapally, D. P. Woodruff, and G. Yaroslavtsev. Certifying equality with limited interaction. Manuscript.
[7]
H. Buhrman, D. Garcıa-Soriano, A. Matsliah, and R. de Wolf. The non-adaptive query complexity of testing k-parities. CoRR, abs/1209.3849, 2012.
[8]
A. Dasgupta, R. Kumar, and D. Sivakumar. Sparse and lopsided set disjointness via information theory. In APPROX-RANDOM, pages 517--528, 2012.
[9]
M. Datar and S. Muthukrishnan. Estimating rarity and similarity over data stream windows. In ESA, pages 323--334, 2002.
[10]
B. Ding and A. C. König. Fast set intersection in memory. PVLDB, 4(4):255--266, 2011.
[11]
B. Ding, H. Wang, R. Jin, J. Han, and Z. Wang. Optimizing index for taxonomy keyword search. In SIGMOD Conference, pages 493--504, 2012.
[12]
T. Feder, E. Kushilevitz, M. Naor, and N. Nisan. Amortized communication complexity. SIAM J. Comput., 24(4):736--750, 1995.
[13]
M. Fontoura, M. Gurevich, V. Josifovski, and S. Vassilvitskii. Efficiently encoding term co-occurrences in inverted indexes. In CIKM, pages 307--316, 2011.
[14]
M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538--544, 1984.
[15]
J. Håstad and A. Wigderson. The randomized communication complexity of set disjointness. Theory of Computing, 3(1):211--219, 2007.
[16]
B. Kalyanasundaram and G. Schnitger. The probabilistic communication complexity of set intersection. SIAM J. Discrete Math., 5(4):545--557, 1992.
[17]
D. M. Kane, J. Nelson, and D. P. Woodruff. On the exact space complexity of sketching and streaming small norms. In SODA, pages 1161--1178, 2010.
[18]
E. Kushilevitz and N. Nisan. Communication complexity. Cambridge University Press, 1997.
[19]
I. Newman. Private vs. common random bits in communication complexity. Inf. Process. Lett., 39(2):67--71, 1991.
[20]
N. Nisan and I. Segal. The communication requirements of efficient allocations and supporting prices. J. Economic Theory, 129(1):192--224, 2006.
[21]
R. Pagh, M. Stockel, and D. P. Woodruff. Is min-wise hashing optimal for summarizing set intersection? In PODS, 2014.
[22]
I. Parnafes, R. Raz, and A. Wigderson. Direct product results and the gcd problem, in old and new communication models. In STOC, pages 363--372, 1997.
[23]
J. M. Phillips, E. Verbin, and Q. Zhang. Lower bounds for number-in-hand multiparty communication complexity, made easy. In SODA, pages 486--501, 2012.
[24]
R. Raz and A. Wigderson. Monotone circuits for matching require linear depth. J. ACM, 39(3):736--744, 1992.
[25]
A. A. Razborov. On the distributional complexity of disjointness. Theor. Comput. Sci., 106(2):385--390, 1992.
[26]
M. Saglam and G. Tardos. On the communication complexity of sparse set disjointness and exists-equal problems. In FOCS, pages 678--687, 2013.
[27]
A. C.-C. Yao. Some complexity questions related to distributive computing (preliminary report). In STOC, pages 209--213, 1979.
[28]
J. Zhou, Z. Bao, W. Wang, T. W. Ling, Z. Chen, X. Lin, and J. Guo. Fast slca and elca computation for xml keyword queries based on set intersection. In ICDE, pages 905--916, 2012.

Cited By

View all
  • (2024)Multi-Party Set Disjointness and Intersection with Bounded DependenceProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662795(332-342)Online publication date: 17-Jun-2024
  • (2024)Distributed Thresholded Counting with Limited InteractionProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671868(4664-4675)Online publication date: 25-Aug-2024
  • (2023)The Communication Complexity of Functions with Large OutputsStructural Information and Communication Complexity10.1007/978-3-031-32733-9_19(427-458)Online publication date: 6-Jun-2023
  • Show More Cited By

Index Terms

  1. Beyond set disjointness: the communication complexity of finding the intersection

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PODC '14: Proceedings of the 2014 ACM symposium on Principles of distributed computing
    July 2014
    444 pages
    ISBN:9781450329446
    DOI:10.1145/2611462
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 July 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. big data
    2. communication complexity
    3. communication protocols
    4. databases
    5. distributed algorithms

    Qualifiers

    • Research-article

    Conference

    PODC '14
    Sponsor:

    Acceptance Rates

    PODC '14 Paper Acceptance Rate 39 of 141 submissions, 28%;
    Overall Acceptance Rate 740 of 2,477 submissions, 30%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 21 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Multi-Party Set Disjointness and Intersection with Bounded DependenceProceedings of the 43rd ACM Symposium on Principles of Distributed Computing10.1145/3662158.3662795(332-342)Online publication date: 17-Jun-2024
    • (2024)Distributed Thresholded Counting with Limited InteractionProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3671868(4664-4675)Online publication date: 25-Aug-2024
    • (2023)The Communication Complexity of Functions with Large OutputsStructural Information and Communication Complexity10.1007/978-3-031-32733-9_19(427-458)Online publication date: 6-Jun-2023
    • (2021)Multiparty Reach and Frequency Histogram: Private, Secure, and PracticalProceedings on Privacy Enhancing Technologies10.2478/popets-2022-00192022:1(373-395)Online publication date: 20-Nov-2021
    • (2021)The communication complexity of multiparty set disjointness under product distributionsProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing10.1145/3406325.3451106(1194-1207)Online publication date: 15-Jun-2021
    • (2020)A Lower Bound for Sampling Disjoint SetsACM Transactions on Computation Theory10.1145/340485812:3(1-13)Online publication date: 20-Jul-2020
    • (2020)Communication Complexity with Small AdvantageComputational Complexity10.1007/s00037-020-00192-w29:1Online publication date: 20-Apr-2020
    • (2020)Distributed Testing of Distance-k ColoringsStructural Information and Communication Complexity10.1007/978-3-030-54921-3_16(275-290)Online publication date: 29-Jun-2020
    • (2019)Polynomial pass lower bounds for graph streaming algorithmsProceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing10.1145/3313276.3316361(265-276)Online publication date: 23-Jun-2019
    • (2019)On Efficient Tree-Based Tag Search in Large-Scale RFID SystemsIEEE/ACM Transactions on Networking10.1109/TNET.2018.287997927:1(42-55)Online publication date: 1-Feb-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media