Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2063576.2063643acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

One is enough: distributed filtering for duplicate elimination

Published: 24 October 2011 Publication History

Abstract

The growth of online services has created the need for duplicate elimination in high-volume streams of events. The sheer volume of data in applications such as pay-per-click clickstream processing, RSS feed syndication and notification services in social sites such Twitter and Facebook makes traditional centralized solutions hard to scale. In this paper, we propose an approach based on distributed filtering. To this end, we introduce a suite of distributed Bloom filters that exploit different ways of partitioning the event space. To address the continuous nature of event delivery, the filters are extended to support sliding window semantics. Moreover, we examine locality-related tradeoffs and propose a tree-based architecture to allow for duplicate elimination across geographic locations. We cast the design space and present experimental results that demonstrate the pros and cons of our various solutions in different settings.

References

[1]
P. S. Almeidaa, C. Baqueroa, N. Preguiça, and D. Hutchison. Scalable bloom filters. Information Processing Letters, 101(6):255--261, 2007.
[2]
B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970.
[3]
A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2002.
[4]
A. Chen, Y. Jin, J. Cao, and L. E. Li. Tracking long duration flows in network traffic. In INFOCOM, 2010.
[5]
S. Cohen and Y. Matias. Spectral bloom filters. In SIGMOD Conference, 2003.
[6]
I. Dar, T. Milo, and E. Verbin. Optimized union of non-disjoint distributed data sets. In EDBT, 2009.
[7]
F. Deng and D. Rafiei. Approximately detecting duplicates for streaming data using stable bloom filters. In SIGMOD Conference, 2006.
[8]
C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270--313, 2003.
[9]
L. Fan, P. Cao, J. M. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8(3):281--293, 2000.
[10]
D. Guo, J. Wu, H. Chen, Y. Yuan, and X. Luo. The dynamic bloom filters. IEEE TKDE, 22(1):120--133, 2010.
[11]
R. Jain, D. Chiu, and W. Hawe. A quantitative measure of fairness and discrimination for resource allocation in shared computer systems. DEC Research Report TR-301, 1984.
[12]
G. Koloniari and E. Pitoura. Content-based routing of path queries in peer-to-peer systems. In EDBT, 2004.
[13]
S. Majumdar, D. Kulkarni, and C. Ravishankar. Addressing click fraud in content delivery systems. In Proc. INFOCOM, 2007.
[14]
A. Metwally, D. Agrawal, and A. E. Abbadi. Duplicate detection in click streams. In WWW, 2005.
[15]
O. Papapetrou, W. Siberski, and W. Nejdl. Cardinality estimation and dynamic length adaptation for bloom filters. Distributed and Parallel Databases, 28(2,3):119--156, 2010.
[16]
S. C. Rhea and J. Kubiatowicz. Probabilistic location and routing. In Proc. INFOCOM, 2002.
[17]
X. Wang, Q. Zhang, and Y. Jia. Efficiently filtering duplicates over distributed data streams. In Proc. CSSE, 2008.
[18]
T. Xia, C. Jin, X. Zhou, and A. Zhou. Filtering duplicate items over distributed data streams. In WAIM, 2005.

Cited By

View all
  • (2024)Autonomous proactive data management in support of pervasive edge applicationsFuture Generation Computer Systems10.1016/j.future.2024.02.003155(108-120)Online publication date: Jun-2024
  • (2024)Data management and selectivity in collaborative pervasive edge computingComputing10.1007/s00607-024-01297-8106:8(2561-2584)Online publication date: 1-Aug-2024
  • (2023)A Learned Cuckoo Filter for Approximate Membership Queries over Variable-sized Sliding Windows on Data StreamsProceedings of the ACM on Management of Data10.1145/36267581:4(1-26)Online publication date: 12-Dec-2023
  • Show More Cited By

Index Terms

  1. One is enough: distributed filtering for duplicate elimination

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
      October 2011
      2712 pages
      ISBN:9781450307178
      DOI:10.1145/2063576
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 October 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. distributed bloom filters
      2. duplicate elimination

      Qualifiers

      • Research-article

      Conference

      CIKM '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)4
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 12 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Autonomous proactive data management in support of pervasive edge applicationsFuture Generation Computer Systems10.1016/j.future.2024.02.003155(108-120)Online publication date: Jun-2024
      • (2024)Data management and selectivity in collaborative pervasive edge computingComputing10.1007/s00607-024-01297-8106:8(2561-2584)Online publication date: 1-Aug-2024
      • (2023)A Learned Cuckoo Filter for Approximate Membership Queries over Variable-sized Sliding Windows on Data StreamsProceedings of the ACM on Management of Data10.1145/36267581:4(1-26)Online publication date: 12-Dec-2023
      • (2019)Inferring Insertion Times and Optimizing Error Penalties in Time-decaying Bloom FiltersACM Transactions on Database Systems10.1145/328455244:2(1-32)Online publication date: 15-Mar-2019
      • (2018)rFilter: A Scalable and Space-efficient Membership Filter2018 5th International Conference on Signal Processing and Integrated Networks (SPIN)10.1109/SPIN.2018.8474044(478-485)Online publication date: Feb-2018
      • (2018)A neural data structure for novelty detectionProceedings of the National Academy of Sciences10.1073/pnas.1814448115115:51(13093-13098)Online publication date: 3-Dec-2018
      • (2016)False-Positive Probability and Compression Optimization for Tree-Structured Bloom FiltersACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/29403241:4(1-39)Online publication date: 21-Sep-2016
      • (2016)BEAD: Best effort autonomous deletion in content-centric networking2016 IFIP Networking Conference (IFIP Networking) and Workshops10.1109/IFIPNetworking.2016.7497241(180-188)Online publication date: May-2016
      • (2013)Inferential time-decaying Bloom filtersProceedings of the 16th International Conference on Extending Database Technology10.1145/2452376.2452405(239-250)Online publication date: 18-Mar-2013
      • (2013)Duplicate Detection for Identifying Social Spam in MicroblogsProceedings of the 2013 IEEE International Congress on Big Data10.1109/BigData.Congress.2013.27(141-148)Online publication date: 27-Jun-2013
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media