Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3485447.3512019acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
research-article

DUET: A Generic Framework for Finding Special Quadratic Elements in Data Streams

Published: 25 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Finding special items, like heavy hitters, top-k, and persistent items, has always been a hot issue in data stream processing for web analysis. While data streams nowadays are usually high-dimensional, most prior works focus on special items according to a certain primary dimension and yield little insight into the correlations between dimensions. Therefore, we propose to find special quadratic elements to reveal close correlations. Based on the items mentioned above, we extend our problem to three applications related to heavy hitters, top-k, and persistent items, and design a generic framework DUET to process them. Besides, we analyze the error bound of our algorithm and conduct extensive experiments on four data sets. Our experimental results show that DUET can achieve 3.5 times higher throughput and three orders of magnitude lower average relative error compared with cutting-edge algorithms.

    References

    [1]
    [n. d.]. Advanced Persistent Threat. http://www.usenix.org/event/lisa09/tech/slides/daly. pdf.
    [2]
    [n. d.]. BobHash. http://burtleburtle.net/bob/hash/evahash.html.
    [3]
    [n. d.]. The CAIDA Anonymized Internet Traces. https://www.caida.org/data/passive/passive_2016_dataset.xml.
    [4]
    [n. d.]. Social Network: Reddit Hyperlink Network. http://snap.stanford.edu/data/soc-RedditHyperlinks.html.
    [5]
    [n. d.]. Source Codes of DUET. https://github.com/callitwhatyouwannt13/WWW_DUET.
    [6]
    [n. d.]. The Source Codes of HeavyKeeper. https://github.com/papergitkeeper/heavy-keeper-project.
    [7]
    [n. d.]. Stack Overflow Temporal Network. http://snap.stanford.edu/data/sx-stackoverflow.html.
    [8]
    Juan Jose Garcia Adeva and Juan Manuel Pikatza Atxa. 2007. Intrusion detection in web applications using text mining. Engineering Applications of Artificial Intelligence 20, 4(2007), 555–566.
    [9]
    Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, 2014. CONGA: Distributed congestion-aware load balancing for datacenters. In Proceedings of ACM Conference on SIGCOMM. ACM, 503–514.
    [10]
    Ran Ben Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. 2017. Optimal elephant flow detection. In Proceedings of International Conference on Computer Communications. IEEE, 1–9.
    [11]
    Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. 2020. Faster and More Accurate Measurement through Additive-Error Counters. In Proceedings of International Conference on Computer Communications. IEEE, 1251–1260.
    [12]
    Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.
    [13]
    Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of International Colloquium on Automata, Languages, and Programming. Springer, 693–703.
    [14]
    Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75.
    [15]
    Haipeng Dai, Muhammad Shahzad, Alex X Liu, and Yuankun Zhong. 2016. Finding persistent items in data streams. Proceedings of the VLDB Endowment 10, 4 (2016), 289–300.
    [16]
    Italo Epicoco, Massimo Cafaro, and Marco Pulimeno. 2018. Fast and accurate mining of correlated heavy hitters. Data Mining and Knowledge Discovery 32, 1 (2018), 162–186.
    [17]
    Cristian Estan and George Varghese. 2002. New directions in traffic measurement and accounting. In Proceedings of ACM Conference on SIGCOMM. ACM, 323–336.
    [18]
    Qun Huang, Patrick PC Lee, and Yungang Bao. 2018. SketchLearn: relieving user burdens in approximate measurement with automated statistical inference. In Proceedings of ACM Conference on SIGCOMM. ACM, 576–590.
    [19]
    Mohammad T Irfan and Tucker Gordon. 2018. The power of context in networks: Ideal point models with social interactions. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 910–918.
    [20]
    Bibudh Lahiri and Srikanta Tirthapura. 2009. Finding correlated heavy-hitters over data streams. In Proceedings of International Performance Computing and Communications Conference. IEEE, 307–314.
    [21]
    Bibudh Lahiri, Srikanta Tirthapura, and Jaideep Chandrashekar. 2014. Space-efficient tracking of persistent items in a massive data stream. Statistical Analysis and Data Mining: The ASA Data Science Journal 7, 1(2014), 70–92.
    [22]
    Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. 2020. WavingSketch: An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams. In Proceedings of International Conference on Knowledge Discovery & Data Mining. ACM, 1574–1584.
    [23]
    Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, Vladimir Braverman, Roy Friedman, and Vyas Sekar. 2019. Nitrosketch: Robust and general sketch-based monitoring in software switches. In Proceedings of ACM Conference on SIGCOMM. ACM, 334–350.
    [24]
    Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. 2016. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of ACM Conference on SIGCOMM. ACM, 101–114.
    [25]
    Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of International Conference on Database Theory. Springer, 398–412.
    [26]
    Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, and Divesh Srivastava. 2013. Finding interesting correlations with conditional heavy hitters. In Proceedings of International Conference on Data Engineering. IEEE, 1069–1080.
    [27]
    Jayadev Misra and David Gries. 1982. Finding repeated elements. Science of Computer Programming 2, 2 (1982), 143–152.
    [28]
    Marco Pulimeno, Italo Epicoco, Massimo Cafaro, Catiuscia Melle, and Giovanni Aloisio. 2018. Parallel Mining of Correlated Heavy Hitters on Distributed and Shared-Memory Architectures. In Proceedings of International Conference on Big Data. IEEE, 5111–5118.
    [29]
    Amin Shokrollahi. 2006. Raptor codes. IEEE Transactions on Information Theory 52, 6 (2006), 2551–2567.
    [30]
    Stuart Staniford, James A Hoagland, and Joseph M McAlerney. 2002. Practical automated detection of stealthy portscans. Journal of Computer Security 10, 1-2 (2002), 105–136.
    [31]
    Karthik Subbian, Charu Aggarwal, and Kshiteesh Hegde. 2016. Recommendations for streaming data. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 2185–2190.
    [32]
    Lu Tang, Qun Huang, and Patrick PC Lee. 2019. MV-Sketch: A fast and compact invertible sketch for heavy flow detection in network data streams. In Proceedings of International Conference on Computer Communications. IEEE, 2026–2034.
    [33]
    Tobias Weller. 2018. Compromised account detection based on clickstream data. In Companion Proceedings of the The Web Conference 2018. 819–823.
    [34]
    Tong Yang, Junzhi Gong, Haowei Zhang, Lei Zou, Lei Shi, and Xiaoming Li. 2018. HeavyGuardian: Separate and guard hot items in data streams. In Proceedings of International Conference on Knowledge Discovery & Data Mining. ACM, 2584–2593.
    [35]
    Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the ACM Special Interest Group on Data Communication. ACM, 561–575.
    [36]
    Yinda Zhang, Jinyang Li, Yutian Lei, Tong Yang, Zhetao Li, Gong Zhang, and Bin Cui. 2020. On-off sketch: a fast and accurate sketch on persistence. Proceedings of the VLDB Endowment 14, 2 (2020), 128–140.

    Cited By

    View all
    • (2024)DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00365(4801-4814)Online publication date: 13-May-2024
    • (2023)SteadySketch: Finding Steady Flows in Data Streams2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS)10.1109/IWQoS57198.2023.10188743(01-09)Online publication date: 19-Jun-2023
    • (2023)Variable-length Encoding Framework: A Generic Framework for Enhancing the Accuracy of Approximate Membership Queries2023 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM58522.2023.00015(61-70)Online publication date: 1-Dec-2023

    Index Terms

    1. DUET: A Generic Framework for Finding Special Quadratic Elements in Data Streams
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          WWW '22: Proceedings of the ACM Web Conference 2022
          April 2022
          3764 pages
          ISBN:9781450390965
          DOI:10.1145/3485447
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 25 April 2022

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. data stream mining
          2. data structure
          3. sketch

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Funding Sources

          Conference

          WWW '22
          Sponsor:
          WWW '22: The ACM Web Conference 2022
          April 25 - 29, 2022
          Virtual Event, Lyon, France

          Acceptance Rates

          Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)75
          • Downloads (Last 6 weeks)5
          Reflects downloads up to 27 Jul 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00365(4801-4814)Online publication date: 13-May-2024
          • (2023)SteadySketch: Finding Steady Flows in Data Streams2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS)10.1109/IWQoS57198.2023.10188743(01-09)Online publication date: 19-Jun-2023
          • (2023)Variable-length Encoding Framework: A Generic Framework for Enhancing the Accuracy of Approximate Membership Queries2023 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM58522.2023.00015(61-70)Online publication date: 1-Dec-2023

          View Options

          Get Access

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media