Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2749443acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Persistent Data Sketching

Published: 27 May 2015 Publication History
  • Get Citation Alerts
  • Abstract

    A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried.
    Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today's big data applications, in particular for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which use only sublinear space by sacrificing slightly on accuracy.
    All streaming algorithms work by maintaining a small data structure in memory, which is usually called a em sketch, summary, or synopsis. The sketch is updated upon the arrival of every element in the stream, thus is ephemeral, meaning that it can only answer queries about the current status of the stream. In this paper, we aim at designing persistent sketches, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.

    References

    [1]
    N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proc. ACM Symposium on Principles of Database Systems, pages 10--20, 1999.
    [2]
    N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999.
    [3]
    A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In Proc. ACM Symposium on Principles of Database Systems, pages 286--296, 2004.
    [4]
    B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. ACM Symposium on Principles of Database Systems, 2002.
    [5]
    B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An asymptotically optimal multiversion B-tree. The VLDB Journal, 5(4):264--275, 1996.
    [6]
    V. Braverman, R. Ostrovsky, and C. Zaniolo. Optimal sampling from sliding windows. In Proc. ACM Symposium on Principles of Database Systems, pages 147--156, 2009.
    [7]
    G. S. Brodal, S. Sioutas, K. Tsakalidis, and K. Tsichlas. Fully persistent B-trees. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2012.
    [8]
    J. L. Carter and M. N. Wegman. Universal classes of hash functions. In Proc. ACM Symposium on Theory of Computing, pages 106--112, 1977.
    [9]
    M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. International Colloquium on Automata, Languages, and Programming, pages 693--703. Springer, 2002.
    [10]
    B. Chazelle and L. Guibas. Fractional cascading: I. A data structuring technique. Algorithmica, 1(1), 1986.
    [11]
    G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005.
    [12]
    G. Cormode and S. Muthukrishnan. What's hot and what's not: tracking most frequent items dynamically. ACM Transactions on Database Systems, 30(1):249--278, 2005.
    [13]
    M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. SIAM Journal on Computing, 31(6):1794--1813, 2002.
    [14]
    A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In Proc. ACM SIGMOD International Conference on Management of Data, pages 61--72, 2002.
    [15]
    J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. Tarjan. Making data structures persistent. Journal of Computer and System Sciences, 38:86--124, 1989.
    [16]
    A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proc. International Conference on Very Large Data Bases, volume 1, pages 79--88, 2001.
    [17]
    S. Guha and A. McGregor. Stream order and order statistics: Quantile estimation in random-order streams. SIAM Journal on Computing, 38(5), 2009.
    [18]
    D. Lomet, R. Barga, M. F. Mokbel, G. Shegalov, R. Wang, and Y. Zhu. Immortal DB: transaction time support for SQL server. In Proc. ACM SIGMOD International Conference on Management of Data, pages 939--941, 2005.
    [19]
    D. Lomet and B. Salzberg. Access methods for multiversion data. In Proc. ACM SIGMOD International Conference on Management of Data, pages 315--324, 1989.
    [20]
    D. B. Lomet and F. Li. Improving transaction-time dbms performance and functionality. In Proc. IEEE International Conference on Data Engineering, pages 581--591, 2009.
    [21]
    A. Metwally, D. Agrawal, and A. E. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006.
    [22]
    S. Muthukrishnan. Data streams: algorithms and applications. Foundations and trends in theoretical computer science. Now Publishers, 2005.
    [23]
    L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Pro. IEEE International Conference on Data Mining Workshops, pages 170--177, 2010.
    [24]
    J. O'Rourke. An on-line algorithm for fitting straight lines between data ranges. Communications of the ACM, 24(9):574--578, 1981.
    [25]
    C. Plattner, A. Wapf, and G. Alonso. Searching in time. In Proc. ACM SIGMOD International Conference on Management of Data, pages 754--756, 2006.
    [26]
    F. Rusu and A. Dobra. Statistical analysis of sketch estimators. In Proc. ACM SIGMOD International Conference on Management of Data, pages 187--198, 2007.
    [27]
    A. D. Sarma, M. Theobald, and J. Widom. Live: a lineage-supported versioned dbms. In Scientific and Statistical Database Management, pages 416--433. Springer, 2010.
    [28]
    R. Shaull, L. Shrira, and H. Xu. Skippy: a new snapshot indexing method for time travel in the storage manager. In Proc. ACM SIGMOD International Conference on Management of Data, pages 637--648, 2008.
    [29]
    L. Shrira and H. Xu. Snap: Efficient snapshots for back-in-time execution. In Proc. IEEE International Conference on Data Engineering, pages 434--445, 2005.
    [30]
    Y. Tao, K. Yi, C. Sheng, J. Pei, and F. Li. Logging every footstep: Quantile summaries for the entire history. In Proc. ACM SIGMOD International Conference on Management of Data, pages 639--650, 2010.
    [31]
    P. J. Varman and R. M. Verma. An efficient multiversion access structure. IEEE Transactions on Knowledge and Data Engineering, 9(3):391--409, 1997.
    [32]
    M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proc.ACM Symposium on Operating Systems Principles, pages 423--438, 2013.

    Cited By

    View all
    • (2024)μMon: Empowering Microsecond-level Network Monitoring with WaveletsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672236(274-290)Online publication date: 4-Aug-2024
    • (2024)Unbiased Real-Time Traffic SketchingIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.328400411:3(2371-2383)Online publication date: May-2024
    • (2024)DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00365(4801-4814)Online publication date: 13-May-2024
    • Show More Cited By

    Index Terms

    1. Persistent Data Sketching

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 May 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. approximation
      2. persistence
      3. sketch

      Qualifiers

      • Research-article

      Funding Sources

      • HKRGC
      • Microsoft grant
      • Research Funds of Renmin University of China
      • 973 Program of China
      • Fundamental Research Funds for the Cen- tral Universities
      • National Key Ba- sic Research Program (973 Program) of China

      Conference

      SIGMOD/PODS'15
      Sponsor:
      SIGMOD/PODS'15: International Conference on Management of Data
      May 31 - June 4, 2015
      Victoria, Melbourne, Australia

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)35
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 27 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)μMon: Empowering Microsecond-level Network Monitoring with WaveletsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672236(274-290)Online publication date: 4-Aug-2024
      • (2024)Unbiased Real-Time Traffic SketchingIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.328400411:3(2371-2383)Online publication date: May-2024
      • (2024)DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00365(4801-4814)Online publication date: 13-May-2024
      • (2024)WavingSketch: an unbiased and generic sketch for finding top-k items in data streamsThe VLDB Journal10.1007/s00778-024-00869-6Online publication date: 29-Jul-2024
      • (2023)Double-Anonymous Sketch: Achieving Top-K-fairness for Finding Global Top-K Frequent ItemsProceedings of the ACM on Management of Data10.1145/35889331:1(1-26)Online publication date: 30-May-2023
      • (2023)TreeSensing: Linearly Compressing Sketches with FlexibilityProceedings of the ACM on Management of Data10.1145/35889101:1(1-28)Online publication date: 30-May-2023
      • (2023)MicroscopeSketch: Accurate Sliding Estimation Using Adaptive ZoomingProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599432(2660-2671)Online publication date: 6-Aug-2023
      • (2023)Hyper-USS: Answering Subset Query Over Multi-Attribute Data StreamProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599383(1698-1709)Online publication date: 6-Aug-2023
      • (2023)BurstSketch: Finding Bursts in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322368635:11(11126-11140)Online publication date: 1-Nov-2023
      • (2023)SketchConf: A Framework for Automatic Sketch Configuration2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00157(2022-2035)Online publication date: Apr-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media