Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2723372.2749443acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Persistent Data Sketching

Published: 27 May 2015 Publication History

Abstract

A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried.
Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of today's big data applications, in particular for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which use only sublinear space by sacrificing slightly on accuracy.
All streaming algorithms work by maintaining a small data structure in memory, which is usually called a em sketch, summary, or synopsis. The sketch is updated upon the arrival of every element in the stream, thus is ephemeral, meaning that it can only answer queries about the current status of the stream. In this paper, we aim at designing persistent sketches, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.

References

[1]
N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage. In Proc. ACM Symposium on Principles of Database Systems, pages 10--20, 1999.
[2]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999.
[3]
A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In Proc. ACM Symposium on Principles of Database Systems, pages 286--296, 2004.
[4]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. ACM Symposium on Principles of Database Systems, 2002.
[5]
B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An asymptotically optimal multiversion B-tree. The VLDB Journal, 5(4):264--275, 1996.
[6]
V. Braverman, R. Ostrovsky, and C. Zaniolo. Optimal sampling from sliding windows. In Proc. ACM Symposium on Principles of Database Systems, pages 147--156, 2009.
[7]
G. S. Brodal, S. Sioutas, K. Tsakalidis, and K. Tsichlas. Fully persistent B-trees. In Proc. ACM-SIAM Symposium on Discrete Algorithms, 2012.
[8]
J. L. Carter and M. N. Wegman. Universal classes of hash functions. In Proc. ACM Symposium on Theory of Computing, pages 106--112, 1977.
[9]
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. International Colloquium on Automata, Languages, and Programming, pages 693--703. Springer, 2002.
[10]
B. Chazelle and L. Guibas. Fractional cascading: I. A data structuring technique. Algorithmica, 1(1), 1986.
[11]
G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005.
[12]
G. Cormode and S. Muthukrishnan. What's hot and what's not: tracking most frequent items dynamically. ACM Transactions on Database Systems, 30(1):249--278, 2005.
[13]
M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. SIAM Journal on Computing, 31(6):1794--1813, 2002.
[14]
A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. Processing complex aggregate queries over data streams. In Proc. ACM SIGMOD International Conference on Management of Data, pages 61--72, 2002.
[15]
J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. Tarjan. Making data structures persistent. Journal of Computer and System Sciences, 38:86--124, 1989.
[16]
A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proc. International Conference on Very Large Data Bases, volume 1, pages 79--88, 2001.
[17]
S. Guha and A. McGregor. Stream order and order statistics: Quantile estimation in random-order streams. SIAM Journal on Computing, 38(5), 2009.
[18]
D. Lomet, R. Barga, M. F. Mokbel, G. Shegalov, R. Wang, and Y. Zhu. Immortal DB: transaction time support for SQL server. In Proc. ACM SIGMOD International Conference on Management of Data, pages 939--941, 2005.
[19]
D. Lomet and B. Salzberg. Access methods for multiversion data. In Proc. ACM SIGMOD International Conference on Management of Data, pages 315--324, 1989.
[20]
D. B. Lomet and F. Li. Improving transaction-time dbms performance and functionality. In Proc. IEEE International Conference on Data Engineering, pages 581--591, 2009.
[21]
A. Metwally, D. Agrawal, and A. E. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Transactions on Database Systems, 31(3):1095--1133, 2006.
[22]
S. Muthukrishnan. Data streams: algorithms and applications. Foundations and trends in theoretical computer science. Now Publishers, 2005.
[23]
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Pro. IEEE International Conference on Data Mining Workshops, pages 170--177, 2010.
[24]
J. O'Rourke. An on-line algorithm for fitting straight lines between data ranges. Communications of the ACM, 24(9):574--578, 1981.
[25]
C. Plattner, A. Wapf, and G. Alonso. Searching in time. In Proc. ACM SIGMOD International Conference on Management of Data, pages 754--756, 2006.
[26]
F. Rusu and A. Dobra. Statistical analysis of sketch estimators. In Proc. ACM SIGMOD International Conference on Management of Data, pages 187--198, 2007.
[27]
A. D. Sarma, M. Theobald, and J. Widom. Live: a lineage-supported versioned dbms. In Scientific and Statistical Database Management, pages 416--433. Springer, 2010.
[28]
R. Shaull, L. Shrira, and H. Xu. Skippy: a new snapshot indexing method for time travel in the storage manager. In Proc. ACM SIGMOD International Conference on Management of Data, pages 637--648, 2008.
[29]
L. Shrira and H. Xu. Snap: Efficient snapshots for back-in-time execution. In Proc. IEEE International Conference on Data Engineering, pages 434--445, 2005.
[30]
Y. Tao, K. Yi, C. Sheng, J. Pei, and F. Li. Logging every footstep: Quantile summaries for the entire history. In Proc. ACM SIGMOD International Conference on Management of Data, pages 639--650, 2010.
[31]
P. J. Varman and R. M. Verma. An efficient multiversion access structure. IEEE Transactions on Knowledge and Data Engineering, 9(3):391--409, 1997.
[32]
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proc.ACM Symposium on Operating Systems Principles, pages 423--438, 2013.

Cited By

View all
  • (2024)μMon: Empowering Microsecond-level Network Monitoring with WaveletsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672236(274-290)Online publication date: 4-Aug-2024
  • (2024)Unbiased Real-Time Traffic SketchingIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.328400411:3(2371-2383)Online publication date: May-2024
  • (2024)DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00365(4801-4814)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. Persistent Data Sketching

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
    May 2015
    2110 pages
    ISBN:9781450327589
    DOI:10.1145/2723372
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 May 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. approximation
    2. persistence
    3. sketch

    Qualifiers

    • Research-article

    Funding Sources

    • HKRGC
    • Microsoft grant
    • Research Funds of Renmin University of China
    • 973 Program of China
    • Fundamental Research Funds for the Cen- tral Universities
    • National Key Ba- sic Research Program (973 Program) of China

    Conference

    SIGMOD/PODS'15
    Sponsor:
    SIGMOD/PODS'15: International Conference on Management of Data
    May 31 - June 4, 2015
    Victoria, Melbourne, Australia

    Acceptance Rates

    SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)47
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)μMon: Empowering Microsecond-level Network Monitoring with WaveletsProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672236(274-290)Online publication date: 4-Aug-2024
    • (2024)Unbiased Real-Time Traffic SketchingIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.328400411:3(2371-2383)Online publication date: May-2024
    • (2024)DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00365(4801-4814)Online publication date: 13-May-2024
    • (2024)WavingSketch: an unbiased and generic sketch for finding top-k items in data streamsThe VLDB Journal10.1007/s00778-024-00869-633:5(1697-1722)Online publication date: 29-Jul-2024
    • (2023)Double-Anonymous Sketch: Achieving Top-K-fairness for Finding Global Top-K Frequent ItemsProceedings of the ACM on Management of Data10.1145/35889331:1(1-26)Online publication date: 30-May-2023
    • (2023)TreeSensing: Linearly Compressing Sketches with FlexibilityProceedings of the ACM on Management of Data10.1145/35889101:1(1-28)Online publication date: 30-May-2023
    • (2023)MicroscopeSketch: Accurate Sliding Estimation Using Adaptive ZoomingProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599432(2660-2671)Online publication date: 6-Aug-2023
    • (2023)Hyper-USS: Answering Subset Query Over Multi-Attribute Data StreamProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599383(1698-1709)Online publication date: 6-Aug-2023
    • (2023)BurstSketch: Finding Bursts in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.322368635:11(11126-11140)Online publication date: 1-Nov-2023
    • (2023)SketchConf: A Framework for Automatic Sketch Configuration2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00157(2022-2035)Online publication date: Apr-2023
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media