Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

FlashP: an analytical pipeline for real-time forecasting of time-series relational data

Published: 01 January 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Interactive response time is important in analytical pipelines for users to explore a sufficient number of possibilities and make informed business decisions. We consider a forecasting pipeline with large volumes of high-dimensional time series data. Real-time forecasting can be conducted in two steps. First, we specify the part of data to be focused on and the measure to be predicted by slicing, dicing, and aggregating the data. Second, a forecasting model is trained on the aggregated results to predict the trend of the specified measure. While there are a number of forecasting models available, the first step is the performance bottleneck. A natural idea is to utilize sampling to obtain approximate aggregations in real time as the input to train the forecasting model. Our scalable real-time forecasting system FlashP (Flash Prediction) is built based on this idea, with two major challenges to be resolved in this paper: first, we need to figure out how approximate aggregations affect the fitting of forecasting models, and forecasting results; and second, accordingly, what sampling algorithms we should use to obtain these approximate aggregations and how large the samples are. We introduce a new sampling scheme, called GSW sampling, and analyze error bounds for estimating aggregations using GSW samples. We introduce how to construct compact GSW samples with the existence of multiple measures to be analyzed. We conduct experiments to evaluate our solution its alternatives on real data.

    References

    [1]
    [n.d.]. https://www.alibabacloud.com. [Online; accessed 1/15/2021].
    [2]
    [n.d.]. https://pypi.org/project/pmdarima/. [Online; accessed 1/15/2021].
    [3]
    [n.d.]. https://www.statsmodels.org/stable/generated/statsmodels.tsa.x13.x13_arima_analysis.html. [Online; accessed 1/15/2021].
    [4]
    [n.d.]. https://keras.io/. [Online; accessed 1/15/2021].
    [5]
    Swarup Acharya, Phillip B Gibbons, and Viswanath Poosala. 2000. Congressional samples for approximate answering of group-by queries. In SIGMOD. 487--498.
    [6]
    Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. The Aqua approximate query answering system. In SIGMOD. 574--576.
    [7]
    Deepak Agarwal, Datong Chen, Long-ji Lin, Jayavel Shanmugasundaram, and Erik Vee. 2010. Forecasting high-dimensional data. In SIGMOD. 1003--1012.
    [8]
    Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Eurosys. 29--42.
    [9]
    Sanjay Agrawal, Surajit Chaudhuri, and Vivek R Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB. 496--505.
    [10]
    Noga Alon, Nick G. Duffield, Carsten Lund, and Mikkel Thorup. 2005. Estimating arbitrary subset sums with few probes. In PODS. 317--325.
    [11]
    Brian Babcock, Surajit Chaudhuri, and Gautam Das. 2003. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD. 539--550.
    [12]
    Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, and Kyuseok Shim. 2001. Approximate query processing using wavelets. VLDBJ 10, 2-3 (2001), 199--223.
    [13]
    Surajit Chaudhuri, Gautam Das, Mayur Datar, Rajeev Motwani, and Vivek Narasayya. 2001. Overcoming limitations of sampling for aggregation queries. In ICDE. 534--542.
    [14]
    Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2001. A robust, optimization-based approach for approximate answering of aggregate queries. In SIGMOD. 295--306.
    [15]
    Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. TODS 32, 2 (2007), 9.
    [16]
    Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In SIGMOD. 511--519.
    [17]
    Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 1999. On random sampling over joins. In SIGMOD. 263--274.
    [18]
    Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. In NSDI. 313--328.
    [19]
    Bolin Ding, Silu Huang, Surajit Chaudhuri, Kaushik Chakrabarti, and Chi Wang. 2016. Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee. In SIGMOD. 679--694.
    [20]
    Nick G. Duffield, Carsten Lund, and Mikkel Thorup. 2005. Learn more, sample less: control of volume and variance in network measurement. IEEE Trans. Information Theory 51, 5 (2005), 1756--1775.
    [21]
    Nick G. Duffield, Carsten Lund, and Mikkel Thorup. 2007. Priority sampling for estimation of arbitrary subset sums. J. ACM 54, 6 (2007), 32.
    [22]
    Venkatesh Ganti, Mong-Li Lee, and Raghu Ramakrishnan. 2000. ICICLES: Self-Tuning Samples for Approximate Query Answering. In VLDB. 187.
    [23]
    Anna C Gilbert, Yannis Kotidis, S Muthukrishnan, and Marin J Strauss. 2001. Optimal and approximate computation of summary statistics for range aggregates. In PODS. 227--236.
    [24]
    Xudong Gong, Yan Xiong, Wenchao Huang, Lei Chen, Qiwei Lu, and Yiqing Hu. 2015. Fast Similarity Search of Multi-Dimensional Time Series via Segment Rotation. In DASFAA. 108--124.
    [25]
    Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1, 1 (1997), 29--53.
    [26]
    James D. Hamilton. 1994. Time Series Analysis. Princeton University Press.
    [27]
    Sariel Har-peled. 2011. Geometric Approximation Algorithms. American Mathematical Society.
    [28]
    Joseph M Hellerstein, Peter J Haas, and Helen J Wang. 1997. Online aggregation. SIGMOD (1997), 171--182.
    [29]
    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (1997), 1735--1780.
    [30]
    D. G. Horvitz and D. J. Thompson. 1952. A Generalization of Sampling Without Replacement From a Finite Universe. J. Amer. Statist. Assoc. 47, 260 (1952), 663--685.
    [31]
    Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In SIGMOD. 631--646.
    [32]
    Christopher Olston, Edward Bortnikov, Khaled Elmeleegy, Flavio Junqueira, and Benjamin Reed. 2009. Interactive Analysis of Web-Scale Data. In CIDR.
    [33]
    Niketan Pansare, Vinayak R. Borkar, Chris Jermaine, and Tyson Condie. 2011. Online Aggregation for Large MapReduce Jobs. PVLDB 4, 11 (2011), 1135--1145.
    [34]
    Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing Approximate Query Processing. In SIGMOD. 1461--1476.
    [35]
    Jürgen Schmidhuber, Daan Wierstra, and Faustino J. Gomez. 2005. Evolino: Hybrid Neuroevolution/Optimal Linear Search for Sequence Learning. In IJCAI. 853--858.
    [36]
    Lefteris Sidirourgos, Martin L. Kersten, and Peter A. Boncz. 2011. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In CIDR. 296--301.
    [37]
    Mario Szegedy. 2006. The DLT priority sampling is essentially optimal. In STOC. 150--158.
    [38]
    Justin Tobias and Arnold Zellner. 2000. A note on aggregation, disaggregation and forecasting performance. Journal of Forecasting 19, 5 (2000), 457--469.
    [39]
    Jeffrey Scott Vitter and Min Wang. 1999. Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. In SIGMOD. 193--204.
    [40]
    Mike West and Jeff Harrison. 1997. Bayesian Forecasting and Dynamic Models. Springer-Verlag.
    [41]
    Shuyuan Yan, Bolin Ding, Wei Guo, Jingren Zhou, Zhewei Wei, Xiaowei Jiang, and Sheng Xu. 2021. FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data. arXiv:2101.03298 [cs.DB]

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 14, Issue 5
    January 2021
    142 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 January 2021
    Published in PVLDB Volume 14, Issue 5

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 63
      Total Downloads
    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media