research-article

FlashP: an analytical pipeline for real-time forecasting of time-series relational data

Authors:

Sheng XuAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 5

Pages 721 - 729

https://doi.org/10.14778/3446095.3446096

Published: 01 January 2021 Publication History

Abstract

Interactive response time is important in analytical pipelines for users to explore a sufficient number of possibilities and make informed business decisions. We consider a forecasting pipeline with large volumes of high-dimensional time series data. Real-time forecasting can be conducted in two steps. First, we specify the part of data to be focused on and the measure to be predicted by slicing, dicing, and aggregating the data. Second, a forecasting model is trained on the aggregated results to predict the trend of the specified measure. While there are a number of forecasting models available, the first step is the performance bottleneck. A natural idea is to utilize sampling to obtain approximate aggregations in real time as the input to train the forecasting model. Our scalable real-time forecasting system FlashP (Flash Prediction) is built based on this idea, with two major challenges to be resolved in this paper: first, we need to figure out how approximate aggregations affect the fitting of forecasting models, and forecasting results; and second, accordingly, what sampling algorithms we should use to obtain these approximate aggregations and how large the samples are. We introduce a new sampling scheme, called GSW sampling, and analyze error bounds for estimating aggregations using GSW samples. We introduce how to construct compact GSW samples with the existence of multiple measures to be analyzed. We conduct experiments to evaluate our solution its alternatives on real data.

References

[1]

[n.d.]. https://www.alibabacloud.com. [Online; accessed 1/15/2021].

[2]

[n.d.]. https://pypi.org/project/pmdarima/. [Online; accessed 1/15/2021].

[3]

[n.d.]. https://www.statsmodels.org/stable/generated/statsmodels.tsa.x13.x13_arima_analysis.html. [Online; accessed 1/15/2021].

[4]

[n.d.]. https://keras.io/. [Online; accessed 1/15/2021].

[5]

Swarup Acharya, Phillip B Gibbons, and Viswanath Poosala. 2000. Congressional samples for approximate answering of group-by queries. In SIGMOD. 487--498.

Digital Library

[6]

Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. The Aqua approximate query answering system. In SIGMOD. 574--576.

Digital Library

[7]

Deepak Agarwal, Datong Chen, Long-ji Lin, Jayavel Shanmugasundaram, and Erik Vee. 2010. Forecasting high-dimensional data. In SIGMOD. 1003--1012.

Digital Library

[8]

Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Eurosys. 29--42.

Digital Library

[9]

Sanjay Agrawal, Surajit Chaudhuri, and Vivek R Narasayya. 2000. Automated Selection of Materialized Views and Indexes in SQL Databases. In VLDB. 496--505.

Digital Library

[10]

Noga Alon, Nick G. Duffield, Carsten Lund, and Mikkel Thorup. 2005. Estimating arbitrary subset sums with few probes. In PODS. 317--325.

Digital Library

[11]

Brian Babcock, Surajit Chaudhuri, and Gautam Das. 2003. Dynamic Sample Selection for Approximate Query Processing. In SIGMOD. 539--550.

Digital Library

[12]

Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, and Kyuseok Shim. 2001. Approximate query processing using wavelets. VLDBJ 10, 2-3 (2001), 199--223.

Digital Library

[13]

Surajit Chaudhuri, Gautam Das, Mayur Datar, Rajeev Motwani, and Vivek Narasayya. 2001. Overcoming limitations of sampling for aggregation queries. In ICDE. 534--542.

Digital Library

[14]

Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2001. A robust, optimization-based approach for approximate answering of aggregate queries. In SIGMOD. 295--306.

Digital Library

[15]

Surajit Chaudhuri, Gautam Das, and Vivek Narasayya. 2007. Optimized stratified sampling for approximate query processing. TODS 32, 2 (2007), 9.

Digital Library

[16]

Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In SIGMOD. 511--519.

Digital Library

[17]

Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 1999. On random sampling over joins. In SIGMOD. 263--274.

Digital Library

[18]

Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. In NSDI. 313--328.

Digital Library

[19]

Bolin Ding, Silu Huang, Surajit Chaudhuri, Kaushik Chakrabarti, and Chi Wang. 2016. Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee. In SIGMOD. 679--694.

Digital Library

[20]

Nick G. Duffield, Carsten Lund, and Mikkel Thorup. 2005. Learn more, sample less: control of volume and variance in network measurement. IEEE Trans. Information Theory 51, 5 (2005), 1756--1775.

Digital Library

[21]

Nick G. Duffield, Carsten Lund, and Mikkel Thorup. 2007. Priority sampling for estimation of arbitrary subset sums. J. ACM 54, 6 (2007), 32.

Digital Library

[22]

Venkatesh Ganti, Mong-Li Lee, and Raghu Ramakrishnan. 2000. ICICLES: Self-Tuning Samples for Approximate Query Answering. In VLDB. 187.

Digital Library

[23]

Anna C Gilbert, Yannis Kotidis, S Muthukrishnan, and Marin J Strauss. 2001. Optimal and approximate computation of summary statistics for range aggregates. In PODS. 227--236.

Digital Library

[24]

Xudong Gong, Yan Xiong, Wenchao Huang, Lei Chen, Qiwei Lu, and Yiqing Hu. 2015. Fast Similarity Search of Multi-Dimensional Time Series via Segment Rotation. In DASFAA. 108--124.

[25]

Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1, 1 (1997), 29--53.

Digital Library

[26]

James D. Hamilton. 1994. Time Series Analysis. Princeton University Press.

[27]

Sariel Har-peled. 2011. Geometric Approximation Algorithms. American Mathematical Society.

Digital Library

[28]

Joseph M Hellerstein, Peter J Haas, and Helen J Wang. 1997. Online aggregation. SIGMOD (1997), 171--182.

Digital Library

[29]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (1997), 1735--1780.

Digital Library

[30]

D. G. Horvitz and D. J. Thompson. 1952. A Generalization of Sampling Without Replacement From a Finite Universe. J. Amer. Statist. Assoc. 47, 260 (1952), 663--685.

[31]

Srikanth Kandula, Anil Shanbhag, Aleksandar Vitorovic, Matthaios Olma, Robert Grandl, Surajit Chaudhuri, and Bolin Ding. 2016. Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters. In SIGMOD. 631--646.

Digital Library

[32]

Christopher Olston, Edward Bortnikov, Khaled Elmeleegy, Flavio Junqueira, and Benjamin Reed. 2009. Interactive Analysis of Web-Scale Data. In CIDR.

[33]

Niketan Pansare, Vinayak R. Borkar, Chris Jermaine, and Tyson Condie. 2011. Online Aggregation for Large MapReduce Jobs. PVLDB 4, 11 (2011), 1135--1145.

Digital Library

[34]

Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: Universalizing Approximate Query Processing. In SIGMOD. 1461--1476.

Digital Library

[35]

Jürgen Schmidhuber, Daan Wierstra, and Faustino J. Gomez. 2005. Evolino: Hybrid Neuroevolution/Optimal Linear Search for Sequence Learning. In IJCAI. 853--858.

Digital Library

[36]

Lefteris Sidirourgos, Martin L. Kersten, and Peter A. Boncz. 2011. SciBORQ: Scientific data management with Bounds On Runtime and Quality. In CIDR. 296--301.

[37]

Mario Szegedy. 2006. The DLT priority sampling is essentially optimal. In STOC. 150--158.

Digital Library

[38]

Justin Tobias and Arnold Zellner. 2000. A note on aggregation, disaggregation and forecasting performance. Journal of Forecasting 19, 5 (2000), 457--469.

[39]

Jeffrey Scott Vitter and Min Wang. 1999. Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. In SIGMOD. 193--204.

Digital Library

[40]

Mike West and Jeff Harrison. 1997. Bayesian Forecasting and Dynamic Models. Springer-Verlag.

Digital Library

[41]

Shuyuan Yan, Bolin Ding, Wei Guo, Jingren Zhou, Zhewei Wei, Xiaowei Jiang, and Sheng Xu. 2021. FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data. arXiv:2101.03298 [cs.DB]

Recommendations

Calibrating hourly rainfall-runoff models with daily forcings for streamflow forecasting applications in meso-scale catchments

The absence of long sub-daily rainfall records can hamper development of continuous streamflow forecasting systems run at sub-daily time steps. We test the hypothesis that simple disaggregation of daily rainfall data to hourly data, combined with hourly ...
SARIMA Model: An Efficient Machine Learning Technique for Weather Forecasting
Abstract
Weather forecasting is a critical tool for many different applications, from agriculture and transportation to disaster preparedness and response. While weather forecasts are not always perfect, they provide valuable information that can help ...
Evaluation of annual rainfall erosivity index based on daily, monthly, and annual precipitation data of rainfall station
Special issue on Sensors and Sensor Networks in Agriculture, Architecture, and Civil Engineering

The erosivity factor in the universal soil loss equation (USLE) provides an effective means of evaluating the erosivity power of rainfall. The present study proposes three regression models for estimating the erosivity factor based on daily, monthly, ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 5

January 2021

142 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2021

Published in PVLDB Volume 14, Issue 5

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
63
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents