Abstract
Uncertain data streams, where data are incomplete and imprecise, have been observed in many environments. Feeding such data streams to existing stream systems produces results of unknown quality, which is of paramount concern to monitoring applications. In this paper, we present the claro system that supports stream processing for uncertain data naturally captured using continuous random variables. claro employs a unique data model that is flexible and allows efficient computation. Built on this model, we develop evaluation techniques for relational operators by exploring statistical theory and approximation. We also consider query planning for complex queries given an accuracy requirement. Evaluation results show that our techniques can achieve high performance while satisfying accuracy requirements and outperform state-of-the-art sampling methods.
Similar content being viewed by others
References
Agrawal, P., Widom, J.: Continuous uncertainty in trio. In: MUD Workshop (2009)
Antova, L., et al.: Fast and simple relational processing of uncertain data. In: ICDE, pp. 983–992 (2008)
Benjelloun, O., et al.: Uldbs: databases with uncertainty and lineage. In: VLDB, pp. 953–964 (2006)
Cassella G. et al.: Statistical Inference. Duxbury, Belmont (2001)
Cheng, R., et al.: Evaluating probabilistic queries over imprecise data. In: SIGMOD, pp. 551–562 (2003)
Cormode, G., Garofalakis, M.: Sketching probabilistic data streams. In: SIGMOD, pp. 281–292 (2007)
Dalvi N.N., Suciu D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)
DasGupta A.: Asymptotic Theory of Statistics and Probability. Springer, Berlin (2008)
Deshpande, A., Madden, S.: MauveDB: supporting model-based user views in database systems. In: SIGMOD (2006)
Diao, Y., et al.: Capturing data uncertainty in high-volume stream processing. In: CIDR (2009)
Ge, T., Zdonik, S.B.: Handling uncertain data in array database systems. In: ICDE, pp. 1140–1149 (2008)
Guestrin, C., et al.: Distributed regression: an efficient framework for modeling sensor network data. In: IPSN (2004)
Jampani, R., et al.: Mcdb: a monte carlo approach to managing uncertain data. In: SIGMOD, pp. 687–700 (2008)
Jayram, T.S., et al.: Efficient aggregation algorithms for probabilistic data. In: SODA, pp. 346–355 (2007)
Jayram, T.S., et al.: Estimating statistical aggregates on probabilistic data streams. ACM TODS 33(4):243–252 (2008)
Kanagal, B., et al.: Efficient query evaluation over temporally correlated probabilistic streams. In: ICDE (2009)
Lopes, R.H., et al.: The two-dimensional kolmogorov-smirnov test. In: Proceeding of the XI International Workshop on Advanced Computing and Analysis Techniques in Physics Research (2007)
McLachlan G., Peel D.: Finite Mixture Models. Wiley-Interscience, New York (2000)
Qi, Y., et al.: Threshold query optimization for uncertain data. In: SIGMOD, pp. 315–326 (2010)
Ré, C., et al.: Event queries on correlated probabilistic streams. In: SIGMOD, pp. 715–728 (2008)
Re, C., Suciu, D.: The trichotomy of having queries on a probabilistic database. In: VLDB J. (2009)
Sen, P., et al.: Exploiting shared correlations in probabilistic databases. In: VLDB (2008)
Singh, S., et al.: Database support for probabilistic attributes and tuples. In: ICDE, pp. 1053–1061 (2008)
Suciu, D., et al.: Embracing uncertainty in large-scale computational astrophysics. In: MUD Workshop (2009)
Szalay, A.S., et al.: Designing and mining multi-terabyte astronomy archives. In: SIGMOD, pp. 451–462 (2000)
Tran, T., et al.: Probabilistic inference over RFID streams in mobile environments. In: ICDE (2009)
Tran, T.T.L., el al.: Claro: modeling and processing uncertain data streams. UMass Amherst (2011). http://www.cs.umass.edu/~ttran/pubs/claro-tr.pdf
Tran, T.T.L., et al.: Conditioning and aggregating uncertain data streams: Going beyond expectations. In: PVLDB (2010)
Tran, T.T.L., et al.: Pods: a new model and processing algorithms for uncertain data streams. In: SIGMOD (2010)
Wang, D.Z., et al.: Bayestore: managing large, uncertain data repositories with probabilistic graphical models. In: VLDB (2008)
Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: CIDR, pp. 262–276 (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tran, T.T.L., Peng, L., Diao, Y. et al. CLARO: modeling and processing uncertain data streams. The VLDB Journal 21, 651–676 (2012). https://doi.org/10.1007/s00778-011-0261-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-011-0261-7