Abstract
Incomplete data has been a longstanding issue in database community, and yet the subject is poorly handled by both theory and practice. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than imputing the missing values. An interval estimation, composed of the upper and lower bound of aggregate query results among all possible interpretation of missing values, are presented to the end-users. The ground-truth aggregate result is guaranteed to be among the interval. Experimental results are consistent with the theoretical results, and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
TPC-H: http://www.tpc.org/tpch.
References
Osborne, J.W.: Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data. Sage, Thousand Oaks (2012)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Ebaid, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: NADEEF: a generalized data cleaning system. PVLDB 6(12), 1218–1221 (2013)
Deng, T., Fan, W., Geerts, F.: Capturing missing tuples and missing values. ACM Trans. Database Syst. 41(2), 10:1–10:47 (2016)
Guagliardo, P., Libkin, L.: Correctness of SQL queries on databases with nulls. SIGMOD Rec. 46(3), 5–16 (2017)
Fahandar, M.A., Hüllermeier, E., Couso, I.: Statistical inference for incomplete ranking data: the case of rank-dependent coarsening. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp. 1078–1087 (2017)
Sarabia, J.M., Shahtahmassebi, G.: Bayesian estimation of incomplete data using conditionally specified priors. Commun. Stat. Simul. Comput. 46(5), 3419–3435 (2017)
Lipski Jr., W.: On semantic issues connected with incomplete information databases. ACM Trans. Database Syst. 4(3), 262–296 (1979)
Reiter, R.: On closed world data bases. In: Logic and Data Bases, Symposium on Logic and Data Bases, Centre d’études et de recherches de Toulouse, pp. 55–76 (1977)
Codd, E.F.: Extending the database relational model to capture more meaning. ACM Trans. Database Syst. 4(4), 397–434 (1979)
Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010, pp. 75–86 (2010)
Rubin, D.B., Little, R.J.A.: Statistical Analysis with Missing Data. Wiley, Hoboken (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Zhang, A., Wang, J., Li, J., Gao, H. (2018). Aggregate Query Processing on Incomplete Data. In: Cai, Y., Ishikawa, Y., Xu, J. (eds) Web and Big Data. APWeb-WAIM 2018. Lecture Notes in Computer Science(), vol 10987. Springer, Cham. https://doi.org/10.1007/978-3-319-96890-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-96890-2_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96889-6
Online ISBN: 978-3-319-96890-2
eBook Packages: Computer ScienceComputer Science (R0)