Abstract
This paper presents and evaluates a simple but very effective method to implement large data warehouses on an arbitrary number of computers, achieving very high query execution performance and scalability. The data is distributed and processed in a potentially large number of autonomous computers using our technique called data warehouse striping (DWS). The major problem of DWS technique is that it would require a very expensive cluster of computers with fault tolerant capabilities to prevent a fault in a single computer to stop the whole system. In this paper, we propose a radically different approach to deal with the problem of the unavailability of one or more computers in the cluster, allowing the use of DWS with a very large number of inexpensive computers. The proposed approach is based on approximate query answering techniques that make it possible to deliver an approximate answer to the user even when one or more computers in the cluster are not available. The evaluation presented in the paper shows both analytically and experimentally that the approximate results obtained this way have a very small error that can be negligible in most of the cases.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Acharaya, S., Gibbons, P., and Poosala, V. (2000). Congressional Samples for Approximate Answering of Groupby-Queries. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, Dallas, Texas, USA (pp. 487–498).
Albrecht, J., Gunzel, H., and Lehner, W. (1998). An Architecture for Distributed OLAP. In Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las Vegas, USA.
Barbara, D. et al. (1997). The New Jersey Data Reduction Report. Bulletin of the Technical Committee on Data Engineering, 20(4), 3–45.
Bernardino, J. and Madeira, H. (2000). A New Technique to Speedup Queries in Data Warehousing. In Proc. of Chalenges ADBIS-DASFA A Symposium on Advances in Databases and Information Systems, Prague, Czech Republic (pp. 21–32).
Bernardino, J. and Madeira, H. (2001). Experimental Evaluation of a New Distributed Partitioning Technique for DataWarehouses. In Proc. of Int. Database Engineering &; Applications Symposium IDEAS, Grenoble, France (pp. 312–321).
Chauduri, S. and Dayal, U. (1997). An Overview of DataWarehousing and OLAP Technology. SIGMOD Record, 26(1), 65–74.
Chen, C.M. and Roussopoulos, N. (1994). Adaptive Selectivity Estimation Using Query Feedback. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 161–172).
Cochran, W.G. (1977). Sampling Techniques (3rd edn.). New York: John Wiley &; Sons.
Codd, E.F., Codd, S.B., and Salley, C.T. (1993). Providing OLAP (Online Analitycal Processing) to User Analysts: An IT Mandate. Technical Report, E.F. Codd &; Associates.
Datta, A., Moon, B., and Thomas, H. (1998). A Case for Parallelism in Data Warehousing and OLAP. In Proc. of the 9th Int. Conf. on Database and Expert Systems Applications DEXA Workshop (pp. 226–231).
DeWitt, D.J. et al. (1990). The Gamma Database Machine Project. IEEE Trans. Knowledge and Data Engineering, 2(1), 44–62.
DeWitt, D.J. and Gray, J. (1992). Parallel Database Systems: The Future of High Performance Database Systems. Communications of the ACM, 35(6), 85–98.
Ganguly, S., Gibbons, P.B., Matias, Y., and Silberschatz, A. (1996). Bifocal Sampling for Skew-Resistant Join Size Estimation. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 271–281).
Gibbons, P.B. and Matias, Y. (1998a). New Sampling-Based Summary Statistics for Improving Approximate Query Answers. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 331–342).
Gibbons, P.B. and Matias, Y. (1998b). AQUA: System and Techniques for Approximate Query Answering. Bell Labs Technical Report.
Gibbons, P.B., Matias, Y., and Poosala, V. (1997a). Aqua Project, White Paper. Technical Report, Bell Laboratories, Murray Hill, New Jersey.
Gibbons, P.B., Matias, Y., and Poosala, V. (1997b). Fast Incremental Maintenance of Approximate Histograms. In Proc. 23rd Int. Conf. on Very Large Data Bases VLDB (pp. 466–475).
Haas, P.J. (1997). Large-Sample and Deterministic Confidence Intervals for Online Aggregation. In uProc. 9th Int. Conf. on Scientific and Statistical Database Management, SSDBM (pp. 51–62).
Haas, P.J. (1999). Techniques for Online Exploration of Large Object-Relational Datasets. In Proc. 9th Int. Conf. on Scientific and Statistical Database Management, SSDBM (pp. 4–12).
Haas, P.J., Naughton, J.F., Seshadri, S., and Stokes, L. (1995). Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In Proc. 21st Int. Conf. on Very Large Data Bases VLDB (pp. 311–322).
Haas, P.J., Naughton, J.F., and Swami, A.N. (1994). On the Relative Cost of Sampling for Join Selectivity Estimation. In Proc. 13th ACM Symp. on Principles of Database Systems (pp. 14–24).
Hansen, M.H., Hurwitz, W.M., and Madow, W.G. (1953). Sample Survey Methods and Theory (vols. I e II). New York: John Wiley &; Sons.
Hellerstein, J.M., Haas, P.J., and Wang, H.J. (1997). Online Aggregation. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 171–182).
Hou, W.-C. and Taneja, B.K. (1998). Statistical Estimators for Relational Algebra Expressions. In Proc. 7th ACM Symp. on Principles of Database Systems (pp. 276–287).
Kimball, R. (1996). The Data Warehouse Toolkit. New York: J. Wiley &; Sons.
Kimball, R., Reeves, L., Ross, M., and Thornthwalte, W. (1998). The Data Warehouse Lifecycle Toolkit. New York: J. Wiley &; Sons.
Kooi, R.P. (1980). The Optimization of Queries in Relational Databases. PhD Thesis, Case Western Reserve University.
Lipton, R.J. and Naughton, J.F. (1995). Query Size Estimation by Adaptive Sampling. J. Computer and System Sciences, 51(1), 18–25.
Lipton, R.J., Naughton, J.F., and Schneider, D.A. (1990). Practical Selectivity Estimation Through Adaptive Sampling. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 1–11).
Lu, H., Ooi, B.C., and Tan, K.L. (1994). Query Processing in Parallel Relational Database Systems. IEEE Computer Society.
Olap Council, APB-1 Benchmark, Olap Council, November 1998, available at www.olpacouncil.org.
Olken, F. and Rotem, D. (1992). Maintenance of Materialized Views of Sampling Queries. In Proc. 8th IEEE Int. Conf. on Data Engineering ICDE (pp. 632–664).
Poosala, V. (1997). Histogram-Based Estimation Techniques in Databases. PhD Thesis, University of Wisconsin-Madison.
Poosala, V., Ganti, V., and Ioannidis, Y.E. (1999). Approximate Query Answering Using Histograms. IEEE Data Engineering Bulletin, 22(4), 5–14.
Poosala, V., Ioannidis, Y.E., Haas, P.J., and Shekita, E.J. (1996). Improved Histograms for Selectivity Estimation of Range Predicates. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 294–305).
Rao, J. and Ross, K.A. (1998). Reusing Invariants: A New Strategy for Correlated Queries. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, Seattle, USA (pp. 37–48).
Selinger, P. et al. (1979). Access Path Selection in a Relational Database Management System. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 23–34).
Seshadri, P., Pirahesh, H., and Cliff, T.Y. (1996). Complex Query Decorrelation. In Proc. IEEE Int. Conf. on Data Engineering ICDE (pp. 450–458).
Stonebraker, M., Katz, R., Patterson, D., and Oustershout, J. (1998). The Design of XPRS. In Proc. of the Int. Conf. on Very Large Databases VLDB, Los Angeles, USA (pp. 318–330).
Transaction Processing Council (1999). TPC Benchmark H. Transaction Processing Council, June 1999, available at www.tpc.org.
Vitter, J. and Wang, M. (1999). Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 193–204).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bernardino, J.R., Furtado, P.S. & Madeira, H.C. Approximate Query Answering Using Data Warehouse Striping. Journal of Intelligent Information Systems 19, 145–167 (2002). https://doi.org/10.1023/A:1016551309288
Issue Date:
DOI: https://doi.org/10.1023/A:1016551309288