Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Approximate Query Answering Using Data Warehouse Striping

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

This paper presents and evaluates a simple but very effective method to implement large data warehouses on an arbitrary number of computers, achieving very high query execution performance and scalability. The data is distributed and processed in a potentially large number of autonomous computers using our technique called data warehouse striping (DWS). The major problem of DWS technique is that it would require a very expensive cluster of computers with fault tolerant capabilities to prevent a fault in a single computer to stop the whole system. In this paper, we propose a radically different approach to deal with the problem of the unavailability of one or more computers in the cluster, allowing the use of DWS with a very large number of inexpensive computers. The proposed approach is based on approximate query answering techniques that make it possible to deliver an approximate answer to the user even when one or more computers in the cluster are not available. The evaluation presented in the paper shows both analytically and experimentally that the approximate results obtained this way have a very small error that can be negligible in most of the cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Acharaya, S., Gibbons, P., and Poosala, V. (2000). Congressional Samples for Approximate Answering of Groupby-Queries. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, Dallas, Texas, USA (pp. 487–498).

  • Albrecht, J., Gunzel, H., and Lehner, W. (1998). An Architecture for Distributed OLAP. In Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las Vegas, USA.

  • Barbara, D. et al. (1997). The New Jersey Data Reduction Report. Bulletin of the Technical Committee on Data Engineering, 20(4), 3–45.

    Google Scholar 

  • Bernardino, J. and Madeira, H. (2000). A New Technique to Speedup Queries in Data Warehousing. In Proc. of Chalenges ADBIS-DASFA A Symposium on Advances in Databases and Information Systems, Prague, Czech Republic (pp. 21–32).

  • Bernardino, J. and Madeira, H. (2001). Experimental Evaluation of a New Distributed Partitioning Technique for DataWarehouses. In Proc. of Int. Database Engineering &; Applications Symposium IDEAS, Grenoble, France (pp. 312–321).

  • Chauduri, S. and Dayal, U. (1997). An Overview of DataWarehousing and OLAP Technology. SIGMOD Record, 26(1), 65–74.

    Google Scholar 

  • Chen, C.M. and Roussopoulos, N. (1994). Adaptive Selectivity Estimation Using Query Feedback. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 161–172).

  • Cochran, W.G. (1977). Sampling Techniques (3rd edn.). New York: John Wiley &; Sons.

    Google Scholar 

  • Codd, E.F., Codd, S.B., and Salley, C.T. (1993). Providing OLAP (Online Analitycal Processing) to User Analysts: An IT Mandate. Technical Report, E.F. Codd &; Associates.

  • Datta, A., Moon, B., and Thomas, H. (1998). A Case for Parallelism in Data Warehousing and OLAP. In Proc. of the 9th Int. Conf. on Database and Expert Systems Applications DEXA Workshop (pp. 226–231).

  • DeWitt, D.J. et al. (1990). The Gamma Database Machine Project. IEEE Trans. Knowledge and Data Engineering, 2(1), 44–62.

    Google Scholar 

  • DeWitt, D.J. and Gray, J. (1992). Parallel Database Systems: The Future of High Performance Database Systems. Communications of the ACM, 35(6), 85–98.

    Google Scholar 

  • Ganguly, S., Gibbons, P.B., Matias, Y., and Silberschatz, A. (1996). Bifocal Sampling for Skew-Resistant Join Size Estimation. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 271–281).

  • Gibbons, P.B. and Matias, Y. (1998a). New Sampling-Based Summary Statistics for Improving Approximate Query Answers. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 331–342).

  • Gibbons, P.B. and Matias, Y. (1998b). AQUA: System and Techniques for Approximate Query Answering. Bell Labs Technical Report.

  • Gibbons, P.B., Matias, Y., and Poosala, V. (1997a). Aqua Project, White Paper. Technical Report, Bell Laboratories, Murray Hill, New Jersey.

    Google Scholar 

  • Gibbons, P.B., Matias, Y., and Poosala, V. (1997b). Fast Incremental Maintenance of Approximate Histograms. In Proc. 23rd Int. Conf. on Very Large Data Bases VLDB (pp. 466–475).

  • Haas, P.J. (1997). Large-Sample and Deterministic Confidence Intervals for Online Aggregation. In uProc. 9th Int. Conf. on Scientific and Statistical Database Management, SSDBM (pp. 51–62).

  • Haas, P.J. (1999). Techniques for Online Exploration of Large Object-Relational Datasets. In Proc. 9th Int. Conf. on Scientific and Statistical Database Management, SSDBM (pp. 4–12).

  • Haas, P.J., Naughton, J.F., Seshadri, S., and Stokes, L. (1995). Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In Proc. 21st Int. Conf. on Very Large Data Bases VLDB (pp. 311–322).

  • Haas, P.J., Naughton, J.F., and Swami, A.N. (1994). On the Relative Cost of Sampling for Join Selectivity Estimation. In Proc. 13th ACM Symp. on Principles of Database Systems (pp. 14–24).

  • Hansen, M.H., Hurwitz, W.M., and Madow, W.G. (1953). Sample Survey Methods and Theory (vols. I e II). New York: John Wiley &; Sons.

    Google Scholar 

  • Hellerstein, J.M., Haas, P.J., and Wang, H.J. (1997). Online Aggregation. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 171–182).

  • Hou, W.-C. and Taneja, B.K. (1998). Statistical Estimators for Relational Algebra Expressions. In Proc. 7th ACM Symp. on Principles of Database Systems (pp. 276–287).

  • Kimball, R. (1996). The Data Warehouse Toolkit. New York: J. Wiley &; Sons.

    Google Scholar 

  • Kimball, R., Reeves, L., Ross, M., and Thornthwalte, W. (1998). The Data Warehouse Lifecycle Toolkit. New York: J. Wiley &; Sons.

    Google Scholar 

  • Kooi, R.P. (1980). The Optimization of Queries in Relational Databases. PhD Thesis, Case Western Reserve University.

  • Lipton, R.J. and Naughton, J.F. (1995). Query Size Estimation by Adaptive Sampling. J. Computer and System Sciences, 51(1), 18–25.

    Google Scholar 

  • Lipton, R.J., Naughton, J.F., and Schneider, D.A. (1990). Practical Selectivity Estimation Through Adaptive Sampling. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 1–11).

  • Lu, H., Ooi, B.C., and Tan, K.L. (1994). Query Processing in Parallel Relational Database Systems. IEEE Computer Society.

  • Olap Council, APB-1 Benchmark, Olap Council, November 1998, available at www.olpacouncil.org.

  • Olken, F. and Rotem, D. (1992). Maintenance of Materialized Views of Sampling Queries. In Proc. 8th IEEE Int. Conf. on Data Engineering ICDE (pp. 632–664).

  • Poosala, V. (1997). Histogram-Based Estimation Techniques in Databases. PhD Thesis, University of Wisconsin-Madison.

    Google Scholar 

  • Poosala, V., Ganti, V., and Ioannidis, Y.E. (1999). Approximate Query Answering Using Histograms. IEEE Data Engineering Bulletin, 22(4), 5–14.

    Google Scholar 

  • Poosala, V., Ioannidis, Y.E., Haas, P.J., and Shekita, E.J. (1996). Improved Histograms for Selectivity Estimation of Range Predicates. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 294–305).

  • Rao, J. and Ross, K.A. (1998). Reusing Invariants: A New Strategy for Correlated Queries. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, Seattle, USA (pp. 37–48).

  • Selinger, P. et al. (1979). Access Path Selection in a Relational Database Management System. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 23–34).

  • Seshadri, P., Pirahesh, H., and Cliff, T.Y. (1996). Complex Query Decorrelation. In Proc. IEEE Int. Conf. on Data Engineering ICDE (pp. 450–458).

  • Stonebraker, M., Katz, R., Patterson, D., and Oustershout, J. (1998). The Design of XPRS. In Proc. of the Int. Conf. on Very Large Databases VLDB, Los Angeles, USA (pp. 318–330).

  • Transaction Processing Council (1999). TPC Benchmark H. Transaction Processing Council, June 1999, available at www.tpc.org.

  • Vitter, J. and Wang, M. (1999). Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. In Proc. ACM SIGMOD Int. Conf. on Management of Data (pp. 193–204).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bernardino, J.R., Furtado, P.S. & Madeira, H.C. Approximate Query Answering Using Data Warehouse Striping. Journal of Intelligent Information Systems 19, 145–167 (2002). https://doi.org/10.1023/A:1016551309288

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1016551309288