Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

A cost model for query processing in high dimensional data spaces

Published: 01 June 2000 Publication History

Abstract

During the last decade, multimedia databases have become increasingly important in many application areas such as medicine, CAD, geography, and molecular biology. An important research topic in multimedia databases is similarity search in large data sets. Most current approaches that address similarity search use the feature approach, which transforms important properties of the stored objects into points of a high-dimensional space (feature vectors). Thus, similarity search is transformed into a neighborhood search in feature space. Multidimensional index structures are usually applied when managing feature vectors. Query processing can be improved substantially with optimization techniques such as blocksize optimization, data space quantization, and dimension reduction. To determine optimal parameters, an accurate estimate of index-based query processing performance is crucial. In this paper we develop a cost model for index structures for point databases such as the R*-tree and the X-tree. It provides accurate estimates of the number of data page accesses for range queries and nearest-neighbor queries under a Euclidean metric and a maximum metric and a maximum metric. The problems specific to high-dimensional data spaces, called boundary effects, are considered. The concept of the fractal dimension is used to take the effects of correlated data into account.

References

[1]
AGRAWAL, R., GEHRKE, J., GUNOPULOS, D., AND RAGHAVAN, P. 1998. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD, Seattle, WA, June). ACM Press, New York, NY, 94-105.]]
[2]
AGRAWAL, R., LIN, K., SHAWNEY, H., AND SHIM, K. 1995. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proceedings of the 21st International Conference on Very Large Databases (VLDB'95, Sept.). 490-501.]]
[3]
AGRAWAL, R., FALOUTSOS, C., AND SWAMI, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the Fourth International Conference on The Foundations of Data Organization and Algorithms (FODO). 69-84.]]
[4]
ALTSCHUL, S., GISH, W., MILLER, W., MYERS,E.W.,AND LIPMAN, D. J. 1990. A basic local alignment search tool. J. Molecular Biology 215, 3, 403-410.]]
[5]
AREF,W.G.AND SAMET, H. 1991. Optimization strategies for spatial query processing. In Proceedings of the 17th Conference on Very Large Data Bases (Barcelona, Spain, Sept.). VLDB Endowment, Berkeley, CA, 81-90.]]
[6]
ARYA, S., MOUNT,D.M.,AND NARAYAN, O. 1995. Accounting for boundary effects in nearest neighbor searching. In Proceedings of the 11th Annual Symposium on Computational Geometry (Vancouver, B.C., Canada, June 5-12), J. Snoeyink, Chair. ACM Press, New York, NY, 336-344.]]
[7]
ARYA, S. 1995. Nearest neighbor searching and applications. Ph.D. Dissertation. University of Maryland at College Park, College Park, MD.]]
[8]
BECKMANN, N., KRIEGEL, H.-P., SCHNEIDER, R., AND SEEGER, B. 1990. The R * -tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data (SIGMOD '90, Atlantic City, NJ, May 23-25), H. Garcia-Molina, Chair. ACM Press, New York, NY, 322-331.]]
[9]
BELUSSI,A.AND FALOUTSOS, C. 1995. Estimating the selectivity of spatial queries using the correlation fractal dimension. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB '95, Zurich, Switzerland, Sept.). 299-310.]]
[10]
BERCHTOLD, S., B~HM, C., JAGADISH,H.V.,KRIEGEL, H. -P., AND SANDER, J. 2000a. Independent quantization: An index compression technique for high-dimensional data spaces. In Proceedings of the 16th International Conference on Data Engineering (ICDE, San Diego, CA, Feb/Mar).]]
[11]
BERCHTOLD, S., B~HM, C., KEIM, D., KRIEGEL, H. -P., AND XU, X. 2000b. Optimal multidimen-sional query processing using tree striping. In Proceedings of the International Conference on Data Warehousing and Knowledge Discovery.]]
[12]
BERCHTOLD, S., BOHM, C., AND KRIEGEL, H. -P. 1998. The pyramid-technique: Towards indexing beyond the curse of dimensionality. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD, Seattle, WA, June). ACM Press, New York, NY, 142-153.]]
[13]
BERCHTOLD, S., B~HM, C., BRAUNM~LLER, B., KEIM,D.A.,AND KRIEGEL, H. -P. 1997a. Fast parallel similarity search in multimedia databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD, Tucson, AZ, May). ACM Press, New York, NY, 1-12.]]
[14]
BERCHTOLD, S., B~HM, C., KEIM,D.A.,AND KRIEGEL, H.-P. 1997b. A cost model for nearest neighbor search in high-dimensional data space. In Proceedings of the 16th ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems (PODS '97, Tucson, AZ, May 12-14), A. Mendelzon and Z. M. ~zsoyoglu, Chairs. ACM Press, New York, NY, 78-86.]]
[15]
BERCHTOLD, S., KEIM, D., AND KRIEGEL, H.-P. 1997c. Using extended feature objects for partial similarity retrieval. VLDB J. 6, 4, 333-348.]]
[16]
BERCHTOLD,S.AND KRIEGEL, H.-P. 1997. S3: Similarity search in CAD database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD, Tucson, AZ, May). ACM Press, New York, NY, 564-567.]]
[17]
BERCHTOLD, S. 1997. Geometry based search of similar parts. Ph.D. Dissertation. Institute for Design Automation, Technical University of Munich, Munich.]]
[18]
BERCHTOLD, S., KEIM,D.A.,AND KRIEGEL, H.-P. 1996. The X-tree: An index structure for high-dimensional data. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96, Bombay, India, Sept.). 28-39.]]
[19]
B~HM, C. 1998. Efficiently indexing high-dimensional data spaces. Ph.D. Dissertation. Utz Verlag, Munich, Germany.]]
[20]
B~HM,C.AND KRIEGEL, H.-P. 2000. Dynamically optimizing high-dimensional index structures. In Proceedings of the 7th International Conference on Extending Database Technology (Konstanz, Germany).]]
[21]
CIACCIA, P., PATELLA, M., AND ZEZULA, P. 1998. A cost model for similarity queries in metric spaces. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Princi-ples of Database Systems (PODS '98, Seattle, WA, June 1-3), A. Mendelson and J. Paredaens, Chairs. ACM Press, New York, NY, 59-68.]]
[22]
CLEARY, J. G. 1979. Analysis of an algorithm for finding nearest neighbors in Euclidean space. ACM Trans. Math. Softw. 5, 2, 183-192.]]
[23]
DUDA,R.O.AND HART, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley and Sons, Inc., New York, NY.]]
[24]
EASTMAN, C. 1981. Optimal bucket size for nearest neighbor searching in k-d. Inf. Process. Lett. 4.]]
[25]
FALOUTSOS,C.AND GAEDE, V. 1996. Analysis of n-dimensional quadtrees using the Hausdorff fractal dimension. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96, Bombay, India, Sept.). 40-50.]]
[26]
FALOUTSOS,C.AND LIN, K.-I. 1994. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. Univ. of Maryland Institute for Advanced Computer Studies Report No. UMIACS-TR-94-132. University of Maryland at College Park, College Park, MD.]]
[27]
FALOUTSOS, C., BARBER, R., FLICKNER, M., HAFNER, J., NIBLACK, W., PETKOVIC, D., AND EQUITZ, W. 1994a. Efficient and effective querying by image content. J. Intell. Inf. Syst. 3, 3/4 (July), 231-262.]]
[28]
FALOUTSOS, C., RANGANATHAN, M., AND MANOLOPOULOS, Y. 1994b. Fast subsequence matching in time-series databases. In Proceedings of the 1994 ACM SIGMOD International Confer-ence on Management of Data (SIGMOD '94, Minneapolis, MN, May 24-27), R. T. Snodgrass and M. Winslett, Eds. ACM Press, New York, NY, 419-429.]]
[29]
FALOUTSOS,C.AND KAMEL, I. 1994. Beyond uniformity and independence: Analysis of R-trees using the concept of fractal dimension. In Proceedings of the 13th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '94, Minneapolis, MN, May 24-26), V. Vianu, Chair. ACM Press, New York, NY, 4-13.]]
[30]
FALOUTSOS, C., SELLIS, T., AND ROUSSOPOULOS, N. 1987. Analysis of object oriented spatial access methods. In Proceedings of the ACM SIGMOD Annual Conference on Management of Data (SIGMOD '87, San Francisco, CA, May 27-29), U. Dayal, Ed. ACM Press, New York, NY, 426-439.]]
[31]
FRIEDMAN,J.H.,BENTLEY,J.L.,AND FINKEL, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3, 3 (Sept.), 209-226.]]
[32]
FUKUNAGA, K. 1990. Introduction to Statistical Pattern Recognition. 2nd ed. Academic Press Prof., Inc., San Diego, CA.]]
[33]
GAEDE,V.AND G~NTHER, O. 1998. Multidimensional access methods. ACM Comput. Surv. 30, 2, 170-231.]]
[34]
GAEDE, V. 1995. Optimal redundancy in spatial database systems. In Proceedings of the Fourth International Symposium on Advances in Spatial Databases (SSD'95, Portland, ME, Aug.), M. J. Egenhofer and J. R. Herring, Eds. Springer-Verlag, New York, NY, 96-116.]]
[35]
GARY,J.E.AND MEHROTRA, R. 1993. Similar shape retrieval using a structural feature index. Inf. Syst. 18, 7 (Oct.), 525-537.]]
[36]
GOLUB,G.AND VAN LOAN, C. F. 1989. Matrix Computations. 2nd ed. Johns Hopkins University Press, Baltimore, MD.]]
[37]
GUTTMAN, A. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the ACM SIGMOD Conference on Management of Data. ACM Press, New York, NY, 47-57.]]
[38]
HENRICH, A. 1994. A distance-scan algorithm for spatial access structures. In Proceedings of the Second ACM Workshop on Geographic Information Systems (Gaithersburg, MD, Dec.). ACM Press, New York, NY, 136-143.]]
[39]
HJALTASON,G.R.AND SAMET, H. 1995. Ranking in spatial databases. In Proceedings of the Fourth International Symposium on Advances in Spatial Databases (SSD'95, Portland, ME, Aug.), M. J. Egenhofer and J. R. Herring, Eds. Springer-Verlag, New York, NY, 83-95.]]
[40]
JAGADISH, H. V. 1991. A retrieval technique for similar shapes. SIGMOD Rec. 20, 2 (June), 208-217.]]
[41]
KALOS,M.H.AND WHITLOCK, P. A. 1986. Monte Carlo Methods. Vol. 1: Basics. Wiley-Interscience, New York, NY.]]
[42]
KASTENM~LLER,G.AND KRIEGEL, H. -P. 1998. Similarity search in 3d protein databases. In Proceedings of the German Conference on Bioinfomatics (Cologne, Germany).]]
[43]
KATAYAMA,N.AND SATOH, S. 1997. The SR-tree: an index structure for high-dimensional nearest neighbor queries. In Proceedings of the International ACM Conference on Manage-ment of Data (SIGMOD '97, Tucson, AZ, May). ACM, New York, NY, 369-380.]]
[44]
KAUL, A., O'CONNOR,M.A.,AND SRINIVASAN, V. 1991. Computing Minkowski sums of regular polygons. In Proceedings of the 3rd Canadian Conference on Computing Geometry. 74-77.]]
[45]
KEIM, D. A. 1997. Efficient similarity search in spatial database systems. Habilitation thesis. University of Munich, Munich, Germany.]]
[46]
KORN, F., SIDIROPOULOS, N., FALOUTSOS, C., SIEGEL, E., AND PROTOPAPAS, Z. 1996. Fast nearest neighbor search in medical image databases. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96, Bombay, India, Sept.). 215-226.]]
[47]
KRIEGEL,H.-P.AND SEIDL, T. 1998. Approximation-based similarity search for 3d surface segments. GeoInformatica J.]]
[48]
KRIEGEL, H.-P., SCHMIDT, T., AND SEIDL, T. 1997. 3-D similarity search by shape approximation. In Proceedings of the Fifth International Symposium on Advances in Spatial Databases (SSD'97, Berlin, July), M. Scholl and A. Voisard, Eds. Springer-Verlag, New York, NY, 11-28.]]
[49]
LIN, K-I., JAGADISH, H., AND FALOUTSOS, C. 1994. The TV-tree:An index structure for high dimensional data. VLDB J. 3 (Oct.), 517-542.]]
[50]
MANDELBROT, B. 1977. Fractal Geometry of Nature. W. H. Freeman and Co., New York, NY.]]
[51]
MEHROTRA, G. 1999. The hybrid tree: An index structure for high dimensional feature spaces. In Proceedings of the 15th International IEEE Conference on Data Engineering (Sydney, Australia, Mar.). IEEE Press, Piscataway, NJ, 440-447.]]
[52]
MEHROTRA,R.AND GARY, J. 1995. Feature-index-based similar shape retrieval. In Proceed-ings of the Third IFIP WG2.6 Working Conference on Visual Database Systems 3 (VDB-3),S. Spaccapietra and R. Jain, Eds. Chapman and Hall, Ltd., London, UK, 46-65.]]
[53]
MEHROTRA,R.AND GARY, J. 1993. Feature-based retrieval of similar shapes. In Proceedings of the 9th International Conference on Data Engineering (Vienna, Austria, Apr.). IEEE Computer Society, Washington, DC.]]
[54]
PAGEL, B.-U., SIX, H.-W., TOBEN, H., AND WIDMAYER, P. 1993. Towards an analysis of range query performance in spatial data structures. In Proceedings of the Twelfth ACM SIGACT-SIGMOD- SIGART Symposium on Principles of Database Systems (PODS, Washington, DC, May 25-28), C. Beeri, Chair. ACM Press, New York, NY, 214-221.]]
[55]
PAPADOPOULOS,A.N.AND MANOLOPOULOS, Y. 1998. Similarity query processing using disk arrays. SIGMOD Rec. 27, 2, 225-236.]]
[56]
PAPADOPOULOS,A.AND MANOLOPOULOS, Y. 1997. Performance of nearest neighbor queries in r-trees. In Proceedings of the 6th International Conference on Database Theory (ICDT '97, Delphi, Greece, Jan. 9-10). Springer-Verlag, Berlin, Germany, 394-408.]]
[57]
PRESS,W.H.,TEUKOLSKY,S.A.,VETTERLING,W.T.,AND FLANNERY, B. P. 1992. Numerical Recipes in C: The Art of Scientific Computing. 2nd ed. Cambridge University Press, New York, NY.]]
[58]
RIEDEL, E., GIBSON,G.A.,AND FALOUTSOS, C. 1998. Active storage for large-scale data mining and multimedia. In Proceedings of the 24th International Conference on Very Large Data Bases. 62-73.]]
[59]
ROUSSOPOULOS, N., KELLEY, S., AND VINCENT, F. 1995. Nearest neighbor queries. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (SIGMOD '95, San Jose, CA, May 23-25), M. Carey and D. Schneider, Eds. ACM Press, New York, NY, 71-79.]]
[60]
SCHROEDER, M. 1991. Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise.W.H. Freeman and Co., New York, NY.]]
[61]
SEIDL, T. 1997. Adaptable similarity search in 3-D spatial database systems. Ph.D. Dissertation.]]
[62]
SEIDL,T.AND KRIEGEL, H. -P. 1997. Efficient user-adaptable similarity search in large multimedia databases. In Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB '97, Athens, Greece, Aug.). 506-515.]]
[63]
SELLIS, T., ROUSSOPOULOS, N., AND FALOUTSOS, C. 1987. The R1-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th Confererence on Very Large Data Bases (Brighton, England, Sept.). VLDB Endowment, Berkeley, CA, 507-518.]]
[64]
SHAWNEY,H.AND HAFNER, J. 1994. Efficient color histogram indexing. In Proceedings of International Conference on Image Processing. 66-70.]]
[65]
SPROULL, R. F. 1991. Refinements to nearest-neighbor searching in k-dimensional trees. Algorithmica 6, 4, 579-589.]]
[66]
STRANG, G. 1980. Linear Algebra and its Applications. 2nd ed. Academic Press, Inc., New York, NY.]]
[67]
THEODORIDIS,Y.AND SELLIS, T. 1996. A model for the prediction of R-tree performance. In Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS '96, Montreal, Que., Canada, June 3-5), R. Hull, Chair. ACM Press, New York, NY, 161-171.]]
[68]
WEBER, R., SCHEK, H. -J., AND BLOTT, S. 1998. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24th International Conference on Very Large Data Bases.]]
[69]
WHITE,D.A.AND JAIN, R. 1996. Similarity indexing with the SS-tree. In Proceedings of the 12th IEEE International Conference on Data Engineering (ICDE '97, New Orleans, LA, Feb.). IEEE Press, Piscataway, NJ, 516-523.]]

Cited By

View all
  • (2024)DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor SearchProceedings of the VLDB Endowment10.14778/3665844.366585417:9(2241-2254)Online publication date: 1-May-2024
  • (2023)Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional SpacesProceedings of the VLDB Endowment10.14778/3594512.359452716:8(1979-1991)Online publication date: 1-Apr-2023
  • (2023)Precise Quantitative Analysis of Binarized Neural Networks: A BDD-based ApproachACM Transactions on Software Engineering and Methodology10.1145/356321232:3(1-51)Online publication date: 27-Apr-2023
  • Show More Cited By

Index Terms

  1. A cost model for query processing in high dimensional data spaces

      Recommendations

      Reviews

      Nagiza F. Samatova

      In this lengthy paper, the author expands on his previous work on the BBKK cost model for query processing in high dimensional spaces. He provides accurate estimates of the number of data page accesses for range queries and k-nearest neighbor queries. The paper assumes the Euclidean and the maximum metric, the R*-tree and the X-tree multidimensional index structures, and the depth-firth search and the HS query processing algorithms. Various factors that influence the performance of index-based query processing are taken into account. The impact of the dimension of the data space is handled by providing separate estimates for low dimensional and high dimensional cases with a criterion for distinguishing between them. The problems specific to high-dimensional data spaces including boundary effects and correlation effects are considered. Experimental evaluations on real data demonstrate the practical applicability of the proposed cost estimates and their higher accuracy over related approaches. The paper's ideas are well organized, professionally presented, and incrementally expanded to relax various constraints. A comprehensive review of cost models with the emphasis on their strengths and limitations as well as an excellent set of references are provided. However, the paper is excessively long: proofs of many estimates could have been placed into the appendix thus making it more readable. It is primarily aimed at the research community but it should be useful to database practitioners as well. I strongly recommend it to anyone researching multidimensional index structures, or as supplemental reading in advanced database management and information storage and retrieval courses.

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 25, Issue 2
      June 2000
      140 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/357775
      • Editor:
      • Won Kim
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 June 2000
      Published in TODS Volume 25, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cost model
      2. multidimensional index

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)91
      • Downloads (Last 6 weeks)10
      Reflects downloads up to 13 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)DET-LSH: A Locality-Sensitive Hashing Scheme with Dynamic Encoding Tree for Approximate Nearest Neighbor SearchProceedings of the VLDB Endowment10.14778/3665844.366585417:9(2241-2254)Online publication date: 1-May-2024
      • (2023)Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional SpacesProceedings of the VLDB Endowment10.14778/3594512.359452716:8(1979-1991)Online publication date: 1-Apr-2023
      • (2023)Precise Quantitative Analysis of Binarized Neural Networks: A BDD-based ApproachACM Transactions on Software Engineering and Methodology10.1145/356321232:3(1-51)Online publication date: 27-Apr-2023
      • (2023)Mobile Localization Techniques for Wireless Sensor Networks: Survey and RecommendationsACM Transactions on Sensor Networks10.1145/356151219:2(1-39)Online publication date: 5-Apr-2023
      • (2023)A Query Optimizer for Range Queries over Multi-Attribute TrajectoriesACM Transactions on Intelligent Systems and Technology10.1145/355581114:1(1-28)Online publication date: 27-Jan-2023
      • (2022)DRL based Joint Affective Services Computing and Resource Allocation in ISTNACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356182118:3s(1-19)Online publication date: 31-Oct-2022
      • (2022)An efficient LSH indexing on discriminative short codes for high-dimensional nearest neighborsMultimedia Tools and Applications10.1007/s11042-018-6987-078:17(24407-24429)Online publication date: 10-Mar-2022
      • (2021)CPRQ: Cost Prediction for Range Queries in Moving Object DatabasesISPRS International Journal of Geo-Information10.3390/ijgi1007046810:7(468)Online publication date: 8-Jul-2021
      • (2019)In Search of Indoor Dense Regions: An Approach Using Indoor Positioning Data2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00258(2127-2128)Online publication date: Apr-2019
      • (2019)Answering why-not questions on KNN queriesFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-018-7074-413:5(1062-1071)Online publication date: 1-Oct-2019
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media