Abstract
Range queries and range joins in metric spaces have applications in many areas, including GIS, computational biology, and data integration, where metric uncertain data exist in different forms, resulting from circumstances such as equipment limitations, high-throughput sequencing technologies, and privacy preservation. We represent metric uncertain data by using an object-level model and a bi-level model, respectively. Two novel indexes, the uncertain pivot B \(^{+}\) -tree (UPB-tree) and the uncertain pivot B \(^{+}\) -forest (UPB-forest), are proposed in order to support probabilistic range queries and range joins for a wide range of uncertain data types and similarity metrics. Both index structures use a small set of effective pivots chosen based on a newly defined criterion and employ the B\(^{+}\)-tree(s) as the underlying index. In addition, we present efficient metric probabilistic range query and metric probabilistic range join algorithms, which utilize validation and pruning techniques based on derived probability lower and upper bounds. Extensive experiments with both real and synthetic data sets demonstrate that, compared against existing state-of-the-art indexes for metric uncertain data, the UPB-tree and the UPB-forest incur much lower construction costs, consume less storage space, and can support more efficient metric probabilistic range queries and metric probabilistic range joins.
Similar content being viewed by others
Notes
Available at http://www.sisap.org/Metric_Space_Library.html.
Available at http://www.sisap.org/Metric_Space_Library.html.
Available at http://www.dbs.informatik.uni-muenchen.de/~seidl.
Available at http://www.ncbi.nlm.nih.gov/genome.
References
Agarwal, P.K., Cheng, S.W., Tao, Y., Yi, K.: Indexing uncertain data. In: PODS, pp. 137–146 (2009)
Aggarwal, C., Yu, P.: On high dimensional indexing of uncertain data. In: ICDE, pp. 1460–1461 (2008)
Angiulli, F., Fassetti, F.: Indexing uncertain data in general metric space. IEEE Trans. Knowl. Data Eng. 24(9), 1640–1657 (2012)
Bohm, C., Kunath, P., Schubert, M.: The Gauss-tree: efficient object identification of probabilistic feature vectors. In: ICDE, article 9 (2006)
Bustos, B., Navarro, G., Chavez, E.: Pivot selection techniques for proximity searching in metric spaces. Pattern Recognit. Lett. 24(14), 2357–2366 (2003)
Chen, J., Cheng, R.: Efficient evaluation of imprecise location-dependent queries. In: ICDE, pp. 586–595 (2007)
Chen, L., Gao, Y., Li, X., Jensen, C.S., Chen, G.: Efficient metric indexing for similarity search. In: ICDE, pp. 591–602 (2015)
Chen, L., Gao, Y., Li, X., Jensen, C.S., Chen, G., Zheng, B.: Indexing metric uncertain data for range queries. In: SIGMOD, pp. 951–965 (2015)
Cheng, R., Singh, S., Prabhakar, S., Shah, R., Vitter, J.S., Xia, Y.: Efficient join processing over uncertain data. In: CIKM, pp. 738–747 (2006)
Cheng, R., Xia, Y., Prabhakar, S., Shah, R., Vitter, J.S.: Efficient indexing methods for probabilistic threshold queries over uncertain data. In: VLDB, pp. 876–887 (2004)
Chung, C.W., Pan, C.H., Liu, C.M.: An effective index for uncertain data. In: IS3C, pp. 482–485 (2014)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997)
Dai, D., Xie, J., Zhang, H., Dong, J.: Efficient range queries over uncertain strings. In: SSDBM, pp. 75–95 (2012)
Dallachiesa, M., Palpanas, T., Ilyas, I.F.: Top-\(k\) nearest neighbor search in uncertain data series. PVLDB 8(1), 13–24 (2014)
Fredriksson, K., Braithwaite, B.: Quicker similarity joins in metric spaces. In: SISAP, pp. 127–140 (2013)
Frentzos, E., Gratsias, K., Theodoridis, Y.: On the effect of location uncertainty in spatial querying. IEEE Trans. Knowl. Data Eng. 21(3), 366–383 (2008)
Gao, M., Jin, C., Wang, W., Lin, X., Zhou, A.: Similarity query processing for probabilistic sets. In: ICDE, pp. 913–924 (2013)
Ge, T., Li, Z.: Approximate substring matching over uncertain strings. In: PVLDB vol. 4(11), pp. 772–782 (2011)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
Jacox, E.H., Samet, H.: Metric space similarity joins. ACM Trans. Database Syst. 33(2), 7:1–7:38 (2008)
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD, pp. 327–338 (2010)
Jin, R., Liu, L., Ding, B., Wang, H.: Distance constraint reachability computation in uncertain graphs. In: PVLDB vol. 4(9), pp. 511–562 (2011)
Kimura, H., Madden, S., Zdonik, S.B.: UPI: a primary index for uncertain databases. In: PVLDB vol. 3(1), pp. 630–637 (2010)
Knight, A., Yu, Q., Rege, M.: Efficient range query processing on complicated uncertain data. In: Ozyer, T., Kianmehr, K., Tan, M., Zeng, J. (eds.) Information Reuse and Integration in Academia and Industry, pp. 51–72. Springer, Vienna (2013)
Kriegel, H.P., Bernecker, T., Renz, M., Zuefle, A.: Probabilistic join queries in uncertain databases. In: Aggarwal, C. C. (ed.) Managing and Mining Uncertain Data, pp. 257–298. Springer, New York (2009)
Kriegel, H.P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In: DASFAA, pp. 295–309 (2006)
Lian, X., Chen, L.: A generic framework for handling uncertain data with local correlations. In: PVLDB, vol. 4(1), pp. 12–21 (2010)
Lian, X., Chen, L.: Set similarity join on probabilistic data. In: PVLDB, vol. 3(1), pp. 650–659 (2010)
Mao, R., Mirankerb, W.L., Mirankerc, D.P.: Pivot selection: dimension reduction for distance-based indexing. J. Discrete Algorithms 13, 32–46 (2012)
Novak, D., Batko, M., Zezula, P.: Metric index: an efficient and scalable solution for precise and approximate similarity search. Inf. Syst. 36(4), 721–723 (2011)
Paredes, R., Reyes, N.: Solving similarity joins and range queries in metric spaces with the list of twin clusters. J. Discrete Algorithms 7(1), 18–35 (2009)
Pearson, S.S., Silva, Y.N.: Index-based R-S similarity joins. In: SISAP, pp. 106–112 (2014)
Sarma, A.D., He, Y., Chaudhuri, S.: Clusterjoin: a similarity joins framework using map-reduce. In: PVLDB, vol. 7(12), pp. 1059–1070 (2014)
Silva, Y.N., Aref, W.G., Ali, M.H.: The similarity join database operator. In: ICDE, pp. 892–903 (2010)
Silva, Y.N., Pearson, S.: Exploiting database similarity joins for metric spaces. In: PVLDB, vol. 5(12), pp. 1922–1925 (2012)
Singh, S., Mayfield, C., Prabhakar, S., Shah, R., Hambrusch, S.E.: Indexing uncertain categorical data. In: ICDE, pp. 616–625 (2007)
Skopal, T., Pokorny, J., Snasel, V.: PM-tree: pivoting metric tree for similarity search in multimedia databases. In: ADBIS, pp. 803–815 (2004)
Tao, Y., Xiao, X., Cheng, R.: Range search on multidimensional uncertain data. ACM Trans. Database Syst. 32(3), 15:1–15:54 (2007)
Traina Jr, C., Traina, A.J.M., Seeger, B., Faloutsos, C.: Slim-trees: high performance metric trees minimizing overlap between nodes. In: ICDE, pp. 51–65 (2000)
Traina Jr, C., Filho, R.F.S., Traina, A.J.M., Vieira, M.R., Faloutsos, C.: The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB J. 16(4), 483–505 (2007)
Vidal, E.: An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognit. Lett. 4(3), 145–157 (1986)
Wang, Y., Metwally, A., Parthasarathy, S.: Scalable all-pairs similarity search in metric spaces. In: KDD, pp. 829–837 (2013)
Zhang, Y., Lin, X., Zhang, W., Wang, J., Lin, Q.: Effectively indexing the uncertain space. IEEE Trans. Knowl. Data Eng. 22(9), 1247–1261 (2010)
Zhang, Y., Zhang, W., Lin, Q., Lin, X.: Effectively indexing the multi-dimensional uncertain objects for range searching. In: EDBT, pp. 504–515 (2012)
Zhu, R., Wang, B., Wang, G.: Indexing uncertain data for supporting range queries. In: WAIM, pp. 72–83 (2014)
Acknowledgements
This work was supported in part by the 973 Program of China No. 2015CB352502, the NSFC Grant Nos. 61522208, 61379033, and 61472348, the NSFC-Zhejiang Joint Fund Grant No. U1609217, and a grant from the Obel Family Foundation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, L., Gao, Y., Zhong, A. et al. Indexing metric uncertain data for range queries and range joins. The VLDB Journal 26, 585–610 (2017). https://doi.org/10.1007/s00778-017-0465-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-017-0465-6