Abstract
In many real-world applications such as business planning and sensor data monitoring, one important, yet challenging, task is to rank objects (e.g., products, documents, or spatial objects) based on their ranking scores and efficiently return those objects with the highest scores. In practice, due to the unreliability of data sources, many real-world objects often contain noises and are thus imprecise and uncertain. In this paper, we study the problem of probabilistic top-k dominating (PTD) query on such large-scale uncertain data in a distributed environment, which retrieves k uncertain objects from distributed uncertain databases (on multiple distributed servers), having the largest ranking scores with high confidences. In order to efficiently tackle the distributed PTD problem, we propose a MapReduce framework for processing distributed PTD queries over distributed uncertain databases. In this MapReduce framework, we design effective pruning strategies to filter out false alarms in the distributed setting, propose cost-model-based index distribution mechanisms over servers, and develop efficient distributed PTD query processing algorithms. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed distributed PTD approaches on both real and synthetic data sets through various experimental settings.
Similar content being viewed by others
References
Antova L, Koch C, Olteanu D (2007) MayBMS: Managing incomplete information with probabilistic world-set decompositions. In: IEEE international conference on data engineering
Arenas M, Bertossi LE, Chomicki J (1999) Consistent query answers in inconsistent databases. In: Proceedings of the eighteenth ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS)
Benjelloun O, Sarma AD, Halevy A, Widom J (2006) ULDBs: databases with uncertainty and lineage. In: Proceedings of the very large data bases
Boulos J, Dalvi N, Mandhani B, Mathur S, Ré C, Suciu D (2005) Mystiq: a system for finding more answers by using probabilities. In: Proceedings of ACM SIGMOD international conference on management of data
Cheng R, Kalashnikov DV, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of ACM SIGMOD international conference on management of data
Cheng R, Kalashnikov DV, Prabhakar S (2004) Querying imprecise data in moving object environments. IEEE Transactions on Knowledge and Data Engineering 16(9)
Cheng R, Singh S, Prabhakar S (2005) U-DBMS: a database system for managing constantly-evolving data. In: Proceedings of the very large data bases
Chorochronos.org: Tiger/line: California streets mbr data (2018). http://chorochronos.datastories.org/?q=node/59. Accessed 15 Apr 2018
Cormode G, Li F, Yi K (2009) Semantics of ranking queries for probabilistic data and expected ranks. In: IEEE international conference on data engineering
Dalvi N, Suciu D (2007) Efficient query evaluation on probabilistic databases. VLDB J. 16(4)
Deshpande A, Guestrin C, Madden SR, Hellerstein JM, Hong W (2004) Model-driven data acquisition in sensor networks. In: Proceedings of the very large data bases
Feng X, Zhao X, Gao Y, Zhang Y (2013) Probabilistic top-k dominating query over sliding windows. In: Proceedings of web technologies and applications—15th Asia-Pacific Web Conference (APWeb)
Flesca S, Furfaro F, Parisi F (2014) Consistency checking and querying in probabilistic databases under integrity constraints. J Comput Syst Sci 80(7):1448–1489. https://doi.org/10.1016/j.jcss.2014.04.026
Fuxman A, Fazli E, Miller RJ (2005) Conquer: Efficient management of inconsistent databases. In: Proceedings of the ACM SIGMOD international conference on management of data
Grant J, Molinaro C, Parisi F (2018) Probabilistic spatio-temporal knowledge bases: capacity constraints, count queries, and consistency checking. Int J Approx Reason 100:1–28. https://doi.org/10.1016/j.ijar.2018.05.003
Han X, Li J, Gao H (2015) TDEP: efficiently processing top-\(k\) dominating query on massive data. Knowl Inf Syst 43(3)
Hua M, Pei J, Zhang W, Lin X (2008) Ranking queries on uncertain data: a probabilistic threshold approach. In: Proceedings of ACM SIGMOD international conference on management of data
Jampani R, Xu F, Wu M, Perez LL, Jermaine C, Haas PJ (2008) MCDB: a Monte Carlo approach to managing uncertain data. In: Proceedings of ACM SIGMOD international conference on management of data
Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28(1)
Kriegel HP, Kunath P, Renz M (2007) Probabilistic nearest-neighbor query on uncertain objects. In: Database systems for advanced applications (DASFAA)
Lai CC, Wang TC, Liu CM, Wang LC (2019) Probabilistic top-\(k\) dominating query monitoring over multiple uncertain IoT data streams in edge computing environments. IEEE Internet Things J 6(5):8563–8576. https://doi.org/10.1109/jiot.2019.2920908
Lazaridis I, Mehrotra S (2001) Progressive approximate aggregate queries with a multi-resolution tree structure. In: Proceedings of ACM SIGMOD international conference on management of data
LeFevre K, DeWitt DJ, Ramakrishnan R (2005) Incognito: Efficient full-domain k-anonymity. In: Proceedings of the ACM SIGMOD international conference on management of data
Li F, Yi K, Jestes J (2009) Ranking distributed probabilistic data. In: Proceedings of ACM SIGMOD international conference on management of data
Li J, Saha B, Deshpande A (2011) A unified approach to ranking in probabilistic databases. VLDB J. 20(2)
Lian X, Chen L (2008) Monochromatic and bichromatic reverse skyline search over uncertain databases. In: Proceedings of ACM SIGMOD international conference on management of data
Lian X, Chen L (2009) Top-\(k\) dominating queries in uncertain databases. In: International conference on extending database technology (EDBT)
Lian X, Chen L (2013) Probabilistic top-\(k\) dominating queries in uncertain databases. Inf Sci 226
Lian X, Chen L, Song S (2010) Consistent query answers in inconsistent probabilistic databases. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD’10, pp 303–314. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1807167.1807202
Papadias D, Tao Y, Fu G, Seeger B (2005) Progressive skyline computation in database systems. ACM Trans Database Syst 30(1)
Parisi F, Grant J (2016) Knowledge representation in probabilistic spatio-temporal knowledge bases. J Artif Intell Res 55:4883. https://doi.org/10.1613/jair.4883
Parisi F, Grant J (2017) On repairing and querying inconsistent probabilistic spatio-temporal databases. Int J Approx Reason 84:41–74. https://doi.org/10.1016/j.ijar.2017.02.003
Park Y, Min JK, Shim K (2013) Parallel computation of skyline and reverse skyline queries using mapreduce. In: Proceedings of the very large data bases, vol 6(14)
Park Y, Min JK, Shim K (2015) Processing of probabilistic skyline queries using mapreduce. In: Proceedings of the very large data bases, vol 8(12)
Pei J, Jiang B, Lin X, Yuan Y (2007) Probabilistic skylines on uncertain data. In: Proceedings of the very large data bases
Santoso BJ, Chiu G (2014) Close dominance graph: an efficient framework for answering continuous top-\(k\) dominating queries. IEEE Trans Knowl Data Eng 26(8)
Singh S, Mayfield C, Shah R, Prabhakar S, Hambrusch S, Neville J, Cheng R (2008) Database support for probabilistic attributes and tuples. In: IEEE international conference on data engineering
Skoutas D, Sacharidis D, Simitsis A, Kantere V, Sellis T (2009) Top-k dominant web services under multi-criteria matching. In: International conference on extending database technology (EDBT)
Tiakas E, Papadopoulos AN, Manolopoulos Y (2011) Progressive processing of subspace dominating queries. VLDB J. 20(6)
Tiakas E, Valkanas G, Papadopoulos AN, Manolopoulos Y (2014) Metric-based top-k dominating queries. In: Proceedings of the 17th international conference on extending database technology (EDBT)
Wang DZ, Michelakis E, Garofalakis M, Hellerstein J (2008) Bayestore: managing large, uncertain data repositories with probabilistic graphical models. In: Proceedings of the very large data bases
Wikipedia: Central limit theorem—wikipedia, the free encyclopedia (2017). https://en.wikipedia.org/w/index.php?title=Central_limit_theorem &oldid=800332726. Accessed 15 Sep 2017
Yiu ML, Mamoulis N (2007) Efficient processing of top-k dominating queries on multi-dimensional data. In: Proceedings of the very large data bases
Zhang J, Jiang X, Ku WS, Qin X (2016) Efficient parallel skyline evaluation using mapreduce. IEEE Trans Parallel Distrib Syst 27(7)
Zhang W, Lin X, Zhang Y, Pei J, Wang W (2010) Threshold-based probabilistic top-\(k\) dominating queries. VLDB J 19(2)
Zhou X, Li K, Zhou Y, Li K (2016) Adaptive processing for distributed skyline queries over uncertain data. IEEE Trans Knowl Data Eng 28(2)
Acknowledgements
Funding for this work was provided by NSF OAC No. 1739491, NSF CCF No. 2217104, and Lian Startup No. 220981, Kent State University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Rai, N., Lian, X. Distributed probabilistic top-k dominating queries over uncertain databases. Knowl Inf Syst 65, 4939–4965 (2023). https://doi.org/10.1007/s10115-023-01917-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01917-3