Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1645953.1646064acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Evaluating top-k queries over incomplete data streams

Published: 02 November 2009 Publication History

Abstract

We study the problem of continuous monitoring of top-k queries over multiple non-synchronized streams. Assuming a sliding window model, this general problem has been a well addressed research topic in recent years. Most approaches, however, assume synchronized streams where all attributes of an object are known simultaneously to the query processing engine. In many streaming scenarios though, different attributes of an item are reported in separate non-synchronized streams which do not allow for exact score calculations. We present how the traditional notion of object dominance changes in this case such that the k dominance set still includes all and only those objects which have a chance of being among the top-k results in their life time. Based on this, we propose an exact algorithm which builds on generating multiple instances of the same object in a way that enables efficient object pruning. We show that even with object pruning the necessary storage for exact evaluation of top-k queries is linear in the size of the sliding window. As data should reside in main memory to provide fast answers in an online fashion and cope with high stream rates, storing all this data may not be possible with limited resources. We present an approximate algorithm which leverages correlation statistics of pairs of streams to evict more objects while maintaining accuracy. We evaluate the efficiency of our proposed algorithms with extensive experiments.

References

[1]
Noga Alon, PhillipB. Gibbons, Yossi Matias, and Mario Szegedy. Tracking join and self-join sizes in limited storage. J. Comput. Syst. Sci., 64(3), 2002.
[2]
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and issues in data stream systems. In PODS, 2002.
[3]
Brian Babcock and Chris Olston. Distributed top-k monitoring. In SIGMOD Conference, 2003.
[4]
Shivnath Babu, Utkarsh Srivastava, and Jennifer Widom. Exploiting k-constraints to reduce memory overhead in continuous queries over data streams. ACM Trans. Database Syst., 29(3), 2004.
[5]
Jon Louis Bentley, H.T. Kung, Mario Schkolnick, and ClarkD. Thompson. On the average number of maxima in a set of vectors and applications. J. ACM, 25(4), 1978.
[6]
Christian Böhm, Beng Chin Ooi, Claudia Plant, and Ying Yan. Efficiently processing continuous k-nn queries on data streams. In ICDE, 2007.
[7]
Stephan Börzsönyi, Donald Kossmann, and Konrad Stocker. The skyline operator. In ICDE, 2001.
[8]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1), 2004.
[9]
Graham Cormode, Flip Korn, and Srikanta Tirthapura. Time-decaying aggregates in out-of-order streams. In PODS, 2008.
[10]
Graham Cormode and S.Muthukrishnan. What's hot and what's not: tracking most frequent items dynamically. In PODS, 2003.
[11]
Abhinandan Das, Johannes Gehrke, and Mirek Riedewald. Approximate join processing over data streams. In SIGMOD Conference, 2003.
[12]
Gautam Das, Dimitrios Gunopulos, Nick Koudas, and Nikos Sarkas. Ad-hoc top-k query answering for data streams. In VLDB, 2007.
[13]
Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Maintaining stream statistics over sliding windows. SIAM J. Comput., 31(6), 2002.
[14]
Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA, 2003.
[15]
Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4), 2003.
[16]
Philippe Flajolet and G.Nigel Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2), 1985.
[17]
Phillip B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In VLDB, 2001.
[18]
Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast incremental maintenance of approximate histograms. In VLDB, 1997.
[19]
Phillip B. Gibbons and Srikanta Tirthapura. Distributed streams algorithms for sliding windows. In SPAA, 2002.
[20]
Lukasz Golab and M.Tamer Özsu. Processing sliding window multi-joins in continuous queries over data streams. In VLDB, 2003.
[21]
Cheqing Jin, KeYi, LeiChen 0002, JeffreyXu Yu, and Xuemin Lin. Sliding-window top-k queries on uncertain streams. PVLDB, 1(1), 2008.
[22]
Nick Koudas, BengChin Ooi, Kian-Lee Tan, and RuiZhang 0003. Approximate nn queries on streams with guaranteed error/performance bounds. In VLDB, 2004.
[23]
Feifei Li, Ching Chang, George Kollios, and Azer Bestavros. Characterizing and exploiting reference locality in data stream applications. In ICDE, 2006.
[24]
Kyriakos Mouratidis, Spiridon Bakiras, and Dimitris Papadias. Continuous monitoring of top-k queries over sliding windows. In SIGMOD Conference, 2006.
[25]
Kyriakos Mouratidis and Dimitris Papadias. Continuous nearest neighbor queries over sliding windows. IEEE Trans. Knowl. Data Eng., 19(6), 2007.
[26]
S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2), 2005.
[27]
Dimitris Papadias, Yufei Tao, Greg Fu, and Bernhard Seeger. Progressive skyline computation in database systems. ACM Trans. Database Syst., 30(1), 2005.
[28]
Kresimir Pripuzic, IvanaPodnar Zarko, and Karl Aberer. Top-k/w publish/subscribe: finding k most relevant publications in sliding time window w. In DEBS, 2008.
[29]
Utkarsh Srivastava and Jennifer Widom. Memory-limited execution of windowed stream joins. In VLDB, 2004.
[30]
Srikanta Tirthapura, Bojian Xu, and Costas Busch. Sketching asynchronous streams over a sliding window. In PODC, 2006.
[31]
Roger Weber, Hans-Jörg Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, 1998.
[32]
Junyi Xie, Jun Yang, and Yuguo Chen. On joining and caching stochastic streams. In SIGMOD Conference, 2005.
[33]
KeYi, Hai Yu, Jun Yang, Gangqiang Xia, and Yuguo Chen. Efficient maintenance of materialized top-k views. In ICDE, 2003.

Cited By

View all
  • (2024)Multiple Continuous Top-K Queries Over Data Stream2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00129(1575-1588)Online publication date: 13-May-2024
  • (2022)Towards Query Pricing on Incomplete DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.302603134:8(4024-4036)Online publication date: 1-Aug-2022
  • (2022)Weighted top-k dominating queries on highly incomplete dataInformation Systems10.1016/j.is.2022.102008107:COnline publication date: 1-Jul-2022
  • Show More Cited By

Index Terms

  1. Evaluating top-k queries over incomplete data streams

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management
      November 2009
      2162 pages
      ISBN:9781605585123
      DOI:10.1145/1645953
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 November 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. dominance set
      2. incomplete streams
      3. skyline
      4. top-k queries

      Qualifiers

      • Research-article

      Conference

      CIKM '09
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 15 Oct 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Multiple Continuous Top-K Queries Over Data Stream2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00129(1575-1588)Online publication date: 13-May-2024
      • (2022)Towards Query Pricing on Incomplete DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.302603134:8(4024-4036)Online publication date: 1-Aug-2022
      • (2022)Weighted top-k dominating queries on highly incomplete dataInformation Systems10.1016/j.is.2022.102008107:COnline publication date: 1-Jul-2022
      • (2022)Continuous spatial keyword query processing over geo-textual data streamsWorld Wide Web10.1007/s11280-022-01062-x26:3(889-903)Online publication date: 11-May-2022
      • (2022)Revealing top-k dominant individuals in incomplete data based on spark environmentEnvironment, Development and Sustainability10.1007/s10668-022-02652-5Online publication date: 3-Oct-2022
      • (2022)Top-k Dominating Queries on Incremental DatasetsDatabase Systems for Advanced Applications. DASFAA 2022 International Workshops10.1007/978-3-031-11217-1_6(79-88)Online publication date: 16-Jul-2022
      • (2021)Answering Skyline Queries Over Incomplete Data With CrowdsourcingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.294679833:4(1360-1374)Online publication date: 1-Apr-2021
      • (2020)Continuous top-k approximated join of streaming and evolving distributed dataSemantic Web10.3233/SW-19036711:5(767-799)Online publication date: 1-Jan-2020
      • (2020)Predictive Intelligence in Analytics Aggregation of Partial Ordered SubsetsIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2017.269036450:4(1417-1428)Online publication date: Apr-2020
      • (2020)ConclusionRelevant Query Answering over Streaming and Distributed Data10.1007/978-3-030-38339-8_7(115-119)Online publication date: 22-Jan-2020
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media