Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Conditional heavy hitters: detecting interesting correlations in data streams

Published: 01 June 2015 Publication History

Abstract

The notion of heavy hitters--items that make up a large fraction of the population--has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of conditional heavy hitters to identify such items, with applications in network monitoring and Markov chain modeling. We explore the relationship between conditional heavy hitters and other related notions in the literature, and show analytically and experimentally the usefulness of our approach. We introduce several algorithm variations that allow us to efficiently find conditional heavy hitters for input data with very different characteristics, and provide analytical results for their performance. Finally, we perform experimental evaluations with several synthetic and real datasets to demonstrate the efficacy of our methods and to study the behavior of the proposed algorithms for different types of data.

References

[1]
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: ACM SIGMOD International Conference on Management of Data, pp. 207---216 (1993)
[2]
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: ACM Symposium on Theory of Computing, pp. 20---29 (1996)
[3]
Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 286---296. ACM (2004)
[4]
Baum, L.E., Petrie, T.: Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37(6), 1554---1563 (1966)
[5]
Boyer, B., Moore, J.: A fast majority vote algorithm. Tech. Rep. ICSCA-CMP-32. Institute for Computer Science, University of Texas (1981)
[6]
Broder, A., Mitzenmacher, M.: Network applications of bloom filters: a survey. Internet Math. 1(4), 485---509 (2005)
[7]
Budak, C., Georgiou, T., Agrawal, D., El Abbadi, A.: Geoscope: online detection of geo-correlated information trends in social networks. PVLDB 7(4), 229---240 (2013)
[8]
Chang, J.H., Lee, W.S.: Finding recent frequent itemsets adaptively over online data streams. In: KDD, pp. 487---492 (2003)
[9]
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP) (2002)
[10]
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. In: International Conference on Very Large Data Bases (2008)
[11]
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in data streams. In: International Conference on Very Large Data Bases, pp. 464---475 (2003)
[12]
Cormode, G., Korn, F., Tirthapura, S.: Time decaying aggregates in out-of-order streams. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 89---98. ACM (2008)
[13]
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Algorithm. 55(1), 58---75 (2005)
[14]
Dallachiesa, M., Nushi, B., Mirylenka, K., Palpanas, T.: Uncertain time-series similarity: return to the basics. PVLDB 5(11), 1662---1673 (2012)
[15]
Dallachiesa, M., Palpanas, T.: Identifying streaming frequent items in ad hoc time windows. Data Knowl. Eng. 87, 66---90 (2013)
[16]
Demaine, E., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: European Symposium on Algorithms (ESA) (2002)
[17]
Duong, T., Goud, B., Schauer, K.: Closed-form density-based framework for automatic detection of cellular morphology changes. Proc. Natl. Acad. Sci. 109(22), 8382---8387 (2012)
[18]
Durme, B.V., Lall, A.: Streaming pointwise mutual information. In: Advances in Neural Information Processing Systems, pp. 1892---1900 (2009)
[19]
Gehrke, J., Korn, F., Srivastava, D.: On computing correlated aggregates over continual data streams. In: ACM SIGMOD International Conference on Management of Data, pp. 13---24 (2001)
[20]
Giannella, C., Han, J., Pei, J., Yan, X., Yu, P.S.: Mining frequent patterns in data streams at multiple time granularities. In: Kargupta, H., Joshi, A., Sivakumar, K., Yesha, Y. (eds.) Next Generation Data Mining, pp. 191---212 (2003)
[21]
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD Conference, pp. 1---12 (2000)
[22]
Lahiri, B., Tirthapura, S.: Finding correlated heavy-hitters over data streams. In: IEEE 28th International Conference on Performance Computing and Communications (IPCCC), pp. 307---314. IEEE (2009)
[23]
Lee, L-K., Ting, H.F.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: Proceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 290---297. ACM (2006)
[24]
Letchner, J., Ré, C., Balazinska, M., Philipose, M.: Approximation trade-offs in Markovian stream processing: an empirical study. In: IEEE 26th International Conference on Data Engineering (ICDE), pp. 936---939. IEEE (2010)
[25]
Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415---430 (2009)
[26]
Manku, G., Motwani, R.: Approximate frequency counts over data streams. In: International Conference on Very Large Data Bases, pp. 346---357 (2002)
[27]
Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory (2005)
[28]
Mirylenka, K., Cormode, G., Palpanas, T., Srivastava, D.: Finding interesting correlations with conditional heavy hitters. In: International Conference on Data Engineering (ICDE) (2013)
[29]
Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2, 143---152 (1982)
[30]
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., Los Altos (1988)
[31]
Rabinovich, M., Spatschek, O.: Web Caching and Replication. Addison-Wesley Longman Publishing Co., Inc, Boston (2002)
[32]
Raftery, A.E.: A model of high-order Markov chains. J. R. Stat. Soc. Series B Methodol. 47(3), 528---539 (1985)
[33]
Rubner, Y., Tomasi, C., Guibas, L.: The earth mover's distance as a metric for image retrieval. Int. J. Comput. Vision 40(2), 99---121 (2000)
[34]
Tantono, F.I., Manerikar, N., Palpanas, T.: Efficiently discovering recent frequent items in data streams. In: Scientific and Statistical Database Management, pp. 222---239. Springer, Berlin, Heidelberg (2008)
[35]
Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New streaming algorithms for fast detection of superspreaders. In: Network and Distributed System Security Symposium NDSS (2005)
[36]
Wang, P., Wang, H., Wang, W.: Finding semantics in time series. In: ACM SIGMOD International Conference on Management of Data, pp. 385---396 (2011)
[37]
Welch, B.L.: The generalization of `student's' problem when several different population variances are involved. Biometrika 34(1/2), 28---35 (1947)
[38]
Yu, P.S., Chi, Y.: Association rule mining on streams. In: Encyclopedia of Database Systems, pp. 136---139. Springer-Verlag (2009)

Cited By

View all
  • (2023)Missing Value Imputation for Multi-Attribute Sensor Data Streams via Message PropagationProceedings of the VLDB Endowment10.14778/3632093.363210017:3(345-358)Online publication date: 1-Nov-2023
  • (2023)Hyper-USS: Answering Subset Query Over Multi-Attribute Data StreamProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599383(1698-1709)Online publication date: 6-Aug-2023
  • (2023)Towards Persistent Detection of DDoS Attacks in NDN: A Sketch-Based ApproachIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.319618720:4(3449-3465)Online publication date: 1-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 24, Issue 3
June 2015
145 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 June 2015

Author Tags

  1. Heavy hitters
  2. Online algorithms
  3. Streaming data

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)4
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Missing Value Imputation for Multi-Attribute Sensor Data Streams via Message PropagationProceedings of the VLDB Endowment10.14778/3632093.363210017:3(345-358)Online publication date: 1-Nov-2023
  • (2023)Hyper-USS: Answering Subset Query Over Multi-Attribute Data StreamProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599383(1698-1709)Online publication date: 6-Aug-2023
  • (2023)Towards Persistent Detection of DDoS Attacks in NDN: A Sketch-Based ApproachIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.319618720:4(3449-3465)Online publication date: 1-Jul-2023
  • (2022)FEDEXProceedings of the VLDB Endowment10.14778/3565838.356584115:13(3854-3868)Online publication date: 1-Sep-2022
  • (2021)Timely Reporting of Heavy Hitters Using External MemoryACM Transactions on Database Systems10.1145/347239246:4(1-35)Online publication date: 15-Nov-2021
  • (2019)HeavyKeeperIEEE/ACM Transactions on Networking10.1109/TNET.2019.293386827:5(1845-1858)Online publication date: 1-Oct-2019
  • (2018)HeavykeeperProceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference10.5555/3277355.3277443(909-921)Online publication date: 11-Jul-2018
  • (2018)HeavyGuardianProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3219819.3219978(2584-2593)Online publication date: 19-Jul-2018
  • (2018)Sketching Linear Classifiers over Data StreamsProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3196930(757-772)Online publication date: 27-May-2018
  • (2018)Fast and accurate mining of correlated heavy hittersData Mining and Knowledge Discovery10.1007/s10618-017-0526-x32:1(162-186)Online publication date: 1-Jan-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media