research-article

A Framework for Exploiting Local Information to Enhance Density Estimation of Data Streams

Authors:

Arnold P. Boedihardjo,

Bingsheng WangAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 9, Issue 1

Article No.: 2, Pages 1 - 38

https://doi.org/10.1145/2629618

Published: 25 August 2014 Publication History

Abstract

The Probability Density Function (PDF) is the fundamental data model for a variety of stream mining algorithms. Existing works apply the standard nonparametric Kernel Density Estimator (KDE) to approximate the PDF of data streams. As a result, the stream-based KDEs cannot accurately capture complex local density features. In this article, we propose the use of Local Region (LRs) to model local density information in univariate data streams. In-depth theoretical analyses are presented to justify the effectiveness of the LR-based KDE. Based on the analyses, we develop the General Local rEgion AlgorithM (GLEAM) to enhance the estimation quality of structurally complex univariate distributions for existing stream-based KDEs. A set of algorithmic optimizations is designed to improve the query throughput of GLEAM and to achieve its linear order computation. Additionally, a comprehensive suite of experiments was conducted to test the effectiveness and efficiency of GLEAM.

References

[1]

C. Aggarwal. 2003. A framework for diagnosing changes in evolving data streams. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 575--586.

Digital Library

[2]

C. Aggarwal, J. Han, J. Wang, and P. S. Yu. 2003. A framework for clustering evolving data streams. In Proceedings of the 2003 International Conference. on Very Large Data Bases (VLDB'03).

Digital Library

[3]

C. Aggarwal and P. S. Yu. 2007. A survey of synopsis construction in data streams. In Data Streams: Models and Algorithms, C. Aggarwal, Ed. Springer Science and Business Media, New York, 169--202.

[4]

A. Asuncion and D. J. Newman. 2007. UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. Available at http://www.ics.uci.edu/&sim; mlearn/MLRepository.html.

[5]

B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. 2002. Models and issues in data stream systems. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 1--16.

Digital Library

[6]

A. P. Boedihardjo, C. T. Lu, and F. Chen. 2008. A framework for estimating complex probability structures in data streams. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (ACM CIKM). 619--628.

Digital Library

[7]

A. Bowman. 1984. An alternative method for cross-validation for the smoothing of density estimates. Biometrika 71, 353--360.

[8]

C. Chatfield and A. J. Collins. 1990. Introduction to Multivariate Analysis. Chapman & Hall.

[9]

P. Domingos and G. Hulten. 2012. A general framework for mining massive data streams. Journal of Computational and Graphical Statistics 12, 945--949.

[10]

M. Garofalakis, J. Gehrke, and R. Rastogi. 2002. Querying and mining data streams: You only get one look (tutorial). In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 635--635.

Digital Library

[11]

P. Gibbons, Y. Matias, and V. Poosala. 2002. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems 27, 261--298.

Digital Library

[12]

A. Gilbert, Y. Kotidis, S. Muthukrishan, and M. J. Strauss. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference of Very Large Data Bases. 454--465.

Digital Library

[13]

A. Gray and A. Moore. 2003. Rapid evaluation of multiple density models. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics.

[14]

S. Guha, N. Koudas, and K. Shim. 2006. Approximation and streaming algorithms for histogram construction problems. ACM Transactions on Database Systems 31, 396--438.

Digital Library

[15]

P. Hall, S. N. Lahiri, and Y. K. Truong. 1995. On bandwidth choice for density estimation with dependent data. Annals of Statistics 23, 2241--2263.

[16]

P. Hall and J. S. Marron. 1987. Estimation of integrated squared density derivatives. Statistics and Probability Letters 6, 109--115.

[17]

P. Hall, S. J. Sheather, M. C. Jones, and J. S. Marron. 1991. On optimal data-based bandwidth selection in kernel density estimation. Biometrika 78, 263--269.

[18]

W. Hardle, M. Muller, S. Sperlich, and A. Werwatz. 2004. Nonparametric and Semiparametric Models. Springer-Verlag, Germany.

[19]

T. Hastie, R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning. Springer-Verlag.

[20]

N.-B. Heidenreich, A. Schindler, and S. Sperlich. 2010. Bandwidth selection methods for kernel density estimation: A review of performance. In Social Science Research Network, Social Science Electronic Publishing. 1--28.

[21]

C. Heinz and B. Seeger. 2006. Towards kernel density estimation over streaming data. In Proceedings of the 13th International Conference on Management of Data. 91--102.

[22]

C. Heinz and B. Seeger. 2008. Cluster kernels: Resource-aware kernel density estimators over streaming data. IEEE Transactions on Knowledge and Data Engineering 20, 880--893.

Digital Library

[23]

N. L. Hjort and M. C. Jones. 1996. Locally parametric nonparametric density estimation. Annals of Statistics 24, 1619--1647.

[24]

Y. Ioannidis. 2003. The history of histograms (abridged). In Proceedings of the 29th International Conference on Very Large Databases. 19--30.

Digital Library

[25]

M. C. Jones, J. S. Marron, and S. J. Sheather. 1996. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91, 401--407.

[26]

E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana. 2008. The UCR time series classification/clustering. Available at http://www.cs.ucr.edu/&sim;eamonn/time_series_data.

[27]

E. M. Knorr and R. T. Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th Very Large Databases Conference, New York, 392--403.

Digital Library

[28]

E. Lehmann. 1998. Theory of Point Estimation. Springer, New York.

[29]

C. R. Loader. 1996. Local likelihood density estimation. The Annals of Statistics 24, 1602--1618.

[30]

C. R. Loader. 1999. Bandwidth selection: Classical or plug-in&quest; Annals of Statistics 27, 415--438.

[31]

I. Mitliagkas, C. Caramanis, and P. Jain. 2013. Streaming, Memory-limited PCA. University of Texas at Austin.

[32]

L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. 2002. Streaming-data algorithms for high-quality clustering. In Proceedings of the 18th IEEE International Conference on Data Engineering. 685--694.

Digital Library

[33]

A. Okabe, T. Satoh, and K. Sugihara. 2009. A kernel density estimation method for networks, its computational method and GIS-based tool. International Journal of Geographical Information Science 23, 1--31.

Digital Library

[34]

E. Parzen. 1962. On estimation of a probability density function and mode. Annals of Mathematical Statistics 33, 1065--1076.

[35]

M. Rudemo. 1982. Emperical choise of histograms and kernel density estimation. Scandanavian Journal of Statistics 9, 65--78.

[36]

S. Sain. 1994. Adaptive kernel density estimation. In Statistics. Rice University, Houston.

[37]

D. W. Scott. 1992. Multivariate Density Estimation. Wiley & Sons, New York.

[38]

S. J. Sheather and M. C. Jones. 1991. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society 53, 683--690.

[39]

B. W. Silverman. 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.

[40]

C. J. Stone. 1984. An asymptotically optimal window selection rule for kernel density estimates. Annals of Statistics 12, 1285--1297.

[41]

S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. 2006. Online outlier detection in sensor data using non-parametric models. In Proceedings of the 32nd International Conference on Very Large Databases. 187--198.

Digital Library

[42]

B. A. Turlach. 1993. Bandwidth selection in kernel density estimation: A review. C.O.R.E. and Institut de Statistique, Universite Catholique de Louvain, Louvain-la-Neuve, Belgium.

[43]

P. Van Kerm. 2003. Adaptive kernel density estimation. Stata Journal 3, 148--156.

[44]

E. J. Wegman and D. J. Marchette. 2003. On some techniques for streaming data: A case study of internet packet headers. Journal of Computational and Graphical Statistics 12, 1--22.

[45]

Z. Xie and J. Yan. 2008. A kernel density estimation of traffic accidents in a network space. Computers, Environment, and Urban Systems 35, 396--406.

[46]

T. Zhang, R. Ramakrishnan, and M. Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. 103--114.

Digital Library

[47]

T. Zhang, R. Ramakrishnan, and M. Livny. 1999. Fast density estimation using CF-kernel for very large databases. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 312--316.

Digital Library

[48]

A. Zhou, Z. Cai, L. Wei, and W. Qian. 2003. M-Kernel merging: Towards density estimation over data streams. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications. 285--292.

Digital Library

Cited By

Lara-Benítez PCarranza-García MGarcía-Gutiérrez JRiquelme J(2020)Asynchronous dual-pipeline deep learning framework for online data stream classificationIntegrated Computer-Aided Engineering10.3233/ICA-20061727:2(101-119)Online publication date: 27-Feb-2020
https://doi.org/10.3233/ICA-200617

Index Terms

A Framework for Exploiting Local Information to Enhance Density Estimation of Data Streams
1. Information systems
  1. Information systems applications

Recommendations

Mixture density estimation with group membership functions

The mixture density model has been extensively studied in the field of statistical pattern recognition. And the EM algorithm has been well known as a convenient and efficient tool to iteratively compute the maximum likelihood estimates of mixture model ...
Wavelet density estimators over data streams
SAC '05: Proceedings of the 2005 ACM symposium on Applied computing

Density estimation is a building block of many data analysis techniques. A recently examined approach based on wavelets promises to be superior to traditional density estimation techniques. For possibly infinite data streams, however, this approach is ...
Online density estimation over high-dimensional stationary and non-stationary data streams
Abstract
Efficient density estimation over an open-ended stream of high-dimensional data is of primary importance to machine learning. In general, parametric methods for density estimation are not suitable for high dimensions, and the widely ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 9, Issue 1

October 2014

209 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/2663598

Editor:
Philip S. Yu
University of Illinois at Chicago, USA

Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 August 2014

Accepted: 01 January 2014

Revised: 01 June 2013

Received: 01 June 2012

Published in TKDD Volume 9, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
261
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lara-Benítez PCarranza-García MGarcía-Gutiérrez JRiquelme J(2020)Asynchronous dual-pipeline deep learning framework for online data stream classificationIntegrated Computer-Aided Engineering10.3233/ICA-20061727:2(101-119)Online publication date: 27-Feb-2020
https://doi.org/10.3233/ICA-200617

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents