Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Framework for Exploiting Local Information to Enhance Density Estimation of Data Streams

Published: 25 August 2014 Publication History

Abstract

The Probability Density Function (PDF) is the fundamental data model for a variety of stream mining algorithms. Existing works apply the standard nonparametric Kernel Density Estimator (KDE) to approximate the PDF of data streams. As a result, the stream-based KDEs cannot accurately capture complex local density features. In this article, we propose the use of Local Region (LRs) to model local density information in univariate data streams. In-depth theoretical analyses are presented to justify the effectiveness of the LR-based KDE. Based on the analyses, we develop the General Local rEgion AlgorithM (GLEAM) to enhance the estimation quality of structurally complex univariate distributions for existing stream-based KDEs. A set of algorithmic optimizations is designed to improve the query throughput of GLEAM and to achieve its linear order computation. Additionally, a comprehensive suite of experiments was conducted to test the effectiveness and efficiency of GLEAM.

References

[1]
C. Aggarwal. 2003. A framework for diagnosing changes in evolving data streams. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 575--586.
[2]
C. Aggarwal, J. Han, J. Wang, and P. S. Yu. 2003. A framework for clustering evolving data streams. In Proceedings of the 2003 International Conference. on Very Large Data Bases (VLDB'03).
[3]
C. Aggarwal and P. S. Yu. 2007. A survey of synopsis construction in data streams. In Data Streams: Models and Algorithms, C. Aggarwal, Ed. Springer Science and Business Media, New York, 169--202.
[4]
A. Asuncion and D. J. Newman. 2007. UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. Available at http://www.ics.uci.edu/∼ mlearn/MLRepository.html.
[5]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. 2002. Models and issues in data stream systems. In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 1--16.
[6]
A. P. Boedihardjo, C. T. Lu, and F. Chen. 2008. A framework for estimating complex probability structures in data streams. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (ACM CIKM). 619--628.
[7]
A. Bowman. 1984. An alternative method for cross-validation for the smoothing of density estimates. Biometrika 71, 353--360.
[8]
C. Chatfield and A. J. Collins. 1990. Introduction to Multivariate Analysis. Chapman & Hall.
[9]
P. Domingos and G. Hulten. 2012. A general framework for mining massive data streams. Journal of Computational and Graphical Statistics 12, 945--949.
[10]
M. Garofalakis, J. Gehrke, and R. Rastogi. 2002. Querying and mining data streams: You only get one look (tutorial). In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 635--635.
[11]
P. Gibbons, Y. Matias, and V. Poosala. 2002. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems 27, 261--298.
[12]
A. Gilbert, Y. Kotidis, S. Muthukrishan, and M. J. Strauss. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the 28th International Conference of Very Large Data Bases. 454--465.
[13]
A. Gray and A. Moore. 2003. Rapid evaluation of multiple density models. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics.
[14]
S. Guha, N. Koudas, and K. Shim. 2006. Approximation and streaming algorithms for histogram construction problems. ACM Transactions on Database Systems 31, 396--438.
[15]
P. Hall, S. N. Lahiri, and Y. K. Truong. 1995. On bandwidth choice for density estimation with dependent data. Annals of Statistics 23, 2241--2263.
[16]
P. Hall and J. S. Marron. 1987. Estimation of integrated squared density derivatives. Statistics and Probability Letters 6, 109--115.
[17]
P. Hall, S. J. Sheather, M. C. Jones, and J. S. Marron. 1991. On optimal data-based bandwidth selection in kernel density estimation. Biometrika 78, 263--269.
[18]
W. Hardle, M. Muller, S. Sperlich, and A. Werwatz. 2004. Nonparametric and Semiparametric Models. Springer-Verlag, Germany.
[19]
T. Hastie, R. Tibshirani, and J. Friedman. 2001. The Elements of Statistical Learning. Springer-Verlag.
[20]
N.-B. Heidenreich, A. Schindler, and S. Sperlich. 2010. Bandwidth selection methods for kernel density estimation: A review of performance. In Social Science Research Network, Social Science Electronic Publishing. 1--28.
[21]
C. Heinz and B. Seeger. 2006. Towards kernel density estimation over streaming data. In Proceedings of the 13th International Conference on Management of Data. 91--102.
[22]
C. Heinz and B. Seeger. 2008. Cluster kernels: Resource-aware kernel density estimators over streaming data. IEEE Transactions on Knowledge and Data Engineering 20, 880--893.
[23]
N. L. Hjort and M. C. Jones. 1996. Locally parametric nonparametric density estimation. Annals of Statistics 24, 1619--1647.
[24]
Y. Ioannidis. 2003. The history of histograms (abridged). In Proceedings of the 29th International Conference on Very Large Databases. 19--30.
[25]
M. C. Jones, J. S. Marron, and S. J. Sheather. 1996. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91, 401--407.
[26]
E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana. 2008. The UCR time series classification/clustering. Available at http://www.cs.ucr.edu/∼eamonn/time_series_data.
[27]
E. M. Knorr and R. T. Ng. 1998. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th Very Large Databases Conference, New York, 392--403.
[28]
E. Lehmann. 1998. Theory of Point Estimation. Springer, New York.
[29]
C. R. Loader. 1996. Local likelihood density estimation. The Annals of Statistics 24, 1602--1618.
[30]
C. R. Loader. 1999. Bandwidth selection: Classical or plug-in? Annals of Statistics 27, 415--438.
[31]
I. Mitliagkas, C. Caramanis, and P. Jain. 2013. Streaming, Memory-limited PCA. University of Texas at Austin.
[32]
L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. 2002. Streaming-data algorithms for high-quality clustering. In Proceedings of the 18th IEEE International Conference on Data Engineering. 685--694.
[33]
A. Okabe, T. Satoh, and K. Sugihara. 2009. A kernel density estimation method for networks, its computational method and GIS-based tool. International Journal of Geographical Information Science 23, 1--31.
[34]
E. Parzen. 1962. On estimation of a probability density function and mode. Annals of Mathematical Statistics 33, 1065--1076.
[35]
M. Rudemo. 1982. Emperical choise of histograms and kernel density estimation. Scandanavian Journal of Statistics 9, 65--78.
[36]
S. Sain. 1994. Adaptive kernel density estimation. In Statistics. Rice University, Houston.
[37]
D. W. Scott. 1992. Multivariate Density Estimation. Wiley & Sons, New York.
[38]
S. J. Sheather and M. C. Jones. 1991. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society 53, 683--690.
[39]
B. W. Silverman. 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
[40]
C. J. Stone. 1984. An asymptotically optimal window selection rule for kernel density estimates. Annals of Statistics 12, 1285--1297.
[41]
S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. 2006. Online outlier detection in sensor data using non-parametric models. In Proceedings of the 32nd International Conference on Very Large Databases. 187--198.
[42]
B. A. Turlach. 1993. Bandwidth selection in kernel density estimation: A review. C.O.R.E. and Institut de Statistique, Universite Catholique de Louvain, Louvain-la-Neuve, Belgium.
[43]
P. Van Kerm. 2003. Adaptive kernel density estimation. Stata Journal 3, 148--156.
[44]
E. J. Wegman and D. J. Marchette. 2003. On some techniques for streaming data: A case study of internet packet headers. Journal of Computational and Graphical Statistics 12, 1--22.
[45]
Z. Xie and J. Yan. 2008. A kernel density estimation of traffic accidents in a network space. Computers, Environment, and Urban Systems 35, 396--406.
[46]
T. Zhang, R. Ramakrishnan, and M. Livny. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data. 103--114.
[47]
T. Zhang, R. Ramakrishnan, and M. Livny. 1999. Fast density estimation using CF-kernel for very large databases. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 312--316.
[48]
A. Zhou, Z. Cai, L. Wei, and W. Qian. 2003. M-Kernel merging: Towards density estimation over data streams. In Proceedings of the 8th International Conference on Database Systems for Advanced Applications. 285--292.

Cited By

View all
  • (2020)Asynchronous dual-pipeline deep learning framework for online data stream classificationIntegrated Computer-Aided Engineering10.3233/ICA-200617(1-19)Online publication date: 13-Jan-2020

Index Terms

  1. A Framework for Exploiting Local Information to Enhance Density Estimation of Data Streams

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 9, Issue 1
    October 2014
    209 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/2663598
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 August 2014
    Accepted: 01 January 2014
    Revised: 01 June 2013
    Received: 01 June 2012
    Published in TKDD Volume 9, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. General Local rEgion AlgorithM (GLEAM)
    2. Local region information

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Asynchronous dual-pipeline deep learning framework for online data stream classificationIntegrated Computer-Aided Engineering10.3233/ICA-200617(1-19)Online publication date: 13-Jan-2020

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media