article

Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams

Authors:

Won Suk LeeAuthors Info & Claims

Data & Knowledge Engineering, Volume 63, Issue 2

Pages 528 - 549

https://doi.org/10.1016/j.datak.2007.04.003

Published: 01 November 2007 Publication History

Abstract

To effectively trace the clusters of recently generated data elements in an on-line data stream, a sibling list and a cell tree are proposed in this paper. Initially, the multi-dimensional data space of a data stream is partitioned into mutually exclusive equal-sized grid-cells. Each grid-cell monitors the recent distribution statistics of data elements within its range. The old distribution statistics of each grid-cell are diminished by a predefined decay rate as time goes by, so that the effect of the obsolete information on the current result of clustering can be eliminated without maintaining any data element physically. Given a partitioning factor h, a dense grid-cell is partitioned into h equal-size smaller grid-cells. Such partitioning is continued until a grid-cell becomes the smallest one called a unit grid-cell. Conversely, a set of consecutive sparse grid-cells can be merged into a single grid-cell. A sibling list is a structure to manage the set of all grid-cells in a one-dimensional data space and it acts as an index for locating a specific grid-cell. Upon creating a dense unit grid-cell on a one-dimensional data space, a new sibling list for another dimension is created as a child of the grid-cell. In such a way, a cell tree is created. By repeating this process, a multi-dimensional dense unit grid-cell is identified by a path of a cell tree. Furthermore, in order to confine the usage of memory space, the size of a unit grid-cell is adaptively minimized such that the result of clustering becomes as accurate as possible at all times. The proposed method is comparatively analyzed by a series of experiments to identify its various characteristics.

References

[1]

M. Garofalakis, J. Gehrke, R. Rastogi, Querying and mining data streams: you only get one look, in: The Tutorial Notes of the 28th International Conference on Very Large Databases, Hong Kong, China, August 2002.

[2]

Chang, Joong Hyuk and Lee, Won Suk, Finding frequent itemsets over online data streams. Information & Software Technology. v48 i7. 606-618.

[3]

M. Datar, A. Gionis, P. Indyk, R. Motawi, Maintaining stream statistics over sliding window, in: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, January 2002.

Digital Library

[4]

Gaber, Mohamed Medhat, Zaslavsky, Arkady B. and Krishnaswamy, Shonali, Mining data streams: a review. SIGMOD Record. v34 i2. 18-26.

[5]

Liadan O'Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, Rajeev Motwani, STREAM-data algorithms for high-quality clustering, in: Proceedings of IEEE International Conference on Data Engineering, March 2002.

[6]

Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, A framework for clustering evolving data streams, in: Proceedings of the VLDB 29th, Berlin, 2003.

[7]

Li, Hua-Fu, Lee, Suh-Yin and Shan, Man-Kwan, Online mining changes of items over continuous append-only and dynamic data streams. Journal of UCS. v11 i8. 1411-1425.

[8]

Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis. 1972. Wiley.

[9]

Kaufman, L. and Rousseeuw, P.J., Finding groups in data. An Introduction to Cluster Analysis. 1990. Wiley, New York.

[10]

T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in: Proceedings of the SIGMOD, 1996, pp. 103-114.

Digital Library

[11]

Park, Nam Hun and Lee, Won Suk, A statistical grid-based clustering over data streams. ACM SIGMOD Record. v33 i1. 32-37.

[12]

Han, J., Kamber, M. and Tung, A.K.H., Spatial clustering methods in data mining: a survey. In: Miller, H., Han, J. (Eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis.

[13]

M. Ester, H. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases, 1996.

[14]

M. Ester, H. Kriegel, J. Sander, M. Wimmer, X. Xu, Incremental clustering for mining in a data warehousing environment, in: Proceedings of the VLDB 24th, New York, 1998.

[15]

Sato, M. and Ishii, S., On-line EM algorithm for the normalized Gaussian network. Neural Computation. v12 i2.

[16]

C.-H. Lee, C.-R. Lin, M.-S. Chen, Sliding-window filtering: an efficient algorithm for incremental mining, in: Proceedings of the 10th International Conference on Information and Knowledge Management, Atlanta, GE, November 2001, pp. 263-270.

Digital Library

[17]

G.S. Manku, R. Motwani, Approximate frequency counts over data streams, in: Proceedings of the 28th International Conference on Very Large Databases, Hong Kong, China, August 2002, pp. 346-357.

Digital Library

[18]

G. Dong, J. Han, L.V.S. Lakshmanan, J. Pei, H. Wang, P.S. Yu, Online mining of changes from data streams: research problems and preliminary results, in: Proceedings of the Workshop on Management and Processing of Data Streams, 2003.

[19]

B.-K. Yi, N.D. Sidiropoulos, T. Johnson, H.V. Jagadish, C. Faloutsos, A. Biliris, Online data mining for co-evolving time sequences, in: Proceedings of the 16th International Conference on Data Engineering, pp. 13-22, 2000.

[20]

H.S. Javitz, A. Valdes, The NIDES Statistical Component Description and Justification, Annual Report, A010, 1994.

[21]

Chang, Joong Hyuk and Lee, Won Suk, A sliding window method for finding recently frequent itemsets over online data streams. Journal of the Information Science Engineering. v20 i4. 753-762.

[22]

Knuth, Donald E., . 1998. The Art of Computer Programming, 1998.third ed. Addison-Wesley.

[23]

C. Cheng, A. Fu, Y. Zhang, Entropy-based subspace clustering for mining numerical data, KDD-99, 84-93, San Diego, August 1999.

[24]

W. Wang, J. Yang, R. Muntz, Sting: a statistical information grid approach to spatial data mining, 1997.

[25]

KDD Cup 1999. <http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>.

Cited By

Zhuang YMa CXie JLi ZYue Y(2020)A Fast Clustering Approach for Identifying Traffic CongestionsSpatial Data and Intelligence10.1007/978-3-030-69873-7_1(3-13)Online publication date: 8-May-2020
https://dl.acm.org/doi/10.1007/978-3-030-69873-7_1
Wattanakitrungroj NManeeroj SLursinsap C(2017)Versatile Hyper-Elliptic Clustering Approach for Streaming Data Based on One-Pass-Thrown-Away LearningJournal of Classification10.1007/s00357-017-9222-134:1(108-147)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1007/s00357-017-9222-1
Xu C(2015)A novel approach for data stream clustering using artificial bee colony algorithmInternational Journal of Wireless and Mobile Computing10.1504/IJWMC.2015.0667558:1(59-65)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1504/IJWMC.2015.066755
Show More Cited By

Index Terms

Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
      2. Unsupervised learning
        Cluster analysis
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Grid-based subspace clustering over data streams
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management

A real-life data stream usually contains many dimensions and some dimensional values of its data elements may be missing. In order to effectively extract the on-going change of a data stream with respect to all the subsets of the dimensions of the data ...
A coarse-grain grid-based subspace clustering method for online multi-dimensional data streams
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

This paper proposes a subspace clustering algorithm which combines grid-based clustering with frequent itemset mining. Given a d-dimensional data stream, the on-going distribution statistics of its data elements in every one-dimensional data space is ...
Approximate trace of grid-based clusters over high dimensional data streams
PAKDD'07: Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining

Clustering in a large data set of high dimensionality has always been a serious challenge in the field of data mining. A good clustering method should provide flexible scalability to the number of dimensions as well as the size of a data set. We have ...

Comments

Information & Contributors

Information

Published In

cover image Data & Knowledge Engineering

Data & Knowledge Engineering Volume 63, Issue 2

November, 2007

408 pages

ISSN:0169-023X

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2007.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 November 2007

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhuang YMa CXie JLi ZYue Y(2020)A Fast Clustering Approach for Identifying Traffic CongestionsSpatial Data and Intelligence10.1007/978-3-030-69873-7_1(3-13)Online publication date: 8-May-2020
https://dl.acm.org/doi/10.1007/978-3-030-69873-7_1
Wattanakitrungroj NManeeroj SLursinsap C(2017)Versatile Hyper-Elliptic Clustering Approach for Streaming Data Based on One-Pass-Thrown-Away LearningJournal of Classification10.1007/s00357-017-9222-134:1(108-147)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1007/s00357-017-9222-1
Xu C(2015)A novel approach for data stream clustering using artificial bee colony algorithmInternational Journal of Wireless and Mobile Computing10.1504/IJWMC.2015.0667558:1(59-65)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1504/IJWMC.2015.066755
Zhuang YXie JMa CLi ZYue Y(2015)A Fast Clustering Approach for Identifying Traffic CongestionsProceedings of the 8th ACM SIGSPATIAL International Workshop on Computational Transportation Science10.1145/2834882.2834885(21-26)Online publication date: 3-Nov-2015
https://dl.acm.org/doi/10.1145/2834882.2834885
Nguyen HWoon YNg W(2015)A survey on data stream clustering and classificationKnowledge and Information Systems10.1007/s10115-014-0808-145:3(535-569)Online publication date: 1-Dec-2015
https://dl.acm.org/doi/10.1007/s10115-014-0808-1
Bhatnagar VKaur SChakravarthy S(2014)Clustering data streams using grid-based synopsisKnowledge and Information Systems10.1007/s10115-013-0659-141:1(127-152)Online publication date: 1-Oct-2014
https://dl.acm.org/doi/10.1007/s10115-013-0659-1
Silva JFaria EBarros RHruschka ECarvalho AGama J(2013)Data stream clusteringACM Computing Surveys10.1145/2522968.252298146:1(1-31)Online publication date: 11-Jul-2013
https://dl.acm.org/doi/10.1145/2522968.2522981
Lee JPark NLee W(2009)Efficiently tracing clusters over high-dimensional on-line data streamsData & Knowledge Engineering10.1016/j.datak.2008.11.00468:3(362-379)Online publication date: 1-Mar-2009
https://dl.acm.org/doi/10.1016/j.datak.2008.11.004
Lee JLee WShanahan JAmer-Yahia SManolescu IZhang YEvans DKolcz AChoi KChowdury A(2008)A coarse-grain grid-based subspace clustering method for online multi-dimensional data streamsProceedings of the 17th ACM conference on Information and knowledge management10.1145/1458082.1458366(1521-1522)Online publication date: 26-Oct-2008
https://dl.acm.org/doi/10.1145/1458082.1458366
Park NLee WDesai B(2008)Memory efficient subspace clustering for online data streamsProceedings of the 2008 international symposium on Database engineering & applications10.1145/1451940.1451968(199-208)Online publication date: 10-Sep-2008
https://dl.acm.org/doi/10.1145/1451940.1451968
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents