Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Cell trees: An adaptive synopsis structure for clustering multi-dimensional on-line data streams

Published: 01 November 2007 Publication History

Abstract

To effectively trace the clusters of recently generated data elements in an on-line data stream, a sibling list and a cell tree are proposed in this paper. Initially, the multi-dimensional data space of a data stream is partitioned into mutually exclusive equal-sized grid-cells. Each grid-cell monitors the recent distribution statistics of data elements within its range. The old distribution statistics of each grid-cell are diminished by a predefined decay rate as time goes by, so that the effect of the obsolete information on the current result of clustering can be eliminated without maintaining any data element physically. Given a partitioning factor h, a dense grid-cell is partitioned into h equal-size smaller grid-cells. Such partitioning is continued until a grid-cell becomes the smallest one called a unit grid-cell. Conversely, a set of consecutive sparse grid-cells can be merged into a single grid-cell. A sibling list is a structure to manage the set of all grid-cells in a one-dimensional data space and it acts as an index for locating a specific grid-cell. Upon creating a dense unit grid-cell on a one-dimensional data space, a new sibling list for another dimension is created as a child of the grid-cell. In such a way, a cell tree is created. By repeating this process, a multi-dimensional dense unit grid-cell is identified by a path of a cell tree. Furthermore, in order to confine the usage of memory space, the size of a unit grid-cell is adaptively minimized such that the result of clustering becomes as accurate as possible at all times. The proposed method is comparatively analyzed by a series of experiments to identify its various characteristics.

References

[1]
M. Garofalakis, J. Gehrke, R. Rastogi, Querying and mining data streams: you only get one look, in: The Tutorial Notes of the 28th International Conference on Very Large Databases, Hong Kong, China, August 2002.
[2]
Chang, Joong Hyuk and Lee, Won Suk, Finding frequent itemsets over online data streams. Information & Software Technology. v48 i7. 606-618.
[3]
M. Datar, A. Gionis, P. Indyk, R. Motawi, Maintaining stream statistics over sliding window, in: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, January 2002.
[4]
Gaber, Mohamed Medhat, Zaslavsky, Arkady B. and Krishnaswamy, Shonali, Mining data streams: a review. SIGMOD Record. v34 i2. 18-26.
[5]
Liadan O'Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, Rajeev Motwani, STREAM-data algorithms for high-quality clustering, in: Proceedings of IEEE International Conference on Data Engineering, March 2002.
[6]
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, A framework for clustering evolving data streams, in: Proceedings of the VLDB 29th, Berlin, 2003.
[7]
Li, Hua-Fu, Lee, Suh-Yin and Shan, Man-Kwan, Online mining changes of items over continuous append-only and dynamic data streams. Journal of UCS. v11 i8. 1411-1425.
[8]
Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis. 1972. Wiley.
[9]
Kaufman, L. and Rousseeuw, P.J., Finding groups in data. An Introduction to Cluster Analysis. 1990. Wiley, New York.
[10]
T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in: Proceedings of the SIGMOD, 1996, pp. 103-114.
[11]
Park, Nam Hun and Lee, Won Suk, A statistical grid-based clustering over data streams. ACM SIGMOD Record. v33 i1. 32-37.
[12]
Han, J., Kamber, M. and Tung, A.K.H., Spatial clustering methods in data mining: a survey. In: Miller, H., Han, J. (Eds.), Geographic Data Mining and Knowledge Discovery, Taylor and Francis.
[13]
M. Ester, H. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases, 1996.
[14]
M. Ester, H. Kriegel, J. Sander, M. Wimmer, X. Xu, Incremental clustering for mining in a data warehousing environment, in: Proceedings of the VLDB 24th, New York, 1998.
[15]
Sato, M. and Ishii, S., On-line EM algorithm for the normalized Gaussian network. Neural Computation. v12 i2.
[16]
C.-H. Lee, C.-R. Lin, M.-S. Chen, Sliding-window filtering: an efficient algorithm for incremental mining, in: Proceedings of the 10th International Conference on Information and Knowledge Management, Atlanta, GE, November 2001, pp. 263-270.
[17]
G.S. Manku, R. Motwani, Approximate frequency counts over data streams, in: Proceedings of the 28th International Conference on Very Large Databases, Hong Kong, China, August 2002, pp. 346-357.
[18]
G. Dong, J. Han, L.V.S. Lakshmanan, J. Pei, H. Wang, P.S. Yu, Online mining of changes from data streams: research problems and preliminary results, in: Proceedings of the Workshop on Management and Processing of Data Streams, 2003.
[19]
B.-K. Yi, N.D. Sidiropoulos, T. Johnson, H.V. Jagadish, C. Faloutsos, A. Biliris, Online data mining for co-evolving time sequences, in: Proceedings of the 16th International Conference on Data Engineering, pp. 13-22, 2000.
[20]
H.S. Javitz, A. Valdes, The NIDES Statistical Component Description and Justification, Annual Report, A010, 1994.
[21]
Chang, Joong Hyuk and Lee, Won Suk, A sliding window method for finding recently frequent itemsets over online data streams. Journal of the Information Science Engineering. v20 i4. 753-762.
[22]
Knuth, Donald E., . 1998. The Art of Computer Programming, 1998.third ed. Addison-Wesley.
[23]
C. Cheng, A. Fu, Y. Zhang, Entropy-based subspace clustering for mining numerical data, KDD-99, 84-93, San Diego, August 1999.
[24]
W. Wang, J. Yang, R. Muntz, Sting: a statistical information grid approach to spatial data mining, 1997.
[25]
KDD Cup 1999. <http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Data & Knowledge Engineering
Data & Knowledge Engineering  Volume 63, Issue 2
November, 2007
408 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 November 2007

Author Tags

  1. Adaptive memory utilization
  2. Clustering
  3. Data mining
  4. Data streams
  5. Grid-based clustering

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2020)A Fast Clustering Approach for Identifying Traffic CongestionsSpatial Data and Intelligence10.1007/978-3-030-69873-7_1(3-13)Online publication date: 8-May-2020
  • (2017)Versatile Hyper-Elliptic Clustering Approach for Streaming Data Based on One-Pass-Thrown-Away LearningJournal of Classification10.1007/s00357-017-9222-134:1(108-147)Online publication date: 1-Apr-2017
  • (2015)A novel approach for data stream clustering using artificial bee colony algorithmInternational Journal of Wireless and Mobile Computing10.1504/IJWMC.2015.0667558:1(59-65)Online publication date: 1-Jan-2015
  • (2015)A Fast Clustering Approach for Identifying Traffic CongestionsProceedings of the 8th ACM SIGSPATIAL International Workshop on Computational Transportation Science10.1145/2834882.2834885(21-26)Online publication date: 3-Nov-2015
  • (2015)A survey on data stream clustering and classificationKnowledge and Information Systems10.1007/s10115-014-0808-145:3(535-569)Online publication date: 1-Dec-2015
  • (2014)Clustering data streams using grid-based synopsisKnowledge and Information Systems10.1007/s10115-013-0659-141:1(127-152)Online publication date: 1-Oct-2014
  • (2013)Data stream clusteringACM Computing Surveys10.1145/2522968.252298146:1(1-31)Online publication date: 11-Jul-2013
  • (2009)Efficiently tracing clusters over high-dimensional on-line data streamsData & Knowledge Engineering10.1016/j.datak.2008.11.00468:3(362-379)Online publication date: 1-Mar-2009
  • (2008)A coarse-grain grid-based subspace clustering method for online multi-dimensional data streamsProceedings of the 17th ACM conference on Information and knowledge management10.1145/1458082.1458366(1521-1522)Online publication date: 26-Oct-2008
  • (2008)Memory efficient subspace clustering for online data streamsProceedings of the 2008 international symposium on Database engineering & applications10.1145/1451940.1451968(199-208)Online publication date: 10-Sep-2008
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media