Abstract
Computer systems generate a large amount of data that, in terms of space and time, is very expensive - even impossible - to store. Besides this, many applications need to keep an historical view of such data in order to provide historical aggregated information, perform data mining tasks or detect anomalous behavior in computer systems. One solution is to treat the data as streams being processed on the fly in order to build historical summaries. Many data summarizing techniques have already been developed such as sampling, clustering, histograms, etc. Some of them have been extended to be applied directly to data streams. This chapter presents a new approach to build such historical summaries of data streams. It is based on a combination of two existing algorithms: StreamSamp and CluStream. The combination takes advantages of the benefits of each algorithm and avoids their drawbacks. Some experiments are presented both on real and synthetic data. These experiments show that the new approach gives better results than using any one of the two mentioned algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. The VLDB Journal 12(2), 120–139 (2003), http://dx.doi.org/10.1007/s00778-003-0095-z
Aggarwal, C. (ed.): Data Streams – Models and Algorithms. Springer, Heidelberg (2007)
Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: VLDB 2006: Proceedings of the 32nd international conference on Very large data bases, VLDB Endowment, pp. 607–618 (2006)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data Streams. In: VLDB, pp. 81–92 (2003)
Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-Size Reservoir Sampling over Data Streams. In: SSDBM, p. 22 (2007)
Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J., Widom, J.: STREAM: the stanford stream data manager (demonstration description). In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, p. 665. ACM, New York (2003), http://doi.acm.org/10.1145/872757.872854
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS 2002: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–16. ACM, New York (2002), http://doi.acm.org/10.1145/543613.543615
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970), http://doi.acm.org/10.1145/362686.362692
Csernel, B.: Résumé généraliste de flux de données. Ph.D. thesis, Ecole Nationale Supérieur des Télécommunications (Février 2008)
Csernel, B., Clérot, F., Hébrail, G.: StreamSamp: DataStream Clustering Over Tilted Windows Through Sampling. In: ECML PKDD 2006 Workshop on Knowledge Discovery from Data Streams (2006)
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985), http://dx.doi.org/10.1016/0022-00008590041-8
Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD Conference, pp. 379–392 (2008)
Golab, L., Özsu, M.T.: Issues in data stream management. SIGMOD Rec. 32(2), 5–14 (2003), http://doi.acm.org/10.1145/776985.776986
Guha, S., Harb, B.: Wavelet synopsis for data streams: minimizing non-euclidean error. In: KDD 2005: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 88–97. ACM, New York (2005), http://doi.acm.org/10.1145/1081870.1081884
Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: STOC 2001: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp. 471–475. ACM, New York (2001), http://doi.acm.org/10.1145/380752.380841
Ioannidis, Y.E., Poosala, V.: Histogram-Based Approximation of Set-Valued Query-Answers. In: VLDB, pp. 174–185 (1999)
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal Histograms with Quality Guarantees. In: VLDB, pp. 275–286 (1998)
Ma, L., Nutt, W., Taylor, H.: Condensative Stream Query Language for Data Streams. In: ADC, pp. 113–122 (2007)
Muthukrishnan, S., Strauss, M., Zheng, X.: Workload-Optimal Histograms on Streams. In: Brodal, G.S., Leonardi, S. (eds.) ESA 2005. LNCS, vol. 3669, pp. 734–745. Springer, Heidelberg (2005)
Park, B.-H., Ostrouchov, G., Samatova, N.F., Geist, A.: Reservoir-Based Random Sampling with Replacement from Data Stream. In: SIAM SDM International Conference on Data Mining (2004)
Puttagunta, V., Kalpakis, K.: Adaptive Clusters and Histograms over Data Streams. In: IKE International Conference on Information and Knowledge Engineering, pp. 98–104 (2005)
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985), http://doi.acm.org/10.1145/3147.3165
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec. 25(2), 103–114 (1996), http://doi.acm.org/10.1145/235968.233324
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Gabsi, N., Clérot, F., Hébrail, G. (2010). An Hybrid Data Stream Summarizing Approach by Sampling and Clustering. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00580-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-00580-0_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00579-4
Online ISBN: 978-3-642-00580-0
eBook Packages: EngineeringEngineering (R0)