Abstract
Event log files are the most common source of information for the characterization of events in large scale systems. However the large size of these files makes the task of manual analysing log messages to be difficult and error prone. This is the reason why recent research has been focusing on creating algorithms for automatically analysing these log files. In this paper we present a novel methodology for extracting templates that describe event formats from large datasets presenting an intuitive and user-friendly output to system administrators. Our algorithm is able to keep up with the rapidly changing environments by adapting the clusters to the incoming stream of events. For testing our tool, we have chosen 5 log files that have different formats and that challenge different aspects in the clustering task. The experiments show that our tool outperforms all other algorithms in all tested scenarios achieving an average precision and recall of 0.9, increasing the correct number of groups by a factor of 1.5 and decreasing the number of false positives and negatives by an average factor of 4.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Archive, F.T., http://fta.inria.fr (accessed on 2010)
Schroeder, G.G.B.: A large-scale study of failures in high-performance computing systems. In: IEEE DSN 2006, pp. 249–258 (June 2006)
Bookstein, A., all: Generalized hamming distance. Information Retrieval Journal 5(4), 353–375 (2002)
Chuah, E., et al.: Diagnosing the root-cause of failures from cluster log files (2010)
T. computer failure data repository, http://cfdr.usenix.org (accessed on 2010)
Fu, Q.: all. Execution anomaly detection in distributed systems through unstructured log analysis. In: ICDM, pp. 149–158 (December 2009)
Fu, S., Xu, C.-Z.: Exploring event correlation for failure prediction in coalitions of clusters. In: Proceedings of the ACM/IEEE Conference on Supercomputing (November 2007)
Han, J., et al.: Mining frequent patterns without candidate generation. In: ACM SIGMOD, pp. 1–12 (May 2000)
Lan, Z., all: Toward automated anomaly identification in large-scale systems. IEEE Trans. on Parallel and Distributed Systems 21(2), 174–187 (2010)
Makanju, A., et al: Clustering event logs using iterative partitioning. In: 15th ACM SIGKDD, pp. 1255–1264 (2009)
McCallum, A., all: Efficient clustering of high-dimensional data sets with application to reference matching. In: ACM SIGKDD, pp. 169–178 (August 2000)
Mitra, M., Chaudhuri, B.: Information retrieval from documents: A survey. Information Retrieval Journal 2(2-3), 141–163 (2000)
NCSA, http://www.ncsa.illinois.edu (accessed on 2010)
Pang, W., et al.: Mining logs files for data-driven system management. ACM SIGKDD 7, 44–51 (2005)
Park, Geist, A.: System log pre-processing to improve failure prediction. In: DSN 2009, pp. 572–577 (2009)
Salfner, F., et al.: A survey of online failure prediction methods. ACM Computing Surveys 42(3) (March 2010)
Stearley, J.: Towards informatic analysis of syslogs. In: IEEE Conference on Cluster Computing (September 2004)
Stearley, J.: Towards informatic analysis of syslogs. In: IEEE International Conference on Cluster Computing, vol. 5, pp. 309–318 (2004)
Vaarandi, R.: Mining event logs with slct and loghound. In: IEEE NOMS 2008, pp. 1071–1074 (April 2008)
Wei Peng, S.M., Li, T.: Mining logs files for data driven system management. ACM SIGKDD 7, 44–51 (2005)
Xue, Z., et al.: A survey on failure prediction of large-scale server clusters. In: ACIS SNPD 2007, pp. 733–738 (June 2007)
Zarza, G., et al.: Fault-tolerant routing for multiple permanent and non-permanent faults in hpc systems. In: PDPTA 2010 (July 2010)
Zhang, X., Furtlehner, C., Sebag, M.: Data streaming with affinity propagation. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 628–643. Springer, Heidelberg (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gainaru, A., Cappello, F., Trausan-Matu, S., Kramer, B. (2011). Event Log Mining Tool for Large Scale HPC Systems. In: Jeannot, E., Namyst, R., Roman, J. (eds) Euro-Par 2011 Parallel Processing. Euro-Par 2011. Lecture Notes in Computer Science, vol 6852. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23400-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-23400-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23399-9
Online ISBN: 978-3-642-23400-2
eBook Packages: Computer ScienceComputer Science (R0)