PreFix: Switch Failure Prediction in Datacenter Networks

Published: 03 April 2018


In modern datacenter networks (DCNs), failures of network devices are the norm rather than the exception, and many research efforts have focused on dealing with failures after they happen. In this paper, we take a different approach by predicting failures, thus the operators can intervene and "fix" the potential failures before they happen. Specifically, in our proposed system, named PreFix, we aim to determine during runtime whether a switch failure will happen in the near future. The prediction is based on the measurements of the current switch system status and historical switch hardware failure cases that have been carefully labelled by network operators. Our key observation is that failures of the same switch model share some common syslog patterns before failures occur, and we can apply machine learning methods to extract the common patterns for predicting switch failures. Our novel set of features (message template sequence, frequency, seasonality and surge) for machine learning can efficiently deal with the challenges of noises, sample imbalance, and computation overhead. We evaluated PreFix on a data set collected from 9397 switches (3 different switch models) deployed in more than 20 datacenters owned by a top global search engine in a 2-year period. PreFix achieved an average of 61.81% recall and 1.84 * 10^-5 false positive ratio. It outperforms the other failure prediction methods for computers and ISP devices.


