Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Time series anomaly detection — with Python example

Krzysztof Drelczuk
4 min readMay 15, 2020

Anomaly detection is one of the most interesting topic in data science. There are many approaches for solving that problem starting on simple global thresholds ending on advanced machine learning. Here I would like to show you simple yet efficient method based on sliding window for setting local thresholds.

Let’s first investigate our data. Just after loading the data I am assigning value 100 to 270th position of the list to have significant outlier (anomaly). It can be clearly seen on the plot.

data = pd.read_csv(‘daily-total-female-births.csv’)
data[‘Births’][270] = 100
column = data[‘Births’]
column.plot(figsize=(20,10))

Idea:

We are going to build a sliding window and we are going to shift it one by one element to the right until we will reach the end of our data set. For each window we are going to compute mean and standard deviation. This is our local threshold. In the next step we compare it with real values. If its grater then mean +/- three standard deviation we consider such point as anomaly. You can choose other thresholds like 98th percentile. It is up to you and it all depends on your data.

Implementation:

Lets start with defining windows size. The actual size is being computed based on percentage value assigned to window_percentage variable. Smaller number will make bands more ‘tight’ compared to the main data. Larger numbers will give more smoother bands. At the end of an article I have placed charts with 1%, 5%, 10% and 20% widow sizes. You can see the difference in bands. You can think of it as a sensitivity parameter. In this example I will go with 3%.

window_percentage = 3
k = int(len(column) * (window_percentage/2/100))
N = len(column)

For computing upper and lower bands for each window I will use following lambda:

get_bands = lambda data : (np.mean(data) + 3*np.std(data),np.mean(data) — 3*np.std(data))

If you prefer to use 99th percentile instead of standard deviation you can use this version of lambda:

get_bands = lambda data : (np.mean(data) + np.nanquantile(data,0.98),np.mean(data) — np.nanquantile(data,0.98))

For sliding window I will use list comprehensions to do it as a one liner. You may be curious why there are if statements in providing indexes to range selector. It is there to handle boundary cases and avoid negative indexes and greater than size of whole data list. In this snippet k is the window size and N is a data list length.

bands = [get_bands(column[range(0 if i — k < 0 else i-k ,i + k if i + k < N else N)]) for i in range(0,N)]
upper, lower = zip(*bands)

Finding the anomalies now its easy:

anomalies = (column > upper) | (column < lower)

Now let’s plot data, both upper and lower bands and anomalies.

As you can on this zoomed chart three anomalies were found. The third one is the one we set at the begging. The rest two were is data sets

Complete charts for 1%, 5%,10%, 20% size window:

Window size = 1%
Window size = 5%
Window size = 10%
Windows size = 20%

--

--