Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Time Series of Price Anomaly Detection with LSTM

Johnson and Johnson, JNJ, Keras, Autoencoder, Tensorflow

Susan Li
Towards Data Science
4 min readSep 8, 2020

--

Autoencoders are an unsupervised learning technique, although they are trained using supervised learning methods. The goal is to minimize reconstruction error based on a loss function, such as the mean squared error.

In this post, we will try to detect anomalies in the Johnson & Johnson’s historical stock price time series data with an LSTM autoencoder.

The data can be downloaded from Yahoo Finance. The time period I selected was from 1985–09–04 to 2020–09–03.

The steps we will follow to detect anomalies in Johnson & Johnson stock price data using an LSTM autoencoder:

  1. Train an LSTM autoencoder on the Johnson & Johnson’s stock price data from 1985–09–04 to 2013–09–03. We assume that there were no anomalies and they were normal.
  2. Using the LSTM autoencoder to reconstruct the error on the test data from 2013–09–04 to 2020–09–03.
  3. If the reconstruction error for the test data is above the threshold, we label the data point as an anomaly.

We will break down an LSTM autoencoder network to understand them layer-by-layer.

The Data

LSTM_autoencoder_anomaly.py

Visualize the timeseries

viz_timeseries.py
Figure 1

Preprocessing

  • Train test split
train_test.py
  • Standardize the data
standardize.py
  • Create sequences

Convert input data into 3-D array combining TIME_STEPS. The shape of the array should be [samples, TIME_STEPS, features], as required for LSTM network.

We want our network to have memory of 30 days, so we set TIME_STEPS=30.

create_sequences.py

Build the Model

  • We define the reconstruction LSTM Autoencoder architecture that expects input sequences with 30 time steps and one feature and outputs a sequence with 30 time steps and one feature.
  • RepeatVector() repeats the inputs 30 times.
  • Set return_sequences=True, so the output will still be a sequence.
  • TimeDistributed(Dense(X_train.shape[2])) is added at the end to get the output, where X_train.shape[2] is the number of features in the input data.
LSTM_autoencoder.py
Figure 2

Train the Model

LSTM_autoencoder_train_model.py
Figure 3
plt.plot(history.history['loss'], label='Training loss')
plt.plot(history.history['val_loss'], label='Validation loss')
plt.legend();
Figure 4
model.evaluate(X_test, y_test)

Determine Anomalies

  • Find MAE loss on the training data.
  • Make the max MAE loss value in the training data as the reconstruction error threshold.
  • If the reconstruction loss for a data point in the test set is greater than this reconstruction error threshold value then we will label this data point as an anomaly.
LSTM_train_loss.py
Figure 5
LSTM_test_loss.py
Figure 6
test_loss_vs_threshold.py
Figure 7
anomalies = test_score_df.loc[test_score_df['anomaly'] == True]
anomalies.shape

As you can see, there are 22 data points in the test set that exceeded the reconstruction error threshold.

Visualize Anomalies

plot_anomalies.py
Figure 8

The model found that some low price anomalies in March and high price anomalies in April. As it was well documented that JNJ stock hit a 2020 low in March, but quickly reaccelerated to a high point less than a month later on bullish expectations for its coronavirus vaccine.

Jupyter notebook can be found on Github. Have a great week!

References:

--

--