Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Source: Result of the study — computed by the author

How to use Deep Learning for Time Series Forecasting

An application of the RNN family

Christophe Pere
Towards Data Science
12 min readSep 21, 2020

--

Introduction

For a long time, I heard that the problem of time series could only be approached by statistical methods (AR[1], AM[2], ARMA[3], ARIMA[4]). These techniques are generally used by mathematicians who try to improve them continuously to constrain stationary and non-stationary time series.

A friend of mine (mathematician, professor of statistics, and specialist in non-stationary time series) offered me several months ago to work on the validation and improvement of techniques to reconstruct the lightcurve of stars. Indeed, the Kepler satellite[11], like many other satellites, could not continuously measure the intensity of the luminous flux of nearby stars. The Kepler satellite was dedicated between 2009 and 2016 to search for planets outside our Solar System called extrasolar planets or exoplanets.

As you have understood, we are going to travel a little further than our planet Earth and deep dive into a galactic journey whose machine learning will be our vessel. As you can understand, astrophysics has remained a strong passion for me.

The notebook is available on Github: Here.

RNN, LSTM, GRU, Bidirectional, CNN-x

So what ship will we carry through this study? We will use the recurrent neural networks (RNN[5]), models. We will use LSTM[6], GRU[7], Stacked LSTM, Stacked GRU, Bidirectional[8] LSTM, Bidirectional GRU, and also CNN-LSTM[9]. For those passionate about tree family, you can find a great article on XGBoost and time series written by Jason Brownlee here. A great repository about time series is available on github.

For those who are not familiar with the RNN family, see them as learning methods with memory effect and the ability to forget. The bidirectional term comes from the architecture, it is about two RNN which will "read" the data in one direction (from left to right) and the other (from right to left) in order to be able to have the best representation of long-term dependencies.

Data

As said earlier in the introduction, the data correspond to flux measurements of several stars. Indeed, at each temporal increment (hour), the satellite made a measurement of the flux coming from nearby stars. This flux, or magnitude, which is light intensity, varies over time. There are several reasons for this, the proper movement of the satellite, rotation, angle of sight, etc. will vary. Therefore the number of photons measured will change, the star is a ball of molten material (hydrogen and helium fusion) which has its own movement therefore the emission of photons is made depending on its movement. This corresponds to fluctuations in light intensity.

But, there can also be planets, exoplanets, which disturb the star or even pass between the star and in the line of sight of the satellite (transit method[12]). This passage obscures the star, the satellite receives fewer photons because they are blocked by the planet passing in front of it (a concrete example is a Solar eclipse due to the Moon).

The set of flux measurements is called a light curve. What does a light curve look like? Here are some examples:

The fluxes are very different from one star to another. Some are very noisy while others have great stability. The fluxes nevertheless present anomalies. Holes, or lack of measurement, are visible in the light curves. The goal is to see if it is possible to predict the behavior of light curves where there is no measurement.

Data Reduction

In order to be able to use the data in the models, it is necessary to carry out a data reduction. Two will be presented here, the moving average and window method.

Moving average:

The moving average consists of taking X successive points and compute the mean on them. This method permits to reduce the variability and delete the noise. This also reduces the number of points, it’s a downsampling method.

The following function allows us to compute a moving average from a list of points by giving a number that will be used to compute the average and the standard deviation of the points.

You can see that the function takes 3 parameters in the input. time and flux are the x and y of the time series. lag is the parameter controlling the number of points takes into account to compute the mean of the time and flux and the standard deviation of the flux.

Now, we can take a look at how to use this function and the result obtained by the transformation.

# import the packages needed for the study
matplotlib inline
import scipy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import tensorflow as tf
# let's see the progress bar
from tqdm import tqdm
tqdm().pandas()

Now we need to import the data. The file kep_lightcurves.csv contains the data from the 13 stars. Each star has 4 columns, the original flux (‘…_orig’), the rescaled flux is the original flux minus the average flux (‘…_rscl’), the difference (‘…_diff’), and the residuals (‘…_res’). So, 52 columns in total.

# reduce the number of points with the mean on 20 points
x, y, y_err = moving_mean(df.index,df["001724719_rscl"], 20)

df.index correspond to the time of the time series
df[“001724719_rscl”] rescaled flux of the star (“001724719”)
lag=20 is the number of points where the mean and the std will be computed

The result for the 3 previous lightcurves:

Lightcurves of the 3 previous stars with a moving average showing the points between 25,000 and 30,000 measures

Window Method:

The second method is the window method, how does it work?

You need to take a number of points, in the previous case 20, compute the mean (no difference with the previous method), this point it’s the beginning of the new time series and it is at the position 20 (shift of 19 points). But, instead of shifting to the next 20 points, the window is shifted by one point, compute the mean with the 20 previous points and move again and again by just shifting one step ahead. It’s not a downsampling method but a cleaning method because the effect is to smooth the data points.

Let’s see it in code:

You can easily use it like that:

# reduce the number of points with the mean on 20 points
x, y, y_err = mean_sliding_windows(df.index,df["001724719_rscl"], 40)

df.index correspond to the time of the time series
df[“001724719_rscl”] rescaled flux of the star (“001724719”)
lag=40 is the number of points where the mean and the std will be computed

Now, look at the result:

Lightcurves of the 3 previous stars with the window method showing the points between 25,000 and 30,000 measures

Well, not so bad. Setting the lag to 40 permits to “predict” or extend the new time series in the small holes. But, if you look closer you’ll see a divergence at the beginning and the end of the portions of the red line. The function can be improved to avoid these artifacts.

For the rest of the study, we will use the time series obtained with the moving average method.

Change the x-axis from values to dates:

You can change the axis if you want with dates. The Kepler mission began on 2009–03–07 and ended in 2017. Pandas has a function called pd.data_range() this function that allows you to create dates from a list constantly incrementing.

df.index = pd.date_range(‘2009–03–07’, periods=len(df.index), freq=’h’)

This line of code will create a new index with a frequency of hours. If you print the result (as below) you’ll find a proper real timescale.

$ df.index
DatetimeIndex(['2009-03-07 00:00:00', '2009-03-07 01:00:00',
'2009-03-07 02:00:00', '2009-03-07 03:00:00',
'2009-03-07 04:00:00', '2009-03-07 05:00:00',
'2009-03-07 06:00:00', '2009-03-07 07:00:00',
'2009-03-07 08:00:00', '2009-03-07 09:00:00',
...
'2017-04-29 17:00:00', '2017-04-29 18:00:00',
'2017-04-29 19:00:00', '2017-04-29 20:00:00',
'2017-04-29 21:00:00', '2017-04-29 22:00:00',
'2017-04-29 23:00:00', '2017-04-30 00:00:00',
'2017-04-30 01:00:00', '2017-04-30 02:00:00'],
dtype='datetime64[ns]', length=71427, freq='H')

You have now a good timescale for the original time series.

Generate the datasets

So, now that the data reduction functions have been created, we can combine them in another function (shown below) which will take into account the initial dataset and the name of the stars present in the dataset (this part could have been done in the function).

To generate the new data frames do this:

stars = df.columns
stars = list(set([i.split("_")[0] for i in stars]))
print(f"The number of stars available is: {len(stars)}")
> The number of stars available is: 13

We have 13 stars with 4 data types, corresponding to 52 columns.

df_mean, df_slide = reduced_data(df,stars)

Perfect, at this point, you have two new datasets containing the data reduced by the moving average and the window method.

Methods

Prepare the data:

In order to use a machine-learning algorithm to predict time series, the data must be prepared accordingly. The data cannot just be set at (x,y) data points. The data must take the form of a series [x1, x2, x3, …, xn] and a predicted value y.

The function below shows you how to set up your dataset:

Two important things before starting.

1- The data need to be rescaled
Deep Learning algorithms are better when the data is in the range of [0, 1) to predict time series. To do it simply scikit-learn provides the function MinMaxScaler(). You can configure the feature_range parameter but by default it takes (0, 1). And clean the data of nan values (if you don’t delete the nan values your loss function will be output nan).

# normalize the dataset 
num = 2 # choose the third star in the dataset
values = df_model[stars[num]+"_rscl_y"].values # extract the list of values
scaler = MinMaxScaler(feature_range=(0, 1)) # make an instance of MinMaxScaler
dataset = scaler.fit_transform(values[~np.isnan(values)].reshape(-1, 1)) # the data will be clean of nan values, rescaled and reshape

2- The data need to be converted into x list and y
Now, we will the create_values() function to generate the data for the models. But, before, I prefer to save the original data by:

df_model = df_mean.save()

# split into train and test sets
train_size = int(len(dataset) * 0.8) # make 80% data train
train = dataset[:train_size] # set the train data
test = dataset[train_size:] # set the test data
# reshape into X=t and Y=t+1
look_back = 20
trainX, trainY = create_dataset(train, look_back)
testX, testY = create_dataset(test, look_back)
# reshape input to be [samples, time steps, features]
trainX = np.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
testX = np.reshape(testX, (testX.shape[0], testX.shape[1], 1))

Just take a look at the result:

trainX[0]
> array([[0.7414906],
[0.76628096],
[0.79901113],
[0.62779976],
[0.64012722],
[0.64934765],
[0.68549234],
[0.64054092],
[0.68075644],
[0.73782449],
[0.68319294],
[0.64330245],
[0.61339268],
[0.62758265],
[0.61779702],
[0.69994317],
[0.64737128],
[0.64122564],
[0.62016833],
[0.47867125]]) # 20 values in the first value of x train data
trainY[0]
> array([0.46174275]) # the corresponding y value

Metrics

What metrics do we use for time series prediction? We can use the mean absolute error and the mean squared error. They are given by the function:

You need to first import the functions:

from sklearn.metrics import mean_absolute_error, mean_squared_error

RNNs:

You can implement easily the RNN family with Keras in few lines of code. Here you can use this function which will configure your RNN. You need to first import the different models from Keras like:

# import some packages
import tensorflow as tf
from keras.layers import SimpleRNN, LSTM, GRU, Bidirectional, Conv1D, MaxPooling1D, Dropout

Now, we have the models imported from Keras. The function below can generate a simple model (SimpleRNN, LSTM, GRU). Or, two models (identical) can be stacked, or used in Bidirectional or a stack of two Bidirectional models. You can also add the CNN part (Conv1D)with MaxPooling1D and dropout.

This function computes the metrics for the training part and the test part and returned the results in a data frame. Look at how you how to use it for five examples.

LSTM:

# train the model and compute the metrics
> x_train_predict_lstm, y_train_lstm,x_test_predict_lstm, y_test_lstm, res= time_series_deep_learning(train_x, train_y, test_x, test_y, model_dl=LSTM , unit=12, look_back=20)
# plot the resuts of the prediction
> plotting_predictions(dataset, look_back, x_train_predict_lstm, x_test_predict_lstm)
# save the metrics per model in the dataframe df_results
> df_results = df_results.append(res)

GRU:

# train the model and compute the metrics
> x_train_predict_lstm, y_train_lstm,x_test_predict_lstm, y_test_lstm, res= time_series_deep_learning(train_x, train_y, test_x, test_y, model_dl=GRU, unit=12, look_back=20)

Stack LSTM:

# train the model and compute the metrics
> x_train_predict_lstm, y_train_lstm,x_test_predict_lstm, y_test_lstm, res= time_series_deep_learning(train_x, train_y, test_x, test_y, model_dl=LSTM , unit=12, look_back=20, stacked=True)

Bidirectional LSTM:

# train the model and compute the metrics
> x_train_predict_lstm, y_train_lstm,x_test_predict_lstm, y_test_lstm, res= time_series_deep_learning(train_x, train_y, test_x, test_y, model_dl=LSTM , unit=12, look_back=20, bidirection=True)

CNN-LSTM:

# train the model and compute the metrics
> x_train_predict_lstm, y_train_lstm,x_test_predict_lstm, y_test_lstm, res= time_series_deep_learning(train_x, train_y, test_x, test_y, model_dl=LSTM , unit=12, look_back=20, cnn=True)

Results

The results are pretty good considering the data. We can see that deep learning RNN can reproduce with good accuracy of the data. The figure below shows the result of the prediction by the LSTM model.

LSTM predictions

Table 1: Results of the different RNN models, showing the MAE and MSE metrics

        Name    | MAE Train | MSE Train | MAE Test | MSE Test
--------------------------------------------------------------------
GRU | 4.24 | 34.11 | 4.15 | 31.47
LSTM | 4.26 | 34.54 | 4.16 | 31.64
Stack_GRU | 4.19 | 33.89 | 4.17 | 32.01
SimpleRNN | 4.21 | 34.07 | 4.18 | 32.41
LSTM | 4.28 | 35.1 | 4.21 | 31.9
Bi_GRU | 4.21 | 34.34 | 4.22 | 32.54
Stack_Bi_LSTM | 4.45 | 36.83 | 4.24 | 32.22
Bi_LSTM | 4.31 | 35.37 | 4.27 | 32.4
Stack_SimpleRNN | 4.4 | 35.62 | 4.27 | 33.94
SimpleRNN | 4.44 | 35.94 | 4.31 | 34.37
Stack_LSTM | 4.51 | 36.78 | 4.4 | 34.28
Stacked_Bi_GRU | 4.56 | 37.32 | 4.45 | 35.34
CNN_LSTM | 5.01 | 45.85 | 4.55 | 36.29
CNN_GRU | 5.05 | 46.25 | 4.66 | 37.17
CNN_Stack_GRU | 5.07 | 45.92 | 4.7 | 38.64

The Table 1 shows the mean absolute error (MAE) and the mean squared error (MSE) for the train set and the test set for the RNN family. The GRU shows the best result on the test set with an MAE of 4.15 and an MSE of 31.47.

Discussion

The results are good and reproduced the lightcurves of the different stars (see notebook). However, the fluctuations are not perfectly reproduced, the peaks do not have the same intensity and the flux is slightly shifted. A potential correction could be made via the attention mechanisms (Transformers[10]). Another way is to tune the models, the number of layers (stack), the number of units (cells), the combination of different RNN algorithms, new loss function or activation function, etc.

Conclusion

This article shows the possibilities of combining so-called artificial intelligence methods with time series. The power of memory algorithms (RNN, LSTM, GRU) makes it possible to accurately reproduce sporadic fluctuations of events. In our case, the stellar flux exhibited quite strong and marked fluctuations that the methods have been able to capture.

This study shows that time series are no longer reserved for statistical methods such as the ARIMA[4] model.

References

[1] Autoregressive model, Wikipedia
[2] Moving-average model, Wikipedia
[3] Peter Whittle, 1950. Hypothesis testing in time series analysis. Thesis
[4] Alberto Luceño & Daniel Peña, 2008. Autoregressive Integrated Moving Average (ARIMA) Modeling. Wiley Online Library. https://doi.org/10.1002/9780470061572.eqr276
[5] Rumelhart, David E. et al., 1986. Learning representations by back-propagating errors. Nature. 323 (6088): 533–536. 1986Natur.323..533R.
[6] Hochreiter, Sepp & Schmidhuber, Jürgen, 1997. Long Short-Term Memory. Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735
[7] Cho, KyungHyun et al., 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555
[8] M. Schuster & K.K. Paliwal, 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, Volume: 45 , Issue: 11, pp. 2673–2681. DOI: 10.1109/78.650093
[9] Tara N. Sainath et al., 2014. CONVOLUTIONAL, LONG SHORT-TERM MEMORY,FULLY CONNECTED DEEP NEURAL NETWORKS. https://static.googleusercontent.com/media/research.google.com/fr//pubs/archive/43455.pdf
[10] Ashish Vaswani et al., 2017. Attention is all you need. https://arxiv.org/abs/1706.03762
[11] Kepler mission, Nasa

--

--