Part #1: Deep Learning Time Series & Forecasting
Introduction to Time Series
Time series is generally defined as an ordered sequence of values that are generally equally spaced over time. So for example, in daily weather forecasts, there is a single value at each time step — known by the term univariate. So what if we encounter time series that has multiple values at each time step, your probably already thinking it, yes multivariate. Consider the below chart showing the global temperature against CO2 concentration. As a univariate chart, this might show a trend, but when combined, the correlation is very easy to see adding further value to the data.
Now you're probably wondering what types of things can we do with machine learning over time series. The obvious is correct, make prediction of forecasting based on our given data. In many cases, we may want to project back into the past to see how we got to where we are now, a process called imputation. Imputation is simply filling in holes in our data from what data doesn’t already exist. Why would we do this, maybe there was no data recorded from some years and some years with data.
Time series can also be used to detect anomalies. For example, we can analyze the time series to spot patterns in them that determine what generated the series itself. A classic example of this is to analyze sound waves to spot words in them which can be used as a neural network for speech recognition.
Time-series can come in all shapes and sizes. The first is trend,where time series have a specific direction that they’re moving in. Another concept is seasonality, which is seen when patterns repeat at predictable intervals. For example, take a look at this chart showing active users at a website for software developers. It follows a very distinct pattern of regular dips. Can you guess what they are?
What if I told you if it was up for five units and then down for two? Then you could tell that it very clearly dips on the weekends when less people are working and thus it shows seasonality. Other seasonal series could be shopping sites that peak on weekends or sport sites that peak at various times throughout the year. Time series can have a combination of both trend and seasonality as this chart shows. There’s an overall upwards trend but there are local peaks and troughs. But of course, there are also some that are probably not predictable at all and just a complete set of random values producing what’s typically called white noise. You can’t predict when that will happen next or how strong the time time series they will be. But clearly, the entire series isn’t random. Between the spikes there’s a very deterministic type of decay. We can see that the value of each time step is 99 percent of the value of the previous time step plus an occasional spike. This is an auto correlated time series. Namely it correlates with a delayed copy of itself often called a lag. The spikes which are unpredictable are often called Innovations. In other words, they cannot be predicted based on past values.
Time series that we see in real-life will generally have a bit of each of these features: trend, seasonality, autocorrelation, and noise. Which is why time -series problems are perfect for machine-learning, machine-learning models are designed to spot patterns, and when we spot patterns we can make predictions. For the most part this can also work with time series except for the noise which is unpredictable. But we should recognize that this assumes that patterns that existed in the past will of course continue into the future. While we know that real life is not always that simple, behavior can change drastically over time. For example, TSLA stock had a positive trend, but then something happened to change its behavior, maybe a news release, financial crisis or a scandal, perhaps a disruptive technological breakthrough causing massive change or news that Tesla is having trouble globally. Then we start to see a downward trend without any seasonality, typically call a non-stationary time series.
Now to predict on a non-stationary time series. We could just train for limited period of time. For example, here I could take just the last 100 days. I might probably get a better performance than if I had trained on the entire time series. But that’s breaking the mold for typical machine-learning problem where we always assume that more data is better. But for time series forecasting it really depends on the time series. If it’s stationary, meaning. its behavior does not change over time, then great. The more data we have the better!
Trend and Seasonality
Let’s take a look at times series and the various attributes of time series using Python. Ok, let’s create a time series that just trends upward:
The x-axis in this case is time and the y-value is the value of the function at that time.
Now let’s generate a time series with a seasonal pattern:
As we investigate the graph above, we can see clear peaks and troughs. But in addition to that, there are smaller, regular spikes. This could be seen as a rough simulation of a seasonal value. For example, maybe profits for the wine shop that are negative on the day the store is closed, peaking a little the day after, decaying during the week and then peaking again on the weekend — cause you know it’s vino time!!!
Now let’s create a time series with both trend and seasonality:
What if we now add a trend to this, so that the seasonal data while still following the pattern increases over time? Maybe simulating a growing business so when we plot it, we’ll see the same pattern but with an overall upward trend (given the vino they sell is good & at a good price).
Noise
Now let’s add another feature that’s common in time series, noise? Here’s a function that adds some noise to a series and when we call that and plot the results and their impact on our time series, we now get a very noisy series, but one which follows the same seasonality as we saw earlier.
Let’s generate some white noise:
What’s interesting at this point, is that the human eye may miss a lot of the seasonality data but a computer will hopefully be able to spot it.
Now let’s add this white noise to the time series:
All right, this looks realistic enough for now. Let’s try to forecast it.
Next we can explore a little bit of Autocorrelation, but first here are a couple of functions that can add it for you. Here is where we add the autocorrelation to the series and plot it. There are two different autocorrelation functions and I’ll plot both so that you can see the effects of each.
We will split it into two periods: the training period and the validation period (in many cases, you would also want to have a test period). The split will be at time step 1000.
There is a pattern and then a sharp fall off followed by the same pattern on a smaller scale with the same fall off, which is then shrunk et cetera. If I change autocorrelation functions and run it again, we can then see the other function.
Now, let’s add them to simulate a seasonal time series:
Naive Forecasting
We could, take the last value and assume that the next value will be the same one, and this is called naive forecasting. So how would you measure performance?
To measure the performance of our forecasting model, we typically want to split the time series into a training period, a validation period and a test period. This is called fixed partitioning. If the time series has some seasonality, you generally want to ensure that each period contains a whole number of seasons.
Next you’ll train your model on the training period, and you’ll evaluate it on the validation period. Here’s where you can experiment to find the right architecture for training. And work on it and your hyper parameters, until you get the desired performance, measured using the validation set. But why would you do that? Well, it’s because the test data is the closest data you have to the current point in time. And as such it’s often the strongest signal in determining future values. If your model is not trained using that data, too, then it may not be optimal. Due to this, it’s actually quite common to forgo a test set all together. And just train, using a training period and a validation period, and the test set is in the future.
Fixed partitioning like this is very simple and very intuitive, but there’s also another way. We start with a short training period, and we gradually increase it, say by one day at a time, or by one week at a time. At each iteration, we train the model on a training period. And we use it to forecast the following day, or the following week, in the validation period. And this is called roll-forward partitioning. You could see it as doing fixed partitioning a number of times, and then continually refining the model as such.
Metrics for evaluating performance
Once we have a model and a period, then we can evaluate the model on it
- Errors: So let’s start simply by calculating the errors, which is the difference between the forecasted values from our model and the actual values over the evaluation period.
- MSE: The most common metric to evaluate the forecasting performance of a model is the mean squared error or mse where we square the errors and then calculate their mean. But, why would we square it? Well, the reason for this is to get rid of negative values. So, for example, if our error was two above the value, then it will be two, but if it were two below the value, then it will be minus two. These errors could then effectively cancel each other out, which will be wrong because we have two errors and not none. But if we square the error of value before analyzing, then both of these errors would square to four, not canceling each other out and effectively being equal.
- RMSE: And if we want the mean of our errors’ calculation to be of the same scale as the original errors, then we just get its square root, giving us a root means squared error or rmse.
- MAE: Mean absolute error or MAE, and it’s also called the main absolute deviation or mad. And in this case, instead of squaring to get rid of negatives, it just uses their absolute value. This does not penalize large errors as much as the mse does. Depending on your task, you may prefer the mae or the mse. For example, if large errors are potentially dangerous and they cost you much more than smaller errors, then you may prefer the mse. But if your gain or your loss is just proportional to the size of the error, then the mae may be better.
- MAPE: mean absolute percentage error or mape, this is the mean ratio between the absolute error and the absolute value, this gives an idea of the size of the errors compared to the values.
Let’s keep this conversation moving….Next!
Moving average and differencing
A common and very simple forecasting method is to calculate a moving average. The idea here is that the yellow line is a plot of the average of the blue values over a fixed period called an averaging window, for example, 30 days. Now this nicely eliminates a lot of the noise and it gives us a curve roughly emulating the original series, but it does not anticipate trend or seasonality. Depending on the current time i.e. the period after which you want to forecast for the future, it can actually end up being worse than a naive forecast.
One method to avoid this is to remove the trend and seasonality from the time series with a technique called differencing. So instead of studying the time series itself, we study the difference between the value at time T and the value at an earlier period. Depending on the time of our data, that period might be a year, a day, a month or whatever. with this approach, we’re not too far from the optimal. Keep this in mind before you rush into deep learning. Simple approaches sometimes can work just fine.
Forecasting
The next code block will set up the time series with seasonality, trend and a bit of noise.
Now that we have the time series, let’s split it so we can start forecasting
Naive Forecast
Let’s zoom in on the start of the validation period:
You can see that the naive forecast lags 1 step behind the time series. Now let’s compute the mean squared error and the mean absolute error between the forecasts and the predictions in the validation period:
That’s our baseline, now let’s try a moving average:
That’s worse than naive forecast!
The moving average does not anticipate trend or seasonality, so let’s try to remove them by using differencing. Since the seasonality period is 365 days, we will subtract the value at time *t* — 365 from the value at time *t*.
Great, the trend and seasonality seem to be gone, so now we can use the moving average:
Now let’s bring back the trend and seasonality by adding the past values from t — 365:
Better than a naive forecast, good. However the forecasts look a bit too random, because we’re just adding past values, which were noisy. Let’s use a moving averaging on past values to remove some of the noise:
Conclusion
So we explored the nature of time series data, and we saw some of the more common attributes of them, including things like seasonality and trend. We also looked at some statistical methods for predicting time series data. Wait, where is the Deep Learning? Tune in to Part #2, where we slowly begin to dive deep into DNNs for time series classification and more!!!
References
All the notes in this post were taking from: https://www.coursera.org/professional-certificates/tensorflow-in-practice, if you are interested in Deep Learning & AI, I highly recommend you taking this course.
TensorFlow API docs: https://www.tensorflow.org/api_docs