Time - Series Machine Learning
Time - Series Machine Learning
Time - Series Machine Learning
Series Prediction
Machine Learning Summer School
(Hammamet, 2013)
Gianluca Bontempi
Machine Learning Group, Computer Science Department
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di
Introducing myself
1992: Computer science engineer (Politecnico di Milano, Italy),
1994: Researcher in robotics in IRST, Trento, Italy,
1995: Researcher in IRIDIA, ULB Artificial Intelligence Lab, Brussels,
1996-97: Researcher in IDSIA, Artificial Intelligence Lab, Lugano,
Switzerland,
1998-2000: Marie Curie fellowship in IRIDIA, ULB Artificial Intelligence
Lab, Brussels,
2000-2001: Scientist in Philips Research, Eindhoven, The Netherlands,
2001-2002: Scientist in IMEC, Microelectronics Institute, Leuven,
Belgium,
since 2002: professor in Machine Learning, Modeling and Simulation,
Bioinformatics in ULB Computer Science Dept., Brussels,
since 2004: head of the ULB Machine Learning Group (MLG).
since 2013: director of the Interuniversity Institute of Bioinformatics in
Brussels (IB)2 , ibsquare.be.
Machine Learning Strategies for Prediction p. 2/128
Website: mlg.ulb.ac.be.
Scientific collaborations outside ULB: Harvard Dana Farber (US), UCL Machine
Learning Group (B), Politecnico di Milano (I), Universit del Sannio (I), Inst Rech
Cliniques Montreal (CAN).
Outline
Notions of time series (30 mins)
conditional probability
Machine learning for prediction (45 mins)
bias/variance
parametric and structural identification
validation
model selection
feature selection
COFFEE BREAK
Local learning (15 mins)
Forecasting: one-step and multi-step-ahed (30 mins)
Some applications (15 mins)
time series competitions
wireless sensor
biomedical
marketing
ML paved the way to the treatment of real problems related
to data analysis, sometimes overlooked by statisticians (nonlinearity,
classification, pattern recognition, missing variables, adaptivity,
optimization, massive datasets, data management, causality,
representation of knowledge, parallelisation)
Positive attitude:
Interdisciplinary attitude:
Time series
Definition A time series is a sequence of observations st R, usually
ordered in time.
Understanding
Description
A general model
Let an observed discrete univariate time series be s1 , . . . , sT . This means
that we have T numbers which are observations on some variable made at T
equally distant time points, which for convenience we label 1, 2, . . . , T .
t = 1, . . . , T
Systematic part:
Stochastic sequence:
probability law.
Types of variation
Traditional methods of time-series analysis are mainly concerned with
decomposing the variation of a series st into:
Trend
Seasonal effect
Irregular fluctuations
We will assume here that once we have detrended and deseasonalized the
series, we can still extract information about the dependency between the
past and the future. Henceforth t will denote the detrended and
deseasonalized series.
360
340
320
340
1 2 3 320
1
0.5
0.0
0.5
random
seasonal
trend
360
observed
1960
1970
1980
1990
Time
p(1 , 2 )
p(1 )
Stochastic processes
The stochastic approach to time series makes the assumption that a
time series is a realization of a stochastic process (like tossing an
unbiased coin is the realization of a discrete random variable with equal
head/tail probability).
A discrete-time stochastic process is a collection of random variables t ,
t = 1, . . . , T defined by a joint density
p(1 , . . . , T )
Statistical time-series analysis is concerned with evaluating the
properties of the probability model which generated the observed time
series.
Statistical time-series modeling is concerned with inferring the properties
of the probability model which generated the observed time series from a
limited set of observations.
Properties
n=1
: If t is strictly stationary and its first two moments are finite, we have
E[t ] = t = ,
n=2
Var [t ] = t2 = 2
(k)
(k)
(k) = 2 =
(0)
Another relevant function is the the partial autocorrelation function (k)
where (k), k > 1 measures the degree of association between t and
tk when the effects of the intermediary lags 1, . . . , k 1 are removed
Weak stationarity
A less restricted definition of stationarity concerns only the first two
moments of t
Definition A process is called second-order stationary or weakly stationary
if its mean is constant and its autocovariance function depends only on
the lag.
No assumptions are made about higher moments than those of second
order.
Strict stationarity implies weak stationarity but not viceversa in general.
Definition A process is called normal is the joint distribution of
t1 , t2 , . . . , tn is multivariate normal for all t1 , . . . , tn .
In the special case of normal processes, weak stationarity implies strict
stationarity. This is due to the fact that a normal process is completely
specified by the mean and the autocovariance function.
PT
t=1
PT k
t=1
(t
)(t+k
)
,
T k1
k <T 2
(k)
(0)
0
1
2
3
White noise
200
400
600
800
1000
0.4
0.2
0.0
ACF
0.6
0.8
1.0
Series y
10
15
20
25
30
Lag
Random walk
Suppose that wt is a discrete, purely random process with mean and
2
variance w
.
A process t is said to be a random walk if
t = t1 + wt
The next value of a random walk is obtained by summing a random
shock to the latest value.
If 0 = 0 then
t =
t
X
wi
i=1
2
E[t ] = t and Var [t ] = tw
.
50
40
30
20
10
10
20
30
40
50
100
150
200
250
300
350
400
450
500
Autoregressive processes
Suppose that wt is a purely random process with mean zero and
2
variance w
.
A process t is said to be an autoregressive process of order n (also an
AR(n) process) if
t = 1 t1 + + n tn + wt
This means that the next value is a linear weighted sum of the past n
values plus a random shock.
Finite memory filter.
If w is a normal variable, t will be normal too.
Note that this is like a linear regression model where is regressed not
on independent variables but on its past values (hence the prefix auto).
The properties of stationarity depends on the values i , i = 1, . . . , n.
2
Var [t ] = w
(1 + 2 + 4 + . . . )
k = 0, . . . , 1, 2
Machine Learning Strategies for Prediction p. 31/128
Example: AR(2)
10
5
0
15
AR(2)
200
400
600
800
1000
Example: AR(2)
0.5
0.0
ACF
0.5
1.0
Series y
10
15
20
25
30
Lag
0.0
0.5
Partial ACF
0.5
Series y
10
15
20
25
30
Lag
= arg min
T
X
[t 1 t1 n tn ]
t=n+1
T
T 1 T 2 . . . T n1
T 1
T 2 T 3 . . . T n2
Y
=
X=
..
..
..
..
..
.
.
.
.
.
n
n1 . . .
1
n+1
(1)
= arg min
a
i=1
= (X T X)1 X T Y
where the X T X matrix is a symmetric [n n] matrix which plays an
important role in multiple linear regression.
Conventional linear regression theory provides also confidence interval
and significativity tests for the AR(n) coefficients.
A recursive version of least-squares, i.e. where time samples arrive
sequentially, is provided by the RLS algorithm.
Supervised learning
INPUT
UNKNOWN
DEPENDENCY
OUTPUT
PREDICTION
ERROR
TRAINING
DATASET
MODEL
PREDICTION
y = f (x) + w
where f () is a deterministic function and the term w represents the
noise or random error. It is typically assumed that w is independent of x
and E[w] = 0.
Suppose that we have available a training set {hxi , yi i : i = 1, . . . , N },
where xi = (xi1 , . . . , xin ) and yi , generated according to the previous
model.
The goal of a learning procedure is to estimate a model f(x) which is
able to give a good approximation of the unknown function f (x).
But how to choose f, if we do not know the probability distribution
underlying the data and we have only a limited training set?
5
5
10
15
Model degree 1
5
0
5
10
15
f(x) = 0 + 1 x
Model degree 3
5
0
5
10
15
f(x) = 0 + 1 x + + 3 x3
Model degree 18
5
0
5
10
15
f(x) = 0 + 1 x + + 18 x1 8
where the intrinsic noise term reflects the target alone, the bias reflects
the targets relation with the learning algorithm and the variance term
reflects the learning algorithm alone.
This result is purely theoretical since these quantities cannot be
measured on the basis of a finite amount of data.
However, this result provides insight about what makes accurate a
learning process.
In other terms, it is commonly said that an hypothesis with large bias but
low variance underfits the data while an hypothesis with low bias but
large variance overfits the data.
In both cases, the hypothesis gives a poor representation of the target
and a reasonable trade-off needs to be found.
The task of the model designer is to search for the optimal trade-off
between the variance and the bias term, on the basis of the available
training set.
Bias/variance trade-off
generalization
error
Underfitting
Overfitting
Bias
Variance
complexity
Parametric identification
The parametric identification of the hypothesis is done according to ERM
(Empirical Risk Minimization) principle where
\ emp ()
N = (DN ) = arg min MISE
\ emp () =
MISE
PN
i=1
yi f(xi , )
2
Model assessment
We have seen before that the training error is not a good estimator (i.e. it
is too optimistic) of the generalization capability of the learned model.
Two alternative exists:
1. Complexity-based penalty criteria
2. Data-driven validation techniques
Complexity-based penalization
In conventional statistics, various criteria have been developed, often in
the context of linear models, for assessing the generalization
performance of the learned hypothesis without the use of further
validation data.
Such criteria take the form of a sum of two terms
\ emp + complexity term
PE = MISE
G
where the complexity term represents a penalty which grows as the
number of free parameters in the model grows.
This expression quantifies the qualitative consideration that simple
models return high training error with a reduced complexity term while
complex models have a low training error thanks to the high number of
parameters.
The minimum for the criterion represents a trade-off between
performance on the training set and complexity.
1
(1 Np )2
1
p
L(N )
N
N
w
2
where
w
is an estimate of the variance of noise,
Testing:
Holdout:
N
N
2
1 X
1 X
k(i) 2
) =
yi f(xi , k(i) )
(yi yi
=
N i=1
N i=1
k(i)
denotes the fitted value for the ith observation returned by the
where yi
model estimated with the k(i)th part of the data removed.
10-fold cross-validation
K = 10: at each iteration 90% of data are used for training and the remaining
10% for the test.
10%
90%
N
N
X
1
1 X
i 2
i 2
(yi yi ) =
(yi f xi , )
=
N i=1
N i=1
Model selection
Model selection concerns the final choice of the model structure
By structure we mean:
family of the approximator (e.g. linear, non linear) and if nonlinear
which kind of learner (e.g. neural networks, support vector
machines, nearest-neighbours, regression trees)
the value of hyper parameters (e.g. number of hidden layers, number
of hidden nodes in NN, number of neighbors in KNN, number of
levels in trees)
number and set of input variables
this choice is typically the result of a compromise between different
factors, like the quantitative measures, the personal experience of the
designer and the effort required to implement a particular model in
practice.
Here we will consider only quantitative criteria. Two are the possible
approaches:
1. the winner-takes-all approach
2. the combination of estimators approach.
Machine Learning Strategies for Prediction p. 67/128
Model selection
1
CLASSES of HYPOTHESIS
PARAMETRIC IDENTIFICATION
N
1
GN
N , GN
2
TRAINING
SET
N , GN
s
REALIZATION
STOCHASTIC
PROCESS
VALIDATION
N
1
GN
N , GN
2
N , GSN
s
STRUCTURAL
IDENTIFICATION
MODEL SELECTION
LEARNED MODEL
Winner-takes-all
s
The best hypothesis is selected in the set {N
}, with s = 1, . . . , S, according
to
s
\
s = arg min MISE
s=1,...,S
A model with complexity s is trained on the whole dataset DN and used for
future predictions.
Winner-takes-all pseudo-code
1. for s = 1, . . . , S: (Structural loop)
for j = 1, . . . , N
(a) Inner parametric identification (for l-o-o):
s
N
1
= arg min
i=1:N,i6=j
s
(b) ej = yj f(xj , N
1 )
PN 2
1
\
MISE
(s)
=
LOO
j=1 ej
N
\
2. Model selection: s = arg mins=1,...,S MISE
LOO (s)
3. Final parametric identification:
PN
s
N = arg mins i=1 (yi f(xi , ))2
s
)
4. The output prediction model is f(, N
Model combination
The winner-takes-all approach is intuitively the approach which should
work the best.
However, recent results in machine learning show that the performance
of the final model can be improved not by choosing the model structure
which is expected to predict the best but by creating a model whose
output is the combination of the output of models having different
structures.
The reason is that in reality any chosen hypothesis f(, N ) is only an
estimate of the real target and, like any estimate, is affected by a bias
and a variance term.
Theoretical results on the combination of estimators show that the
combination of unbiased estimators leads an unbiased estimator with
reduced variance.
This principle is at the basis of approaches like bagging or boosting.
Curse of dimensionality
The error of the best model decreases with n but the mean integrated
squared error of models increases faster than linearly in n.
In high dimensions, all data sets are sparse.
In high dimensions, the number of possible models to consider
increases superexponenetially in n.
In high dimensions, all datasets show multicollinearity.
As n increases the amount of local data goes to zero.
For a uniform distribution around a query point xq the amount of data
that are contained in a ball of radius r < 1 centered in xq grows like r n .
0.9
0.8
fraction of local
points
0.7
0.6
n=1
0.5
n=2
0.4
n=3
0.3
0.2
0.1
n=100
0.1
0.2
0.3
0.4
0.5
r
0.6
0.7
0.8
0.9
The size of the neighborhood on which we can estimate local features of the
output (e.g. E[y|x]) increases with dimension n, making the estimation
coarser and coarser.
Machine Learning Strategies for Prediction p. 75/128
Filter methods:
Wrapper methods:
Embedded methods:
Local learning
0
y1
0
1
0
y1
0
1
Underfitting
Overfitting
Bias
Variance
1/Bandwith
MANY NEIGHBORS
FEW NEIGHBORS
PARAMETRIC IDENTIFICATION
ON N SAMPLES
PRESS STATISTIC
LEAVE-ONE-OUT
The leave-one-out error can be computed in two equivalent ways: the slowest
way (on the right) which repeats N times the training and the test procedure;
the fastest way (on the left) which performs only once the parametric
identification and the computation of the PRESS statistic.
= (X T X)1 X T Y
2. This procedure is performed only once on the N samples and
returns as by product the Hat matrix
H = X(X T X)1 X T
3. we compute the residual vector e, whose j th term is ej = yj xTj
,
4. we use the PRESS statistic to compute eloo
j as
eloo
j
ej
=
1 Hjj
i
i
\
MISE
LOO =
N i=1 1 Hii
Note that PRESS is not an approximation of the loo error but simply a faster
way of computing it.
yq = xTq
(k),
\
with k = arg min MISE
LOO (k)
k
LOO
yq =
Pb
q (ki )
i=1 i y
,
Pb
i=1 i
where the weights are the inverse of the mean square errors:
\
i = 1/MISE
(ki ).
LOO
Forecasting
z -1
t-3
f
z -1
t-2
z -1
t-1
The approximator f returns the prediction of the value of the time series at
time t + 1 as a function of the n previous values (the rectangular box
containing z 1 represents a unit delay operator, i.e., t1 = z 1 t ).
t16
t11
t6
t1
Iterated prediction
t-n
z -1
z -1
t-3
f
z -1
t-2
z -1
t-1
z -1
The approximator f returns the prediction of the value of the time series at
time t + 1 by iterating the predictions obtained in the previous steps (the
rectangular box containing z 1 represents a unit delay operator, i.e.,
t1 = z 1 t ).
1
2
3
e (3)
3
4
cv
e (3) 3
4
2 3
a)
b)
50
100
150
200
250
200
400
600
800
1000
The A chaotic time series has a training set of 1000 values: the task is to
predict the continuation for 100 steps, starting from different points.
Machine Learning Strategies for Prediction p. 98/128
250
200
150
100
50
10
20
30
40
50
60
70
80
90
100
250
200
150
100
50
10
20
30
40
50
60
70
80
90
100
Direct strategy
The Direct strategy [22, 17, 7] learns independently H models fh
t+h1 = fh (t1 , . . . , tn ) + wt+h1
with h {1, . . . , H} and returns a multi-step forecast by concatenating
the H predictions.
Several machine learning models have been used to implement the
Direct strategy for multi-step forecasting tasks, for instance neural
networks [10], nearest neighbors [17] and decision trees [21].
Since the Direct strategy does not use any approximated values to
compute the forecasts, it is not prone to any accumulation of errors,
since each model is tailored for the horizon it is supposed to predict.
Notwithstanding, it has some weaknesses.
DirRec strategy
The DirRec strategy [18] combines the architectures and the principles
underlying the Direct and the Recursive strategies.
DirRec computes the forecasts with different models for every
horizon (like the Direct strategy) and, at each time step, it enlarges the
set of inputs by adding variables corresponding to the forecasts of the
previous step (like the Recursive strategy).
Unlike the previous strategies, the embedding size n is not the same for
all the horizons. In other terms, the DirRec strategy learns H models fh
from the time series where
t+h1 = fh (t+h1 , . . . , tn ) + wt+h1
with h {1, . . . , H}.
The technique is prone to the curse of dimensionality. The use of feature
selection is recommended for large h.
MIMO strategy
This strategy [3, 5] (also known as Joint strategy [10]) avoids the
simplistic assumption of conditional independence between future
values made by the Direct strategy [3, 5] by learning a single
multiple-output model
[t+H1 , . . . , t ] = F (t1 , . . . , tn ) + w
where F : Rd RH is a vector-valued function [15], and w RH is a
noise vector with a covariance that is not necessarily diagonal [13].
The forecasts are returned in one step by a multiple-input
multiple-output regression model.
In [5] we proposed a multi-output extension of the local learning
algorithm.
Other multi-output regression model could be taken into consideration
like multi-output neural networks or partial least squares.
t1
t+1
t+2
t+3
n = 2 NAR dependency t = f (t1 , t2 ) + w(t).
Machine Learning Strategies for Prediction p. 105/128
t1
t+1
t+2
t+3
t+1
t+2
t+3
MIMO strategy
The rationale of the MIMO strategy is to model, between the predicted
values, the stochastic dependency characterizing the time series. This
strategy avoids the conditional independence assumption made by the
Direct strategy as well as the accumulation of errors which plagues the
Recursive strategy.
So far, this strategy has been successfully applied to several real-world
multi-step time series forecasting tasks [3, 5, 19, 2].
However, the wish to preserve the stochastic dependencies constrains
all the horizons to be forecasted with the same model structure. Since
this constraint could reduce the flexibility of the forecasting
approach [19], a variant of the MIMO strategy (called DIRMO) has been
proposed in [19, 2] .
Extensive validation on the 111 times series of the NN5 competition
showed that MIMO are invariably better than single-output approaches.
Competitions
Santa Fe Time Series Prediction and Analysis Competition (1994) [22]:
International Workshop on Advanced Black-box techniques for nonlinear
modeling Competition (Leuven, Belgium; 1998)
NN3 competition [8]: 111 monthly time series drawn from homogeneous
population of empirical business time series.
NN5 competition [1]: 111 time series of the daily retirement amounts
from independent cash machines at different, randomly selected
locations across England.
Kaggle competition.
Accuracy measures
Let
et+h = t+h t+h
represent the error of the forecast t+h at the horizon h = 1, . . . , H. A
conventional measure of accuracy is the Normalized Mean Squared Error
NMSE =
PH
2
(
)
t+h
t+h
h=1
PH
)2
h=1 (t+h
This quantity is smaller than one if the predictor performs better than the
naivest predictor, i.e. the average
.
Other measures rely on relative or percentage error
pet+h = 100
t+h t+h
t+h
like
MAPE =
PH
|pet+h |
H
h=1
Applications in my lab
Side-channel attack
35
30
25
20
Temperature (C)
40
45
Accuracy: 2C
Constant model
!! !
!! !! !
! !!
10
! !
15
20
Time (Hour)
Metric
Communication costs
Model error
Model complexity
AR(p) : si [t] =
p
!
j si [t j]
j=1
(Q)
(Q)
(Q)
2
X
1
D (Qj , T ) =
fQj T(t1) , T(t2) , ..., T(tn1) T(t)
N n + 1 t=n
and choosing the key minimizing it
= arg
Q
min
j[0,2(k1) ]
D (Qj , T )
Machine Learning Strategies for Prediction p. 122/128
Conclusions
Open-source software
Many commercial solutions exist but only open-source software can cope with
fast integration of new algorithms
portability over several platforms
new paradigms of data storage (e.g. Hadoop)
integration with different data formats and architectures
A de-facto standard in computational statistics, machine learning,
bioinformatics, geostatistics or more general analytics is nowadays
Highly recommended !
Conclusion
Popper claimed that, if a theory is falsifiable (i.e. it can be contradicted
by an observation or the outcome of a physical experiment.), then it is
scientific. Since prediction is the most falsifiable aspect of science it is
also the most scientific one.
Effective machine learning is an extension of statistics, in no way an
alternative.
Simplest (i.e. linear) model first.
Local learning techniques represent an effective trade-off between
linearity and nonlinearity.
Modelling is more an art than an automatic process... then experience
data analysts are more valuable than expensive tools.
Expert knowledge matters..., data too
Understanding what is predictable is as important as trying to predict it.
References
[1] Robert R. Andrawis, Amir F. Atiya, and Hisham ElShishiny. Forecast combinations of computational intelligence and linear models for the NN5 time series forecasting competition. International Journal of Forecasting,
January 2011.
[2] S. Ben Taieb, A. Sorjamaa, and G. Bontempi. Multipleoutput modelling for multi-step-ahead forecasting. Neurocomputing, 73:19501957, 2010.
[3] G. Bontempi. Long term time series prediction with multiinput multi-output local learning. In Proceedings of the 2nd
European Symposium on Time Series Prediction (TSP),
ESTSP08, pages 145154, Helsinki, Finland, February
2008.
[4] G. Bontempi, M. Birattari, and H. Bersini. Local learning
for iterated time-series prediction. In I. Bratko and S. Dzeroski, editors, Machine Learning: Proceedings of the Sixteenth International Conference, pages 3238, San Francisco, CA, 1999. Morgan Kaufmann Publishers.
[5] G. Bontempi and S. Ben Taieb. Conditionally dependent
strategies for multiple-step-ahead prediction in local learn-
128-1
128-2
[11] Yann-Al Le Borgne, Silvia Santini, and Gianluca Bontempi. Adaptive model selection for time series prediction in wireless sensor networks. Signal Processing,
87(12):30103020, 2007.
[12] Liran Lerman, Gianluca Bontempi, Souhaib Ben Taieb,
and Olivier Markowitch. A time series approach for profiling attack. In SPACE, pages 7594, 2013.
[13] Jos M. Matas. Multi-output nonparametric regression.
In Carlos Bento, Amlcar Cardoso, and Gal Dias, editors, EPIA, volume 3808 of Lecture Notes in Computer
Science, pages 288292. Springer, 2005.
[14] J. McNames. A nearest trajectory strategy for time series
prediction. In Proceedings of the InternationalWorkshop
on Advanced Black-Box Techniques for Nonlinear Modeling, pages 112128, Belgium, 1998. K.U. Leuven.
[15] Charles A. Micchelli and Massimiliano A. Pontil. On learning vector-valued functions. Neural Comput., 17(1):177
204, 2005.
[16] T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
[17] A. Sorjamaa, J. Hao, N. Reyhani, Y. Ji, and A. Lendasse.
Methodology for long-term prediction of time series. Neurocomputing, 70(16-18):28612869, October 2007.
128-3
[18] A. Sorjamaa and A. Lendasse. Time series prediction using dirrec strategy. In M. Verleysen, editor, ESANN06, European Symposium on Artificial Neural Networks, pages
143148, Bruges, Belgium, April 26-28 2006. European
Symposium on Artificial Neural Networks.
[19] Souhaib Ben Taieb, Gianluca Bontempi, Antti Sorjamaa,
and Amaury Lendasse. Long-term prediction of time series by combining direct and mimo strategies. International Joint Conference on Neural Networks, 2009.
[20] H. Tong. Threshold models in Nonlinear Time Series Analysis. Springer Verlag, Berlin, 1983.
[21] Van Tung Tran, Bo-Suk Yang, and Andy Chit Chiow Tan.
Multi-step ahead direct prediction for the machine condition prognosis using regression trees and neuro-fuzzy
systems. Expert Syst. Appl., 36(5):93789387, 2009.
[22] A.S. Weigend and N.A. Gershenfeld. Time Series Prediction: forecasting the future and understanding the past.
Addison Wesley, Harlow, UK, 1994.
128-4