Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
On the use of cross-validation for local
modeling in regression and time series
prediction
Gianluca Bontempi
gbonte@ulb.ac.be
Machine Learning Group
Departement d’Informatique, ULB
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di/mlg
On the use of cross-validation for local modeling in regression and time series prediction – p.1/75
Outline
 
The Machine Learning Group 
A local learning algorithm: the Lazy Learning.
 
Lazy Learning for multivariate regression modeling.
 
Lazy Learning for multi-step-ahead time series prediction.
 
Lazy Learning for feature selection.
 
Applications.
 
Future work.
On the use of cross-validation for local modeling in regression and time series prediction – p.2/75
Machine Learning: a definition
The field of machine learning is concerned with the question of how to
construct computer programs that automatically improve with
experience. [35]
On the use of cross-validation for local modeling in regression and time series prediction – p.3/75
The Machine Learning Group (MLG)
¡
7 researchers (1 prof, 6 PhD students), 4 graduate students).¡
Research topics: Bioinformatics, Classification, Computational statistics, Data
mining, Regression, Time series prediction, Sensor networks.
¡
Computing facilities: cluster of 16 processors, LEGO Robotics Lab.
¡
Website: www.ulb.ac.be/di/mlg.
¡
Scientific collaborations in ULB: IRIDIA (Sciences Appliquées), Physiologie
Moléculaire de la Cellule (IBMM), Conformation des Macromolécules Biologiques
et Bioinformatique (IBMM), CENOLI (Sciences), Microarray Unit (Hopital Jules
Bordet), Service d’Anesthesie (ERASME).
¡
Scientific collaborations outside ULB: UCL Machine Learning Group (B),
Politecnico di Milano (I), Universitá del Sannio (I), George Mason University (US).
¡
The MLG is part to the "Groupe de Contact FNRS" on Machine Learning.
On the use of cross-validation for local modeling in regression and time series prediction – p.4/75
MLG: running projects
1. "Integrating experimental and theoretical approaches to decipher the molecular
networks of nitrogen utilisation in yeast": ARC (Action de Recherche Concertée)
funded by the Communauté Française de Belgique (2004-2009). Partners: IBMM
(Gosselies and La Plaine), CENOLI.
2. "COMP2SYS" (COMPutational intelligence methods for COMPlex SYStems)
MARIE CURIE Early Stage Research Training funded by the European Union
(2004-2008). Main contractor: IRIDIA (ULB).
3. "Predictive data mining techniques in anaesthesia": FIRST Europe Objectif 1
funded by the Région wallonne and the Fonds Social Européen (2004-2009).
Partners: Service d’anesthesie (ERASME).
4. "AIDAR - Adressage et Indexation de Documents Multimédias Assistés par des
techniques de Reconnaissance Vocale": funded by Région Bruxelles-Capitale
(2004-2006). Partners: Voice Insight, RTBF, Titan.
On the use of cross-validation for local modeling in regression and time series prediction – p.5/75
Machine learning and applied statistics
Reductionist attitude: ML is a modern buzzword which equates to
statistics plus marketing
Positive attitude: ML paved the way to the treatment of real problems
related to data analysis, sometimes overlooked by statisticians
(nonlinearity, classification, pattern recognition, missing variables,
adaptivity, optimization, massive datasets, data management,
causality, representation of knowledge, parallelisation)
Interdisciplinary attitude: ML should have its roots on statistics and
complements it by focusing on: algorithmic issues, computational
efficiency, data engineering.
On the use of cross-validation for local modeling in regression and time series prediction – p.6/75
Motivations
 
There exists a wide amount of theoretical and practical results for
linear methods in statistics, forecasting and control.
 
However, in real settings we encounter often nonlinear problems.
 
Nonlinear methods are generally more difficult to analyze than
linear ones, rarely produce closed-form or analytically tractable
expressions, and are not easy to manipulate and implement.
 
Local learning techniques are a powerful way of re-using linear
techniques in a nonlinear setting.
On the use of cross-validation for local modeling in regression and time series prediction – p.7/75
Prediction models from data
TARGET
PREDICTION
MODEL
PREDICTION
INPUT OUTPUT
ERROR
DATA
TRAINING
On the use of cross-validation for local modeling in regression and time series prediction – p.8/75
Regression setting
 
Multidimensional input
¢
£
¤¥
and scalar output
¦
£
¤
¦
§
¨
©
¢


where ¨
is the unknown regression function and

is the random
error term.
 
A finite number of noisy input/output observations (training set

).
 
A test set of input values for which an accurate generalization or
prediction of the output is required.
 
A learning machine which returns a input/output model on the
basis of training set.
Assumption: No a priori knowledge on the process underlying the data.
On the use of cross-validation for local modeling in regression and time series prediction – p.9/75
The global modeling approach
x
y
q

Input-output regression problem.
On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
The global modeling approach

x
y
q
!
#$
%
'(
)012
3456
7899A@
BBAC
DE
FG
HI
PQ
RRASS
TU
VW
XY
`a
bc
de
fg
hipq
rstu
vwxxAy
€€A
‚ƒ
„…
††A‡‡
ˆ‰
‘
’“
”•
–—
˜™
de
fghi
jklm
noppAq
rrAs
tu
vw
xy
zzA{{
|}
~
€
‚ƒ
„…
†‡
Training data set.
On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
The global modeling approach
ˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆ‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰
x
y
q
Š‹
Œ
Ž
‘
’“”•
–—˜™
š›œœA
žžAŸ
 ¡
¢£
¤¥
¦§
¨¨A©©
ª«
¬­
®¯
°±
²³
´µ
¶·
¸¹º»
¼½¾¿
ÀÁÂÂAÃ
ÄÄAÅ
ÆÇ
ÈÉ
ÊÊAËË
ÌÍ
ÎÏ
ÐÑ
ÒÓ
ÔÕ
Ö×
ØÙ
ÚÛÜÝ
Þßàá
âãääAå
ææAç
èé
êë
ìí
îï
ððAññ
òó
ôõ
ö÷
øù
úû
Global model fitting.
On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
The global modeling approach
üüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý
y
q x
Prediction by using the fitted global model.
On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
The global modeling approach
þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
x
y
q
Another prediction by using the fitted global model.
On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
The local modeling approach
x
y
q
                               ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡
Input-output regression problem.
On the use of cross-validation for local modeling in regression and time series prediction – p.11/75
The local modeling approach
¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢£££££££££££££££££££££££££££££££
x
y
q
¤¥
¦§
¨©




!
$#
%%$
'(
)0
12
34
55$66
78
9@
AB
CD
EF
GH
IP
QRST
UV
WX
Y`
aa$b
cc$d
ef
gh
ii$pp
qr
st
uv
wx
y€
‚
ƒ„
…†‡ˆ
‰
‘’
“”
••$–
——$˜
™d
ef
gh
ii$jj
kl
mn
op
qr
st
uv
Training data set.
On the use of cross-validation for local modeling in regression and time series prediction – p.11/75
The local modeling approach
wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
y
q
yz
{|
}~
€
‚ƒ„
…†
‡ˆ
‰Š
‹‹$ŒŒ
$ŽŽ

‘’
“”
•–
——$˜
™š
›œ
ž
Ÿ 
¡¢
£¤
¥¦
§¨©ª
«¬
­®
¯°
±±$²²
³³$´´
µ¶
·¸
¹¹$º
»¼
½¾
¿À
ÁÂ
ÃÄ
ÅÆ
ÇÈ
ÉÊËÌ
ÍÎ
ÏÐ
ÑÒ
ÓÓ$ÔÔ
ÕÕ$ÖÖ
×Ø
ÙÚ
ÛÜ
ÝÞ
ßß$à
áâ
ãä
åæ
çè
éê
ëì
íí$îîïð
ññ$òòóô
õõ$öö÷ø
ùúûü
ýýýÿþþ ¡
¢£
¤¥
¦§
¨©



Local fitting and prediction.
On the use of cross-validation for local modeling in regression and time series prediction – p.11/75
The local modeling approach

!
#$
%
''(
))0
12
34
56
78
99@@
AB
CD
EF
GH
IP
QR
ST
UVWX
Y`
ab
cd
eef
ggh
ip
qr
sstt
uv
wx
y€
‚
ƒ„
…†
‡ˆ
‰‘’
“”
•–
—˜
™™d
eef
gh
ij
kl
mn
oopp
qr
st
uv
wx
yz
{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{|||||||||||||||||||||||||||||||
x
y
q
}~
€
‚
ƒ„
Another local fitting and prediction.
On the use of cross-validation for local modeling in regression and time series prediction – p.11/75
Global vs. local modeling
 
The traditional approach to supervised learning is global
modeling which describes the relationship between the input and
the output with an analytical function over the whole input domain.
 
Even for huge datasets, a parametric model can be stored in a
small memory. Also, the evaluation of the parametric model
requires a short program that can be executed in a reduced
amount of time.
 
Modeling complex input/output relations often requires the
adoption of global nonlinear models, whose learning procedures
are typically slow and analytically intractable. In particular,
validation methods, which address the problem of assessing a
global model on the basis of a finite amount of noisy samples, are
computationally prohibitive.
 
For these reasons, in recent years, interest has grown in pursuing
alternatives (divide-and-conquer) to global modeling techniques.
On the use of cross-validation for local modeling in regression and time series prediction – p.12/75
Global vs. local modeling
 
The divide-and-conquer strategy consists in attacking a complex
problem by dividing it into simpler problems whose solutions can
be combined to yield a solution to the original problem.
 
Instances of the divide-and-conquer approach are modular
techniques (e.g. local model networks [36], regression trees [19],
splines [45]) and local modeling (aka smoothing) techniques.
 
The principle underlying local modeling is that a smooth function
can be well approximated by a low degree polynomial in the
neighborhood of any query point.
 
Local modeling techniques do not return a global fit of the
available dataset but perform the prediction of the output for
specific test input values, also called queries.
 
The talk presents our contribution to local modeling techniques
and their application to a number of experimental problems.
On the use of cross-validation for local modeling in regression and time series prediction – p.13/75
Lazy vs. eager modeling
 
Eager techniques perform a wide amount of computation for
tuning the model before observing the new query.
 
An eager technique must then commit to a specific hypothesis
that covers all the future queries.
 
Lazy techniques [1] wait for the query to be defined before
starting the learning procedure.
 
For that purpose, the database of observed input/output data is
always kept in memory and the output prediction is obtained by
interpolating the samples in the neighborhood of the query point.
 
Lazy methods will generally require less computation during
training but more computation when they must predict the target
value for a new query.
On the use of cross-validation for local modeling in regression and time series prediction – p.14/75
Examples
 
The classical linear regression is an example of global, eager, and
linear approach.
 
Neural networks (NN) are instances of the global, eager, and
nonlinear approach: NN are global in the sense that a single
representation covers the whole input space. They are eager in
the sense that the examples are used for tuning the network and
then they are discarded without waiting for any query. Finally, NN
are nonlinear in the sense that the relation between the weights
and the output is nonlinear.
 
The technique we are going to discuss here is a lazy and local
approach.
 
Remark: we can imagine a local technique (e.g. a K-nearest
neighbor) where the most important parameter (i.e. the number of
neighbors) is defined in an eager fashion.
On the use of cross-validation for local modeling in regression and time series prediction – p.15/75
Some history
 
Local regression estimation was independently introduced in
several different fields in the late nineteenth [42] and early
twentieth century [28].
 
In the statistical literature, the method was independently
introduced from different viewpoints in the late 1970’s [20, 31, 43].
 
Reference books are Fan and Gijbels [26] and Loader [32].
 
In the machine learning literature, work on local techniques for
classification dates back to 1967 [24]. A more recent reference is
the special issue on Lazy Learning [1].
On the use of cross-validation for local modeling in regression and time series prediction – p.16/75
Local modeling procedure
The identification of a local model [3] can be summarized in these
steps:
1. Compute the distance between the query and the training
samples according to a predefined metric.
2. Rank the neighbors on the basis of their distance to the query.
3. Select a subset of the nearest neighbors according to the
bandwidth which measures the size of the neighborhood.
4. Fit a local model (e.g. constant, linear,...).
Each of the local approaches has one or more structural (or
smoothing) parameters that control the amount of smoothing
performed.
In this talk we will focus on the bandwidth selection.
On the use of cross-validation for local modeling in regression and time series prediction – p.17/75
The bandwidth trade-off: overfit
e
q
…
…†
‡
‡ˆ
‰Š
‹Œ
Ž
‘’
“”
•–
——™˜˜
šš™›
œ
žŸ
 ¡
¢£
¤¥
¦§
¨©
ªª™««
¬­®
®¯
°±
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
x
y
³
³´
µ
µ¶
·¸
¹º
»¼½¾
¿À
ÁÂ
ÃÄ
ÅÅ™ÆÆ
ÇÇ™È
ÉÊ
ËË™ÌÌÍÎ
ÏÐ
ÑÒ
ÓÔ
ÕÖ
××™ØØ
ÙÚÛ
ÛÜ
Ý
Ý
Ý
Ý
Ý
ÝßÞÞàá
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
x
y
Too narrow bandwidth
ã
overfitting
ã
large prediction error
ä
.
In terms of bias/variance trade-off, this is typically a situation of high
variance.
On the use of cross-validation for local modeling in regression and time series prediction – p.18/75
The bandwidth trade-off: underfit
e
q
å
åæ
ç
çè
éê
ëì
íîïðñò
óô
õö÷÷™øø
ùù™ú
ûü
ýþ
ÿ 
¡¢
£¤
¥¦
§§©¨¨













































x
y




!
#$
%'(
))©00
111322
4
4
4
4©55
67
88©99@A
BB©CCDE
FF©GGHI
PP©QQRS
T
T
T
T
T
T3UUU
V
V
V
V©WWX
XY
`
`
`
`
`
`3aabc
dd©ee
ff©gg
hh©ii
pp©qq
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
x
y
Too large bandwidth
ã
underfitting
ã
large prediction error
ä
In terms of bias/variance trade-off, this is typically a situation of high
bias.
On the use of cross-validation for local modeling in regression and time series prediction – p.19/75
Bandwidth and bias/variance trade-off
Mean Squared Error
1/Bandwith
FEW NEIGHBORSMANY NEIGHBORS
Bias
Variance
Underfitting Overfitting
On the use of cross-validation for local modeling in regression and time series prediction – p.20/75
Existing work on bandwidth selection
Rule of thumb methods. They provide a crude bandwidth selection which in some
situations may result sufficient. Examples of rule of thumb are in [25],[27].
Plug-in techniques. The exact expression of optimal bandwidth can be obtained from
the asymptotic expressions of bias and variance, which unfortunately depends on
unknown terms. The idea of the direct plug-in method is to replace these terms
with estimates. This method was first introduced by Woodrofe [47] in density
estimation. Examples of plug-in methods for non parametric regression are
reported in Ruppert et al. [41].
Data-driven estimation. It is a selection procedure which estimates the generalization
error directly from data. Unlike the previous approach, this method does not rely
on the asymptotic expression but it estimates the values directly from the finite
data set. To this group belong methods like cross-validation, Mallow’s
sut
,
Akaike’s AIC and other extensions of methods used in classical parametric
modeling.
On the use of cross-validation for local modeling in regression and time series prediction – p.21/75
Existing work (II)
¡
Debate on the superiority of plug-in methods over data-driven methods is still
open and the experimental evidences are contrasting. Results on behalf of
plug-in methods come from [47, 41, 38].
¡
Loader [33] showed how the supposed superior performance of plug-in
approaches is a complete myth. The use of cross-validation for bandwidth
selection has been investigated in several papers, mainly in the case of density
estimation [30].
¡
In regression an adaptation of Mallow’s
st
was introduced by Rice [40] for
constant fitting and by Cleveland and Devlin [21] in local polynomial regression.
Cleveland and Loader [22] suggested local
st
and local PRESS for choosing
both the degree of local polynomial mixing and the bandwidth.
¡
We believe that plug-in methods are built on a series of assumptions about the
statistical process underlying the data set and on theoretical results which are
more reliable more the number of points tends to infinity.
¡
In a common black-box situation where no a priori information is available, the
adoption of data driven techniques can result a promising approach to the
problem. On the use of cross-validation for local modeling in regression and time series prediction – p.22/75
Data-driven bandwidth selection
MSE (k ), mβ(k )m
q
MSE (k ), Mβ(k )M
q
MSE (k ), mβ(k )m
loo
MSE (k ), Mβ(k )M
loo
TRAINING
SET
LOCAL WEIGHTED REGRESSION
IDENTIFICATION
STRUCTURAL
DIFFERENT BANDWIDTHS
LEAVE-ONE-OUT
yq
PREDICTION
MODEL SELECTION
q
vwx
x
x
x€y
y
y
y
€‚ƒ„…†‡‡€ˆ‰‰€‘‘€’’“
“
“
“€””
•
•
•
•€–—
—˜™
™
™
™
d
d
d
d
efgg€hhijkl
m
mnop
q
qrss€tuu€vwx
y
y
y
y
y
y
y
y
y
y
y
y
y
y
z
z
z
z
z
z
z
z
z
z
z
z
z
z
x
y
q
{
{
{
{€|}}€~~

€€‚ƒ„…
…†‡
‡
‡
‡
ˆ
ˆ‰‰€Š‹
‹Œ
Œ
Ž
€‘‘€’“
“”••€–——€˜™š›
›œžŸ ¡
¡
¡
¡€¢£
£
£
£€¤¥¦
§
§¨©
©ª
«
«¬
¬­
­
­
­€®
¯
¯
¯
¯€°°±±€²²
³
³
³
³
³
³
³
³
³
³
³
³
³
´
´
´
´
´
´
´
´
´
´
´
´
´
x
y
q
µ
µ¶·
·
·
·€¸
¸
¸
¸
¹º»¼
½
½
½
½€¾¾¿ÀÁÂÃ
ÃÄÅ
ÅÆÇ
Ç
Ç
Ç€È
È
È
È
É
ÉÊ
Ë
ËÌÍ
ÍÎÏÐÑÒÓ
Ó
Ó
Ó€ÔÔ
Õ
Õ
Õ
Õ€ÖÖ××€ØÙÚÛÜÝ
ÝÞß
ßàá
áâ
ã
ãäå
åæ
ç
çèé
éê
ëë€ìì
í
í
í
í
í
í
í
í
í
í
í
í
í
í
î
î
î
î
î
î
î
î
î
î
î
î
î
î
ïïïïïïïïïïïïï
x
y
On the use of cross-validation for local modeling in regression and time series prediction – p.23/75
Original contributions
Problem1: identifying a sequence of local models is expensive.
Solution1: we propose recursive-least-squares (RLS) to speed up the
identification of sequence of models with increasing number of
neighbors [6, 13].
Problem 2: validating a local model by cross-validation is expensive.
Solution 2: we compute the leave-one-out cross-validation by obtaining
the PRESS statistic through the terms of RLS [9].
Problem 3: choosing the best model is prone to errors.
Solution 3: we combine the best models [7].
On the use of cross-validation for local modeling in regression and time series prediction – p.24/75
Recursive-least-squares in space
β m(k ) β m+1(k ) β(k )M
SLOW IDENTIFICATIONq
ðñò
ò
ò
òôó
õõôöö÷øùúûûôüýýôþÿ 
¡¢£
£
£
£¥¤¤¦
¦§¨
¨©
¥



!

#
#
$$¥%¥''()
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
x
y
q
2
2
2
2¥3
44¥5
6
6788¥9@AB
BC
D
D
D
D¥EFF¥GGH
HIP
P
Q
Q
RR¥SSTT¥UU
VWXX¥YY``¥aabc
defghip
p
p
p¥qrr¥stu
v
vw
w
x
xy
y
€
€

‚
‚
‚
‚¥ƒ
„
„
„
„¥…††¥‡
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
‰
‰
‰
‰
‰
‰
‰
‰
‰
‰
‰
‰
x
y
q
‘
’’¥“
”•–—
˜
˜™defgh
hij
jk
k
l
l
l
l¥m
m
m
m
n
no
p
pq
qr
rstuvwxyzz¥{{||¥}~€‚
‚ƒ„
„…†
†‡
ˆ
ˆ‰Š
Š‹
Œ
ŒŽ
Ž
¥‘‘
’
’
’
’
’
’
’
’
’
’
’
’
’
“
“
“
“
“
“
“
“
“
“
“
“
“
”””””””””””””•••••••••••••
x
y
β m(k ) β m+1(k ) β(k )M
RLS RLS RLS
FAST IDENTIFICATIONq
–
–—˜
˜
˜
˜¥™™
šš¥››œžŸ  ¥¡¢¢¥£¤¥
¦§¨
¨
¨
¨¥©ª
ª«¬
¬­
®¯°°¥±
²³´µ
¶
¶·¸¹
º
º»¼¼¥½½¾¾¥¿¿ÀÁ
x
y
q
Â
Â
Â
Â¥ÃÃÄÄ¥Å
Æ
ÆÇÈÈ¥ÉÉÊËÌ
ÌÍ
Î
Î
Î
Î¥ÏÐÐ¥ÑÒ
ÒÓÔ
ÔÕÖÖ¥××ØØ¥ÙÙ
ÚÛÜÜ¥ÝÞÞ¥ßßàá
âãäåæçè
è
è
è¥éêê¥ëëìí
î
îïð
ðñ
ò
òóô
ô
ô
ô¥õ
ö
ö
ö
ö¥÷øø¥ù
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
û
û
û
û
û
û
û
û
û
û
û
û
û
üüüüüüüüüüü
x
y
q
ýþ
ÿÿ¡  
¢£¤¥
¦
¦§
§¨©









!#$%
'()00¡1122¡3345678
89
9
@
@ABC
D
DEF
FG
H
HIPQ
RR¡SS
T
T
T
T
T
T
T
T
T
T
T
T
T
U
U
U
U
U
U
U
U
U
U
U
U
U
x
y
On the use of cross-validation for local modeling in regression and time series prediction – p.25/75
PRESS statistic and leave-one-out
PARAMETRIC IDENTIFICATION ON N-1 SAMPLES
PUT THE j-th SAMPLE ASIDE
TEST ON THE j-th SAMPLE
PARAMETRIC IDENTIFICATION
ON N SAMPLES
N TIMES
TRAINING SET
PRESS STATISTIC
LEAVE-ONE-OUT
PRESS was first introduced by Allen [2].
On the use of cross-validation for local modeling in regression and time series prediction – p.26/75
The regression task
Given two variables
V
W
XY
and
`
W
X
, let us consider the mapping
acb
XY
d
X
,
known only through a set of
e
examples
fg
Vhpi
`h
qr
Yhts
u
obtained as follows:
`h
v
ag
Vh
qxw
yhpi
where
€
,
¡
yh
is a random variable such that
‚
ƒ
yh
„
v
…
and
‚
ƒ
yh
y†
„
v
…
,
€‡
ˆv

,
¡
‚
ƒ
y
‰
h
„
v

‰g
Vh
q
,
€’‘
“
”
, where

‰g–•
q
is the unknown
‘
th
moment of the
distribution of
yh
and is defined as a function of
Vh
.
In particular for
‘
v
”
, the last of the above mentioned properties implies that no
assumption of global homoscedasticity is made.
On the use of cross-validation for local modeling in regression and time series prediction – p.27/75
Local Weighted Regression
¡
The problem of local regression can be stated as the problem of estimating the
value that the regression function
ag
V
q
v
‚
ƒ
`—
V„
assumes for a specific query
point
V
, using information pertaining only to a neighborhood of
V
.
¡
Given a query point
V™˜
, and under the hypothesis of a local homoscedasticity of
yh
, the parameter
d
of a local linear approximation of
ag–•
q
in a neighborhood of
V˜
can be obtained solving the local polynomial regression:
e
hs
u
f
`h
g
V
h
h
d
i
j
k
lg
Vhi
V˜
q
m
i
where, given a metric on the space
XY
,
¡
lg
Vhi
V˜
q
is the distance from the query point to the

n
o
example,

v
p
iq
q
q
i
r
,
¡
kg–•
q
is a weight (aka kernel) function,
¡
m
is the bandwidth
On the use of cross-validation for local modeling in regression and time series prediction – p.28/75
Local Weighted Regression (II)
¡
In matrix notation, the solution of the above stated weighted least squares
problem is given by:
sd
v
g
t
h
u
h
u
t
qv
u
t
h
u
h
uxw
v
gy
h
y
qv
u
y
h{z
v
|
y
hz
i
where
t
is a matrix whose

n
o
row is
V
h
h
,
w
is a vector whose

n
o
element is
`h
,
u
is a diagonal matrix whose

n
o
diagonal element is
}h
h
v
kg
lg
Vhi
V˜
q~
m
q
,
y
v
u
t
,
z
v
uxw
, and the matrix
t
h
u
h
u
t
vy
h
y
is assumed to be
non-singular so that its inverse
|
v
gy
h
y
qv
u
is defined.
¡
Once obtained the local linear polynomial approximation, a prediction of
`˜
v
ag
V˜
q
, is finally given by:
€`˜
v
V
h
˜
sd
q
On the use of cross-validation for local modeling in regression and time series prediction – p.29/75
Linear Leave-one-out
¡
By exploiting the linearity of the local approximator, a leave-one-out
cross-validation estimation of the mean squared error
‚
ƒg
ag‚
˜
q
g
€`˜
q
j
„
can be
obtained without any significant overload.
¡
In fact, using the PRESS statistic [2, 37], it is possible to calculate the error
ƒ
cv
†
v
`†
g
V
h
†
sdv
†
, without explicitly identifying the parameters
sdv
†
from the
examples available with the
‡
th
removed.
¡
The formulation of the PRESS statistic for the case at hand is the following:
ƒ
cv
†
v
`†
g
V
h
†
sdv
†
v
`†
g
V
h
†
|
y
hz
p
g
„
h
†
|
„†
v
`†
g
V
h
†
sd
p
g
m†
†
i
where
„
h
†
is the
‡
th
row of
y
and therefore
„†
v
}†
†
V†
, and where
m†
†
is the
‡
th
diagonal element of the Hat matrix
…
vy
|
y
h
vy
gy
h
y
q†v
u
y
h
.
On the use of cross-validation for local modeling in regression and time series prediction – p.30/75
Rectangular weight function
¡
In what follows, for the sake of simplicity, we will focus on linear approximator. An
extension to generic polynomial approximators of any degree is straightforward.
We will assume also that a metric on the space
XY
is given. All the attention will
be thus centered on the problem of bandwidth selection.
¡
If as a weight function
kg–•
q
the indicator function
k
lg
Vhi
V˜
q
m
v
‡ˆ‰
p
if
lg
Vhi
V˜
qŠ
m
,
…
otherwise;
(0)
is adopted, the optimization of the parameter
m
can be conveniently reduced to
the optimization of the number ‹ of neighbors to which a unitary weight is
assigned in the local regression evaluation.
¡
In other words, we reduce the problem of bandwidth selection to a search in the
space of
mg
‹
q
v
lg
Vg
‹
q
i
V˜
q
, where
Vg
‹
q
is the
‹
th
nearest neighbor of the query
point.
On the use of cross-validation for local modeling in regression and time series prediction – p.31/75
Recursive local regression
The main advantage deriving from the adoption of the rectangular weight function is
that, simply by updating the parameter
sdg
‹
q
of the model identified using the
‹
nearest neighbors, it is straightforward and inexpensive to obtain
sdg
‹w
p
q
. In fact,
performing a step of the standard recursive least squares algorithm [4], we have:
‡ŒŒŽŒ‚ŒŽŒŒŽŒ‚ŒŽŒˆŒŽŒ‚ŒŽŒŒŒŒŽŒ‚ŒŽ‰
|g
‹w
p
q
v
|g
‹
q
g
|g
‹
q
Vg
‹w
p
q
V
hg
‹w
p
q
|g
‹
q
pw
V
hg
‹w
p
q
|g
‹
q
Vg
‹w
p
q
g
‹w
p
q
v
|g
‹w
p
q
Vg
‹w
p
q
ƒg
‹w
p
q
v
`g
‹w
p
q
g
V
hg
‹w
p
q
sdg
‹
q
sdg
‹w
p
q
v
sdg
‹
qxw
g
‹w
p
q
ƒg
‹w
p
q
where
|g
‹
q
v
gy
h
y
q†v
u
when
m
v
mg
‹
q
, and where
Vg
‹w
p
q
is the
g
‹w
p
q
th
nearest
neighbor of the query point.
On the use of cross-validation for local modeling in regression and time series prediction – p.32/75
Recursive PRESS computation
Moreover, once the matrix

©‘

’

is available, the leave-one-out
cross-validation errors can be directly calculated without the need of
any further model identification:
ä
cv
“
©‘

’

§
¦“
”
•
–
“
—™˜
©‘

’

’
”
•
–
“

©‘

’

•“
š
›œž
Ÿ
©
•“
š
•¡ 
†¢
£
©‘

’
¥¤
Let us define for each value of
‘
the
¦‘
§
’
¨
vector
©
cv
©‘

that contains
all the leave-one-out errors associated to the model
—˜
©‘

.
On the use of cross-validation for local modeling in regression and time series prediction – p.33/75
Model selection
 
The recursive algorithm returns for a given query point
• 
, a set of
predictions
ª¦ 
©‘

§
•
–
 
—˜
©‘

, together with a set of associated
leave-one-out error vectors
©
cv
©‘

.
 
If the selection paradigm, frequently called winner-takes-all, is
adopted, the most natural way to extract a final prediction
ª¦¡ 
,
consists in comparing the prediction obtained for each value of
‘
on the basis of the classical mean square error criterion:
ª¦ 
§
•
–
 
—˜
©
ª
‘

š
with
ª
‘
§
«¬
­®
¯±°³²
´
MSE
©‘

§
«¬
­®
¯°²
²µ’¶
·
¸
µ
©
©
cv
µ
©‘


¹
²µ’¶
·
¸
µ
º
On the use of cross-validation for local modeling in regression and time series prediction – p.34/75
Local Model combination
¡
As an alternative to the winner-takes-all paradigm, we explored also the
effectiveness of local combinations of estimates [46].
¡
The final prediction of the value
`˜
is obtained as a weighted average of the best
»
models, where
»
is a parameter of the algorithm.
¡
Suppose the predictions
€`˜
g
‹
q
and the error vectors
¼
cv
g
‹
q
have been ordered
creating a sequence of integers
f
‹h
r
so that
½
MSE
g
‹h
qŠ
½
MSE
g
ܠ
q
,
€¾
‡
. The
prediction of
€`˜
is given by
€`˜
v
¿
À
hs
u
Áh
€`˜
g
‹h
q
¿
À
hts
u
Áh
i
where the weights are the inverse of the mean square errors:
Áh
v
p~
½
MSE
g
‹h
q
.
This is an example of the generalized ensemble method [39].
On the use of cross-validation for local modeling in regression and time series prediction – p.35/75
From local learning to Lazy Learning (LL)
 
By speeding up the local learning procedure, we can delay the
learning procedure to the moment when a prediction in a query
point is required (query-by-query learning).
 
The combination approach makes possible to integrate local
models of different order (e.g. constant and linear) and different
bandwidths.
 
This method is called lazy since the whole learning procedure
(i.e. the parametric and the structural identification) is deferred
until a prediction is required.
On the use of cross-validation for local modeling in regression and time series prediction – p.36/75
Experimental setup for regression
Datasets: 23 real and artificial datasets from the ML repository.
Methods: Lazy Learning, Local modeling, Feed Forward Neural
Networks, Mixtures of Experts, Neuro Fuzzy, Regression Trees
(Cubist).
Experimental methodology:
’Â
-fold cross-validation.
Results: Mean absolute error (Table 7.2), relative error (Table 7.3) and
paired t-test (Appendix C) [7].
On the use of cross-validation for local modeling in regression and time series prediction – p.37/75
Regression datasets
Dataset Number of examples Number of regressors
Housing 330 8
Cpu 506 13
Prices 209 6
Mpg 159 16
Servo 392 7
Ozone 167 8
Bodyfat 252 13
Pool 253 3
Energy 2444 5
Breast 699 9
Abalone 4177 10
Sonar 208 60
Bupa 345 6
Iono 351 34
Pima 768 8
Kin_8fh 8192 8
Kin_8nh 8192 8
Kin_8fm 8192 8
Kin_8nm 8192 8
Kin_32fh 8192 32
Kin_32nh 8192 32
Kin_32fm 8192 32
Kin_32nm 8192 32
On the use of cross-validation for local modeling in regression and time series prediction – p.38/75
Experimental results: paired comparison
Each method is statistically compared with all the others
(9 * 23 =207 comparisons).
Method
Number of times the method
was significantly worse than another
LL linear 74
LL constant 96
LL combination 23
Local modeling linear 58
Local modeling constant 81
Cubist 40
Feed Forward NN 53
Mixtures of Experts 80
Local Model Network (fuzzy) 132
Local Model Network (k-mean) 145
The less, the best !!
On the use of cross-validation for local modeling in regression and time series prediction – p.39/75
Award in EUFIT competition
Data analysis competition on regression: awarded as a runner-up among
Ã
’
participants at the Third International Erudit competition on
Protecting rivers and streams by monitoring chemical
concentrations and algae communities [10].
On the use of cross-validation for local modeling in regression and time series prediction – p.40/75
Lazy Learning for dynamic tasks
Multi-step-ahead prediction: [12]
long horizon forecasting based on the iteration of a LL
one-step-ahead predictor.
Nonlinear control: [11]
1. Lazy Learning inverse/forward control.
2. Lazy Learning self-tuning control.
3. Lazy Learning optimal control.
On the use of cross-validation for local modeling in regression and time series prediction – p.41/75
Embedding in time series
Consider a sequence
Ä
of measurements
Å
Æ
£
¤
of a observable at
equal time intervals.
We express the present value as a function of the previous
Ç
values of
the time series itself
Å
Æ
§
¨
©
Å
ÆÉÈ
·
š
Å
ÆÈ
¹
š¤
¤
¤
š
Å
ÆÈ
¥


where
¨
is an unknown nonlinear function and the vector
¦
Å
ÆÉÈ
·
š
Å
ÆÉÈ
¹
š¤
¤
¤
š
Å
ÆÈ
¥
¨
lies in the
Ç
dimensional time delay space or lag
space.
This standard approach is called “state-space reconstruction” in the
physics community, “tapped delay line” in the engineering community
and Nonlinear Autoregressive (NAR) in the forecasting community.
On the use of cross-validation for local modeling in regression and time series prediction – p.42/75
0
10
20
30
40
50
0
10
20
30
40
50
−8
−6
−4
−2
0
2
4
6
8
10
fit
TIME SERIES t t-n+1t-1
= (ϕ ,ϕ ,..., ϕ )fϕt+1
t-1
ϕ
temporal representation
embedding representation
ϕt+1
ϕt
input/output representation
ϕ
1
2
3
4
5
On the use of cross-validation for local modeling in regression and time series prediction – p.43/75
One-step and multi-step-ahead prediction
One-step ahead prediction: the
Ç
previous values of the series are
assumed to be available for the prediction of the next value.
This is equivalent to a problem of supervised learning. LL was
used in this way in several prediction tasks: finance, economic
variables, environmental modeling [23].
Multi-step ahead prediction: we predict the value of the series for the
next
£
steps.
We can classify the methods for multiple step prediction
according to two features, the horizon of the predictor and the
training criterion.
On the use of cross-validation for local modeling in regression and time series prediction – p.44/75
Multi-step-ahead-prediction
One-step-ahead predictor and one-step-ahead training criterion. The model
predicts
£
steps ahead by iterating a one-step-ahead predictor
whose parameters are optimized to minimize the training error on
one-step-ahead forecast.
One-step-ahead predictor and
£
-step-ahead training criterion. The model
predicts
£
steps ahead by iterating a one-step-ahead predictor
whose parameters are optimized to minimize the training error on
the iterated
£
-step-ahead forecast.
Direct forecasting. The model makes a direct forecast at time
Ê

£
:
Å
ÆË
Ì
§
¨
Ì
©
Å
Æ
š
Å
ÆÈ
·
š¤
¤
¤
š
Å
ÆÈ
¥
Ë
·

On the use of cross-validation for local modeling in regression and time series prediction – p.45/75
Iteration of a one-step-ahead predictor
f
ϕt-2
z-1
z-1
z-1
z-1
ϕt-3
ϕt-n
ϕt-1
z-1
ϕt
On the use of cross-validation for local modeling in regression and time series prediction – p.46/75
Local Modeling in the time domain
Consider the embedding
Å
ÆË
·
§
¨
©
Å
Æ
š
Å
ÆÈ
·
š¤
¤
¤
š
Å
ÆÈ
Í

of order
Ç
§
Î
.
- -
t
ϕ
t-11t-16 t-1t-6
t
On the use of cross-validation for local modeling in regression and time series prediction – p.47/75
Local Modeling in the I/O space
Consider the embedding
Å
ÆË
·
§
¨
©
Å
Æ

of order
Ç
§
’
.
t+1
t
ÏÐ
ÑÒ
ÓÔ
ÕÖ
×ØÙÚ
ÛÜ
ÝÞ
ßà
áâ
ããåä
æç
èé
êë
ìí
îïð
ðñò
òó
ô
ôõ
ö÷
øù
úûü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ϕ
ϕ
q
ýýåþþ
ÿÿ¡  
¢¢¡££
¤
¤
¤
¤¡¥¥
¦¦¡§§
¨¨¡©©
Note the labels of the axis !!!
On the use of cross-validation for local modeling in regression and time series prediction – p.48/75
Local modeling in the embedding space
Consider the embedding
Å
ÆË
·
§
¨
©
Å
Æ
š
Å
ÆÈ
·

of order
Ç
§
Ã
.
t-2
ϕ
t-1
ϕ
t-1
ϕ
t-2
ϕ
1
2
3
4
5
t+h-1
t
t-1
On the use of cross-validation for local modeling in regression and time series prediction – p.49/75
Conventional and iterated leave-one-out
a)
3
1
2
4
5
3
e (3)
cv
1
2
3
4
5
e (3)
b)
it
1
2
4
5
3
1
2 3
4
5
3
On the use of cross-validation for local modeling in regression and time series prediction – p.50/75
It Press in the space
x4 x5x3x2x1z1 z2 z4 z5
y1
y2
y3
y4
y5
xy
loo
z3
yz
it
x
xyβ
-3
yzβ
-3
x
y
z
3
-3
y^
e (3)
xz
e (3)
e (3)
loo
¢
represents the value of the time series with order
Ç
§
’
at time
Ê
”
’
,
¦
represents the value of the time series at time
Ê
, and

represents
the value of the time series at time
Ê

’
.
On the use of cross-validation for local modeling in regression and time series prediction – p.51/75
From conventional to iterated PRESS
 
PRESS statistic returns leave-one-out as a by product of the local
weighted regression.
 
We derived in [12] an analytical iterated formulation of the PRESS
statistic for long horizon assessment.
 
Iterated assessment criterion improves stability and prediction
accuracy.
On the use of cross-validation for local modeling in regression and time series prediction – p.52/75
The Iterated multi-step-ahead algo
1. Time series embedded as an input/output mapping
¨

¤¥

¤
.
2. The one-step-ahead predictor is a local estimate of the mapping
¨
.
3. The
£
-step-ahead prediction is performed by iterating a
one-step-ahead estimator.
4. Local structure identification performed in a space of alternative
model configurations, each characterized by a different
bandwidth.
5. Prediction ability assessed by the iterated formulation of the
cross-validation PRESS statistic (
£
-step-ahead criterion).
On the use of cross-validation for local modeling in regression and time series prediction – p.53/75
The Santa Fe time series
 
The iterated PRESS approach has been applied both to the
prediction of a real-world data set (A) and to a computer
generated time series (D) from the Santa Fe Time Series
Prediction and Analysis Competition.
 
The A time series has a training set of 1000 values and a test set
of 10000 samples: the task is to predict the continuation for
’Â
Â
steps, starting from different points.
 
The D time series has a training set of 100000 values and a test
set of 500 samples: the task is to predict the continuation for
Ã
steps, starting from different points.
On the use of cross-validation for local modeling in regression and time series prediction – p.54/75
A series: training set
0 100 200 300 400 500 600 700 800 900 1000
0
50
100
150
200
250
300
On the use of cross-validation for local modeling in regression and time series prediction – p.55/75
A series: one-step criterion
0 10 20 30 40 50 60 70 80 90 100
0
50
100
150
200
250
300
On the use of cross-validation for local modeling in regression and time series prediction – p.56/75
A series: multi-step criterion
0 10 20 30 40 50 60 70 80 90 100
0
50
100
150
200
250
300
On the use of cross-validation for local modeling in regression and time series prediction – p.57/75
Experiments: The Santa Fe Time Series A
order n=16 Training set: 1000 values Test set: 100 steps
Test data Non iter. PRESS Iter. PRESS Sauer Wan
1-100 0.350 0.029 0.077 0.055
1180-1280 0.379 0.131 0.174 0.065
2870-2970 0.793 0.055 0.183 0.487
3000-3100 0.003 0.003 0.006 0.023
4180-4280 1.134 0.051 0.111 0.160
Sauer: combination of iterated and direct local models.
Wan: recurrent network.
On the use of cross-validation for local modeling in regression and time series prediction – p.58/75
The Santa Fe Time Series D
order
Ç
§
ÃÂ
Training set:
’Â
Â
šÂ
Â
Â
values Test set:
Ã
steps
Test data Non iter. PRESS Iter. PRESS Zhang Hutchinson
0-24 0.1255 0.0492 0.0665
100-124 0.0460 0.0363 0.0616
200-224 0.2635 0.1692 0.1475
300-324 0.0461 0.0405 0.0541
400-424 0.1610 0.0644 0.0720
Zhang: combination of iterated and direct multilayer perceptron.
On the use of cross-validation for local modeling in regression and time series prediction – p.59/75
Award in Leuven Competition
Training set made of
ÃÂ
Â
Â
points.
Task: predict the continuation for the next
ÃÂ
Â
points.
0 20 40 60 80 100 120 140 160 180 200
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Iterated Lazy Learning ranked second and fourth [8].
On the use of cross-validation for local modeling in regression and time series prediction – p.60/75
Lazy Learning for iterated prediction
Multi-step ahead by iteration of a one-step predictor.
Lazy learning to implement the one-step predictor.
Selection of the local structure by an iterated PRESS.
Iterated criterion avoids the accumulation of prediction errors and
improves the performance.
On the use of cross-validation for local modeling in regression and time series prediction – p.61/75
Complexity in global and local modeling
Consider
r
training samples,
e
features and

query points.
GLOBAL LAZY
Parametric ident.

(NLS)

(Nn)+

(LS)
Structural ident. by K-fold cross-validation K

(NLS) small
prediction for Q queries negligible Q (

(Nn)+

(LS))
TOTAL K

(NLS) Q [

(Nn)+

(LS)]
where

(NLS) stands for the cost of Non-Linear least-Squares and

(LS) stands for
the cost of Linear least-Squares.
On the use of cross-validation for local modeling in regression and time series prediction – p.62/75
Feature selection and LL
 
Local modeling techniques are known to be weak in large
dimensional spaces.
 
A way to defy the curse of dimensionality is dimensionality
reduction (aka feature selection).
 
It requires the assessment of an exponential number of
alternatives (
Ã¥
subsets of input variables) and the choice of the
best one.
 
Several techniques exist: we focus here on wrappers.
 
Wrappers rely on expensive cross-validation (e.g. leave-one-out
assessment)
 
Our idea: combine racing [34] and sub-sampling [29] to
accelerate the wrapper feature selection procedure in LL.
On the use of cross-validation for local modeling in regression and time series prediction – p.63/75
On the use of cross-validation for local modeling in regression and time series prediction – p.64/75
Racing for feature selection
 
Suppose we have several sets of different input variables. 
The computational cost of making a selection results from the
cost of identification and the cost of validation.
 
The validation cost required by a global model is independent of
Q, while this is not the case for LL.
 
The idea of racing techniques consists in using blocking and paired
multiple test to compare different models in similar conditions and
discard as soon as possible the worst ones.
 
Racing reduces the number of tests

to be made.
 
This makes more competitive the wrapper LL approach.
On the use of cross-validation for local modeling in regression and time series prediction – p.65/75
On the use of cross-validation for local modeling in regression and time series prediction – p.66/75
On the use of cross-validation for local modeling in regression and time series prediction – p.67/75
Sub-sampling and LL
 
The goal of model selection is to find the best hypothesis in a set
of alternatives.
 
What is relevant is ordering the different alternatives: M2  M3 
M5  M1 M2.
 
Reducing the training set size N, we hope to reduce the accuracy
of each single model but not necessarily their ordering.
 
In LL reducing the training set size

reduces the cost.
 
The idea of sub-sampling is to reduce the size of the training set
without altering the ranking of the different models.
 
This makes more competitive the LL approach
On the use of cross-validation for local modeling in regression and time series prediction – p.68/75
RACSAM for feature selection
We proposed the following algorithm [14]
1. Define an initial group of promising feature subsets.
2. Start with small training and test sets.
3. Discard by racing all the feature subsets that appear as
significantly worse than the others.
4. Increase the training and test size until at most winners models
remain.
5. Update the group with new candidates to be assessed and go
back to 3.
On the use of cross-validation for local modeling in regression and time series prediction – p.69/75
Experimental session
 
We compare the performance accuracy of the LL algorithm
enhanced by the RACSAM procedure to the the accuracy of two
state-of-art algorithms, a SVM for regression and a regression
tree (RTREE).
 
Two version of the RACSAM algorithm were tested: the first
(LL-RAC1) takes as feature set the best one (in terms of estimate
Mean absolute Error (MAE)) among the winning candidates :
the second (LL-RAC1) averages the predictions of LL
predictors.
 
§

, and p-value is
Â
¤
Â
’
.
On the use of cross-validation for local modeling in regression and time series prediction – p.70/75
Experimental results
Five-fold cross-validation on six real datasets of high dimensionality:
Ailerons (

§
’

Â
š
Ç
§
Â
), Pole (

§
’
Â
Â
Â
š
Ç
§


),
Elevators (

§
’
Î
š
Ç
§
’

), Triazines (

§
’

Î
š
Ç
§
ÎÂ
),
Wisconsin (

§
’
š
Ç
§

Ã
) and Census (

§
Ã
Ã
!

š
Ç
§
’
!
).
Dataset AIL POL ELE TRI WIS CEN
LL-RAC1 9.7e-5 3.12 1.6e-3 0.21 27.39 0.17
LL-RAC2 9.0e-5 3.13 1.5e-3 0.12 27.41 0.16
SVM 1.3e-4 26.5 1.9e-3 0.11 29.91 0.21
RTREE 1.8e-4 8.80 3.1e-3 0.11 33.02 0.17
On the use of cross-validation for local modeling in regression and time series prediction – p.71/75
Applications
¡
Financial prediction of stock markets: in collaboration with Masterfood, Belgium.¡
Prediction of yearly sales: in collaboration with Dieteren, Belgium, the first
Belgian car dealer.
¡
Non linear control and identification task in power systems: in collaboration with
Universit´a del Sannio (I) [44, 18].
¡
Modeling of industrial processes: in collaboration with FaFer Usinor steel
company (B), and Honeywell Technology Center, (US).
¡
Performance modelling of embedded systems: during my stay at Philips
Research [16], Eindhoven (NL).
¡
Quality of service: during my stay at IMEC, Leuven (B) [17].
¡
Black-box simulators: in collaboration with CENEARO, Gosselies (B) [15].
¡
Environmental predictions: in collaboration with Politecnico di Milano (I) [23].
On the use of cross-validation for local modeling in regression and time series prediction – p.72/75
Software
 
MATLAB toolbox on Lazy Learning [5]. 
R contributed package lazy.
 
Joint work with Dr. Mauro Birattari (IRIDIA).
 
Web page: http://iridia.ulb.ac.be/~lazy.
 
About 5000 accesses since October 2002.
On the use of cross-validation for local modeling in regression and time series prediction – p.73/75
The importance of being Lazy
 
Fast data-driven design. 
No global assumption on the noise.
 
Linear methods still effective in a multivariate non-linear setting
(LWR, PRESS).
 
An estimate of the variance is returned with each prediction.
 
Intrinsically adaptive.
On the use of cross-validation for local modeling in regression and time series prediction – p.74/75
Future work
 
Extension of the LL method to other local selection criteria (VC
dimension, GCV).
 
Classification applications.
 
Integration with powerful software and hardware devices.
 
From large to huge databases.
 
New applications: bioinformatics, text mining, medical data,
sensor networks, power systems.
On the use of cross-validation for local modeling in regression and time series prediction – p.75/75
References
[1] D. W. Aha. Editorial of special issue on lazy learning. Artificial
Intelligence Review, 11(1–5):1–6, 1997.
[2] D. M. Allen. The relationship between variable and data augmen-
tation and a method of prediction. Technometrics, 16:125–127,
1974.
[3] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted
learning. Artificial Intelligence Review, 11(1–5):11–73, 1997.
[4] G. J. Bierman. Factorization Methods for Discrete Sequential
Estimation. Academic Press, New York, NY, 1977.
[5] M. Birattari and G. Bontempi. The lazy learning toolbox, for
use with matlab. Technical Report TR/IRIDIA/99-7, IRIDIA-ULB,
Brussels, Belgium, 1999.
[6] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets
the recursive least-squares algorithm. In M. S. Kearns, S. A.
Solla, and D. A. Cohn, editors, NIPS 11, pages 375–381, Cam-
bridge, 1999. MIT Press.
75-1
[7] G. Bontempi. Local Learning Techniques for Modeling, Predic-
tion and Control. PhD thesis, IRIDIA- Universit´e Libre de Brux-
elles, 1999.
[8] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for it-
erated time series prediction. In J. A. K. Suykens and J. Van-
dewalle, editors, Proceedings of the International Workshop on
Advanced Black-Box Techniques for Nonlinear Modeling, pages
62–68. Katholieke Universiteit Leuven, Belgium, 1998.
[9] G. Bontempi, M. Birattari, and H. Bersini. Recursive lazy learning
for modeling and control. In Machine Learning: ECML-98 (10th
European Conference on Machine Learning), pages 292–303.
Springer, 1998.
[10] G. Bontempi, M. Birattari, and H. Bersini. Lazy learners at work:
the lazy learning toolbox. In Proceeding of the 7th European
Congress on Inteligent Techniques and Soft Computing EUFIT
’99, 1999.
[11] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for
modeling and control design. International Journal of Control,
72(7/8):643–658, 1999.
75-1
[12] G. Bontempi, M. Birattari, and H. Bersini. Local learning for iter-
ated time-series prediction. In I. Bratko and S. Dzeroski, editors,
Machine Learning: Proceedings of the Sixteenth International
Conference, pages 32–38, San Francisco, CA, 1999. Morgan
Kaufmann Publishers.
[13] G. Bontempi, M. Birattari, and H. Bersini. A model selection ap-
proach for local learning. Artificial Intelligence Communications,
121(1), 2000.
[14] G. Bontempi, M. Birattari, and P.E. Meyer. Combining lazy learn-
ing, racing and subsampling for effective feature selection. In
Proceedings of the International Conference on Adaptive and
Natural Computing Algorithms. Springer Verlag, 2005. To ap-
pear.
[15] G. Bontempi, O. Caelen, S. Pierret, and C. Goffaux. On the
use of supervised learning techniques to speed up the design
of aeronautics components. WSEAS Transactions on Systems,
10(3):3098–3103, 2005.
[16] G. Bontempi and W. Kruijtzer. The use of intelligent data anal-
ysis techniques for system-level design: a software estimation
75-1
example. Soft Computing, 8(7):477–490, 2004.
[17] G. Bontempi and G. Lafruit. Enabling multimedia qos control with
black-box modeling. In D. Bustard, W. Liu, and R. Sterritt, edi-
tors, Soft-Ware 2002: Computing in an Imperfect World, Lecture
Notes in Computer Science, pages 46–59, 2002.
[18] G. Bontempi, A. Vaccaro, and D. Villacci. A semi-physical mod-
elling architecture for dynamic assessment of power components
loading capability. IEE Proceedings of Generation Transmission
and Distribution, 151(4):533–542, 2004.
[19] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Clas-
sification and Regression Trees. Wadsworth International Group,
Belmont, CA, 1984.
[20] W. S. Cleveland. Robust locally weighted regression and smooth-
ing scatterplots. Journal of the American Statistical Association,
74:829–836, 1979.
[21] W. S. Cleveland and S. J. Devlin. Locally weighted regression:
an approach to regression analysis by local fitting. Journal of
American Statistical Association, 83:596–610, 1988.
75-1
[22] W. S. Cleveland and C. Loader. Smoothing by local regression:
Principles and methods. Computational Statistics, 11, 1995.
[23] G. Corani. Air quality prediction in milan: feed-forward neural
networks, pruned neural networks and lazy learning. Ecological
Modelling, 2005. In press.
[24] T. Cover and P. Hart. Nearest neighbor pattern classification.
Proc. IEEE Trans. Inform. Theory, pages 21–27, 1967.
[25] J. Fan and I. Gijbels. Adaptive order polynomial fitting: band-
width robustification and bias reduction. J. Comp. Graph. Statist.,
4:213–227, 1995.
[26] J. Fan and I. Gijbels. Local Polynomial Modelling and Its Appli-
cations. Chapman and Hall, 1996.
[27] W. Hardle and J. S. Marron. Fast and simple scatterplot smooth-
ing. Comp. Statist. Data Anal., 20:1–17, 1995.
[28] R. Henderson. Note on graduation by adjusted average. Trans-
actions of the Actuarial Society of America, 17:43–48, 1916.
[29] G. H. John and P. Langley. Static versus dynamic sampling for
data mining. In Proceedings of the Second International Con-
75-1
ference on Knowledge Discovery in Databases and Data Mining.
AAAI/MIT Press, 1996.
[30] M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of
bandwidth selection for density estimation. Journal of American
Statistical Association, 90, 1995.
[31] V. Y. Katkovnik. Linear and nonlinear methods of nonparametric
regression analysis. Soviet Automatic Control, 5:25–34, 1979.
[32] C. Loader. Local Regression and Likelihood. Springer, New York,
1999.
[33] C. R. Loader. Old faithful erupts: Bandwidth selection reviewed.
Technical report, Bell-Labs, 1987.
[34] O. Maron and A. Moore. The racing algorithm: Model selection
for lazy learners. Artificial Intelligence Review, 11(1–5):193–225,
1997.
[35] T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
[36] R. Murray-Smith and T. A. Johansen. Local learning in local
model networks. In R. Murray-Smith and T. A. Johansen, editors,
75-1
Multiple Model Approaches to Modeling and Control, chapter 7,
pages 185–210. Taylor and Francis, 1997.
[37] R. H. Myers. Classical and Modern Regression with Applications.
PWS-KENT Publishing Company, Boston, MA, second edition,
1994.
[38] B. U. Park and J. S. Marron. Comparison of data-driven band-
width selectors. Journal of American Statistical Association,
85:66–72, 1990.
[39] M. P. Perrone and L. N. Cooper. When networks disagree: En-
semble methods for hybrid neural networks. In R. J. Mammone,
editor, Artificial Neural Networks for Speech and Vision, pages
126–142. Chapman and Hall, 1993.
[40] J. Rice. Bandwidth choice for nonparametric regression. The
Annals of Statistics, 12:1215–1230, 1984.
[41] D. Ruppert, S. J. Sheather, and M. P. Wand. An effective band-
width selector for local least squares regression. Journal of
American Statistical Association, 90:1257–1270, 1995.
75-1
[42] G. V. Schiaparelli. Sul modo di ricavare la vera espressione
delle leggi della natura dalle curve empiricae. Effemeridi Astro-
nomiche di Milano per l’Arno, 857:3–56, 1886.
[43] C. Stone. Consistent nonparametric regression. The Annals of
Statistics, 5:595–645, 1977.
[44] D. Villacci, G. Bontempi, A. Vaccaro, and M. Birattari. The role
of learning methods in the dynamic assessment of power com-
ponents loading capability. IEEE Transactions on Industrial Elec-
tronics, 52(1), 2005.
[45] G. Wahba and S. Wold. A completely automatic french curve:
Fitting spline functions by cross-validation. Communications in
Statistics, 4(1), 1975.
[46] D. Wolpert. Stacked generalization. Neural Networks, 5:241–
259, 1992.
[47] M. Woodrofe. On choosing a delta-sequence. Ann. Math. Statist.,
41:1665–1671, 1970.
75-1

More Related Content

What's hot

Data Compression - Text Compression - Run Length Encoding
Data Compression - Text Compression - Run Length EncodingData Compression - Text Compression - Run Length Encoding
Data Compression - Text Compression - Run Length Encoding
MANISH T I
 
Ant Colony Optimization
Ant Colony OptimizationAnt Colony Optimization
Ant Colony Optimization
Pratik Poddar
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
Page Maker
 
UNIT-1 CSA.pptx
UNIT-1 CSA.pptxUNIT-1 CSA.pptx
UNIT-1 CSA.pptx
Medicaps University
 
Quantum computing and machine learning overview
Quantum computing and machine learning overviewQuantum computing and machine learning overview
Quantum computing and machine learning overview
Colleen Farrelly
 
Presentation on Indian submarine Cable network
Presentation on Indian submarine Cable networkPresentation on Indian submarine Cable network
Presentation on Indian submarine Cable network
AGNIMITRA GHOSAL
 
Chapter 2 : Application Layer
Chapter 2 : Application LayerChapter 2 : Application Layer
Chapter 2 : Application Layer
Amin Omi
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
DigiGurukul
 
atm
atmatm
Data and Computer Communication
Data and Computer CommunicationData and Computer Communication
Data and Computer Communication
Naqeeb Ullah Kakar
 
Processes and Thread OS_Tanenbaum_3e
Processes and Thread OS_Tanenbaum_3eProcesses and Thread OS_Tanenbaum_3e
Processes and Thread OS_Tanenbaum_3e
Le Gia Hoang
 
Multiplexing
MultiplexingMultiplexing
Multiplexing
nimay1
 
Unit 1 1 introduction
Unit 1   1 introductionUnit 1   1 introduction
Unit 1 1 introduction
CHINTAN Patel
 
Fixed Point Conversion
Fixed Point ConversionFixed Point Conversion
Fixed Point Conversion
Rajesh Sharma
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
gokulprasath06
 
Dcom ppt(en.39) dpcm
Dcom ppt(en.39) dpcmDcom ppt(en.39) dpcm
Dcom ppt(en.39) dpcm
Dharit Unadkat
 
Chap 8 switching
Chap 8 switchingChap 8 switching
Chap 8 switching
Mukesh Tekwani
 
Learning With Complete Data
Learning With Complete DataLearning With Complete Data
Learning With Complete Data
Vishnuprabhu Gopalakrishnan
 
Nonlinear methods of analysis of electrophysiological data and Machine learni...
Nonlinear methods of analysis of electrophysiological data and Machine learni...Nonlinear methods of analysis of electrophysiological data and Machine learni...
Nonlinear methods of analysis of electrophysiological data and Machine learni...
Facultad de Informática UCM
 
Network Layer Numericals
Network Layer NumericalsNetwork Layer Numericals
Network Layer Numericals
Manisha Keim
 

What's hot (20)

Data Compression - Text Compression - Run Length Encoding
Data Compression - Text Compression - Run Length EncodingData Compression - Text Compression - Run Length Encoding
Data Compression - Text Compression - Run Length Encoding
 
Ant Colony Optimization
Ant Colony OptimizationAnt Colony Optimization
Ant Colony Optimization
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
UNIT-1 CSA.pptx
UNIT-1 CSA.pptxUNIT-1 CSA.pptx
UNIT-1 CSA.pptx
 
Quantum computing and machine learning overview
Quantum computing and machine learning overviewQuantum computing and machine learning overview
Quantum computing and machine learning overview
 
Presentation on Indian submarine Cable network
Presentation on Indian submarine Cable networkPresentation on Indian submarine Cable network
Presentation on Indian submarine Cable network
 
Chapter 2 : Application Layer
Chapter 2 : Application LayerChapter 2 : Application Layer
Chapter 2 : Application Layer
 
Dempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th SemDempster Shafer Theory AI CSE 8th Sem
Dempster Shafer Theory AI CSE 8th Sem
 
atm
atmatm
atm
 
Data and Computer Communication
Data and Computer CommunicationData and Computer Communication
Data and Computer Communication
 
Processes and Thread OS_Tanenbaum_3e
Processes and Thread OS_Tanenbaum_3eProcesses and Thread OS_Tanenbaum_3e
Processes and Thread OS_Tanenbaum_3e
 
Multiplexing
MultiplexingMultiplexing
Multiplexing
 
Unit 1 1 introduction
Unit 1   1 introductionUnit 1   1 introduction
Unit 1 1 introduction
 
Fixed Point Conversion
Fixed Point ConversionFixed Point Conversion
Fixed Point Conversion
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Dcom ppt(en.39) dpcm
Dcom ppt(en.39) dpcmDcom ppt(en.39) dpcm
Dcom ppt(en.39) dpcm
 
Chap 8 switching
Chap 8 switchingChap 8 switching
Chap 8 switching
 
Learning With Complete Data
Learning With Complete DataLearning With Complete Data
Learning With Complete Data
 
Nonlinear methods of analysis of electrophysiological data and Machine learni...
Nonlinear methods of analysis of electrophysiological data and Machine learni...Nonlinear methods of analysis of electrophysiological data and Machine learni...
Nonlinear methods of analysis of electrophysiological data and Machine learni...
 
Network Layer Numericals
Network Layer NumericalsNetwork Layer Numericals
Network Layer Numericals
 

Similar to Local modeling in regression and time series prediction

Autonomy Incubator Seminar Series: Tractable Robust Planning and Model Learni...
Autonomy Incubator Seminar Series: Tractable Robust Planning and Model Learni...Autonomy Incubator Seminar Series: Tractable Robust Planning and Model Learni...
Autonomy Incubator Seminar Series: Tractable Robust Planning and Model Learni...
AutonomyIncubator
 
Computational optimization, modelling and simulation: Recent advances and ove...
Computational optimization, modelling and simulation: Recent advances and ove...Computational optimization, modelling and simulation: Recent advances and ove...
Computational optimization, modelling and simulation: Recent advances and ove...
Xin-She Yang
 
Dj4201737746
Dj4201737746Dj4201737746
Dj4201737746
IJERA Editor
 
Dolap13 v9 7.docx
Dolap13 v9 7.docxDolap13 v9 7.docx
Dolap13 v9 7.docx
Hassan Nazih
 
achine Learning and Model Risk
achine Learning and Model Riskachine Learning and Model Risk
achine Learning and Model Risk
QuantUniversity
 
DSUS_MAO_2012_Jie
DSUS_MAO_2012_JieDSUS_MAO_2012_Jie
DSUS_MAO_2012_Jie
MDO_Lab
 
MACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORING
MACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORINGMACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORING
MACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORING
VisionGEOMATIQUE2014
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
MLReview
 
Predictive Analytics for Transportation in a High Dimensional Heterogeneous D...
Predictive Analytics for Transportation in a High Dimensional Heterogeneous D...Predictive Analytics for Transportation in a High Dimensional Heterogeneous D...
Predictive Analytics for Transportation in a High Dimensional Heterogeneous D...
Center for Transportation Research - UT Austin
 
zdx
zdxzdx
ASS_SDM2012_Ali
ASS_SDM2012_AliASS_SDM2012_Ali
ASS_SDM2012_Ali
MDO_Lab
 
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdfChapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
AndresBelloAvila
 
HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...
HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...
HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...
University of Maribor
 
[20240318_LabSeminar_Huy]GSTNet: Global Spatial-Temporal Network for Traffic ...
[20240318_LabSeminar_Huy]GSTNet: Global Spatial-Temporal Network for Traffic ...[20240318_LabSeminar_Huy]GSTNet: Global Spatial-Temporal Network for Traffic ...
[20240318_LabSeminar_Huy]GSTNet: Global Spatial-Temporal Network for Traffic ...
thanhdowork
 
A value added predictive defect type distribution model
A value added predictive defect type distribution modelA value added predictive defect type distribution model
A value added predictive defect type distribution model
UmeshchandraYadav5
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
IRJET Journal
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
South West Data Meetup
 
AIAA-SDM-SequentialSampling-2012
AIAA-SDM-SequentialSampling-2012AIAA-SDM-SequentialSampling-2012
AIAA-SDM-SequentialSampling-2012
OptiModel
 
AIAA-MAO-DSUS-2012
AIAA-MAO-DSUS-2012AIAA-MAO-DSUS-2012
AIAA-MAO-DSUS-2012
OptiModel
 
Rides Request Demand Forecast- OLA Bike
Rides Request Demand Forecast- OLA BikeRides Request Demand Forecast- OLA Bike
Rides Request Demand Forecast- OLA Bike
IRJET Journal
 

Similar to Local modeling in regression and time series prediction (20)

Autonomy Incubator Seminar Series: Tractable Robust Planning and Model Learni...
Autonomy Incubator Seminar Series: Tractable Robust Planning and Model Learni...Autonomy Incubator Seminar Series: Tractable Robust Planning and Model Learni...
Autonomy Incubator Seminar Series: Tractable Robust Planning and Model Learni...
 
Computational optimization, modelling and simulation: Recent advances and ove...
Computational optimization, modelling and simulation: Recent advances and ove...Computational optimization, modelling and simulation: Recent advances and ove...
Computational optimization, modelling and simulation: Recent advances and ove...
 
Dj4201737746
Dj4201737746Dj4201737746
Dj4201737746
 
Dolap13 v9 7.docx
Dolap13 v9 7.docxDolap13 v9 7.docx
Dolap13 v9 7.docx
 
achine Learning and Model Risk
achine Learning and Model Riskachine Learning and Model Risk
achine Learning and Model Risk
 
DSUS_MAO_2012_Jie
DSUS_MAO_2012_JieDSUS_MAO_2012_Jie
DSUS_MAO_2012_Jie
 
MACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORING
MACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORINGMACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORING
MACHINE LEARNING FOR SATELLITE-GUIDED WATER QUALITY MONITORING
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
Predictive Analytics for Transportation in a High Dimensional Heterogeneous D...
Predictive Analytics for Transportation in a High Dimensional Heterogeneous D...Predictive Analytics for Transportation in a High Dimensional Heterogeneous D...
Predictive Analytics for Transportation in a High Dimensional Heterogeneous D...
 
zdx
zdxzdx
zdx
 
ASS_SDM2012_Ali
ASS_SDM2012_AliASS_SDM2012_Ali
ASS_SDM2012_Ali
 
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdfChapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
Chapter_6_Prescriptive_Analytics_Optimization_and_Simulation.pptx.pdf
 
HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...
HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...
HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...
 
[20240318_LabSeminar_Huy]GSTNet: Global Spatial-Temporal Network for Traffic ...
[20240318_LabSeminar_Huy]GSTNet: Global Spatial-Temporal Network for Traffic ...[20240318_LabSeminar_Huy]GSTNet: Global Spatial-Temporal Network for Traffic ...
[20240318_LabSeminar_Huy]GSTNet: Global Spatial-Temporal Network for Traffic ...
 
A value added predictive defect type distribution model
A value added predictive defect type distribution modelA value added predictive defect type distribution model
A value added predictive defect type distribution model
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 
AIAA-SDM-SequentialSampling-2012
AIAA-SDM-SequentialSampling-2012AIAA-SDM-SequentialSampling-2012
AIAA-SDM-SequentialSampling-2012
 
AIAA-MAO-DSUS-2012
AIAA-MAO-DSUS-2012AIAA-MAO-DSUS-2012
AIAA-MAO-DSUS-2012
 
Rides Request Demand Forecast- OLA Bike
Rides Request Demand Forecast- OLA BikeRides Request Demand Forecast- OLA Bike
Rides Request Demand Forecast- OLA Bike
 

More from Gianluca Bontempi

A statistical criterion for reducing indeterminacy in linear causal modeling
A statistical criterion for reducing indeterminacy in linear causal modelingA statistical criterion for reducing indeterminacy in linear causal modeling
A statistical criterion for reducing indeterminacy in linear causal modeling
Gianluca Bontempi
 
Adaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor NetworksAdaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor Networks
Gianluca Bontempi
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Gianluca Bontempi
 
A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...
Gianluca Bontempi
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
Gianluca Bontempi
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray data
Gianluca Bontempi
 
A Monte Carlo strategy for structure multiple-step-head time series prediction
A Monte Carlo strategy for structure multiple-step-head time series predictionA Monte Carlo strategy for structure multiple-step-head time series prediction
A Monte Carlo strategy for structure multiple-step-head time series prediction
Gianluca Bontempi
 
Some Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningSome Take-Home Message about Machine Learning
Some Take-Home Message about Machine Learning
Gianluca Bontempi
 
FP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluatorFP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluator
Gianluca Bontempi
 
Perspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsPerspective of feature selection in bioinformatics
Perspective of feature selection in bioinformatics
Gianluca Bontempi
 
Computational Intelligence for Time Series Prediction
Computational Intelligence for Time Series PredictionComputational Intelligence for Time Series Prediction
Computational Intelligence for Time Series Prediction
Gianluca Bontempi
 

More from Gianluca Bontempi (11)

A statistical criterion for reducing indeterminacy in linear causal modeling
A statistical criterion for reducing indeterminacy in linear causal modelingA statistical criterion for reducing indeterminacy in linear causal modeling
A statistical criterion for reducing indeterminacy in linear causal modeling
 
Adaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor NetworksAdaptive model selection in Wireless Sensor Networks
Adaptive model selection in Wireless Sensor Networks
 
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature SelectionCombining Lazy Learning, Racing and Subsampling for Effective Feature Selection
Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection
 
A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...A model-based relevance estimation approach for feature selection in microarr...
A model-based relevance estimation approach for feature selection in microarr...
 
Machine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series PredictionMachine Learning Strategies for Time Series Prediction
Machine Learning Strategies for Time Series Prediction
 
Feature selection and microarray data
Feature selection and microarray dataFeature selection and microarray data
Feature selection and microarray data
 
A Monte Carlo strategy for structure multiple-step-head time series prediction
A Monte Carlo strategy for structure multiple-step-head time series predictionA Monte Carlo strategy for structure multiple-step-head time series prediction
A Monte Carlo strategy for structure multiple-step-head time series prediction
 
Some Take-Home Message about Machine Learning
Some Take-Home Message about Machine LearningSome Take-Home Message about Machine Learning
Some Take-Home Message about Machine Learning
 
FP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluatorFP7 evaluation & selection: the point of view of an evaluator
FP7 evaluation & selection: the point of view of an evaluator
 
Perspective of feature selection in bioinformatics
Perspective of feature selection in bioinformaticsPerspective of feature selection in bioinformatics
Perspective of feature selection in bioinformatics
 
Computational Intelligence for Time Series Prediction
Computational Intelligence for Time Series PredictionComputational Intelligence for Time Series Prediction
Computational Intelligence for Time Series Prediction
 

Recently uploaded

chapter one 1 cloudcomputing .pptx someone
chapter one 1 cloudcomputing .pptx someonechapter one 1 cloudcomputing .pptx someone
chapter one 1 cloudcomputing .pptx someone
abeeeeeeeer588
 
Hadoop Vs Snowflake Blog PDF Submission.pptx
Hadoop Vs Snowflake Blog PDF Submission.pptxHadoop Vs Snowflake Blog PDF Submission.pptx
Hadoop Vs Snowflake Blog PDF Submission.pptx
dewsharon760
 
emotional interface - dehligame satta for you
emotional interface  -  dehligame satta for youemotional interface  -  dehligame satta for you
emotional interface - dehligame satta for you
bkldehligame1
 
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZKeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
jp3113ig
 
Toward a National Research Platform to Enable Data-Intensive Computing
Toward a National Research Platform to Enable Data-Intensive ComputingToward a National Research Platform to Enable Data-Intensive Computing
Toward a National Research Platform to Enable Data-Intensive Computing
Larry Smarr
 
Indian KS Unit 2 Mathematicians (1).pptx
Indian KS Unit 2 Mathematicians (1).pptxIndian KS Unit 2 Mathematicians (1).pptx
Indian KS Unit 2 Mathematicians (1).pptx
Nikita Gaikwad
 
ChessMaster Project Presentation for Batch 1643.pptx
ChessMaster Project Presentation for Batch 1643.pptxChessMaster Project Presentation for Batch 1643.pptx
ChessMaster Project Presentation for Batch 1643.pptx
duduphc
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
AltanAtabarut
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
Graph Machine Learning - Past, Present, and Future -
Graph Machine Learning - Past, Present, and Future -Graph Machine Learning - Past, Present, and Future -
Graph Machine Learning - Past, Present, and Future -
kashipong
 
19328-48051-2-PB.pdf jurnal ttg analisis
19328-48051-2-PB.pdf jurnal ttg analisis19328-48051-2-PB.pdf jurnal ttg analisis
19328-48051-2-PB.pdf jurnal ttg analisis
IndahMaimunah1
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Data management and excel appication.pptx
Data management and excel appication.pptxData management and excel appication.pptx
Data management and excel appication.pptx
OlabodeSamuel3
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
AkhinaRomdoni
 
一比一原版(unb毕业证书)新布伦瑞克大学毕业证如何办理
一比一原版(unb毕业证书)新布伦瑞克大学毕业证如何办理一比一原版(unb毕业证书)新布伦瑞克大学毕业证如何办理
一比一原版(unb毕业证书)新布伦瑞克大学毕业证如何办理
ks1ni2di
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
da42ki0
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
DALubis
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdfParcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
AltanAtabarut
 

Recently uploaded (20)

chapter one 1 cloudcomputing .pptx someone
chapter one 1 cloudcomputing .pptx someonechapter one 1 cloudcomputing .pptx someone
chapter one 1 cloudcomputing .pptx someone
 
Hadoop Vs Snowflake Blog PDF Submission.pptx
Hadoop Vs Snowflake Blog PDF Submission.pptxHadoop Vs Snowflake Blog PDF Submission.pptx
Hadoop Vs Snowflake Blog PDF Submission.pptx
 
emotional interface - dehligame satta for you
emotional interface  -  dehligame satta for youemotional interface  -  dehligame satta for you
emotional interface - dehligame satta for you
 
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZKeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
KeynoteUploadJRP ABCDEFGHIJKLMNOPQRSTUVWXYZ
 
Toward a National Research Platform to Enable Data-Intensive Computing
Toward a National Research Platform to Enable Data-Intensive ComputingToward a National Research Platform to Enable Data-Intensive Computing
Toward a National Research Platform to Enable Data-Intensive Computing
 
Indian KS Unit 2 Mathematicians (1).pptx
Indian KS Unit 2 Mathematicians (1).pptxIndian KS Unit 2 Mathematicians (1).pptx
Indian KS Unit 2 Mathematicians (1).pptx
 
ChessMaster Project Presentation for Batch 1643.pptx
ChessMaster Project Presentation for Batch 1643.pptxChessMaster Project Presentation for Batch 1643.pptx
ChessMaster Project Presentation for Batch 1643.pptx
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
Graph Machine Learning - Past, Present, and Future -
Graph Machine Learning - Past, Present, and Future -Graph Machine Learning - Past, Present, and Future -
Graph Machine Learning - Past, Present, and Future -
 
19328-48051-2-PB.pdf jurnal ttg analisis
19328-48051-2-PB.pdf jurnal ttg analisis19328-48051-2-PB.pdf jurnal ttg analisis
19328-48051-2-PB.pdf jurnal ttg analisis
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Data management and excel appication.pptx
Data management and excel appication.pptxData management and excel appication.pptx
Data management and excel appication.pptx
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
 
一比一原版(unb毕业证书)新布伦瑞克大学毕业证如何办理
一比一原版(unb毕业证书)新布伦瑞克大学毕业证如何办理一比一原版(unb毕业证书)新布伦瑞克大学毕业证如何办理
一比一原版(unb毕业证书)新布伦瑞克大学毕业证如何办理
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
一比一原版(uc毕业证书)加拿大卡尔加里大学毕业证如何办理
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdfParcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
 

Local modeling in regression and time series prediction

  • 1. On the use of cross-validation for local modeling in regression and time series prediction Gianluca Bontempi gbonte@ulb.ac.be Machine Learning Group Departement d’Informatique, ULB Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di/mlg On the use of cross-validation for local modeling in regression and time series prediction – p.1/75
  • 2. Outline   The Machine Learning Group  A local learning algorithm: the Lazy Learning.   Lazy Learning for multivariate regression modeling.   Lazy Learning for multi-step-ahead time series prediction.   Lazy Learning for feature selection.   Applications.   Future work. On the use of cross-validation for local modeling in regression and time series prediction – p.2/75
  • 3. Machine Learning: a definition The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience. [35] On the use of cross-validation for local modeling in regression and time series prediction – p.3/75
  • 4. The Machine Learning Group (MLG) ¡ 7 researchers (1 prof, 6 PhD students), 4 graduate students).¡ Research topics: Bioinformatics, Classification, Computational statistics, Data mining, Regression, Time series prediction, Sensor networks. ¡ Computing facilities: cluster of 16 processors, LEGO Robotics Lab. ¡ Website: www.ulb.ac.be/di/mlg. ¡ Scientific collaborations in ULB: IRIDIA (Sciences Appliquées), Physiologie Moléculaire de la Cellule (IBMM), Conformation des Macromolécules Biologiques et Bioinformatique (IBMM), CENOLI (Sciences), Microarray Unit (Hopital Jules Bordet), Service d’Anesthesie (ERASME). ¡ Scientific collaborations outside ULB: UCL Machine Learning Group (B), Politecnico di Milano (I), Universitá del Sannio (I), George Mason University (US). ¡ The MLG is part to the "Groupe de Contact FNRS" on Machine Learning. On the use of cross-validation for local modeling in regression and time series prediction – p.4/75
  • 5. MLG: running projects 1. "Integrating experimental and theoretical approaches to decipher the molecular networks of nitrogen utilisation in yeast": ARC (Action de Recherche Concertée) funded by the Communauté Française de Belgique (2004-2009). Partners: IBMM (Gosselies and La Plaine), CENOLI. 2. "COMP2SYS" (COMPutational intelligence methods for COMPlex SYStems) MARIE CURIE Early Stage Research Training funded by the European Union (2004-2008). Main contractor: IRIDIA (ULB). 3. "Predictive data mining techniques in anaesthesia": FIRST Europe Objectif 1 funded by the Région wallonne and the Fonds Social Européen (2004-2009). Partners: Service d’anesthesie (ERASME). 4. "AIDAR - Adressage et Indexation de Documents Multimédias Assistés par des techniques de Reconnaissance Vocale": funded by Région Bruxelles-Capitale (2004-2006). Partners: Voice Insight, RTBF, Titan. On the use of cross-validation for local modeling in regression and time series prediction – p.5/75
  • 6. Machine learning and applied statistics Reductionist attitude: ML is a modern buzzword which equates to statistics plus marketing Positive attitude: ML paved the way to the treatment of real problems related to data analysis, sometimes overlooked by statisticians (nonlinearity, classification, pattern recognition, missing variables, adaptivity, optimization, massive datasets, data management, causality, representation of knowledge, parallelisation) Interdisciplinary attitude: ML should have its roots on statistics and complements it by focusing on: algorithmic issues, computational efficiency, data engineering. On the use of cross-validation for local modeling in regression and time series prediction – p.6/75
  • 7. Motivations   There exists a wide amount of theoretical and practical results for linear methods in statistics, forecasting and control.   However, in real settings we encounter often nonlinear problems.   Nonlinear methods are generally more difficult to analyze than linear ones, rarely produce closed-form or analytically tractable expressions, and are not easy to manipulate and implement.   Local learning techniques are a powerful way of re-using linear techniques in a nonlinear setting. On the use of cross-validation for local modeling in regression and time series prediction – p.7/75
  • 8. Prediction models from data TARGET PREDICTION MODEL PREDICTION INPUT OUTPUT ERROR DATA TRAINING On the use of cross-validation for local modeling in regression and time series prediction – p.8/75
  • 9. Regression setting   Multidimensional input ¢ £ ¤¥ and scalar output ¦ £ ¤ ¦ § ¨ © ¢ where ¨ is the unknown regression function and is the random error term.   A finite number of noisy input/output observations (training set ).   A test set of input values for which an accurate generalization or prediction of the output is required.   A learning machine which returns a input/output model on the basis of training set. Assumption: No a priori knowledge on the process underlying the data. On the use of cross-validation for local modeling in regression and time series prediction – p.9/75
  • 10. The global modeling approach x y q Input-output regression problem. On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
  • 11. The global modeling approach x y q ! #$ % '( )012 3456 7899A@ BBAC DE FG HI PQ RRASS TU VW XY `a bc de fg hipq rstu vwxxAy €€A ‚ƒ „… ††A‡‡ ˆ‰ ‘ ’“ ”• –— ˜™ de fghi jklm noppAq rrAs tu vw xy zzA{{ |} ~ € ‚ƒ „… †‡ Training data set. On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
  • 12. The global modeling approach ˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆ‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰ x y q Š‹ Œ Ž ‘ ’“”• –—˜™ š›œœA žžAŸ  ¡ ¢£ ¤¥ ¦§ ¨¨A©© ª« ¬­ ®¯ °± ²³ ´µ ¶· ¸¹º» ¼½¾¿ ÀÁÂÂAà ÄÄAÅ ÆÇ ÈÉ ÊÊAËË ÌÍ ÎÏ ÐÑ ÒÓ ÔÕ Ö× ØÙ ÚÛÜÝ Þßàá âãääAå ææAç èé êë ìí îï ððAññ òó ôõ ö÷ øù úû Global model fitting. On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
  • 13. The global modeling approach üüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý y q x Prediction by using the fitted global model. On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
  • 14. The global modeling approach þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ x y q Another prediction by using the fitted global model. On the use of cross-validation for local modeling in regression and time series prediction – p.10/75
  • 15. The local modeling approach x y q                                ¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡ Input-output regression problem. On the use of cross-validation for local modeling in regression and time series prediction – p.11/75
  • 16. The local modeling approach ¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢£££££££££££££££££££££££££££££££ x y q ¤¥ ¦§ ¨© ! $# %%$ '( )0 12 34 55$66 78 9@ AB CD EF GH IP QRST UV WX Y` aa$b cc$d ef gh ii$pp qr st uv wx y€ ‚ ƒ„ …†‡ˆ ‰ ‘’ “” ••$– ——$˜ ™d ef gh ii$jj kl mn op qr st uv Training data set. On the use of cross-validation for local modeling in regression and time series prediction – p.11/75
  • 17. The local modeling approach wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x y q yz {| }~ € ‚ƒ„ …† ‡ˆ ‰Š ‹‹$ŒŒ $ŽŽ  ‘’ “” •– ——$˜ ™š ›œ ž Ÿ  ¡¢ £¤ ¥¦ §¨©ª «¬ ­® ¯° ±±$²² ³³$´´ µ¶ ·¸ ¹¹$º »¼ ½¾ ¿À Á ÃÄ ÅÆ ÇÈ ÉÊËÌ ÍÎ ÏÐ ÑÒ ÓÓ$ÔÔ ÕÕ$ÖÖ ×Ø ÙÚ ÛÜ ÝÞ ßß$à áâ ãä åæ çè éê ëì íí$îîïð ññ$òòóô õõ$öö÷ø ùúûü ýýýÿþþ ¡ ¢£ ¤¥ ¦§ ¨© Local fitting and prediction. On the use of cross-validation for local modeling in regression and time series prediction – p.11/75
  • 18. The local modeling approach ! #$ % ''( ))0 12 34 56 78 99@@ AB CD EF GH IP QR ST UVWX Y` ab cd eef ggh ip qr sstt uv wx y€ ‚ ƒ„ …† ‡ˆ ‰‘’ “” •– —˜ ™™d eef gh ij kl mn oopp qr st uv wx yz {{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{||||||||||||||||||||||||||||||| x y q }~ € ‚ ƒ„ Another local fitting and prediction. On the use of cross-validation for local modeling in regression and time series prediction – p.11/75
  • 19. Global vs. local modeling   The traditional approach to supervised learning is global modeling which describes the relationship between the input and the output with an analytical function over the whole input domain.   Even for huge datasets, a parametric model can be stored in a small memory. Also, the evaluation of the parametric model requires a short program that can be executed in a reduced amount of time.   Modeling complex input/output relations often requires the adoption of global nonlinear models, whose learning procedures are typically slow and analytically intractable. In particular, validation methods, which address the problem of assessing a global model on the basis of a finite amount of noisy samples, are computationally prohibitive.   For these reasons, in recent years, interest has grown in pursuing alternatives (divide-and-conquer) to global modeling techniques. On the use of cross-validation for local modeling in regression and time series prediction – p.12/75
  • 20. Global vs. local modeling   The divide-and-conquer strategy consists in attacking a complex problem by dividing it into simpler problems whose solutions can be combined to yield a solution to the original problem.   Instances of the divide-and-conquer approach are modular techniques (e.g. local model networks [36], regression trees [19], splines [45]) and local modeling (aka smoothing) techniques.   The principle underlying local modeling is that a smooth function can be well approximated by a low degree polynomial in the neighborhood of any query point.   Local modeling techniques do not return a global fit of the available dataset but perform the prediction of the output for specific test input values, also called queries.   The talk presents our contribution to local modeling techniques and their application to a number of experimental problems. On the use of cross-validation for local modeling in regression and time series prediction – p.13/75
  • 21. Lazy vs. eager modeling   Eager techniques perform a wide amount of computation for tuning the model before observing the new query.   An eager technique must then commit to a specific hypothesis that covers all the future queries.   Lazy techniques [1] wait for the query to be defined before starting the learning procedure.   For that purpose, the database of observed input/output data is always kept in memory and the output prediction is obtained by interpolating the samples in the neighborhood of the query point.   Lazy methods will generally require less computation during training but more computation when they must predict the target value for a new query. On the use of cross-validation for local modeling in regression and time series prediction – p.14/75
  • 22. Examples   The classical linear regression is an example of global, eager, and linear approach.   Neural networks (NN) are instances of the global, eager, and nonlinear approach: NN are global in the sense that a single representation covers the whole input space. They are eager in the sense that the examples are used for tuning the network and then they are discarded without waiting for any query. Finally, NN are nonlinear in the sense that the relation between the weights and the output is nonlinear.   The technique we are going to discuss here is a lazy and local approach.   Remark: we can imagine a local technique (e.g. a K-nearest neighbor) where the most important parameter (i.e. the number of neighbors) is defined in an eager fashion. On the use of cross-validation for local modeling in regression and time series prediction – p.15/75
  • 23. Some history   Local regression estimation was independently introduced in several different fields in the late nineteenth [42] and early twentieth century [28].   In the statistical literature, the method was independently introduced from different viewpoints in the late 1970’s [20, 31, 43].   Reference books are Fan and Gijbels [26] and Loader [32].   In the machine learning literature, work on local techniques for classification dates back to 1967 [24]. A more recent reference is the special issue on Lazy Learning [1]. On the use of cross-validation for local modeling in regression and time series prediction – p.16/75
  • 24. Local modeling procedure The identification of a local model [3] can be summarized in these steps: 1. Compute the distance between the query and the training samples according to a predefined metric. 2. Rank the neighbors on the basis of their distance to the query. 3. Select a subset of the nearest neighbors according to the bandwidth which measures the size of the neighborhood. 4. Fit a local model (e.g. constant, linear,...). Each of the local approaches has one or more structural (or smoothing) parameters that control the amount of smoothing performed. In this talk we will focus on the bandwidth selection. On the use of cross-validation for local modeling in regression and time series prediction – p.17/75
  • 25. The bandwidth trade-off: overfit e q … …† ‡ ‡ˆ ‰Š ‹Œ Ž ‘’ “” •– ——™˜˜ šš™› œ žŸ  ¡ ¢£ ¤¥ ¦§ ¨© ªª™«« ¬­® ®¯ °± ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² ² x y ³ ³´ µ µ¶ ·¸ ¹º »¼½¾ ¿À Á ÃÄ ÅÅ™ÆÆ ÇÇ™È ÉÊ ËË™ÌÌÍÎ ÏÐ ÑÒ ÓÔ ÕÖ ××™ØØ ÙÚÛ ÛÜ Ý Ý Ý Ý Ý ÝßÞÞàá â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â â x y Too narrow bandwidth ã overfitting ã large prediction error ä . In terms of bias/variance trade-off, this is typically a situation of high variance. On the use of cross-validation for local modeling in regression and time series prediction – p.18/75
  • 26. The bandwidth trade-off: underfit e q å åæ ç çè éê ëì íîïðñò óô õö÷÷™øø ùù™ú ûü ýþ ÿ  ¡¢ £¤ ¥¦ §§©¨¨ x y ! #$ %'( ))©00 111322 4 4 4 4©55 67 88©99@A BB©CCDE FF©GGHI PP©QQRS T T T T T T3UUU V V V V©WWX XY ` ` ` ` ` `3aabc dd©ee ff©gg hh©ii pp©qq r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r r x y Too large bandwidth ã underfitting ã large prediction error ä In terms of bias/variance trade-off, this is typically a situation of high bias. On the use of cross-validation for local modeling in regression and time series prediction – p.19/75
  • 27. Bandwidth and bias/variance trade-off Mean Squared Error 1/Bandwith FEW NEIGHBORSMANY NEIGHBORS Bias Variance Underfitting Overfitting On the use of cross-validation for local modeling in regression and time series prediction – p.20/75
  • 28. Existing work on bandwidth selection Rule of thumb methods. They provide a crude bandwidth selection which in some situations may result sufficient. Examples of rule of thumb are in [25],[27]. Plug-in techniques. The exact expression of optimal bandwidth can be obtained from the asymptotic expressions of bias and variance, which unfortunately depends on unknown terms. The idea of the direct plug-in method is to replace these terms with estimates. This method was first introduced by Woodrofe [47] in density estimation. Examples of plug-in methods for non parametric regression are reported in Ruppert et al. [41]. Data-driven estimation. It is a selection procedure which estimates the generalization error directly from data. Unlike the previous approach, this method does not rely on the asymptotic expression but it estimates the values directly from the finite data set. To this group belong methods like cross-validation, Mallow’s sut , Akaike’s AIC and other extensions of methods used in classical parametric modeling. On the use of cross-validation for local modeling in regression and time series prediction – p.21/75
  • 29. Existing work (II) ¡ Debate on the superiority of plug-in methods over data-driven methods is still open and the experimental evidences are contrasting. Results on behalf of plug-in methods come from [47, 41, 38]. ¡ Loader [33] showed how the supposed superior performance of plug-in approaches is a complete myth. The use of cross-validation for bandwidth selection has been investigated in several papers, mainly in the case of density estimation [30]. ¡ In regression an adaptation of Mallow’s st was introduced by Rice [40] for constant fitting and by Cleveland and Devlin [21] in local polynomial regression. Cleveland and Loader [22] suggested local st and local PRESS for choosing both the degree of local polynomial mixing and the bandwidth. ¡ We believe that plug-in methods are built on a series of assumptions about the statistical process underlying the data set and on theoretical results which are more reliable more the number of points tends to infinity. ¡ In a common black-box situation where no a priori information is available, the adoption of data driven techniques can result a promising approach to the problem. On the use of cross-validation for local modeling in regression and time series prediction – p.22/75
  • 30. Data-driven bandwidth selection MSE (k ), mβ(k )m q MSE (k ), Mβ(k )M q MSE (k ), mβ(k )m loo MSE (k ), Mβ(k )M loo TRAINING SET LOCAL WEIGHTED REGRESSION IDENTIFICATION STRUCTURAL DIFFERENT BANDWIDTHS LEAVE-ONE-OUT yq PREDICTION MODEL SELECTION q vwx x x x€y y y y €‚ƒ„…†‡‡€ˆ‰‰€‘‘€’’“ “ “ “€”” • • • •€–— —˜™ ™ ™ ™ d d d d efgg€hhijkl m mnop q qrss€tuu€vwx y y y y y y y y y y y y y y z z z z z z z z z z z z z z x y q { { { {€|}}€~~  €€‚ƒ„… …†‡ ‡ ‡ ‡ ˆ ˆ‰‰€Š‹ ‹Œ Œ Ž €‘‘€’“ “”••€–——€˜™š› ›œžŸ ¡ ¡ ¡ ¡€¢£ £ £ £€¤¥¦ § §¨© ©ª « «¬ ¬­ ­ ­ ­€® ¯ ¯ ¯ ¯€°°±±€²² ³ ³ ³ ³ ³ ³ ³ ³ ³ ³ ³ ³ ³ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ ´ x y q µ µ¶· · · ·€¸ ¸ ¸ ¸ ¹º»¼ ½ ½ ½ ½€¾¾¿ÀÁÂà ÃÄÅ ÅÆÇ Ç Ç Ç€È È È È É ÉÊ Ë ËÌÍ ÍÎÏÐÑÒÓ Ó Ó Ó€ÔÔ Õ Õ Õ Õ€ÖÖ××€ØÙÚÛÜÝ ÝÞß ßàá áâ ã ãäå åæ ç çèé éê ëë€ìì í í í í í í í í í í í í í í î î î î î î î î î î î î î î ïïïïïïïïïïïïï x y On the use of cross-validation for local modeling in regression and time series prediction – p.23/75
  • 31. Original contributions Problem1: identifying a sequence of local models is expensive. Solution1: we propose recursive-least-squares (RLS) to speed up the identification of sequence of models with increasing number of neighbors [6, 13]. Problem 2: validating a local model by cross-validation is expensive. Solution 2: we compute the leave-one-out cross-validation by obtaining the PRESS statistic through the terms of RLS [9]. Problem 3: choosing the best model is prone to errors. Solution 3: we combine the best models [7]. On the use of cross-validation for local modeling in regression and time series prediction – p.24/75
  • 32. Recursive-least-squares in space β m(k ) β m+1(k ) β(k )M SLOW IDENTIFICATIONq ðñò ò ò òôó õõôöö÷øùúûûôüýýôþÿ  ¡¢£ £ £ £¥¤¤¦ ¦§¨ ¨© ¥ ! # # $$¥%¥''() 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 x y q 2 2 2 2¥3 44¥5 6 6788¥9@AB BC D D D D¥EFF¥GGH HIP P Q Q RR¥SSTT¥UU VWXX¥YY``¥aabc defghip p p p¥qrr¥stu v vw w x xy y € €  ‚ ‚ ‚ ‚¥ƒ „ „ „ „¥…††¥‡ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ ‰ ‰ ‰ ‰ ‰ ‰ ‰ ‰ ‰ ‰ ‰ ‰ x y q ‘ ’’¥“ ”•–— ˜ ˜™defgh hij jk k l l l l¥m m m m n no p pq qr rstuvwxyzz¥{{||¥}~€‚ ‚ƒ„ „…† †‡ ˆ ˆ‰Š Š‹ Œ ŒŽ Ž ¥‘‘ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ ’ “ “ “ “ “ “ “ “ “ “ “ “ “ ”””””””””””””••••••••••••• x y β m(k ) β m+1(k ) β(k )M RLS RLS RLS FAST IDENTIFICATIONq – –—˜ ˜ ˜ ˜¥™™ šš¥››œžŸ  ¥¡¢¢¥£¤¥ ¦§¨ ¨ ¨ ¨¥©ª ª«¬ ¬­ ®¯°°¥± ²³´µ ¶ ¶·¸¹ º º»¼¼¥½½¾¾¥¿¿ÀÁ x y q    ¥ÃÃÄÄ¥Å Æ ÆÇÈÈ¥ÉÉÊËÌ ÌÍ Î Î Î Î¥ÏÐÐ¥ÑÒ ÒÓÔ ÔÕÖÖ¥××ØØ¥ÙÙ ÚÛÜÜ¥ÝÞÞ¥ßßàá âãäåæçè è è è¥éêê¥ëëìí î îïð ðñ ò òóô ô ô ô¥õ ö ö ö ö¥÷øø¥ù ú ú ú ú ú ú ú ú ú ú ú ú ú û û û û û û û û û û û û û üüüüüüüüüüü x y q ýþ ÿÿ¡   ¢£¤¥ ¦ ¦§ §¨© !#$% '()00¡1122¡3345678 89 9 @ @ABC D DEF FG H HIPQ RR¡SS T T T T T T T T T T T T T U U U U U U U U U U U U U x y On the use of cross-validation for local modeling in regression and time series prediction – p.25/75
  • 33. PRESS statistic and leave-one-out PARAMETRIC IDENTIFICATION ON N-1 SAMPLES PUT THE j-th SAMPLE ASIDE TEST ON THE j-th SAMPLE PARAMETRIC IDENTIFICATION ON N SAMPLES N TIMES TRAINING SET PRESS STATISTIC LEAVE-ONE-OUT PRESS was first introduced by Allen [2]. On the use of cross-validation for local modeling in regression and time series prediction – p.26/75
  • 34. The regression task Given two variables V W XY and ` W X , let us consider the mapping acb XY d X , known only through a set of e examples fg Vhpi `h qr Yhts u obtained as follows: `h v ag Vh qxw yhpi where € , ¡ yh is a random variable such that ‚ ƒ yh „ v … and ‚ ƒ yh y† „ v … , €‡ ˆv  , ¡ ‚ ƒ y ‰ h „ v  ‰g Vh q , €’‘ “ ” , where  ‰g–• q is the unknown ‘ th moment of the distribution of yh and is defined as a function of Vh . In particular for ‘ v ” , the last of the above mentioned properties implies that no assumption of global homoscedasticity is made. On the use of cross-validation for local modeling in regression and time series prediction – p.27/75
  • 35. Local Weighted Regression ¡ The problem of local regression can be stated as the problem of estimating the value that the regression function ag V q v ‚ ƒ `— V„ assumes for a specific query point V , using information pertaining only to a neighborhood of V . ¡ Given a query point V™˜ , and under the hypothesis of a local homoscedasticity of yh , the parameter d of a local linear approximation of ag–• q in a neighborhood of V˜ can be obtained solving the local polynomial regression: e hs u f `h g V h h d i j k lg Vhi V˜ q m i where, given a metric on the space XY , ¡ lg Vhi V˜ q is the distance from the query point to the  n o example,  v p iq q q i r , ¡ kg–• q is a weight (aka kernel) function, ¡ m is the bandwidth On the use of cross-validation for local modeling in regression and time series prediction – p.28/75
  • 36. Local Weighted Regression (II) ¡ In matrix notation, the solution of the above stated weighted least squares problem is given by: sd v g t h u h u t qv u t h u h uxw v gy h y qv u y h{z v | y hz i where t is a matrix whose  n o row is V h h , w is a vector whose  n o element is `h , u is a diagonal matrix whose  n o diagonal element is }h h v kg lg Vhi V˜ q~ m q , y v u t , z v uxw , and the matrix t h u h u t vy h y is assumed to be non-singular so that its inverse | v gy h y qv u is defined. ¡ Once obtained the local linear polynomial approximation, a prediction of `˜ v ag V˜ q , is finally given by: €`˜ v V h ˜ sd q On the use of cross-validation for local modeling in regression and time series prediction – p.29/75
  • 37. Linear Leave-one-out ¡ By exploiting the linearity of the local approximator, a leave-one-out cross-validation estimation of the mean squared error ‚ ƒg ag‚ ˜ q g €`˜ q j „ can be obtained without any significant overload. ¡ In fact, using the PRESS statistic [2, 37], it is possible to calculate the error ƒ cv † v `† g V h † sdv † , without explicitly identifying the parameters sdv † from the examples available with the ‡ th removed. ¡ The formulation of the PRESS statistic for the case at hand is the following: ƒ cv † v `† g V h † sdv † v `† g V h † | y hz p g „ h † | „† v `† g V h † sd p g m† † i where „ h † is the ‡ th row of y and therefore „† v }† † V† , and where m† † is the ‡ th diagonal element of the Hat matrix … vy | y h vy gy h y q†v u y h . On the use of cross-validation for local modeling in regression and time series prediction – p.30/75
  • 38. Rectangular weight function ¡ In what follows, for the sake of simplicity, we will focus on linear approximator. An extension to generic polynomial approximators of any degree is straightforward. We will assume also that a metric on the space XY is given. All the attention will be thus centered on the problem of bandwidth selection. ¡ If as a weight function kg–• q the indicator function k lg Vhi V˜ q m v ‡ˆ‰ p if lg Vhi V˜ qŠ m , … otherwise; (0) is adopted, the optimization of the parameter m can be conveniently reduced to the optimization of the number ‹ of neighbors to which a unitary weight is assigned in the local regression evaluation. ¡ In other words, we reduce the problem of bandwidth selection to a search in the space of mg ‹ q v lg Vg ‹ q i V˜ q , where Vg ‹ q is the ‹ th nearest neighbor of the query point. On the use of cross-validation for local modeling in regression and time series prediction – p.31/75
  • 39. Recursive local regression The main advantage deriving from the adoption of the rectangular weight function is that, simply by updating the parameter sdg ‹ q of the model identified using the ‹ nearest neighbors, it is straightforward and inexpensive to obtain sdg ‹w p q . In fact, performing a step of the standard recursive least squares algorithm [4], we have: ‡ŒŒŽŒ‚ŒŽŒŒŽŒ‚ŒŽŒˆŒŽŒ‚ŒŽŒŒŒŒŽŒ‚ŒŽ‰ |g ‹w p q v |g ‹ q g |g ‹ q Vg ‹w p q V hg ‹w p q |g ‹ q pw V hg ‹w p q |g ‹ q Vg ‹w p q g ‹w p q v |g ‹w p q Vg ‹w p q ƒg ‹w p q v `g ‹w p q g V hg ‹w p q sdg ‹ q sdg ‹w p q v sdg ‹ qxw g ‹w p q ƒg ‹w p q where |g ‹ q v gy h y q†v u when m v mg ‹ q , and where Vg ‹w p q is the g ‹w p q th nearest neighbor of the query point. On the use of cross-validation for local modeling in regression and time series prediction – p.32/75
  • 40. Recursive PRESS computation Moreover, once the matrix  ©‘ ’ is available, the leave-one-out cross-validation errors can be directly calculated without the need of any further model identification: ä cv “ ©‘ ’ § ¦“ ” • – “ —™˜ ©‘ ’ ’ ” • – “  ©‘ ’ •“ š ›œž Ÿ © •“ š •¡  †¢ £ ©‘ ’ ¥¤ Let us define for each value of ‘ the ¦‘ § ’ ¨ vector © cv ©‘ that contains all the leave-one-out errors associated to the model —˜ ©‘ . On the use of cross-validation for local modeling in regression and time series prediction – p.33/75
  • 41. Model selection   The recursive algorithm returns for a given query point •  , a set of predictions ª¦  ©‘ § • –   —˜ ©‘ , together with a set of associated leave-one-out error vectors © cv ©‘ .   If the selection paradigm, frequently called winner-takes-all, is adopted, the most natural way to extract a final prediction ª¦¡  , consists in comparing the prediction obtained for each value of ‘ on the basis of the classical mean square error criterion: ª¦  § • –   —˜ © ª ‘ š with ª ‘ § «¬ ­® ¯±°³² ´ MSE ©‘ § «¬ ­® ¯°² ²µ’¶ · ¸ µ © © cv µ ©‘ ¹ ²µ’¶ · ¸ µ º On the use of cross-validation for local modeling in regression and time series prediction – p.34/75
  • 42. Local Model combination ¡ As an alternative to the winner-takes-all paradigm, we explored also the effectiveness of local combinations of estimates [46]. ¡ The final prediction of the value `˜ is obtained as a weighted average of the best » models, where » is a parameter of the algorithm. ¡ Suppose the predictions €`˜ g ‹ q and the error vectors ¼ cv g ‹ q have been ordered creating a sequence of integers f ‹h r so that ½ MSE g ‹h qŠ ½ MSE g ‹† q , €¾ ‡ . The prediction of €`˜ is given by €`˜ v ¿ À hs u Áh €`˜ g ‹h q ¿ À hts u Áh i where the weights are the inverse of the mean square errors: Áh v p~ ½ MSE g ‹h q . This is an example of the generalized ensemble method [39]. On the use of cross-validation for local modeling in regression and time series prediction – p.35/75
  • 43. From local learning to Lazy Learning (LL)   By speeding up the local learning procedure, we can delay the learning procedure to the moment when a prediction in a query point is required (query-by-query learning).   The combination approach makes possible to integrate local models of different order (e.g. constant and linear) and different bandwidths.   This method is called lazy since the whole learning procedure (i.e. the parametric and the structural identification) is deferred until a prediction is required. On the use of cross-validation for local modeling in regression and time series prediction – p.36/75
  • 44. Experimental setup for regression Datasets: 23 real and artificial datasets from the ML repository. Methods: Lazy Learning, Local modeling, Feed Forward Neural Networks, Mixtures of Experts, Neuro Fuzzy, Regression Trees (Cubist). Experimental methodology: ’Â -fold cross-validation. Results: Mean absolute error (Table 7.2), relative error (Table 7.3) and paired t-test (Appendix C) [7]. On the use of cross-validation for local modeling in regression and time series prediction – p.37/75
  • 45. Regression datasets Dataset Number of examples Number of regressors Housing 330 8 Cpu 506 13 Prices 209 6 Mpg 159 16 Servo 392 7 Ozone 167 8 Bodyfat 252 13 Pool 253 3 Energy 2444 5 Breast 699 9 Abalone 4177 10 Sonar 208 60 Bupa 345 6 Iono 351 34 Pima 768 8 Kin_8fh 8192 8 Kin_8nh 8192 8 Kin_8fm 8192 8 Kin_8nm 8192 8 Kin_32fh 8192 32 Kin_32nh 8192 32 Kin_32fm 8192 32 Kin_32nm 8192 32 On the use of cross-validation for local modeling in regression and time series prediction – p.38/75
  • 46. Experimental results: paired comparison Each method is statistically compared with all the others (9 * 23 =207 comparisons). Method Number of times the method was significantly worse than another LL linear 74 LL constant 96 LL combination 23 Local modeling linear 58 Local modeling constant 81 Cubist 40 Feed Forward NN 53 Mixtures of Experts 80 Local Model Network (fuzzy) 132 Local Model Network (k-mean) 145 The less, the best !! On the use of cross-validation for local modeling in regression and time series prediction – p.39/75
  • 47. Award in EUFIT competition Data analysis competition on regression: awarded as a runner-up among à ’ participants at the Third International Erudit competition on Protecting rivers and streams by monitoring chemical concentrations and algae communities [10]. On the use of cross-validation for local modeling in regression and time series prediction – p.40/75
  • 48. Lazy Learning for dynamic tasks Multi-step-ahead prediction: [12] long horizon forecasting based on the iteration of a LL one-step-ahead predictor. Nonlinear control: [11] 1. Lazy Learning inverse/forward control. 2. Lazy Learning self-tuning control. 3. Lazy Learning optimal control. On the use of cross-validation for local modeling in regression and time series prediction – p.41/75
  • 49. Embedding in time series Consider a sequence Ä of measurements Å Æ £ ¤ of a observable at equal time intervals. We express the present value as a function of the previous Ç values of the time series itself Å Æ § ¨ © Å ÆÉÈ · š Å ÆÈ ¹ š¤ ¤ ¤ š Å ÆÈ ¥ where ¨ is an unknown nonlinear function and the vector ¦ Å ÆÉÈ · š Å ÆÉÈ ¹ š¤ ¤ ¤ š Å ÆÈ ¥ ¨ lies in the Ç dimensional time delay space or lag space. This standard approach is called “state-space reconstruction” in the physics community, “tapped delay line” in the engineering community and Nonlinear Autoregressive (NAR) in the forecasting community. On the use of cross-validation for local modeling in regression and time series prediction – p.42/75
  • 50. 0 10 20 30 40 50 0 10 20 30 40 50 −8 −6 −4 −2 0 2 4 6 8 10 fit TIME SERIES t t-n+1t-1 = (ϕ ,ϕ ,..., ϕ )fϕt+1 t-1 ϕ temporal representation embedding representation ϕt+1 ϕt input/output representation ϕ 1 2 3 4 5 On the use of cross-validation for local modeling in regression and time series prediction – p.43/75
  • 51. One-step and multi-step-ahead prediction One-step ahead prediction: the Ç previous values of the series are assumed to be available for the prediction of the next value. This is equivalent to a problem of supervised learning. LL was used in this way in several prediction tasks: finance, economic variables, environmental modeling [23]. Multi-step ahead prediction: we predict the value of the series for the next £ steps. We can classify the methods for multiple step prediction according to two features, the horizon of the predictor and the training criterion. On the use of cross-validation for local modeling in regression and time series prediction – p.44/75
  • 52. Multi-step-ahead-prediction One-step-ahead predictor and one-step-ahead training criterion. The model predicts £ steps ahead by iterating a one-step-ahead predictor whose parameters are optimized to minimize the training error on one-step-ahead forecast. One-step-ahead predictor and £ -step-ahead training criterion. The model predicts £ steps ahead by iterating a one-step-ahead predictor whose parameters are optimized to minimize the training error on the iterated £ -step-ahead forecast. Direct forecasting. The model makes a direct forecast at time Ê £ : Å ÆË Ì § ¨ Ì © Å Æ š Å ÆÈ · š¤ ¤ ¤ š Å ÆÈ ¥ Ë · On the use of cross-validation for local modeling in regression and time series prediction – p.45/75
  • 53. Iteration of a one-step-ahead predictor f ϕt-2 z-1 z-1 z-1 z-1 ϕt-3 ϕt-n ϕt-1 z-1 ϕt On the use of cross-validation for local modeling in regression and time series prediction – p.46/75
  • 54. Local Modeling in the time domain Consider the embedding Å ÆË · § ¨ © Å Æ š Å ÆÈ · š¤ ¤ ¤ š Å ÆÈ Í of order Ç § Î . - - t ϕ t-11t-16 t-1t-6 t On the use of cross-validation for local modeling in regression and time series prediction – p.47/75
  • 55. Local Modeling in the I/O space Consider the embedding Å ÆË · § ¨ © Å Æ of order Ç § ’ . t+1 t ÏÐ ÑÒ ÓÔ ÕÖ ×ØÙÚ ÛÜ ÝÞ ßà áâ ããåä æç èé êë ìí îïð ðñò òó ô ôõ ö÷ øù úûü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ü ϕ ϕ q ýýåþþ ÿÿ¡   ¢¢¡££ ¤ ¤ ¤ ¤¡¥¥ ¦¦¡§§ ¨¨¡©© Note the labels of the axis !!! On the use of cross-validation for local modeling in regression and time series prediction – p.48/75
  • 56. Local modeling in the embedding space Consider the embedding Å ÆË · § ¨ © Å Æ š Å ÆÈ · of order Ç § Ã . t-2 ϕ t-1 ϕ t-1 ϕ t-2 ϕ 1 2 3 4 5 t+h-1 t t-1 On the use of cross-validation for local modeling in regression and time series prediction – p.49/75
  • 57. Conventional and iterated leave-one-out a) 3 1 2 4 5 3 e (3) cv 1 2 3 4 5 e (3) b) it 1 2 4 5 3 1 2 3 4 5 3 On the use of cross-validation for local modeling in regression and time series prediction – p.50/75
  • 58. It Press in the space x4 x5x3x2x1z1 z2 z4 z5 y1 y2 y3 y4 y5 xy loo z3 yz it x xyβ -3 yzβ -3 x y z 3 -3 y^ e (3) xz e (3) e (3) loo ¢ represents the value of the time series with order Ç § ’ at time Ê ” ’ , ¦ represents the value of the time series at time Ê , and represents the value of the time series at time Ê ’ . On the use of cross-validation for local modeling in regression and time series prediction – p.51/75
  • 59. From conventional to iterated PRESS   PRESS statistic returns leave-one-out as a by product of the local weighted regression.   We derived in [12] an analytical iterated formulation of the PRESS statistic for long horizon assessment.   Iterated assessment criterion improves stability and prediction accuracy. On the use of cross-validation for local modeling in regression and time series prediction – p.52/75
  • 60. The Iterated multi-step-ahead algo 1. Time series embedded as an input/output mapping ¨  ¤¥ ¤ . 2. The one-step-ahead predictor is a local estimate of the mapping ¨ . 3. The £ -step-ahead prediction is performed by iterating a one-step-ahead estimator. 4. Local structure identification performed in a space of alternative model configurations, each characterized by a different bandwidth. 5. Prediction ability assessed by the iterated formulation of the cross-validation PRESS statistic ( £ -step-ahead criterion). On the use of cross-validation for local modeling in regression and time series prediction – p.53/75
  • 61. The Santa Fe time series   The iterated PRESS approach has been applied both to the prediction of a real-world data set (A) and to a computer generated time series (D) from the Santa Fe Time Series Prediction and Analysis Competition.   The A time series has a training set of 1000 values and a test set of 10000 samples: the task is to predict the continuation for ’  steps, starting from different points.   The D time series has a training set of 100000 values and a test set of 500 samples: the task is to predict the continuation for à steps, starting from different points. On the use of cross-validation for local modeling in regression and time series prediction – p.54/75
  • 62. A series: training set 0 100 200 300 400 500 600 700 800 900 1000 0 50 100 150 200 250 300 On the use of cross-validation for local modeling in regression and time series prediction – p.55/75
  • 63. A series: one-step criterion 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 On the use of cross-validation for local modeling in regression and time series prediction – p.56/75
  • 64. A series: multi-step criterion 0 10 20 30 40 50 60 70 80 90 100 0 50 100 150 200 250 300 On the use of cross-validation for local modeling in regression and time series prediction – p.57/75
  • 65. Experiments: The Santa Fe Time Series A order n=16 Training set: 1000 values Test set: 100 steps Test data Non iter. PRESS Iter. PRESS Sauer Wan 1-100 0.350 0.029 0.077 0.055 1180-1280 0.379 0.131 0.174 0.065 2870-2970 0.793 0.055 0.183 0.487 3000-3100 0.003 0.003 0.006 0.023 4180-4280 1.134 0.051 0.111 0.160 Sauer: combination of iterated and direct local models. Wan: recurrent network. On the use of cross-validation for local modeling in regression and time series prediction – p.58/75
  • 66. The Santa Fe Time Series D order Ç § àTraining set: ’  šÂ   values Test set: à steps Test data Non iter. PRESS Iter. PRESS Zhang Hutchinson 0-24 0.1255 0.0492 0.0665 100-124 0.0460 0.0363 0.0616 200-224 0.2635 0.1692 0.1475 300-324 0.0461 0.0405 0.0541 400-424 0.1610 0.0644 0.0720 Zhang: combination of iterated and direct multilayer perceptron. On the use of cross-validation for local modeling in regression and time series prediction – p.59/75
  • 67. Award in Leuven Competition Training set made of ÃÂ Â Â points. Task: predict the continuation for the next ÃÂ Â points. 0 20 40 60 80 100 120 140 160 180 200 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Iterated Lazy Learning ranked second and fourth [8]. On the use of cross-validation for local modeling in regression and time series prediction – p.60/75
  • 68. Lazy Learning for iterated prediction Multi-step ahead by iteration of a one-step predictor. Lazy learning to implement the one-step predictor. Selection of the local structure by an iterated PRESS. Iterated criterion avoids the accumulation of prediction errors and improves the performance. On the use of cross-validation for local modeling in regression and time series prediction – p.61/75
  • 69. Complexity in global and local modeling Consider r training samples, e features and query points. GLOBAL LAZY Parametric ident. (NLS) (Nn)+ (LS) Structural ident. by K-fold cross-validation K (NLS) small prediction for Q queries negligible Q ( (Nn)+ (LS)) TOTAL K (NLS) Q [ (Nn)+ (LS)] where (NLS) stands for the cost of Non-Linear least-Squares and (LS) stands for the cost of Linear least-Squares. On the use of cross-validation for local modeling in regression and time series prediction – p.62/75
  • 70. Feature selection and LL   Local modeling techniques are known to be weak in large dimensional spaces.   A way to defy the curse of dimensionality is dimensionality reduction (aka feature selection).   It requires the assessment of an exponential number of alternatives ( Ã¥ subsets of input variables) and the choice of the best one.   Several techniques exist: we focus here on wrappers.   Wrappers rely on expensive cross-validation (e.g. leave-one-out assessment)   Our idea: combine racing [34] and sub-sampling [29] to accelerate the wrapper feature selection procedure in LL. On the use of cross-validation for local modeling in regression and time series prediction – p.63/75
  • 71. On the use of cross-validation for local modeling in regression and time series prediction – p.64/75
  • 72. Racing for feature selection   Suppose we have several sets of different input variables.  The computational cost of making a selection results from the cost of identification and the cost of validation.   The validation cost required by a global model is independent of Q, while this is not the case for LL.   The idea of racing techniques consists in using blocking and paired multiple test to compare different models in similar conditions and discard as soon as possible the worst ones.   Racing reduces the number of tests to be made.   This makes more competitive the wrapper LL approach. On the use of cross-validation for local modeling in regression and time series prediction – p.65/75
  • 73. On the use of cross-validation for local modeling in regression and time series prediction – p.66/75
  • 74. On the use of cross-validation for local modeling in regression and time series prediction – p.67/75
  • 75. Sub-sampling and LL   The goal of model selection is to find the best hypothesis in a set of alternatives.   What is relevant is ordering the different alternatives: M2 M3 M5 M1 M2.   Reducing the training set size N, we hope to reduce the accuracy of each single model but not necessarily their ordering.   In LL reducing the training set size reduces the cost.   The idea of sub-sampling is to reduce the size of the training set without altering the ranking of the different models.   This makes more competitive the LL approach On the use of cross-validation for local modeling in regression and time series prediction – p.68/75
  • 76. RACSAM for feature selection We proposed the following algorithm [14] 1. Define an initial group of promising feature subsets. 2. Start with small training and test sets. 3. Discard by racing all the feature subsets that appear as significantly worse than the others. 4. Increase the training and test size until at most winners models remain. 5. Update the group with new candidates to be assessed and go back to 3. On the use of cross-validation for local modeling in regression and time series prediction – p.69/75
  • 77. Experimental session   We compare the performance accuracy of the LL algorithm enhanced by the RACSAM procedure to the the accuracy of two state-of-art algorithms, a SVM for regression and a regression tree (RTREE).   Two version of the RACSAM algorithm were tested: the first (LL-RAC1) takes as feature set the best one (in terms of estimate Mean absolute Error (MAE)) among the winning candidates : the second (LL-RAC1) averages the predictions of LL predictors.   § , and p-value is  ¤  ’ . On the use of cross-validation for local modeling in regression and time series prediction – p.70/75
  • 78. Experimental results Five-fold cross-validation on six real datasets of high dimensionality: Ailerons ( § ’ Â š Ç § Â ), Pole ( § ’ Â Â Â š Ç § ), Elevators ( § ’ Î š Ç § ’ ), Triazines ( § ’ Î š Ç § ÎÂ ), Wisconsin ( § ’ š Ç § Ã ) and Census ( § Ã Ã ! š Ç § ’ ! ). Dataset AIL POL ELE TRI WIS CEN LL-RAC1 9.7e-5 3.12 1.6e-3 0.21 27.39 0.17 LL-RAC2 9.0e-5 3.13 1.5e-3 0.12 27.41 0.16 SVM 1.3e-4 26.5 1.9e-3 0.11 29.91 0.21 RTREE 1.8e-4 8.80 3.1e-3 0.11 33.02 0.17 On the use of cross-validation for local modeling in regression and time series prediction – p.71/75
  • 79. Applications ¡ Financial prediction of stock markets: in collaboration with Masterfood, Belgium.¡ Prediction of yearly sales: in collaboration with Dieteren, Belgium, the first Belgian car dealer. ¡ Non linear control and identification task in power systems: in collaboration with Universit´a del Sannio (I) [44, 18]. ¡ Modeling of industrial processes: in collaboration with FaFer Usinor steel company (B), and Honeywell Technology Center, (US). ¡ Performance modelling of embedded systems: during my stay at Philips Research [16], Eindhoven (NL). ¡ Quality of service: during my stay at IMEC, Leuven (B) [17]. ¡ Black-box simulators: in collaboration with CENEARO, Gosselies (B) [15]. ¡ Environmental predictions: in collaboration with Politecnico di Milano (I) [23]. On the use of cross-validation for local modeling in regression and time series prediction – p.72/75
  • 80. Software   MATLAB toolbox on Lazy Learning [5].  R contributed package lazy.   Joint work with Dr. Mauro Birattari (IRIDIA).   Web page: http://iridia.ulb.ac.be/~lazy.   About 5000 accesses since October 2002. On the use of cross-validation for local modeling in regression and time series prediction – p.73/75
  • 81. The importance of being Lazy   Fast data-driven design.  No global assumption on the noise.   Linear methods still effective in a multivariate non-linear setting (LWR, PRESS).   An estimate of the variance is returned with each prediction.   Intrinsically adaptive. On the use of cross-validation for local modeling in regression and time series prediction – p.74/75
  • 82. Future work   Extension of the LL method to other local selection criteria (VC dimension, GCV).   Classification applications.   Integration with powerful software and hardware devices.   From large to huge databases.   New applications: bioinformatics, text mining, medical data, sensor networks, power systems. On the use of cross-validation for local modeling in regression and time series prediction – p.75/75
  • 83. References [1] D. W. Aha. Editorial of special issue on lazy learning. Artificial Intelligence Review, 11(1–5):1–6, 1997. [2] D. M. Allen. The relationship between variable and data augmen- tation and a method of prediction. Technometrics, 16:125–127, 1974. [3] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted learning. Artificial Intelligence Review, 11(1–5):11–73, 1997. [4] G. J. Bierman. Factorization Methods for Discrete Sequential Estimation. Academic Press, New York, NY, 1977. [5] M. Birattari and G. Bontempi. The lazy learning toolbox, for use with matlab. Technical Report TR/IRIDIA/99-7, IRIDIA-ULB, Brussels, Belgium, 1999. [6] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets the recursive least-squares algorithm. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS 11, pages 375–381, Cam- bridge, 1999. MIT Press. 75-1
  • 84. [7] G. Bontempi. Local Learning Techniques for Modeling, Predic- tion and Control. PhD thesis, IRIDIA- Universit´e Libre de Brux- elles, 1999. [8] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for it- erated time series prediction. In J. A. K. Suykens and J. Van- dewalle, editors, Proceedings of the International Workshop on Advanced Black-Box Techniques for Nonlinear Modeling, pages 62–68. Katholieke Universiteit Leuven, Belgium, 1998. [9] G. Bontempi, M. Birattari, and H. Bersini. Recursive lazy learning for modeling and control. In Machine Learning: ECML-98 (10th European Conference on Machine Learning), pages 292–303. Springer, 1998. [10] G. Bontempi, M. Birattari, and H. Bersini. Lazy learners at work: the lazy learning toolbox. In Proceeding of the 7th European Congress on Inteligent Techniques and Soft Computing EUFIT ’99, 1999. [11] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for modeling and control design. International Journal of Control, 72(7/8):643–658, 1999. 75-1
  • 85. [12] G. Bontempi, M. Birattari, and H. Bersini. Local learning for iter- ated time-series prediction. In I. Bratko and S. Dzeroski, editors, Machine Learning: Proceedings of the Sixteenth International Conference, pages 32–38, San Francisco, CA, 1999. Morgan Kaufmann Publishers. [13] G. Bontempi, M. Birattari, and H. Bersini. A model selection ap- proach for local learning. Artificial Intelligence Communications, 121(1), 2000. [14] G. Bontempi, M. Birattari, and P.E. Meyer. Combining lazy learn- ing, racing and subsampling for effective feature selection. In Proceedings of the International Conference on Adaptive and Natural Computing Algorithms. Springer Verlag, 2005. To ap- pear. [15] G. Bontempi, O. Caelen, S. Pierret, and C. Goffaux. On the use of supervised learning techniques to speed up the design of aeronautics components. WSEAS Transactions on Systems, 10(3):3098–3103, 2005. [16] G. Bontempi and W. Kruijtzer. The use of intelligent data anal- ysis techniques for system-level design: a software estimation 75-1
  • 86. example. Soft Computing, 8(7):477–490, 2004. [17] G. Bontempi and G. Lafruit. Enabling multimedia qos control with black-box modeling. In D. Bustard, W. Liu, and R. Sterritt, edi- tors, Soft-Ware 2002: Computing in an Imperfect World, Lecture Notes in Computer Science, pages 46–59, 2002. [18] G. Bontempi, A. Vaccaro, and D. Villacci. A semi-physical mod- elling architecture for dynamic assessment of power components loading capability. IEE Proceedings of Generation Transmission and Distribution, 151(4):533–542, 2004. [19] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Clas- sification and Regression Trees. Wadsworth International Group, Belmont, CA, 1984. [20] W. S. Cleveland. Robust locally weighted regression and smooth- ing scatterplots. Journal of the American Statistical Association, 74:829–836, 1979. [21] W. S. Cleveland and S. J. Devlin. Locally weighted regression: an approach to regression analysis by local fitting. Journal of American Statistical Association, 83:596–610, 1988. 75-1
  • 87. [22] W. S. Cleveland and C. Loader. Smoothing by local regression: Principles and methods. Computational Statistics, 11, 1995. [23] G. Corani. Air quality prediction in milan: feed-forward neural networks, pruned neural networks and lazy learning. Ecological Modelling, 2005. In press. [24] T. Cover and P. Hart. Nearest neighbor pattern classification. Proc. IEEE Trans. Inform. Theory, pages 21–27, 1967. [25] J. Fan and I. Gijbels. Adaptive order polynomial fitting: band- width robustification and bias reduction. J. Comp. Graph. Statist., 4:213–227, 1995. [26] J. Fan and I. Gijbels. Local Polynomial Modelling and Its Appli- cations. Chapman and Hall, 1996. [27] W. Hardle and J. S. Marron. Fast and simple scatterplot smooth- ing. Comp. Statist. Data Anal., 20:1–17, 1995. [28] R. Henderson. Note on graduation by adjusted average. Trans- actions of the Actuarial Society of America, 17:43–48, 1916. [29] G. H. John and P. Langley. Static versus dynamic sampling for data mining. In Proceedings of the Second International Con- 75-1
  • 88. ference on Knowledge Discovery in Databases and Data Mining. AAAI/MIT Press, 1996. [30] M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection for density estimation. Journal of American Statistical Association, 90, 1995. [31] V. Y. Katkovnik. Linear and nonlinear methods of nonparametric regression analysis. Soviet Automatic Control, 5:25–34, 1979. [32] C. Loader. Local Regression and Likelihood. Springer, New York, 1999. [33] C. R. Loader. Old faithful erupts: Bandwidth selection reviewed. Technical report, Bell-Labs, 1987. [34] O. Maron and A. Moore. The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11(1–5):193–225, 1997. [35] T. M. Mitchell. Machine Learning. McGraw Hill, 1997. [36] R. Murray-Smith and T. A. Johansen. Local learning in local model networks. In R. Murray-Smith and T. A. Johansen, editors, 75-1
  • 89. Multiple Model Approaches to Modeling and Control, chapter 7, pages 185–210. Taylor and Francis, 1997. [37] R. H. Myers. Classical and Modern Regression with Applications. PWS-KENT Publishing Company, Boston, MA, second edition, 1994. [38] B. U. Park and J. S. Marron. Comparison of data-driven band- width selectors. Journal of American Statistical Association, 85:66–72, 1990. [39] M. P. Perrone and L. N. Cooper. When networks disagree: En- semble methods for hybrid neural networks. In R. J. Mammone, editor, Artificial Neural Networks for Speech and Vision, pages 126–142. Chapman and Hall, 1993. [40] J. Rice. Bandwidth choice for nonparametric regression. The Annals of Statistics, 12:1215–1230, 1984. [41] D. Ruppert, S. J. Sheather, and M. P. Wand. An effective band- width selector for local least squares regression. Journal of American Statistical Association, 90:1257–1270, 1995. 75-1
  • 90. [42] G. V. Schiaparelli. Sul modo di ricavare la vera espressione delle leggi della natura dalle curve empiricae. Effemeridi Astro- nomiche di Milano per l’Arno, 857:3–56, 1886. [43] C. Stone. Consistent nonparametric regression. The Annals of Statistics, 5:595–645, 1977. [44] D. Villacci, G. Bontempi, A. Vaccaro, and M. Birattari. The role of learning methods in the dynamic assessment of power com- ponents loading capability. IEEE Transactions on Industrial Elec- tronics, 52(1), 2005. [45] G. Wahba and S. Wold. A completely automatic french curve: Fitting spline functions by cross-validation. Communications in Statistics, 4(1), 1975. [46] D. Wolpert. Stacked generalization. Neural Networks, 5:241– 259, 1992. [47] M. Woodrofe. On choosing a delta-sequence. Ann. Math. Statist., 41:1665–1671, 1970. 75-1