Local modeling in regression and time series prediction

On the use of cross-validation for local
modeling in regression and time series
prediction
Gianluca Bontempi
gbonte@ulb.ac.be
Machine Learning Group
Departement d’Informatique, ULB
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di/mlg
On the use of cross-validation for local modeling in regression and time series prediction – p.1/75

Outline

The Machine Learning Group
A local learning algorithm: the Lazy Learning.

Lazy Learning for multivariate regression modeling.

Lazy Learning for multi-step-ahead time series prediction.

Lazy Learning for feature selection.

Applications.

Future work.

Machine Learning: a deﬁnition
The ﬁeld of machine learning is concerned with the question of how to
construct computer programs that automatically improve with
experience. [35]

The Machine Learning Group (MLG)
¡
7 researchers (1 prof, 6 PhD students), 4 graduate students).¡
Research topics: Bioinformatics, Classification, Computational statistics, Data
mining, Regression, Time series prediction, Sensor networks.
¡
Computing facilities: cluster of 16 processors, LEGO Robotics Lab.
¡
Website: www.ulb.ac.be/di/mlg.
¡
Scientific collaborations in ULB: IRIDIA (Sciences Appliquées), Physiologie
Moléculaire de la Cellule (IBMM), Conformation des Macromolécules Biologiques
et Bioinformatique (IBMM), CENOLI (Sciences), Microarray Unit (Hopital Jules
Bordet), Service d’Anesthesie (ERASME).
¡
Scientific collaborations outside ULB: UCL Machine Learning Group (B),
Politecnico di Milano (I), Universitá del Sannio (I), George Mason University (US).
¡
The MLG is part to the "Groupe de Contact FNRS" on Machine Learning.

MLG: running projects
1. "Integrating experimental and theoretical approaches to decipher the molecular
networks of nitrogen utilisation in yeast": ARC (Action de Recherche Concertée)
funded by the Communauté Française de Belgique (2004-2009). Partners: IBMM
(Gosselies and La Plaine), CENOLI.
2. "COMP2SYS" (COMPutational intelligence methods for COMPlex SYStems)
MARIE CURIE Early Stage Research Training funded by the European Union
(2004-2008). Main contractor: IRIDIA (ULB).
3. "Predictive data mining techniques in anaesthesia": FIRST Europe Objectif 1
funded by the Région wallonne and the Fonds Social Européen (2004-2009).
Partners: Service d’anesthesie (ERASME).
4. "AIDAR - Adressage et Indexation de Documents Multimédias Assistés par des
techniques de Reconnaissance Vocale": funded by Région Bruxelles-Capitale
(2004-2006). Partners: Voice Insight, RTBF, Titan.

Machine learning and applied statistics
Reductionist attitude: ML is a modern buzzword which equates to
statistics plus marketing
Positive attitude: ML paved the way to the treatment of real problems
related to data analysis, sometimes overlooked by statisticians
(nonlinearity, classiﬁcation, pattern recognition, missing variables,
adaptivity, optimization, massive datasets, data management,
causality, representation of knowledge, parallelisation)
Interdisciplinary attitude: ML should have its roots on statistics and
complements it by focusing on: algorithmic issues, computational
efﬁciency, data engineering.

Motivations

There exists a wide amount of theoretical and practical results for
linear methods in statistics, forecasting and control.

However, in real settings we encounter often nonlinear problems.

Nonlinear methods are generally more difﬁcult to analyze than
linear ones, rarely produce closed-form or analytically tractable
expressions, and are not easy to manipulate and implement.

Local learning techniques are a powerful way of re-using linear
techniques in a nonlinear setting.

Prediction models from data
TARGET
PREDICTION
MODEL
PREDICTION
INPUT OUTPUT
ERROR
DATA
TRAINING

Regression setting

Multidimensional input
¢
£
¤¥
and scalar output
¦
£
¤
¦
§
¨
©
¢

where ¨
is the unknown regression function and

is the random
error term.

A ﬁnite number of noisy input/output observations (training set

).

A test set of input values for which an accurate generalization or
prediction of the output is required.

A learning machine which returns a input/output model on the
basis of training set.
Assumption: No a priori knowledge on the process underlying the data.

The global modeling approach
x
y
q

Input-output regression problem.


x
y
q
!
#$
%
'(
)012
3456
7899A@
BBAC
DE
FG
HI
PQ
RRASS
TU
VW
XY
`a
bc
de
fg
hipq
rstu
vwxxAy
€€A
‚ƒ
„…
††A‡‡
ˆ‰
‘
’“
”•
–—
˜™
de
fghi
jklm
noppAq
rrAs
tu
vw
xy
zzA{{
|}
~
€
‚ƒ
„…
†‡
Training data set.

ˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆˆ‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰‰
x
y
q
Š‹
Œ
Ž
‘
’“”•
–—˜™
š›œœA
žžAŸ
¡
¢£
¤¥
¦§
¨¨A©©
ª«
¬
®¯
°±
²³
´µ
¶·
¸¹º»
¼½¾¿
ÀÁÂÂAÃ
ÄÄAÅ
ÆÇ
ÈÉ
ÊÊAËË
ÌÍ
ÎÏ
ÐÑ
ÒÓ
ÔÕ
Ö×
ØÙ
ÚÛÜÝ
Þßàá
âãääAå
ææAç
èé
êë
ìí
îï
ððAññ
òó
ôõ
ö÷
øù
úû
Global model ﬁtting.

üüüüüüüüüüüüüüüüüüüüüüüüüüüüüüüýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý
y
q x
Prediction by using the ﬁtted global model.

þþþþþþþþþþþþþþþþþþþþþþþþþþþþþþþÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
x
y
q
Another prediction by using the ﬁtted global model.

The local modeling approach
x
y
q
¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡¡
Input-output regression problem.

¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢¢£££££££££££££££££££££££££££££££
x
y
q
¤¥
¦§
¨©

!
$#
%%$
'(
)0
12
34
55$66
78
9@
AB
CD
EF
GH
IP
QRST
UV
WX
Y`
aa$b
cc$d
ef
gh
ii$pp
qr
st
uv
wx
y€
‚
ƒ„
…†‡ˆ
‰
‘’
“”
••$–
——$˜
™d
ef
gh
ii$jj
kl
mn
op
qr
st
uv
Training data set.

wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x
y
q
yz
{|
}~
€
‚ƒ„
…†
‡ˆ
‰Š
‹‹$ŒŒ
$ŽŽ

‘’
“”
•–
——$˜
™š
›œ
ž
Ÿ
¡¢
£¤
¥¦
§¨©ª
«¬
®
¯°
±±$²²
³³$´´
µ¶
·¸
¹¹$º
»¼
½¾
¿À
ÁÂ
ÃÄ
ÅÆ
ÇÈ
ÉÊËÌ
ÍÎ
ÏÐ
ÑÒ
ÓÓ$ÔÔ
ÕÕ$ÖÖ
×Ø
ÙÚ
ÛÜ
ÝÞ
ßß$à
áâ
ãä
åæ
çè
éê
ëì
íí$îîïð
ññ$òòóô
õõ$öö÷ø
ùúûü
ýýýÿþþ ¡
¢£
¤¥
¦§
¨©

Local ﬁtting and prediction.


!
#$
%
''(
))0
12
34
56
78
99@@
AB
CD
EF
GH
IP
QR
ST
UVWX
Y`
ab
cd
eef
ggh
ip
qr
sstt
uv
wx
y€
‚
ƒ„
…†
‡ˆ
‰‘’
“”
•–
—˜
™™d
eef
gh
ij
kl
mn
oopp
qr
st
uv
wx
yz
{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{|||||||||||||||||||||||||||||||
x
y
q
}~
€
‚
ƒ„
Another local ﬁtting and prediction.

Global vs. local modeling

The traditional approach to supervised learning is global
modeling which describes the relationship between the input and
the output with an analytical function over the whole input domain.

Even for huge datasets, a parametric model can be stored in a
small memory. Also, the evaluation of the parametric model
requires a short program that can be executed in a reduced
amount of time.

Modeling complex input/output relations often requires the
adoption of global nonlinear models, whose learning procedures
are typically slow and analytically intractable. In particular,
validation methods, which address the problem of assessing a
global model on the basis of a ﬁnite amount of noisy samples, are
computationally prohibitive.

For these reasons, in recent years, interest has grown in pursuing
alternatives (divide-and-conquer) to global modeling techniques.

Global vs. local modeling

The divide-and-conquer strategy consists in attacking a complex
problem by dividing it into simpler problems whose solutions can
be combined to yield a solution to the original problem.

Instances of the divide-and-conquer approach are modular
techniques (e.g. local model networks [36], regression trees [19],
splines [45]) and local modeling (aka smoothing) techniques.

The principle underlying local modeling is that a smooth function
can be well approximated by a low degree polynomial in the
neighborhood of any query point.

Local modeling techniques do not return a global ﬁt of the
available dataset but perform the prediction of the output for
speciﬁc test input values, also called queries.

The talk presents our contribution to local modeling techniques
and their application to a number of experimental problems.

Lazy vs. eager modeling

Eager techniques perform a wide amount of computation for
tuning the model before observing the new query.

An eager technique must then commit to a speciﬁc hypothesis
that covers all the future queries.

Lazy techniques [1] wait for the query to be deﬁned before
starting the learning procedure.

For that purpose, the database of observed input/output data is
always kept in memory and the output prediction is obtained by
interpolating the samples in the neighborhood of the query point.

Lazy methods will generally require less computation during
training but more computation when they must predict the target
value for a new query.

Examples

The classical linear regression is an example of global, eager, and
linear approach.

Neural networks (NN) are instances of the global, eager, and
nonlinear approach: NN are global in the sense that a single
representation covers the whole input space. They are eager in
the sense that the examples are used for tuning the network and
then they are discarded without waiting for any query. Finally, NN
are nonlinear in the sense that the relation between the weights
and the output is nonlinear.

The technique we are going to discuss here is a lazy and local
approach.

Remark: we can imagine a local technique (e.g. a K-nearest
neighbor) where the most important parameter (i.e. the number of
neighbors) is deﬁned in an eager fashion.

Some history

Local regression estimation was independently introduced in
several different ﬁelds in the late nineteenth [42] and early
twentieth century [28].

In the statistical literature, the method was independently
introduced from different viewpoints in the late 1970’s [20, 31, 43].

Reference books are Fan and Gijbels [26] and Loader [32].

In the machine learning literature, work on local techniques for
classiﬁcation dates back to 1967 [24]. A more recent reference is
the special issue on Lazy Learning [1].

Local modeling procedure
The identiﬁcation of a local model [3] can be summarized in these
steps:
1. Compute the distance between the query and the training
samples according to a predeﬁned metric.
2. Rank the neighbors on the basis of their distance to the query.
3. Select a subset of the nearest neighbors according to the
bandwidth which measures the size of the neighborhood.
4. Fit a local model (e.g. constant, linear,...).
Each of the local approaches has one or more structural (or
smoothing) parameters that control the amount of smoothing
performed.
In this talk we will focus on the bandwidth selection.

The bandwidth trade-off: overﬁt
e
q
…
…†
‡
‡ˆ
‰Š
‹Œ
Ž
‘’
“”
•–
——™˜˜
šš™›
œ
žŸ
¡
¢£
¤¥
¦§
¨©
ªª™««
¬®
®¯
°±
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
²
x
y
³
³´
µ
µ¶
·¸
¹º
»¼½¾
¿À
ÁÂ
ÃÄ
ÅÅ™ÆÆ
ÇÇ™È
ÉÊ
ËË™ÌÌÍÎ
ÏÐ
ÑÒ
ÓÔ
ÕÖ
××™ØØ
ÙÚÛ
ÛÜ
Ý
Ý
Ý
Ý
Ý
ÝßÞÞàá
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
â
x
y
Too narrow bandwidth
ã
overﬁtting
ã
large prediction error
ä
.
In terms of bias/variance trade-off, this is typically a situation of high
variance.

The bandwidth trade-off: underﬁt
e
q
å
åæ
ç
çè
éê
ëì
íîïðñò
óô
õö÷÷™øø
ùù™ú
ûü
ýþ
ÿ
¡¢
£¤
¥¦
§§©¨¨

x
y

!
#$
%'(
))©00
111322
4
4
4
4©55
67
88©99@A
BB©CCDE
FF©GGHI
PP©QQRS
T
T
T
T
T
T3UUU
V
V
V
V©WWX
XY
`
`
`
`
`
`3aabc
dd©ee
ff©gg
hh©ii
pp©qq
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
r
x
y
Too large bandwidth
ã
underﬁtting
ã
large prediction error
ä
In terms of bias/variance trade-off, this is typically a situation of high
bias.

Bandwidth and bias/variance trade-off
Mean Squared Error
1/Bandwith
FEW NEIGHBORSMANY NEIGHBORS
Bias
Variance
Underfitting Overfitting

Existing work on bandwidth selection
Rule of thumb methods. They provide a crude bandwidth selection which in some
situations may result sufficient. Examples of rule of thumb are in [25],[27].
Plug-in techniques. The exact expression of optimal bandwidth can be obtained from
the asymptotic expressions of bias and variance, which unfortunately depends on
unknown terms. The idea of the direct plug-in method is to replace these terms
with estimates. This method was first introduced by Woodrofe [47] in density
estimation. Examples of plug-in methods for non parametric regression are
reported in Ruppert et al. [41].
Data-driven estimation. It is a selection procedure which estimates the generalization
error directly from data. Unlike the previous approach, this method does not rely
on the asymptotic expression but it estimates the values directly from the finite
data set. To this group belong methods like cross-validation, Mallow’s
sut
,
Akaike’s AIC and other extensions of methods used in classical parametric
modeling.

Existing work (II)
¡
Debate on the superiority of plug-in methods over data-driven methods is still
open and the experimental evidences are contrasting. Results on behalf of
plug-in methods come from [47, 41, 38].
¡
Loader [33] showed how the supposed superior performance of plug-in
approaches is a complete myth. The use of cross-validation for bandwidth
selection has been investigated in several papers, mainly in the case of density
estimation [30].
¡
In regression an adaptation of Mallow’s
st
was introduced by Rice [40] for
constant ﬁtting and by Cleveland and Devlin [21] in local polynomial regression.
Cleveland and Loader [22] suggested local
st
and local PRESS for choosing
both the degree of local polynomial mixing and the bandwidth.
¡
We believe that plug-in methods are built on a series of assumptions about the
statistical process underlying the data set and on theoretical results which are
more reliable more the number of points tends to inﬁnity.
¡
In a common black-box situation where no a priori information is available, the
adoption of data driven techniques can result a promising approach to the
problem. On the use of cross-validation for local modeling in regression and time series prediction – p.22/75

Data-driven bandwidth selection
MSE (k ), mβ(k )m
q
MSE (k ), Mβ(k )M
q
MSE (k ), mβ(k )m
loo
MSE (k ), Mβ(k )M
loo
TRAINING
SET
LOCAL WEIGHTED REGRESSION
IDENTIFICATION
STRUCTURAL
DIFFERENT BANDWIDTHS
LEAVE-ONE-OUT
yq
PREDICTION
MODEL SELECTION
q
vwx
x
x
x€y
y
y
y
€‚ƒ„…†‡‡€ˆ‰‰€‘‘€’’“
“
“
“€””
•
•
•
•€–—
—˜™
™
™
™
d
d
d
d
efgg€hhijkl
m
mnop
q
qrss€tuu€vwx
y
y
y
y
y
y
y
y
y
y
y
y
y
y
z
z
z
z
z
z
z
z
z
z
z
z
z
z
x
y
q
{
{
{
{€|}}€~~

€€‚ƒ„…
…†‡
‡
‡
‡
ˆ
ˆ‰‰€Š‹
‹Œ
Œ
Ž
€‘‘€’“
“”••€–——€˜™š›
›œžŸ ¡
¡
¡
¡€¢£
£
£
£€¤¥¦
§
§¨©
©ª
«
«¬
¬

€®
¯
¯
¯
¯€°°±±€²²
³
³
³
³
³
³
³
³
³
³
³
³
³
´
´
´
´
´
´
´
´
´
´
´
´
´
x
y
q
µ
µ¶·
·
·
·€¸
¸
¸
¸
¹º»¼
½
½
½
½€¾¾¿ÀÁÂÃ
ÃÄÅ
ÅÆÇ
Ç
Ç
Ç€È
È
È
È
É
ÉÊ
Ë
ËÌÍ
ÍÎÏÐÑÒÓ
Ó
Ó
Ó€ÔÔ
Õ
Õ
Õ
Õ€ÖÖ××€ØÙÚÛÜÝ
ÝÞß
ßàá
áâ
ã
ãäå
åæ
ç
çèé
éê
ëë€ìì
í
í
í
í
í
í
í
í
í
í
í
í
í
í
î
î
î
î
î
î
î
î
î
î
î
î
î
î
ïïïïïïïïïïïïï
x
y

Original contributions
Problem1: identifying a sequence of local models is expensive.
Solution1: we propose recursive-least-squares (RLS) to speed up the
identiﬁcation of sequence of models with increasing number of
neighbors [6, 13].
Problem 2: validating a local model by cross-validation is expensive.
Solution 2: we compute the leave-one-out cross-validation by obtaining
the PRESS statistic through the terms of RLS [9].
Problem 3: choosing the best model is prone to errors.
Solution 3: we combine the best models [7].

Recursive-least-squares in space
β m(k ) β m+1(k ) β(k )M
SLOW IDENTIFICATIONq
ðñò
ò
ò
òôó
õõôöö÷øùúûûôüýýôþÿ
¡¢£
£
£
£¥¤¤¦
¦§¨
¨©
¥

!

#
#
$$¥%¥''()
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
x
y
q
2
2
2
2¥3
44¥5
6
6788¥9@AB
BC
D
D
D
D¥EFF¥GGH
HIP
P
Q
Q
RR¥SSTT¥UU
VWXX¥YY``¥aabc
defghip
p
p
p¥qrr¥stu
v
vw
w
x
xy
y
€
€

‚
‚
‚
‚¥ƒ
„
„
„
„¥…††¥‡
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
‰
‰
‰
‰
‰
‰
‰
‰
‰
‰
‰
‰
x
y
q
‘
’’¥“
”•–—
˜
˜™defgh
hij
jk
k
l
l
l
l¥m
m
m
m
n
no
p
pq
qr
rstuvwxyzz¥{{||¥}~€‚
‚ƒ„
„…†
†‡
ˆ
ˆ‰Š
Š‹
Œ
ŒŽ
Ž
¥‘‘
’
’
’
’
’
’
’
’
’
’
’
’
’
“
“
“
“
“
“
“
“
“
“
“
“
“
”””””””””””””•••••••••••••
x
y
β m(k ) β m+1(k ) β(k )M
RLS RLS RLS
FAST IDENTIFICATIONq
–
–—˜
˜
˜
˜¥™™
šš¥››œžŸ ¥¡¢¢¥£¤¥
¦§¨
¨
¨
¨¥©ª
ª«¬
¬
®¯°°¥±
²³´µ
¶
¶·¸¹
º
º»¼¼¥½½¾¾¥¿¿ÀÁ
x
y
q
Â
Â
Â
Â¥ÃÃÄÄ¥Å
Æ
ÆÇÈÈ¥ÉÉÊËÌ
ÌÍ
Î
Î
Î
Î¥ÏÐÐ¥ÑÒ
ÒÓÔ
ÔÕÖÖ¥××ØØ¥ÙÙ
ÚÛÜÜ¥ÝÞÞ¥ßßàá
âãäåæçè
è
è
è¥éêê¥ëëìí
î
îïð
ðñ
ò
òóô
ô
ô
ô¥õ
ö
ö
ö
ö¥÷øø¥ù
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
û
û
û
û
û
û
û
û
û
û
û
û
û
üüüüüüüüüüü
x
y
q
ýþ
ÿÿ¡
¢£¤¥
¦
¦§
§¨©

!#$%
'()00¡1122¡3345678
89
9
@
@ABC
D
DEF
FG
H
HIPQ
RR¡SS
T
T
T
T
T
T
T
T
T
T
T
T
T
U
U
U
U
U
U
U
U
U
U
U
U
U
x
y

PRESS statistic and leave-one-out
PARAMETRIC IDENTIFICATION ON N-1 SAMPLES
PUT THE j-th SAMPLE ASIDE
TEST ON THE j-th SAMPLE
PARAMETRIC IDENTIFICATION
ON N SAMPLES
N TIMES
TRAINING SET
PRESS STATISTIC
LEAVE-ONE-OUT
PRESS was ﬁrst introduced by Allen [2].

The regression task
Given two variables
V
W
XY
and
`
W
X
, let us consider the mapping
acb
XY
d
X
,
known only through a set of
e
examples
fg
Vhpi
`h
qr
Yhts
u
obtained as follows:
`h
v
ag
Vh
qxw
yhpi
where
€
,
¡
yh
is a random variable such that
‚
ƒ
yh
„
v
…
and
‚
ƒ
yh
y†
„
v
…
,
€‡
ˆv

,
¡
‚
ƒ
y
‰
h
„
v

‰g
Vh
q
,
€’‘
“
”
, where

‰g–•
q
is the unknown
‘
th
moment of the
distribution of
yh
and is deﬁned as a function of
Vh
.
In particular for
‘
v
”
, the last of the above mentioned properties implies that no
assumption of global homoscedasticity is made.

Local Weighted Regression
¡
The problem of local regression can be stated as the problem of estimating the
value that the regression function
ag
V
q
v
‚
ƒ
`—
V„
assumes for a speciﬁc query
point
V
, using information pertaining only to a neighborhood of
V
.
¡
Given a query point
V™˜
, and under the hypothesis of a local homoscedasticity of
yh
, the parameter
d
of a local linear approximation of
ag–•
q
in a neighborhood of
V˜
can be obtained solving the local polynomial regression:
e
hs
u
f
`h
g
V
h
h
d
i
j
k
lg
Vhi
V˜
q
m
i
where, given a metric on the space
XY
,
¡
lg
Vhi
V˜
q
is the distance from the query point to the

n
o
example,

v
p
iq
q
q
i
r
,
¡
kg–•
q
is a weight (aka kernel) function,
¡
m
is the bandwidth

Local Weighted Regression (II)
¡
In matrix notation, the solution of the above stated weighted least squares
problem is given by:
sd
v
g
t
h
u
h
u
t
qv
u
t
h
u
h
uxw
v
gy
h
y
qv
u
y
h{z
v
|
y
hz
i
where
t
is a matrix whose

n
o
row is
V
h
h
,
w
is a vector whose

n
o
element is
`h
,
u
is a diagonal matrix whose

n
o
diagonal element is
}h
h
v
kg
lg
Vhi
V˜
q~
m
q
,
y
v
u
t
,
z
v
uxw
, and the matrix
t
h
u
h
u
t
vy
h
y
is assumed to be
non-singular so that its inverse
|
v
gy
h
y
qv
u
is deﬁned.
¡
Once obtained the local linear polynomial approximation, a prediction of
`˜
v
ag
V˜
q
, is ﬁnally given by:
€`˜
v
V
h
˜
sd
q

Linear Leave-one-out
¡
By exploiting the linearity of the local approximator, a leave-one-out
cross-validation estimation of the mean squared error
‚
ƒg
ag‚
˜
q
g
€`˜
q
j
„
can be
obtained without any signiﬁcant overload.
¡
In fact, using the PRESS statistic [2, 37], it is possible to calculate the error
ƒ
cv
†
v
`†
g
V
h
†
sdv
†
, without explicitly identifying the parameters
sdv
†
from the
examples available with the
‡
th
removed.
¡
The formulation of the PRESS statistic for the case at hand is the following:
ƒ
cv
†
v
`†
g
V
h
†
sdv
†
v
`†
g
V
h
†
|
y
hz
p
g
„
h
†
|
„†
v
`†
g
V
h
†
sd
p
g
m†
†
i
where
„
h
†
is the
‡
th
row of
y
and therefore
„†
v
}†
†
V†
, and where
m†
†
is the
‡
th
diagonal element of the Hat matrix
…
vy
|
y
h
vy
gy
h
y
q†v
u
y
h
.

Rectangular weight function
¡
In what follows, for the sake of simplicity, we will focus on linear approximator. An
extension to generic polynomial approximators of any degree is straightforward.
We will assume also that a metric on the space
XY
is given. All the attention will
be thus centered on the problem of bandwidth selection.
¡
If as a weight function
kg–•
q
the indicator function
k
lg
Vhi
V˜
q
m
v
‡ˆ‰
p
if
lg
Vhi
V˜
qŠ
m
,
…
otherwise;
(0)
is adopted, the optimization of the parameter
m
can be conveniently reduced to
the optimization of the number ‹ of neighbors to which a unitary weight is
assigned in the local regression evaluation.
¡
In other words, we reduce the problem of bandwidth selection to a search in the
space of
mg
‹
q
v
lg
Vg
‹
q
i
V˜
q
, where
Vg
‹
q
is the
‹
th
nearest neighbor of the query
point.

Recursive local regression
The main advantage deriving from the adoption of the rectangular weight function is
that, simply by updating the parameter
sdg
‹
q
of the model identiﬁed using the
‹
nearest neighbors, it is straightforward and inexpensive to obtain
sdg
‹w
p
q
. In fact,
performing a step of the standard recursive least squares algorithm [4], we have:
‡ŒŒŽŒ‚ŒŽŒŒŽŒ‚ŒŽŒˆŒŽŒ‚ŒŽŒŒŒŒŽŒ‚ŒŽ‰
|g
‹w
p
q
v
|g
‹
q
g
|g
‹
q
Vg
‹w
p
q
V
hg
‹w
p
q
|g
‹
q
pw
V
hg
‹w
p
q
|g
‹
q
Vg
‹w
p
q
g
‹w
p
q
v
|g
‹w
p
q
Vg
‹w
p
q
ƒg
‹w
p
q
v
`g
‹w
p
q
g
V
hg
‹w
p
q
sdg
‹
q
sdg
‹w
p
q
v
sdg
‹
qxw
g
‹w
p
q
ƒg
‹w
p
q
where
|g
‹
q
v
gy
h
y
q†v
u
when
m
v
mg
‹
q
, and where
Vg
‹w
p
q
is the
g
‹w
p
q
th
nearest
neighbor of the query point.

Recursive PRESS computation
Moreover, once the matrix

©‘

’

is available, the leave-one-out
cross-validation errors can be directly calculated without the need of
any further model identiﬁcation:
ä
cv
“
©‘

’

§
¦“
”
•
–
“
—™˜
©‘

’

’
”
•
–
“

©‘

’

•“
š
›œž
Ÿ
©
•“
š
•¡
†¢
£
©‘

’
¥¤
Let us deﬁne for each value of
‘
the
¦‘
§
’
¨
vector
©
cv
©‘

that contains
all the leave-one-out errors associated to the model
—˜
©‘

.

Model selection

The recursive algorithm returns for a given query point
•
, a set of
predictions
ª¦
©‘

§
•
–

—˜
©‘

, together with a set of associated
leave-one-out error vectors
©
cv
©‘

.

If the selection paradigm, frequently called winner-takes-all, is
adopted, the most natural way to extract a ﬁnal prediction
ª¦¡
,
consists in comparing the prediction obtained for each value of
‘
on the basis of the classical mean square error criterion:
ª¦
§
•
–

—˜
©
ª
‘

š
with
ª
‘
§
«¬
®
¯±°³²
´
MSE
©‘

§
«¬
®
¯°²
²µ’¶
·
¸
µ
©
©
cv
µ
©‘

¹
²µ’¶
·
¸
µ
º

Local Model combination
¡
As an alternative to the winner-takes-all paradigm, we explored also the
effectiveness of local combinations of estimates [46].
¡
The ﬁnal prediction of the value
`˜
is obtained as a weighted average of the best
»
models, where
»
is a parameter of the algorithm.
¡
Suppose the predictions
€`˜
g
‹
q
and the error vectors
¼
cv
g
‹
q
have been ordered
creating a sequence of integers
f
‹h
r
so that
½
MSE
g
‹h
qŠ
½
MSE
g
‹†
q
,
€¾
‡
. The
prediction of
€`˜
is given by
€`˜
v
¿
À
hs
u
Áh
€`˜
g
‹h
q
¿
À
hts
u
Áh
i
where the weights are the inverse of the mean square errors:
Áh
v
p~
½
MSE
g
‹h
q
.
This is an example of the generalized ensemble method [39].

From local learning to Lazy Learning (LL)

By speeding up the local learning procedure, we can delay the
learning procedure to the moment when a prediction in a query
point is required (query-by-query learning).

The combination approach makes possible to integrate local
models of different order (e.g. constant and linear) and different
bandwidths.

This method is called lazy since the whole learning procedure
(i.e. the parametric and the structural identiﬁcation) is deferred
until a prediction is required.

Experimental setup for regression
Datasets: 23 real and artiﬁcial datasets from the ML repository.
Methods: Lazy Learning, Local modeling, Feed Forward Neural
Networks, Mixtures of Experts, Neuro Fuzzy, Regression Trees
(Cubist).
Experimental methodology:
’Â
-fold cross-validation.
Results: Mean absolute error (Table 7.2), relative error (Table 7.3) and
paired t-test (Appendix C) [7].

Regression datasets
Dataset Number of examples Number of regressors
Housing 330 8
Cpu 506 13
Prices 209 6
Mpg 159 16
Servo 392 7
Ozone 167 8
Bodyfat 252 13
Pool 253 3
Energy 2444 5
Breast 699 9
Abalone 4177 10
Sonar 208 60
Bupa 345 6
Iono 351 34
Pima 768 8
Kin_8fh 8192 8
Kin_8nh 8192 8
Kin_8fm 8192 8
Kin_8nm 8192 8
Kin_32fh 8192 32
Kin_32nh 8192 32
Kin_32fm 8192 32
Kin_32nm 8192 32

Experimental results: paired comparison
Each method is statistically compared with all the others
(9 * 23 =207 comparisons).
Method
Number of times the method
was signiﬁcantly worse than another
LL linear 74
LL constant 96
LL combination 23
Local modeling linear 58
Local modeling constant 81
Cubist 40
Feed Forward NN 53
Mixtures of Experts 80
Local Model Network (fuzzy) 132
Local Model Network (k-mean) 145
The less, the best !!

Award in EUFIT competition
Data analysis competition on regression: awarded as a runner-up among
Ã
’
participants at the Third International Erudit competition on
Protecting rivers and streams by monitoring chemical
concentrations and algae communities [10].

Lazy Learning for dynamic tasks
Multi-step-ahead prediction: [12]
long horizon forecasting based on the iteration of a LL
one-step-ahead predictor.
Nonlinear control: [11]
1. Lazy Learning inverse/forward control.
2. Lazy Learning self-tuning control.
3. Lazy Learning optimal control.

Embedding in time series
Consider a sequence
Ä
of measurements
Å
Æ
£
¤
of a observable at
equal time intervals.
We express the present value as a function of the previous
Ç
values of
the time series itself
Å
Æ
§
¨
©
Å
ÆÉÈ
·
š
Å
ÆÈ
¹
š¤
¤
¤
š
Å
ÆÈ
¥

where
¨
is an unknown nonlinear function and the vector
¦
Å
ÆÉÈ
·
š
Å
ÆÉÈ
¹
š¤
¤
¤
š
Å
ÆÈ
¥
¨
lies in the
Ç
dimensional time delay space or lag
space.
This standard approach is called “state-space reconstruction” in the
physics community, “tapped delay line” in the engineering community
and Nonlinear Autoregressive (NAR) in the forecasting community.

0
10
20
30
40
50
0
10
20
30
40
50
−8
−6
−4
−2
0
2
4
6
8
10
fit
TIME SERIES t t-n+1t-1
= (ϕ ,ϕ ,..., ϕ )fϕt+1
t-1
ϕ
temporal representation
embedding representation
ϕt+1
ϕt
input/output representation
ϕ
1
2
3
4
5

One-step and multi-step-ahead prediction
One-step ahead prediction: the
Ç
previous values of the series are
assumed to be available for the prediction of the next value.
This is equivalent to a problem of supervised learning. LL was
used in this way in several prediction tasks: ﬁnance, economic
variables, environmental modeling [23].
Multi-step ahead prediction: we predict the value of the series for the
next
£
steps.
We can classify the methods for multiple step prediction
according to two features, the horizon of the predictor and the
training criterion.

Multi-step-ahead-prediction
One-step-ahead predictor and one-step-ahead training criterion. The model
predicts
£
steps ahead by iterating a one-step-ahead predictor
whose parameters are optimized to minimize the training error on
one-step-ahead forecast.
One-step-ahead predictor and
£
-step-ahead training criterion. The model
predicts
£
steps ahead by iterating a one-step-ahead predictor
whose parameters are optimized to minimize the training error on
the iterated
£
-step-ahead forecast.
Direct forecasting. The model makes a direct forecast at time
Ê

£
:
Å
ÆË
Ì
§
¨
Ì
©
Å
Æ
š
Å
ÆÈ
·
š¤
¤
¤
š
Å
ÆÈ
¥
Ë
·


Iteration of a one-step-ahead predictor
f
ϕt-2
z-1
z-1
z-1
z-1
ϕt-3
ϕt-n
ϕt-1
z-1
ϕt

Local Modeling in the time domain
Consider the embedding
Å
ÆË
·
§
¨
©
Å
Æ
š
Å
ÆÈ
·
š¤
¤
¤
š
Å
ÆÈ
Í

of order
Ç
§
Î
.
- -
t
ϕ
t-11t-16 t-1t-6
t

Local Modeling in the I/O space
Å
ÆË
·
§
¨
©
Å
Æ

of order
Ç
§
’
.
t+1
t
ÏÐ
ÑÒ
ÓÔ
ÕÖ
×ØÙÚ
ÛÜ
ÝÞ
ßà
áâ
ããåä
æç
èé
êë
ìí
îïð
ðñò
òó
ô
ôõ
ö÷
øù
úûü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ü
ϕ
ϕ
q
ýýåþþ
ÿÿ¡
¢¢¡££
¤
¤
¤
¤¡¥¥
¦¦¡§§
¨¨¡©©
Note the labels of the axis !!!

Local modeling in the embedding space
Å
ÆË
·
§
¨
©
Å
Æ
š
Å
ÆÈ
·

of order
Ç
§
Ã
.
t-2
ϕ
t-1
ϕ
t-1
ϕ
t-2
ϕ
1
2
3
4
5
t+h-1
t
t-1

Conventional and iterated leave-one-out
a)
3
1
2
4
5
3
e (3)
cv
1
2
3
4
5
e (3)
b)
it
1
2
4
5
3
1
2 3
4
5
3

It Press in the space
x4 x5x3x2x1z1 z2 z4 z5
y1
y2
y3
y4
y5
xy
loo
z3
yz
it
x
xyβ
-3
yzβ
-3
x
y
z
3
-3
y^
e (3)
xz
e (3)
e (3)
loo
¢
represents the value of the time series with order
Ç
§
’
at time
Ê
”
’
,
¦
represents the value of the time series at time
Ê
, and

represents
the value of the time series at time
Ê

’
.

From conventional to iterated PRESS

PRESS statistic returns leave-one-out as a by product of the local
weighted regression.

We derived in [12] an analytical iterated formulation of the PRESS
statistic for long horizon assessment.

Iterated assessment criterion improves stability and prediction
accuracy.

The Iterated multi-step-ahead algo
1. Time series embedded as an input/output mapping
¨

¤¥

¤
.
2. The one-step-ahead predictor is a local estimate of the mapping
¨
.
3. The
£
-step-ahead prediction is performed by iterating a
one-step-ahead estimator.
4. Local structure identiﬁcation performed in a space of alternative
model conﬁgurations, each characterized by a different
bandwidth.
5. Prediction ability assessed by the iterated formulation of the
cross-validation PRESS statistic (
£
-step-ahead criterion).

The Santa Fe time series

The iterated PRESS approach has been applied both to the
prediction of a real-world data set (A) and to a computer
generated time series (D) from the Santa Fe Time Series
Prediction and Analysis Competition.

The A time series has a training set of 1000 values and a test set
of 10000 samples: the task is to predict the continuation for
’Â
Â
steps, starting from different points.

The D time series has a training set of 100000 values and a test
set of 500 samples: the task is to predict the continuation for
Ã
steps, starting from different points.

A series: training set
0 100 200 300 400 500 600 700 800 900 1000
0
50
100
150
200
250
300

A series: one-step criterion
0 10 20 30 40 50 60 70 80 90 100
0
50
100
150
200
250
300

A series: multi-step criterion
0 10 20 30 40 50 60 70 80 90 100
0
50
100
150
200
250
300

Experiments: The Santa Fe Time Series A
order n=16 Training set: 1000 values Test set: 100 steps
Test data Non iter. PRESS Iter. PRESS Sauer Wan
1-100 0.350 0.029 0.077 0.055
1180-1280 0.379 0.131 0.174 0.065
2870-2970 0.793 0.055 0.183 0.487
3000-3100 0.003 0.003 0.006 0.023
4180-4280 1.134 0.051 0.111 0.160
Sauer: combination of iterated and direct local models.
Wan: recurrent network.

The Santa Fe Time Series D
order
Ç
§
ÃÂ
Training set:
’Â
Â
šÂ
Â
Â
values Test set:
Ã
steps
Test data Non iter. PRESS Iter. PRESS Zhang Hutchinson
0-24 0.1255 0.0492 0.0665
100-124 0.0460 0.0363 0.0616
200-224 0.2635 0.1692 0.1475
300-324 0.0461 0.0405 0.0541
400-424 0.1610 0.0644 0.0720
Zhang: combination of iterated and direct multilayer perceptron.

Award in Leuven Competition
Training set made of
ÃÂ
Â
Â
points.
Task: predict the continuation for the next
ÃÂ
Â
points.
0 20 40 60 80 100 120 140 160 180 200
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
Iterated Lazy Learning ranked second and fourth [8].

Lazy Learning for iterated prediction
Multi-step ahead by iteration of a one-step predictor.
Lazy learning to implement the one-step predictor.
Selection of the local structure by an iterated PRESS.
Iterated criterion avoids the accumulation of prediction errors and
improves the performance.

Complexity in global and local modeling
Consider
r
training samples,
e
features and

query points.
GLOBAL LAZY
Parametric ident.

(NLS)

(Nn)+

(LS)
Structural ident. by K-fold cross-validation K

(NLS) small
prediction for Q queries negligible Q (

(Nn)+

(LS))
TOTAL K

(NLS) Q [

(Nn)+

(LS)]
where

(NLS) stands for the cost of Non-Linear least-Squares and

(LS) stands for
the cost of Linear least-Squares.

Feature selection and LL

Local modeling techniques are known to be weak in large
dimensional spaces.

A way to defy the curse of dimensionality is dimensionality
reduction (aka feature selection).

It requires the assessment of an exponential number of
alternatives (
Ã¥
subsets of input variables) and the choice of the
best one.

Several techniques exist: we focus here on wrappers.

Wrappers rely on expensive cross-validation (e.g. leave-one-out
assessment)

Our idea: combine racing [34] and sub-sampling [29] to
accelerate the wrapper feature selection procedure in LL.

Racing for feature selection

Suppose we have several sets of different input variables.
The computational cost of making a selection results from the
cost of identiﬁcation and the cost of validation.

The validation cost required by a global model is independent of
Q, while this is not the case for LL.

The idea of racing techniques consists in using blocking and paired
multiple test to compare different models in similar conditions and
discard as soon as possible the worst ones.

Racing reduces the number of tests

to be made.

This makes more competitive the wrapper LL approach.

Sub-sampling and LL

The goal of model selection is to ﬁnd the best hypothesis in a set
of alternatives.

What is relevant is ordering the different alternatives: M2 M3
M5 M1 M2.

Reducing the training set size N, we hope to reduce the accuracy
of each single model but not necessarily their ordering.

In LL reducing the training set size

reduces the cost.

The idea of sub-sampling is to reduce the size of the training set
without altering the ranking of the different models.

This makes more competitive the LL approach

RACSAM for feature selection
We proposed the following algorithm [14]
1. Deﬁne an initial group of promising feature subsets.
2. Start with small training and test sets.
3. Discard by racing all the feature subsets that appear as
signiﬁcantly worse than the others.
4. Increase the training and test size until at most winners models
remain.
5. Update the group with new candidates to be assessed and go
back to 3.

Experimental session

We compare the performance accuracy of the LL algorithm
enhanced by the RACSAM procedure to the the accuracy of two
state-of-art algorithms, a SVM for regression and a regression
tree (RTREE).

Two version of the RACSAM algorithm were tested: the ﬁrst
(LL-RAC1) takes as feature set the best one (in terms of estimate
Mean absolute Error (MAE)) among the winning candidates :
the second (LL-RAC1) averages the predictions of LL
predictors.

§

, and p-value is
Â
¤
Â
’
.

Experimental results
Five-fold cross-validation on six real datasets of high dimensionality:
Ailerons (

§
’

Â
š
Ç
§
Â
), Pole (

§
’
Â
Â
Â
š
Ç
§

),
Elevators (

§
’
Î
š
Ç
§
’

), Triazines (

§
’

Î
š
Ç
§
ÎÂ
),
Wisconsin (

§
’
š
Ç
§

Ã
) and Census (

§
Ã
Ã
!

š
Ç
§
’
!
).
Dataset AIL POL ELE TRI WIS CEN
LL-RAC1 9.7e-5 3.12 1.6e-3 0.21 27.39 0.17
LL-RAC2 9.0e-5 3.13 1.5e-3 0.12 27.41 0.16
SVM 1.3e-4 26.5 1.9e-3 0.11 29.91 0.21
RTREE 1.8e-4 8.80 3.1e-3 0.11 33.02 0.17

Applications
¡
Financial prediction of stock markets: in collaboration with Masterfood, Belgium.¡
Prediction of yearly sales: in collaboration with Dieteren, Belgium, the first
Belgian car dealer.
¡
Non linear control and identification task in power systems: in collaboration with
Universitá del Sannio (I) [44, 18].
¡
Modeling of industrial processes: in collaboration with FaFer Usinor steel
company (B), and Honeywell Technology Center, (US).
¡
Performance modelling of embedded systems: during my stay at Philips
Research [16], Eindhoven (NL).
¡
Quality of service: during my stay at IMEC, Leuven (B) [17].
¡
Black-box simulators: in collaboration with CENEARO, Gosselies (B) [15].
¡
Environmental predictions: in collaboration with Politecnico di Milano (I) [23].

Software

MATLAB toolbox on Lazy Learning [5].
R contributed package lazy.

Joint work with Dr. Mauro Birattari (IRIDIA).

Web page: http://iridia.ulb.ac.be/~lazy.

About 5000 accesses since October 2002.

The importance of being Lazy

Fast data-driven design.
No global assumption on the noise.

Linear methods still effective in a multivariate non-linear setting
(LWR, PRESS).

An estimate of the variance is returned with each prediction.

Intrinsically adaptive.

Future work

Extension of the LL method to other local selection criteria (VC
dimension, GCV).

Classiﬁcation applications.

Integration with powerful software and hardware devices.

From large to huge databases.

New applications: bioinformatics, text mining, medical data,
sensor networks, power systems.

References
[1] D. W. Aha. Editorial of special issue on lazy learning. Artiﬁcial
Intelligence Review, 11(1–5):1–6, 1997.
[2] D. M. Allen. The relationship between variable and data augmen-
tation and a method of prediction. Technometrics, 16:125–127,
1974.
[3] C. G. Atkeson, A. W. Moore, and S. Schaal. Locally weighted
learning. Artiﬁcial Intelligence Review, 11(1–5):11–73, 1997.
[4] G. J. Bierman. Factorization Methods for Discrete Sequential
Estimation. Academic Press, New York, NY, 1977.
[5] M. Birattari and G. Bontempi. The lazy learning toolbox, for
use with matlab. Technical Report TR/IRIDIA/99-7, IRIDIA-ULB,
Brussels, Belgium, 1999.
[6] M. Birattari, G. Bontempi, and H. Bersini. Lazy learning meets
the recursive least-squares algorithm. In M. S. Kearns, S. A.
Solla, and D. A. Cohn, editors, NIPS 11, pages 375–381, Cam-
bridge, 1999. MIT Press.
75-1

[7] G. Bontempi. Local Learning Techniques for Modeling, Predic-
tion and Control. PhD thesis, IRIDIA- Universit´e Libre de Brux-
elles, 1999.
[8] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for it-
erated time series prediction. In J. A. K. Suykens and J. Van-
dewalle, editors, Proceedings of the International Workshop on
Advanced Black-Box Techniques for Nonlinear Modeling, pages
62–68. Katholieke Universiteit Leuven, Belgium, 1998.
[9] G. Bontempi, M. Birattari, and H. Bersini. Recursive lazy learning
for modeling and control. In Machine Learning: ECML-98 (10th
European Conference on Machine Learning), pages 292–303.
Springer, 1998.
[10] G. Bontempi, M. Birattari, and H. Bersini. Lazy learners at work:
the lazy learning toolbox. In Proceeding of the 7th European
Congress on Inteligent Techniques and Soft Computing EUFIT
’99, 1999.
[11] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning for
modeling and control design. International Journal of Control,
72(7/8):643–658, 1999.
75-1

[12] G. Bontempi, M. Birattari, and H. Bersini. Local learning for iter-
ated time-series prediction. In I. Bratko and S. Dzeroski, editors,
Machine Learning: Proceedings of the Sixteenth International
Conference, pages 32–38, San Francisco, CA, 1999. Morgan
Kaufmann Publishers.
[13] G. Bontempi, M. Birattari, and H. Bersini. A model selection ap-
proach for local learning. Artiﬁcial Intelligence Communications,
121(1), 2000.
[14] G. Bontempi, M. Birattari, and P.E. Meyer. Combining lazy learn-
ing, racing and subsampling for effective feature selection. In
Proceedings of the International Conference on Adaptive and
Natural Computing Algorithms. Springer Verlag, 2005. To ap-
pear.
[15] G. Bontempi, O. Caelen, S. Pierret, and C. Goffaux. On the
use of supervised learning techniques to speed up the design
of aeronautics components. WSEAS Transactions on Systems,
10(3):3098–3103, 2005.
[16] G. Bontempi and W. Kruijtzer. The use of intelligent data anal-
ysis techniques for system-level design: a software estimation
75-1

example. Soft Computing, 8(7):477–490, 2004.
[17] G. Bontempi and G. Lafruit. Enabling multimedia qos control with
black-box modeling. In D. Bustard, W. Liu, and R. Sterritt, edi-
tors, Soft-Ware 2002: Computing in an Imperfect World, Lecture
Notes in Computer Science, pages 46–59, 2002.
[18] G. Bontempi, A. Vaccaro, and D. Villacci. A semi-physical mod-
elling architecture for dynamic assessment of power components
loading capability. IEE Proceedings of Generation Transmission
and Distribution, 151(4):533–542, 2004.
[19] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Clas-
siﬁcation and Regression Trees. Wadsworth International Group,
Belmont, CA, 1984.
[20] W. S. Cleveland. Robust locally weighted regression and smooth-
ing scatterplots. Journal of the American Statistical Association,
74:829–836, 1979.
[21] W. S. Cleveland and S. J. Devlin. Locally weighted regression:
an approach to regression analysis by local ﬁtting. Journal of
American Statistical Association, 83:596–610, 1988.
75-1

[22] W. S. Cleveland and C. Loader. Smoothing by local regression:
Principles and methods. Computational Statistics, 11, 1995.
[23] G. Corani. Air quality prediction in milan: feed-forward neural
networks, pruned neural networks and lazy learning. Ecological
Modelling, 2005. In press.
[24] T. Cover and P. Hart. Nearest neighbor pattern classification.
Proc. IEEE Trans. Inform. Theory, pages 21–27, 1967.
[25] J. Fan and I. Gijbels. Adaptive order polynomial fitting: band-
width robustification and bias reduction. J. Comp. Graph. Statist.,
4:213–227, 1995.
[26] J. Fan and I. Gijbels. Local Polynomial Modelling and Its Appli-
cations. Chapman and Hall, 1996.
[27] W. Hardle and J. S. Marron. Fast and simple scatterplot smooth-
ing. Comp. Statist. Data Anal., 20:1–17, 1995.
[28] R. Henderson. Note on graduation by adjusted average. Trans-
actions of the Actuarial Society of America, 17:43–48, 1916.
[29] G. H. John and P. Langley. Static versus dynamic sampling for
data mining. In Proceedings of the Second International Con-
75-1

ference on Knowledge Discovery in Databases and Data Mining.
AAAI/MIT Press, 1996.
[30] M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of
bandwidth selection for density estimation. Journal of American
Statistical Association, 90, 1995.
[31] V. Y. Katkovnik. Linear and nonlinear methods of nonparametric
regression analysis. Soviet Automatic Control, 5:25–34, 1979.
[32] C. Loader. Local Regression and Likelihood. Springer, New York,
1999.
[33] C. R. Loader. Old faithful erupts: Bandwidth selection reviewed.
Technical report, Bell-Labs, 1987.
[34] O. Maron and A. Moore. The racing algorithm: Model selection
for lazy learners. Artiﬁcial Intelligence Review, 11(1–5):193–225,
1997.
[35] T. M. Mitchell. Machine Learning. McGraw Hill, 1997.
[36] R. Murray-Smith and T. A. Johansen. Local learning in local
model networks. In R. Murray-Smith and T. A. Johansen, editors,
75-1

Multiple Model Approaches to Modeling and Control, chapter 7,
pages 185–210. Taylor and Francis, 1997.
[37] R. H. Myers. Classical and Modern Regression with Applications.
PWS-KENT Publishing Company, Boston, MA, second edition,
1994.
[38] B. U. Park and J. S. Marron. Comparison of data-driven band-
width selectors. Journal of American Statistical Association,
85:66–72, 1990.
[39] M. P. Perrone and L. N. Cooper. When networks disagree: En-
semble methods for hybrid neural networks. In R. J. Mammone,
editor, Artiﬁcial Neural Networks for Speech and Vision, pages
126–142. Chapman and Hall, 1993.
[40] J. Rice. Bandwidth choice for nonparametric regression. The
Annals of Statistics, 12:1215–1230, 1984.
[41] D. Ruppert, S. J. Sheather, and M. P. Wand. An effective band-
width selector for local least squares regression. Journal of
American Statistical Association, 90:1257–1270, 1995.
75-1

[42] G. V. Schiaparelli. Sul modo di ricavare la vera espressione
delle leggi della natura dalle curve empiricae. Effemeridi Astro-
nomiche di Milano per l’Arno, 857:3–56, 1886.
[43] C. Stone. Consistent nonparametric regression. The Annals of
Statistics, 5:595–645, 1977.
[44] D. Villacci, G. Bontempi, A. Vaccaro, and M. Birattari. The role
of learning methods in the dynamic assessment of power com-
ponents loading capability. IEEE Transactions on Industrial Elec-
tronics, 52(1), 2005.
[45] G. Wahba and S. Wold. A completely automatic french curve:
Fitting spline functions by cross-validation. Communications in
Statistics, 4(1), 1975.
[46] D. Wolpert. Stacked generalization. Neural Networks, 5:241–
259, 1992.
[47] M. Woodrofe. On choosing a delta-sequence. Ann. Math. Statist.,
41:1665–1671, 1970.
75-1

Local modeling in regression and time series prediction

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Local modeling in regression and time series prediction

Similar to Local modeling in regression and time series prediction (20)

More from Gianluca Bontempi

More from Gianluca Bontempi (11)

Recently uploaded

Recently uploaded (20)

Local modeling in regression and time series prediction