Statistics and Its Interface Volume 2 (2009) 361–368
Boosting on the functional ANOVA
decomposition
Yongdai Kim∗ , Yuwon Kim, Jinseog Kim,
Sangin Lee, and Sunghoon Kwon
A boosting algorithm on the functional ANOVA decomposition, called ANOVA boosting, is proposed. The main
idea of ANOVA boosting is to estimate each component in
the functional ANOVA decomposition by combining many
base (weak) learners. A regularization procedure based on
the L1 penalty is proposed to give a componentwise sparse
solution and an efficient computing algorithm is developed.
Simulated as well as bench mark data sets are analyzed to
compare ANOVA boosting and standard boosting. ANOVA
boosting improves prediction accuracy as well as interpretability by estimating the components directly and providing componentwisely sparser models.
Keywords and phrases: Functional ANOVA decomposition, Boosting, Variable selection.
1. INTRODUCTION
Given an output y ∈ Y and its corresponding input
x = (x(1) , . . . , x(p) ) ∈ X , suppose we are interested in a
functional relationship f : X → Y between x and y. When
the dimension of the input (i.e. p) is high, one has a lot
of difficulties in estimating and interpreting f . One of the
most important learning methods for high dimensional data
is a boosting method, which constructs a strong learner by
combining many base (weak) learners. The boosting method
has shown great success in statistics and machine learning
areas for their significant improvement in prediction accuracy. Since [1] introduced the first boosting algorithm – AdaBoost, various extensions have been proposed by [2] and
[3].
In this paper, we develop a way of using a boosting algorithm to estimate the components in the functional ANOVA
decomposition, which is given as
(1)
f (x) = f1 (x) + f2 (x) + · · · + fK (x)
where the components fk (x) depend only on low dimensional elements of an input vector x. The main idea of the
proposed boosting algorithm is to estimate each component
fj , j = 1, . . . , K by combining base learners. We call the
∗ Corresponding
author.
proposed boosting method “ANOVA boosting.” First, we
propose sets of base learners for the components in the functional ANOVA decomposition to make the model identifiable. In particular, we use stumps (decision trees with only
two terminal nodes) as base learners for main effect terms
and their tensor products as base learners for interaction
effect terms. Second, we develop a regularization procedure
which gives a componentwisely sparse solution. Finally, we
implement an efficient computational algorithm.
An advantage of ANOVA boosting over standard boosting methods is that ANOVA boosting can estimate and identify important components and their influence to the output simultaneously. In contrast, as [3] explained, standard
boosting methods estimate only the highest order interaction components, and so estimating lower order components
requires additional post-processing procedures. See, also, [4].
This advantage of ANOVA boosting makes it possible to select (or delete) relevant (or irrelevant) input variables. When
the dimension of input is high, the final estimated model of
a standard boosting method includes many noisy components and we need to identify which components are real
signals and which are noises. Since ANOVA boosting can
estimate each component simultaneously, we can easily develop a method which can identify signal and noisy components in the estimated model. For this purpose, we develop
a componentwise sparse regularization procedure called the
componentwisely adaptive L1 penalty, which is motivated by
the adaptive lasso by [5].
There are several modified boosting algorithms which
give sparser solutions than standard boosting. [6] developed a similar boosting algorithm for the generalized additive model, and [7] proposed a boosting method called
sparser boosting which yields a sparser solution than standard boosting. ANOVA boosting can estimate higher order
interaction terms while the algorithm of [6] can estimate
only main effect terms. Also, ANOVA boosting also gives a
componentwisely sparser solution in contrast to the sparser
boosting of [7] which only gives a sparser solution in terms
of base learners. That is, important components can be selected by ANOVA boosting but not by sparser boosting.
ANOVA boosting has several advantages over the kernel
based method for the functional ANOVA decomposition. [8]
used the kernel machine for the functional ANOVA decomposition to improve the interpretability, and their idea has
been studied and extended by [9], [10] and [11]. However,
the kernel machine has a problem with categorical inputs
since the Gram matrix can be singular and so the algorithm
fails to converge. Also, when the dimension of the input is
high, the computational cost for inverting the Gram matrix
is expensive. In contrast, categorical inputs can be processed
easily and computation is simpler since no matrix inversion
is required in ANOVA boosting.
The paper is organized as follows. Section 2 presents various ingredients of ANOVA boosting such as model, the
choice of base learners and regularization procedure. In Section 3, a computational algorithm is presented. Simulated as
well as real datasets are analyzed in Section 4. Concluding
remarks follow in Section 5.
Then, we can write
⎤ ⎡
⎤
⎡
p
Fj ⎦ ⊕ ⎣
Fjk ⎦
F = {1} ⊕ ⎣
j=1
j<k
where all subspaces {1}, Fj , Fjk , j = 1, . . . , p, j < k are
p
orthogonal on L2 (μ) where μ = j=1 μj , and hence all components are identifiable.
2.2 Choice of base learners
The basic idea of ANOVA boosting is to estimate each
component (i.e. fj s and fjk s) by a linear combination of base
learners. For this, we have to choose sets of base learners Gj
and Gjk for the components fj and fjk , respectively.
For Gj , we use the set of decision trees with only two ter2. ANOVA BOOSTING
minal nodes split by the variable x(j) . For the side condition,
we enforce
2.1 Model
gj (x(j) )μj (dx(j) ) = 0
Let (x1 , y1 ), . . . , (xn , yn ) be n input-output pairs of a (4)
p
X
j
training dataset where xi ∈ X ⊂ R and yi ∈ Y, which are
assumed to be a random sample from a probability measure
for gj ∈ Gj , and hence the resulting fj satisfies (2). For a
(1)
(p)
(j)
P on X × Y. Let xi = (xi , . . . , xi ) where xi ∈ Xj ⊂ R continuous input variable, let g (x(j) ) = θ I(x(j) ≤ s) +
j
L
and X = X1 × · · · × Xp . Let F be a given set of functions on θ I(x(j) > s). To satisfy the side condition (4), we should
R
Rp and let l : Y × R → R be a loss function. The objective have
of statistical learning is to find a function f ∗ ∈ F which
minimizes EP (l(Y, f (X))) among f ∈ F.
(5)
μj (x(j) ≤ s)θL + μj (x(j) > s)θR = 0.
The functional ANOVA decomposition of f is
That is, we can choose the split value s freely, but the prep
dictive
values θL and θR should be selected to satisfy (5).
fj (x(j) ) +
fjk (x(j) , x(k) ) + · · ·
f (x) = β0 +
Categorical inputs can be treated similarly.
j=1
j<k
For Gjk , we use the tensor products of the base learners
in Gj and Gk . That is, we let Gjk = Gj ⊗ Gk . That is, for
where β0 is a constant, fj are the main effect components,
any gjk ∈ Gjk , there exist gj ∈ Gj and gk ∈ Gk such that
fjk are second order interaction components and so on. For
gjk (x(j) , x(k) ) = gj (x(j) )gk (x(k) ). With Gjk , the resulting fjk
simplicity, we consider the model truncated up to the second
automatically satisfies the identifiability condition (3). Note
order interaction components for f . That is, F consists of
that gjk have the form of
functions having the form
f (x) = β0 +
p
j=1
(j)
fj (x
)+
gjk (x(j) , x(k) ) = θLL I(x(j) ≤ sj , x(k) ≤ sk )
(j)
fjk (x
(k)
,x
).
+ θLR I(x(j) ≤ sj , x(k) > sk )
j<k
Given predefined probability measures μj on Xj , let Fj
be the set of functions fj in L2 (μj ) satisfying
(2)
fj (x(j) )μj (dx(j) ) = 0 for fj ∈ Fj ,
+ θRL I(x(j) > sj , x(k) ≤ sk )
+ θRR I(x(j) > sj , x(k) > sk )
with the identifiability condition
μj (x(j) ≤ sj )θLL + μj (x(j) > sj )θRL = 0,
Xj
and let Fjk be the set of functions fjk in L2 (μj × μk ) satisfying
fjk (x(j) , x(k) )μj (dx(j) ) = 0,
Xj
(3)
Xk
fjk (x(j) , x(k) )μk (dx(k) ) = 0.
362 Y. Kim et al.
μj (x(j) ≤ sj )θLR + μj (x(j) > sj )θRR = 0,
μk (x(k) ≤ sk )θLL + μk (x(k) > sk )θLR = 0,
(6)
μk (x(k) ≤ sk )θRL + μk (x(k) > sk )θRR = 0.
It is easy to see that one of θLL , θLR , θRL and θRR uniquely
defines the other three values. In this view, we may say that
the degree of freedom of gjk is the same as that of gj and gk .
For the choice of μj , the most natural one is Pj , the
3. COMPUTATIONAL ALGORITHM
marginal probability measure of x(j) , which are unknown.
Given g in G = ∪j Gj ∪j<k Gjk , let hg (x) = λwg g(x)
(j)
We estimate Pj (x ≤ s) by their empirical counterparts
where wg = wk if g ∈ Gk and wg = wjk if g ∈ Gjk . Then, we
(j)
n
can rewrite (8) by
i=1 I(xi ≤ s)/n.
2.3 Regularization
In ANOVA boosting, the final model has the form
(7)
f (x) = β0 + f0 (x)
where
(8) f0 (x) =
p
βg g(x(j) ) +
j=1 g∈Gj
βg g(x(j) , x(k) ).
j<k g∈Gjk
First, we need to control the norm of base learners to
make βs estimable. For this, we let
sup |g(x(j) )| ≤ 1 for all g ∈ Gj ,
x(j) ∈Xj
and
sup
(x(j) ,x(k) )∈Xj ×Xk
|g(x(j) , x(k) )| ≤ 1 for all g ∈ Gjk ,
for all j, k.
Second, we need a regularization procedure for βs to avoid
overfitting and ensure componentwise sparsity. For this, we
propose to use the componentwisely adaptive L1 constraint
(0)
given as follows. Let βg be the initial estimates obtained
by a standard boosting method, and let
⎛
⎞γ
⎛
⎞γ
|βg(0) |⎠
and wjk = ⎝
|βg(0) |⎠
(9) wj = ⎝
g∈Gj
g∈Gjk
for some γ ≥ 0. Then, the componentwisely adaptive L1
constraint is defined by
(10)
p
g∈Gj
j=1
wj
|βg |
+
j<k
g∈Gjk
wjk
|βg |
≤λ
where γ and λ are regularization parameters which can be
selected by using test samples or cross-validation. The proposed constraint (10) is motivated by the adaptive Lasso by
[5]. Finally, we propose to estimate the βs by minimizing
n
the empirical risk Cn (β0 , f0 ) =
i=1 l(yi , f (xi )) with the
constraint (10).
Remark. It would be possible to use different regularization parameters for the components. That is, we let
|βg |
|βg |
jk
≤ λj and
≤ λjk . This is useful when
wjk
we have prior information about the importance of the components. For example, to incorporate the prior information
that the main effect components are more important than
higher order interaction component, we let λjk ≤ λj . The algorithm developed in the next section can be modified easily
for this case.
g∈Gj
wj
g∈G
(11) f0 (x) =
p
j=1 g∈Gj
θg hg (x(j) ) +
θg hg (x(j) , x(k) )
j<k g∈Gjk
and the constraint (10) becomes
g∈G |θg | ≤ 1 where
θg = βg /(λwg ). Hence, for fixed β0 , we can use of the
MarginBoost.L1 algorithm of [12]. However, there is a room
to improve the MarginBoost.L1 algorithm. The final estimated model from the algorithm may be less sparse than it
should be. This is because the MarginBoost.L1 algorithm
keeps adding base learners to update the model. Hence,
when unnecessary base learners are added in the early stage
of iteration, they are never deleted from the estimated
model. This may not be a serious problem for prediction
accuracy, but it affects largely to the sparsity of the estimated model. For resolving this problem, we employ a deletion step after each iteration. In the deletion step, some base
learners in the model are deleted. By doing so, we improve
the convergence speed and ensure the sparsity of the final
estimated model.
The idea of the deletion step is as follows. After m iterations, there are at most m many base learners whose
coefficients are not zero. Then, we move the non-zero coefficients to the their gradient direction until either a nonzero coefficient becomes zero or the optimization criterion
is satisfied. To explaining more details, given a current estimated model f0 , let G + = {g ∈ G : θg > 0}. That
is, f0 (x) =
g∈G + θg hg (x). Since the set of base learners is negation closed (i.e. if g ∈ G, then −g ∈ G) we assume that all the non-zero coefficients θg are positive and
+
g∈G + θg ≤ 1. Let ∇g = ∂Cn (β0 , f0 )/∂θg for g ∈ G , and
let ∇∗g = ∇g − g∈G + ∇g /#G + where #G + is the cardinality of G + . Consider new coefficients θg (v) = θg − v∇∗g for
some v ≥ 0. Since g∈G + ∇∗g = 0, we have θg (v) ≥ 0 for all
g ∈ G + and g∈G + θg (v) ≤ 1 as long as 0 ≤ v ≤ η where
η = min{θg /∇∗g : ∇∗g > 0}. We update θg by θg (v̂) where v̂ =
argminv∈[0,η] Cn (β0 , f0v ) where f0v (x) = g∈G + θg (v)hg (x).
When v̂ = η, at least one of θg , g ∈ G + becomes 0 and hence
the corresponding base learner is deleted from the estimated
model. Note that the deletion step always reduces the empirical risk, and hence the algorithm also converges to the
global optimum as the MarginBoost.L1 algorithm does under regularity conditions.
The MarginBoost.L1 algorithm and deletion step, which
we call the ANOVA boosting algorithm, is presented in
Fig. 1. Figure 2 compares the convergence speeds of the
ANOVA boosting and MarginBoost.L1 algorithms with a
simulated data set from the model 1 in Section 4.1. It is
clear that the ANOVA boosting algorithm converges much
faster than the MarginBoost.L1 algorithm. The training error measured by the empirical risk (the average loss over
Boosting on the functional ANOVA decomposition 363
1. Let β0 and f0 be the initial estimates from a standard boosting algorithm.
2. Let λj = |fj |γ and λjk = |fjk |γ where
|fj | =
|βg | and |fjk | =
g∈Gj
|βg |.
g∈Gjk
3. Repeat until convergence
• Addition step: MarginBoost.L1 algorithm
n
i=1
(a) Find ĝ in G which minimizes
hg (xi )zi where
∂l(yi , a)
.
zi =
∂a a=f (x )
i
(b) Find α̂ by
α̂ = argminα∈[0,1] Cn (β0 , (1 − α)f0 + αhĝ ).
(c) Update f0 = (1 − α̂)f0 + α̂hĝ .
• Deletion step
(a) Let f0 (x) =
(b) Let ∇g =
θ h (x)
g∈G + g g
n
h (xi )zi for
i=1 g
where G + = {g : θg > 0}.
g ∈ G + and let ∇∗g = ∇g −
g∈G +
∇g /#G +
(c) Find v̂ by
v̂ = argminv∈[0,η] Cn (β0 , f0v )
where f0v (x) =
g∈G +
(θg − v∇∗g )hg (x) and η = min{θg /∇∗g : ∇∗g > 0}.
(d) Update f0 = f0v̂ .
• Update β0
(a) Update β0 = argminγ∈R C(γ, f0 ).
Figure 1. The ANOVA boosting algorithm.
standard boosting algorithms such as AdaBoost [1] and
gradient boosting [3] which need a stopping rule to avoid
overfitting. This is an another advantage of the ANOVA
boosting algorithm.
4. EXPERIMENTS
We compare empirical performance of ANOVA boosting
with a standard boosting method in terms of prediction
accuracy and variable selectivity. For a standard boosting
method, we use the MarginBoost.L1 of Mason et al. (2000).
For variable selectivity, we compute the relative frequencies of components selected. The regularization parameters
γ and λ are selected by 5-fold cross validation. We search the
Figure 2. Training error (empirical risk) curves on the number optimal value of γ only on {0, 0.5, 1} to save computing time.
of iterations for the MarginBoost.L1 (dashed line) and
4.1 Simulation
ANOVA boosting algorithms (solid line).
the training samples) achieves its minimum after around
25 iterations of the ANOVA boosting algorithm while the
training error keeps decreasing even after 200 iterations of
the MarginBoost.L1 algorithm.
The ANOVA boosting algorithm always converges since
the empirical risk Cn (β0 , f0 ) always decreases after each
iteration. The ANOVA boosting algorithm differs from
364 Y. Kim et al.
We consider the following four models for simulation. The
first two models are regression problems and the last two
models are logistic regression.
Model 1: The input vector x is generated from a 10 dimensional uniform distribution on [0, 1]10 . For given x, y is
generated from the model y = f (x) + , where
f (x) = 5g1 (x(1) ) + 3g2 (x(2) ) + 4g3 (x(3) ) + 6g4 (x(4) )
Table 1. Estimates of the error rate and sparsity (standard errors) in 100 simulations
Model 1
Model 2
Model 3
Model 4
Method
Boosting
ANOVA boosting
Boosting
ANOVA boosting
Boosting
ANOVA boosting
Boosting
ANOVA boosting
MIS-rate
1.2881 (0.0210)
1.1155 (0.0146)
0.1908 (0.0007)
0.1641 (0.0015)
0.2397 (0.0010)
0.2253 (0.0011)
0.1781 (0.0012)
0.1606 (0.0007)
NNZ
9.94 (0.0239)
4.58 (0.0768)
49.81 (0.0466)
11.15 (0.2556)
9.78 (0.0628)
7.42 (0.1646)
12.61 (0.2755)
3.92 (0.1468)
and is a normal variate with mean 0 variance σ 2 which is The model has 2 main effect components and one second
selected to give the signal to noise ratio 3:1. Here,
order interaction component, and x3 , x4 , x5 are noisy input
variables.
sin(2πt)
g1 (t) = t;
g2 (t) = (2t − 1)2 ;
g3 (t) =
2 − sin(2πt)
Table 1 compares the prediction accuracy and sparsity.
Sparsity is measured by the number of non-zero components.
g4 (t) = 0.1 sin(2πt) + 0.2 cos(2πt) + 0.3 sin2 (2πt)
We simulate 100 data sets of size 250. The error rate is eval+ 0.4 cos3 (2πt) + 0.5 sin3 (2πt).
uated on 10,000 testing points. In the table, the MIS-rate is
This model is used by [9]. The model has only main ef- the average misclassification error rate on the test samples
fect components, and x5 , . . . , x10 are noisy input variables. and the NNZ is the average number of non-zero components.
We apply the boosting algorithms with the square error From Table 1, we can see ANOVA boosting is more accurate and selects less components than the standard boosting.
loss.
That is, ANOVA Boosting has superior prediction power as
well as interpretability compared to the standard boosting.
Model 2: Model 2 is the same as Model 1 except
Better performance of ANOVA boosting is expected since
f (x) = g1 (x(1) ) + g2 (x(2) ) + g3 (x(3) ) + g4 (x(4) )
the true models are sparse.
(1)
(3)
Table 2 shows the relative frequency of each variable
x +x
+ g1 (x(3) x(4) ) + g2
+ g3 (x(1) x(2) ).
appearing
in the 100 estimated models, which shows that
2
ANOVA boosting successively deletes many noisy compoThat is, there are three interaction terms in the true nents compared to the standard boosting.
model.
4.2 Analysis of real data sets
Model 3: The input vector x is generated from 10 dimensional multivariate norm distribution with mean 0 and variance matrix Σ, the off-diagonals of which are 0.2 and the diagonals are 1. For given x, y is generated from the Bernoulli
distribution with Pr(Y = 1|x) = exp(f (x))/(1 + exp(f (x)))
where
2
4
1
f (x) = x1 + π sin(πx2 ) + x53 + 3e−x4 /2 − 1.5.
3
10
We analyze the four real data sets which are available
on the UCI machine learning repository. The description
of the four data sets is presented in Table 3. In the table,
Type represents if the data set is either a regression problem
(R) or a classification problem (C). N.obs is the number
of observations, Cont. means continuous type inputs and
Categ. represents categorical inputs.
The main effect model as well as the second order interaction
model are fitted. Table 4 summarizes the prediction
The model has only main effect components, and x5 , . . . , x10
are noisy input variables. We apply the boosting algorithms accuracy as well as the sparsity of ANOVA boosting and the
standard boosting on the six data sets. The error rates are
with the negative log-likelihood loss.
calculated by the 10-fold cross-validation.
The results show that ANOVA boosting is consistently
Model 4: The input vector x is generated from 5 dimensional multivariate norm distribution with mean 0 and vari- more accurate than the standard boosting in most cases (one
ance matrix Σ, the off-diagonals of which are 0.2 and the di- exception for “Bupa” and main effect model). Also, ANOVA
agonals are 1. For given x, y is generated from the Bernoulli Boosting produces more sparse models than the standard
distribution with Pr(Y = 1|x) = exp(f (x))/(1 + exp(f (x))) boosting. In particular, for the data set “Sonar” with the
second order interaction model, the ANOVA boosting model
where
consists of only 25.7 components while the standard boostf (x) = 2x1 + π sin(πx1 ) + x2 − 2x32 + 4 exp(−2|x1 − x2 |). ing model has 111 components (i.e. 75% reduction).
Boosting on the functional ANOVA decomposition 365
Table 2. The relative frequencies of appearance of components in the models chosen in 100 runs
Model 1
Model 2
Model 3
Model 4
Method
Boosting
ANOVA Boosting
Method
Boosting
ANOVA Boosting
Method
Boosting
ANOVA Boosting
Method
Boosting
ANOVA Boosting
X1
1.00
1.00
X1
1.00
1.00
X1
1.00
1.00
X1 ∼ X4
1.00
1.00
Table 3. Description of the six data sets
Inputs
Name
Bupa
Breast
Sonar
Housing
Type
C
C
C
R
N.obs
345
286
210
506
Cont.
6
3
60
12
Categ.
0
6
0
1
4.3 Illustration on the data set “Breast”
We investigate more about the components selected in
the breast cancer data set. This data set includes 201 instances of one class (no-recurrence-events) and 85 instances
of another class (recurrence-events). The instances are described by 9 attributes – X1: age, X2: menopause (lt40,
he40, premeno), X3: tumor-size, X4: invasion node, X5:
node-caps (yes or no), X6: degree of malignance, X7:breast
X2
1.00
1.00
X2
1.00
1.00
X2
1.00
1.00
X1 X2
1.00
1.00
X3
1.00
0.90
X1 X2
0.80
0.41
X3
1.00
1.00
X1 X3
1.00
0.78
X4
0.96
0.93
Others
0.12
X4
0.99
1.00
X3 X4
0.69
0.11
Others
0.59
Others
0.10
Others
0.05
location (left or right), X8: breast quad (left-up, leftlow, right-up, right-low, central), X9: irradiated (yes or
no).
Since the second order interaction model is better in prediction accuracy than the main effect model in Table 4, we
present the results from the second order interaction model.
Figure 3 gives the L1 norms of the 12 selected components
out of 45 candidate components. Among these, Fig. 4 shows
the estimated functional forms of the first 6 components having the largest L1 norms. There are three main effects and
three second order interaction components. The risk of the
recurrence of breast tumor increases as the deg-mailg, invnodes and tumor-size increase. Also, the three interaction
components show that the location of the cancer are interacted with the status of menopause and age. These suggest
that different treatments would be applied according to the
age of a patient, status of menopause and location of the
cancer.
Table 4. Estimates of the accuracies and number of non-zero components (standard errors) in the four data sets
Data
Model
Main effect
Bupa
Second order
Main effect
Sonar
Second order
Main effect
Breast
Second order
Main effect
Housing
Second order
366 Y. Kim et al.
Method
Boosting
ANOVA Boosting
Boosting
ANOVA Boosting
Boosting
ANOVA Boosting
Boosting
ANOVA Boosting
Boosting
ANOVA Boosting
Boosting
ANOVA Boosting
Boosting
ANOVA Boosting
Boosting
ANOVA Boosting
MIS-rate
0.2868 (0.0247)
0.2926 (0.0237)
0.3362 (0.0223)
0.3187 (0.0175)
0.1583 (0.0235)
0.1529 (0.0193)
0.1631 (0.0225)
0.1575 (0.0225)
0.2494 (0.0130)
0.2449 (0.0094)
0.2462 (0.0186)
0.2421 (0.0156)
15.8973 (1.8412)
14.6608 (1.7641)
14.6569 (1.8556)
13.4980 (1.4603)
NNZ
6.0 (0.0000)
6.0 (0.0000)
20.3 (0.3000)
12.7 (1.0333)
40.8 (1.7048)
24.7 (1.0005)
111 (10.8443)
25.7 (1.6401)
7.1 (0.5467)
4.8 (0.5537)
21.2 (2.5638)
14.7 (3.0112)
11.8 (0.1334)
9.4 (0.4760)
39.1 (0.6904)
23.3 (0.8171)
5. CONCLUDING REMARKS
Figure 3. L1 norms of the 12 selected components.
By simulations and analysis of real data sets, we have illustrated that ANOVA boosting improves the interpretability of the standard boosting significantly by estimating the
components directly and providing componentwisely sparser
models without sacrificing prediction accuracy. Also, the
newly proposed computational algorithm converges faster
and can be applied to high dimensional data.
The final estimated components of ANOVA boosting are
not smooth. This is because decision trees are used as base
learners. If one wants smooth estimates, one can use smooth
base learners such as the radial basis functions and smooth
splines. As long as we have base learners for main effect
components, base learners for higher order interactions can
be constructed via the tensor product operation. See [13] for
this approach. However, there is an advantage of using decision trees as base learners. ANOVA boosting is expected
Figure 4. Estimated functional forms of the 6 components having the largest L1 norm for the Breast Cancer data.
Boosting on the functional ANOVA decomposition 367
to be robust to input noise since decision trees are so. This [11] Zhang, Z. and Lin, Y. (2006). Statistica Sinica 16 1021–1041.
MR2281313
is because decision trees are invariant to a monotone trans[12] Mason, L., Baxter, J., Bartlett, P. L., and Frean, M.
formation of an input. So, in practice, we can use ANOVA
(2000). Functional gradient techniques for combining hypotheses.
boosting without preprocessing input variables.
In Smola, A. J., Bartlett, P. L., Scholkopf, B., and Schuurmans,
ACKNOWLEDGEMENTS
This work was supported by the Korea Science and
Engineering Foundation (KOSEF) grant funded by the
Korea government (MEST) R01-2007-000-20045-0 and
the Engineering Research Center of Excellence Program
of Korea Ministry of Education, Science and Technology (MEST)/Korea Science and Engineering Foundation (KOSEF), grant number R11-2008-007-01002-0.
Received 26 May 2009
REFERENCES
[1] Freund, Y. and Schapire, R. (1997). Journal of Computer and
System Sciences 55 119–139. MR1473055
[2] Schapire, R. and Singer, Y. (1999). Machine Learning 37 297–
336.
[3] Friedman, J. H. (2001). Annals of Statistics 29 1189–1232.
MR1873328
[4] Friedman, J. H. and Popescu, B. E. (2005). Predictive learning
via rule ensembles. Technical report, Stanford University.
[5] Zou, H. (2006). Journal of the American Statistical Association
101 1418–1429. MR2279469
[6] Tutz, G. and Binder, H. (2006) Biometrics 62 961–971.
MR2297666
[7] Buhlmann, P. and Yu, B. (2006). Journal of machine learning
research 7 1001–1024. MR2274395
[8] Gunn, S. R. and Kandola, J. S. (2002). Machine Learning 48
137–163.
[9] Lin, Y. and Zhang, H. (2003). Component Selection and Smoothing in Smoothing Spline Analysis of Variance Models. Technical
Report 1072, Department of Statistics, University of WisconsinMadison.
[10] Lee, Y., Kim, Y., Lee, S., and Koo, J.-Y. (2006). Biometrika 93
555–571. MR2261442
368 Y. Kim et al.
View publication stats
D. (eds.), Advances in Large Margin Classifiers, pp. 221–246. MIT
press, Cambridge. MR1820960
[13] Zhang, H., Wahba, G., Lin, Y., Voelker, M., Ferris, M.,
Klein, R., and Klein, B. (2004). Journal of the American Statistical Association 99 659–672. MR2090901
Yongdai Kim
Department of Statistics
Seoul National University
Korea
E-mail address: ydkim0903@gmail.com
Yuwon Kim
NHN Corp.
Korea
E-mail address: gary@stats.snu.ac.kr
Jinseog Kim
Department of Statistics
Dongguk University
Korea
E-mail address: jskim@stats.snu.ac.kr
Sangin Lee
Department of Statistics
Seoul National University
Korea
E-mail address: lsi44@statcom.snu.ac.kr
Sunghoon Kwon
Department of Statistics
Seoul National University
Korea
E-mail address: shkwon0522@gmail.com