Boosting on the functional ANOVA decomposition

Jinseog Kim

Boosting on the functional ANOVA decomposition

2009, Statistics and Its Interface

Statistics and Its Interface Volume 2 (2009) 361–368 Boosting on the functional ANOVA decomposition Yongdai Kim∗ , Yuwon Kim, Jinseog Kim, Sangin Lee, and Sunghoon Kwon A boosting algorithm on the functional ANOVA decomposition, called ANOVA boosting, is proposed. The main idea of ANOVA boosting is to estimate each component in the functional ANOVA decomposition by combining many base (weak) learners. A regularization procedure based on the L1 penalty is proposed to give a componentwise sparse solution and an eﬃcient computing algorithm is developed. Simulated as well as bench mark data sets are analyzed to compare ANOVA boosting and standard boosting. ANOVA boosting improves prediction accuracy as well as interpretability by estimating the components directly and providing componentwisely sparser models. Keywords and phrases: Functional ANOVA decomposition, Boosting, Variable selection. 1. INTRODUCTION Given an output y ∈ Y and its corresponding input x = (x(1) , . . . , x(p) ) ∈ X , suppose we are interested in a functional relationship f : X → Y between x and y. When the dimension of the input (i.e. p) is high, one has a lot of diﬃculties in estimating and interpreting f . One of the most important learning methods for high dimensional data is a boosting method, which constructs a strong learner by combining many base (weak) learners. The boosting method has shown great success in statistics and machine learning areas for their signiﬁcant improvement in prediction accuracy. Since [1] introduced the ﬁrst boosting algorithm – AdaBoost, various extensions have been proposed by [2] and [3]. In this paper, we develop a way of using a boosting algorithm to estimate the components in the functional ANOVA decomposition, which is given as (1) f (x) = f1 (x) + f2 (x) + · · · + fK (x) where the components fk (x) depend only on low dimensional elements of an input vector x. The main idea of the proposed boosting algorithm is to estimate each component fj , j = 1, . . . , K by combining base learners. We call the ∗ Corresponding author. proposed boosting method “ANOVA boosting.” First, we propose sets of base learners for the components in the functional ANOVA decomposition to make the model identiﬁable. In particular, we use stumps (decision trees with only two terminal nodes) as base learners for main eﬀect terms and their tensor products as base learners for interaction eﬀect terms. Second, we develop a regularization procedure which gives a componentwisely sparse solution. Finally, we implement an eﬃcient computational algorithm. An advantage of ANOVA boosting over standard boosting methods is that ANOVA boosting can estimate and identify important components and their inﬂuence to the output simultaneously. In contrast, as [3] explained, standard boosting methods estimate only the highest order interaction components, and so estimating lower order components requires additional post-processing procedures. See, also, [4]. This advantage of ANOVA boosting makes it possible to select (or delete) relevant (or irrelevant) input variables. When the dimension of input is high, the ﬁnal estimated model of a standard boosting method includes many noisy components and we need to identify which components are real signals and which are noises. Since ANOVA boosting can estimate each component simultaneously, we can easily develop a method which can identify signal and noisy components in the estimated model. For this purpose, we develop a componentwise sparse regularization procedure called the componentwisely adaptive L1 penalty, which is motivated by the adaptive lasso by [5]. There are several modiﬁed boosting algorithms which give sparser solutions than standard boosting. [6] developed a similar boosting algorithm for the generalized additive model, and [7] proposed a boosting method called sparser boosting which yields a sparser solution than standard boosting. ANOVA boosting can estimate higher order interaction terms while the algorithm of [6] can estimate only main eﬀect terms. Also, ANOVA boosting also gives a componentwisely sparser solution in contrast to the sparser boosting of [7] which only gives a sparser solution in terms of base learners. That is, important components can be selected by ANOVA boosting but not by sparser boosting. ANOVA boosting has several advantages over the kernel based method for the functional ANOVA decomposition. [8] used the kernel machine for the functional ANOVA decomposition to improve the interpretability, and their idea has been studied and extended by [9], [10] and [11]. However, the kernel machine has a problem with categorical inputs since the Gram matrix can be singular and so the algorithm fails to converge. Also, when the dimension of the input is high, the computational cost for inverting the Gram matrix is expensive. In contrast, categorical inputs can be processed easily and computation is simpler since no matrix inversion is required in ANOVA boosting. The paper is organized as follows. Section 2 presents various ingredients of ANOVA boosting such as model, the choice of base learners and regularization procedure. In Section 3, a computational algorithm is presented. Simulated as well as real datasets are analyzed in Section 4. Concluding remarks follow in Section 5. Then, we can write ⎤ ⎡ ⎤ ⎡ p Fj ⎦ ⊕ ⎣ Fjk ⎦ F = {1} ⊕ ⎣ j=1 j<k where all subspaces {1}, Fj , Fjk , j = 1, . . . , p, j < k are p orthogonal on L2 (μ) where μ = j=1 μj , and hence all components are identiﬁable. 2.2 Choice of base learners The basic idea of ANOVA boosting is to estimate each component (i.e. fj s and fjk s) by a linear combination of base learners. For this, we have to choose sets of base learners Gj and Gjk for the components fj and fjk , respectively. For Gj , we use the set of decision trees with only two ter2. ANOVA BOOSTING minal nodes split by the variable x(j) . For the side condition, we enforce 2.1 Model gj (x(j) )μj (dx(j) ) = 0 Let (x1 , y1 ), . . . , (xn , yn ) be n input-output pairs of a (4) p X j training dataset where xi ∈ X ⊂ R and yi ∈ Y, which are assumed to be a random sample from a probability measure for gj ∈ Gj , and hence the resulting fj satisﬁes (2). For a (1) (p) (j) P on X × Y. Let xi = (xi , . . . , xi ) where xi ∈ Xj ⊂ R continuous input variable, let g (x(j) ) = θ I(x(j) ≤ s) + j L and X = X1 × · · · × Xp . Let F be a given set of functions on θ I(x(j) > s). To satisfy the side condition (4), we should R Rp and let l : Y × R → R be a loss function. The objective have of statistical learning is to ﬁnd a function f ∗ ∈ F which minimizes EP (l(Y, f (X))) among f ∈ F. (5) μj (x(j) ≤ s)θL + μj (x(j) > s)θR = 0. The functional ANOVA decomposition of f is That is, we can choose the split value s freely, but the prep dictive values θL and θR should be selected to satisfy (5). fj (x(j) ) + fjk (x(j) , x(k) ) + · · · f (x) = β0 + Categorical inputs can be treated similarly. j=1 j<k For Gjk , we use the tensor products of the base learners in Gj and Gk . That is, we let Gjk = Gj ⊗ Gk . That is, for where β0 is a constant, fj are the main eﬀect components, any gjk ∈ Gjk , there exist gj ∈ Gj and gk ∈ Gk such that fjk are second order interaction components and so on. For gjk (x(j) , x(k) ) = gj (x(j) )gk (x(k) ). With Gjk , the resulting fjk simplicity, we consider the model truncated up to the second automatically satisﬁes the identiﬁability condition (3). Note order interaction components for f . That is, F consists of that gjk have the form of functions having the form f (x) = β0 + p j=1 (j) fj (x )+ gjk (x(j) , x(k) ) = θLL I(x(j) ≤ sj , x(k) ≤ sk ) (j) fjk (x (k) ,x ). + θLR I(x(j) ≤ sj , x(k) > sk ) j<k Given predeﬁned probability measures μj on Xj , let Fj be the set of functions fj in L2 (μj ) satisfying (2) fj (x(j) )μj (dx(j) ) = 0 for fj ∈ Fj , + θRL I(x(j) > sj , x(k) ≤ sk ) + θRR I(x(j) > sj , x(k) > sk ) with the identiﬁability condition μj (x(j) ≤ sj )θLL + μj (x(j) > sj )θRL = 0, Xj and let Fjk be the set of functions fjk in L2 (μj × μk ) satisfying fjk (x(j) , x(k) )μj (dx(j) ) = 0, Xj (3) Xk fjk (x(j) , x(k) )μk (dx(k) ) = 0. 362 Y. Kim et al. μj (x(j) ≤ sj )θLR + μj (x(j) > sj )θRR = 0, μk (x(k) ≤ sk )θLL + μk (x(k) > sk )θLR = 0, (6) μk (x(k) ≤ sk )θRL + μk (x(k) > sk )θRR = 0. It is easy to see that one of θLL , θLR , θRL and θRR uniquely deﬁnes the other three values. In this view, we may say that the degree of freedom of gjk is the same as that of gj and gk . For the choice of μj , the most natural one is Pj , the 3. COMPUTATIONAL ALGORITHM marginal probability measure of x(j) , which are unknown. Given g in G = ∪j Gj ∪j<k Gjk , let hg (x) = λwg g(x) (j) We estimate Pj (x ≤ s) by their empirical counterparts where wg = wk if g ∈ Gk and wg = wjk if g ∈ Gjk . Then, we (j) n can rewrite (8) by i=1 I(xi ≤ s)/n. 2.3 Regularization In ANOVA boosting, the ﬁnal model has the form (7) f (x) = β0 + f0 (x) where (8) f0 (x) = p βg g(x(j) ) + j=1 g∈Gj βg g(x(j) , x(k) ). j<k g∈Gjk First, we need to control the norm of base learners to make βs estimable. For this, we let sup |g(x(j) )| ≤ 1 for all g ∈ Gj , x(j) ∈Xj and sup (x(j) ,x(k) )∈Xj ×Xk |g(x(j) , x(k) )| ≤ 1 for all g ∈ Gjk , for all j, k. Second, we need a regularization procedure for βs to avoid overﬁtting and ensure componentwise sparsity. For this, we propose to use the componentwisely adaptive L1 constraint (0) given as follows. Let βg be the initial estimates obtained by a standard boosting method, and let ⎛ ⎞γ ⎛ ⎞γ |βg(0) |⎠ and wjk = ⎝ |βg(0) |⎠ (9) wj = ⎝ g∈Gj g∈Gjk for some γ ≥ 0. Then, the componentwisely adaptive L1 constraint is deﬁned by (10) p g∈Gj j=1 wj |βg | + j<k g∈Gjk wjk |βg | ≤λ where γ and λ are regularization parameters which can be selected by using test samples or cross-validation. The proposed constraint (10) is motivated by the adaptive Lasso by [5]. Finally, we propose to estimate the βs by minimizing n the empirical risk Cn (β0 , f0 ) = i=1 l(yi , f (xi )) with the constraint (10). Remark. It would be possible to use diﬀerent regularization parameters for the components. That is, we let |βg | |βg | jk ≤ λj and ≤ λjk . This is useful when wjk we have prior information about the importance of the components. For example, to incorporate the prior information that the main eﬀect components are more important than higher order interaction component, we let λjk ≤ λj . The algorithm developed in the next section can be modiﬁed easily for this case. g∈Gj wj g∈G (11) f0 (x) = p j=1 g∈Gj θg hg (x(j) ) + θg hg (x(j) , x(k) ) j<k g∈Gjk and the constraint (10) becomes g∈G |θg | ≤ 1 where θg = βg /(λwg ). Hence, for ﬁxed β0 , we can use of the MarginBoost.L1 algorithm of [12]. However, there is a room to improve the MarginBoost.L1 algorithm. The ﬁnal estimated model from the algorithm may be less sparse than it should be. This is because the MarginBoost.L1 algorithm keeps adding base learners to update the model. Hence, when unnecessary base learners are added in the early stage of iteration, they are never deleted from the estimated model. This may not be a serious problem for prediction accuracy, but it aﬀects largely to the sparsity of the estimated model. For resolving this problem, we employ a deletion step after each iteration. In the deletion step, some base learners in the model are deleted. By doing so, we improve the convergence speed and ensure the sparsity of the ﬁnal estimated model. The idea of the deletion step is as follows. After m iterations, there are at most m many base learners whose coeﬃcients are not zero. Then, we move the non-zero coeﬃcients to the their gradient direction until either a nonzero coeﬃcient becomes zero or the optimization criterion is satisﬁed. To explaining more details, given a current estimated model f0 , let G + = {g ∈ G : θg > 0}. That is, f0 (x) = g∈G + θg hg (x). Since the set of base learners is negation closed (i.e. if g ∈ G, then −g ∈ G) we assume that all the non-zero coeﬃcients θg are positive and + g∈G + θg ≤ 1. Let ∇g = ∂Cn (β0 , f0 )/∂θg for g ∈ G , and let ∇∗g = ∇g − g∈G + ∇g /#G + where #G + is the cardinality of G + . Consider new coeﬃcients θg (v) = θg − v∇∗g for some v ≥ 0. Since g∈G + ∇∗g = 0, we have θg (v) ≥ 0 for all g ∈ G + and g∈G + θg (v) ≤ 1 as long as 0 ≤ v ≤ η where η = min{θg /∇∗g : ∇∗g > 0}. We update θg by θg (v̂) where v̂ = argminv∈[0,η] Cn (β0 , f0v ) where f0v (x) = g∈G + θg (v)hg (x). When v̂ = η, at least one of θg , g ∈ G + becomes 0 and hence the corresponding base learner is deleted from the estimated model. Note that the deletion step always reduces the empirical risk, and hence the algorithm also converges to the global optimum as the MarginBoost.L1 algorithm does under regularity conditions. The MarginBoost.L1 algorithm and deletion step, which we call the ANOVA boosting algorithm, is presented in Fig. 1. Figure 2 compares the convergence speeds of the ANOVA boosting and MarginBoost.L1 algorithms with a simulated data set from the model 1 in Section 4.1. It is clear that the ANOVA boosting algorithm converges much faster than the MarginBoost.L1 algorithm. The training error measured by the empirical risk (the average loss over Boosting on the functional ANOVA decomposition 363 1. Let β0 and f0 be the initial estimates from a standard boosting algorithm. 2. Let λj = |fj |γ and λjk = |fjk |γ where |fj | = |βg | and |fjk | = g∈Gj |βg |. g∈Gjk 3. Repeat until convergence • Addition step: MarginBoost.L1 algorithm n i=1 (a) Find ĝ in G which minimizes hg (xi )zi where ∂l(yi , a) . zi = ∂a a=f (x ) i (b) Find α̂ by α̂ = argminα∈[0,1] Cn (β0 , (1 − α)f0 + αhĝ ). (c) Update f0 = (1 − α̂)f0 + α̂hĝ . • Deletion step (a) Let f0 (x) = (b) Let ∇g = θ h (x) g∈G + g g n h (xi )zi for i=1 g where G + = {g : θg > 0}. g ∈ G + and let ∇∗g = ∇g − g∈G + ∇g /#G + (c) Find v̂ by v̂ = argminv∈[0,η] Cn (β0 , f0v ) where f0v (x) = g∈G + (θg − v∇∗g )hg (x) and η = min{θg /∇∗g : ∇∗g > 0}. (d) Update f0 = f0v̂ . • Update β0 (a) Update β0 = argminγ∈R C(γ, f0 ). Figure 1. The ANOVA boosting algorithm. standard boosting algorithms such as AdaBoost [1] and gradient boosting [3] which need a stopping rule to avoid overﬁtting. This is an another advantage of the ANOVA boosting algorithm. 4. EXPERIMENTS We compare empirical performance of ANOVA boosting with a standard boosting method in terms of prediction accuracy and variable selectivity. For a standard boosting method, we use the MarginBoost.L1 of Mason et al. (2000). For variable selectivity, we compute the relative frequencies of components selected. The regularization parameters γ and λ are selected by 5-fold cross validation. We search the Figure 2. Training error (empirical risk) curves on the number optimal value of γ only on {0, 0.5, 1} to save computing time. of iterations for the MarginBoost.L1 (dashed line) and 4.1 Simulation ANOVA boosting algorithms (solid line). the training samples) achieves its minimum after around 25 iterations of the ANOVA boosting algorithm while the training error keeps decreasing even after 200 iterations of the MarginBoost.L1 algorithm. The ANOVA boosting algorithm always converges since the empirical risk Cn (β0 , f0 ) always decreases after each iteration. The ANOVA boosting algorithm diﬀers from 364 Y. Kim et al. We consider the following four models for simulation. The ﬁrst two models are regression problems and the last two models are logistic regression. Model 1: The input vector x is generated from a 10 dimensional uniform distribution on [0, 1]10 . For given x, y is generated from the model y = f (x) + , where f (x) = 5g1 (x(1) ) + 3g2 (x(2) ) + 4g3 (x(3) ) + 6g4 (x(4) ) Table 1. Estimates of the error rate and sparsity (standard errors) in 100 simulations Model 1 Model 2 Model 3 Model 4 Method Boosting ANOVA boosting Boosting ANOVA boosting Boosting ANOVA boosting Boosting ANOVA boosting MIS-rate 1.2881 (0.0210) 1.1155 (0.0146) 0.1908 (0.0007) 0.1641 (0.0015) 0.2397 (0.0010) 0.2253 (0.0011) 0.1781 (0.0012) 0.1606 (0.0007) NNZ 9.94 (0.0239) 4.58 (0.0768) 49.81 (0.0466) 11.15 (0.2556) 9.78 (0.0628) 7.42 (0.1646) 12.61 (0.2755) 3.92 (0.1468) and is a normal variate with mean 0 variance σ 2 which is The model has 2 main eﬀect components and one second selected to give the signal to noise ratio 3:1. Here, order interaction component, and x3 , x4 , x5 are noisy input variables. sin(2πt) g1 (t) = t; g2 (t) = (2t − 1)2 ; g3 (t) = 2 − sin(2πt) Table 1 compares the prediction accuracy and sparsity. Sparsity is measured by the number of non-zero components. g4 (t) = 0.1 sin(2πt) + 0.2 cos(2πt) + 0.3 sin2 (2πt) We simulate 100 data sets of size 250. The error rate is eval+ 0.4 cos3 (2πt) + 0.5 sin3 (2πt). uated on 10,000 testing points. In the table, the MIS-rate is This model is used by [9]. The model has only main ef- the average misclassiﬁcation error rate on the test samples fect components, and x5 , . . . , x10 are noisy input variables. and the NNZ is the average number of non-zero components. We apply the boosting algorithms with the square error From Table 1, we can see ANOVA boosting is more accurate and selects less components than the standard boosting. loss. That is, ANOVA Boosting has superior prediction power as well as interpretability compared to the standard boosting. Model 2: Model 2 is the same as Model 1 except Better performance of ANOVA boosting is expected since f (x) = g1 (x(1) ) + g2 (x(2) ) + g3 (x(3) ) + g4 (x(4) ) the true models are sparse. (1) (3) Table 2 shows the relative frequency of each variable x +x + g1 (x(3) x(4) ) + g2 + g3 (x(1) x(2) ). appearing in the 100 estimated models, which shows that 2 ANOVA boosting successively deletes many noisy compoThat is, there are three interaction terms in the true nents compared to the standard boosting. model. 4.2 Analysis of real data sets Model 3: The input vector x is generated from 10 dimensional multivariate norm distribution with mean 0 and variance matrix Σ, the oﬀ-diagonals of which are 0.2 and the diagonals are 1. For given x, y is generated from the Bernoulli distribution with Pr(Y = 1|x) = exp(f (x))/(1 + exp(f (x))) where 2 4 1 f (x) = x1 + π sin(πx2 ) + x53 + 3e−x4 /2 − 1.5. 3 10 We analyze the four real data sets which are available on the UCI machine learning repository. The description of the four data sets is presented in Table 3. In the table, Type represents if the data set is either a regression problem (R) or a classiﬁcation problem (C). N.obs is the number of observations, Cont. means continuous type inputs and Categ. represents categorical inputs. The main eﬀect model as well as the second order interaction model are ﬁtted. Table 4 summarizes the prediction The model has only main eﬀect components, and x5 , . . . , x10 are noisy input variables. We apply the boosting algorithms accuracy as well as the sparsity of ANOVA boosting and the standard boosting on the six data sets. The error rates are with the negative log-likelihood loss. calculated by the 10-fold cross-validation. The results show that ANOVA boosting is consistently Model 4: The input vector x is generated from 5 dimensional multivariate norm distribution with mean 0 and vari- more accurate than the standard boosting in most cases (one ance matrix Σ, the oﬀ-diagonals of which are 0.2 and the di- exception for “Bupa” and main eﬀect model). Also, ANOVA agonals are 1. For given x, y is generated from the Bernoulli Boosting produces more sparse models than the standard distribution with Pr(Y = 1|x) = exp(f (x))/(1 + exp(f (x))) boosting. In particular, for the data set “Sonar” with the second order interaction model, the ANOVA boosting model where consists of only 25.7 components while the standard boostf (x) = 2x1 + π sin(πx1 ) + x2 − 2x32 + 4 exp(−2|x1 − x2 |). ing model has 111 components (i.e. 75% reduction). Boosting on the functional ANOVA decomposition 365 Table 2. The relative frequencies of appearance of components in the models chosen in 100 runs Model 1 Model 2 Model 3 Model 4 Method Boosting ANOVA Boosting Method Boosting ANOVA Boosting Method Boosting ANOVA Boosting Method Boosting ANOVA Boosting X1 1.00 1.00 X1 1.00 1.00 X1 1.00 1.00 X1 ∼ X4 1.00 1.00 Table 3. Description of the six data sets Inputs Name Bupa Breast Sonar Housing Type C C C R N.obs 345 286 210 506 Cont. 6 3 60 12 Categ. 0 6 0 1 4.3 Illustration on the data set “Breast” We investigate more about the components selected in the breast cancer data set. This data set includes 201 instances of one class (no-recurrence-events) and 85 instances of another class (recurrence-events). The instances are described by 9 attributes – X1: age, X2: menopause (lt40, he40, premeno), X3: tumor-size, X4: invasion node, X5: node-caps (yes or no), X6: degree of malignance, X7:breast X2 1.00 1.00 X2 1.00 1.00 X2 1.00 1.00 X1 X2 1.00 1.00 X3 1.00 0.90 X1 X2 0.80 0.41 X3 1.00 1.00 X1 X3 1.00 0.78 X4 0.96 0.93 Others 0.12 X4 0.99 1.00 X3 X4 0.69 0.11 Others 0.59 Others 0.10 Others 0.05 location (left or right), X8: breast quad (left-up, leftlow, right-up, right-low, central), X9: irradiated (yes or no). Since the second order interaction model is better in prediction accuracy than the main eﬀect model in Table 4, we present the results from the second order interaction model. Figure 3 gives the L1 norms of the 12 selected components out of 45 candidate components. Among these, Fig. 4 shows the estimated functional forms of the ﬁrst 6 components having the largest L1 norms. There are three main eﬀects and three second order interaction components. The risk of the recurrence of breast tumor increases as the deg-mailg, invnodes and tumor-size increase. Also, the three interaction components show that the location of the cancer are interacted with the status of menopause and age. These suggest that diﬀerent treatments would be applied according to the age of a patient, status of menopause and location of the cancer. Table 4. Estimates of the accuracies and number of non-zero components (standard errors) in the four data sets Data Model Main eﬀect Bupa Second order Main eﬀect Sonar Second order Main eﬀect Breast Second order Main eﬀect Housing Second order 366 Y. Kim et al. Method Boosting ANOVA Boosting Boosting ANOVA Boosting Boosting ANOVA Boosting Boosting ANOVA Boosting Boosting ANOVA Boosting Boosting ANOVA Boosting Boosting ANOVA Boosting Boosting ANOVA Boosting MIS-rate 0.2868 (0.0247) 0.2926 (0.0237) 0.3362 (0.0223) 0.3187 (0.0175) 0.1583 (0.0235) 0.1529 (0.0193) 0.1631 (0.0225) 0.1575 (0.0225) 0.2494 (0.0130) 0.2449 (0.0094) 0.2462 (0.0186) 0.2421 (0.0156) 15.8973 (1.8412) 14.6608 (1.7641) 14.6569 (1.8556) 13.4980 (1.4603) NNZ 6.0 (0.0000) 6.0 (0.0000) 20.3 (0.3000) 12.7 (1.0333) 40.8 (1.7048) 24.7 (1.0005) 111 (10.8443) 25.7 (1.6401) 7.1 (0.5467) 4.8 (0.5537) 21.2 (2.5638) 14.7 (3.0112) 11.8 (0.1334) 9.4 (0.4760) 39.1 (0.6904) 23.3 (0.8171) 5. CONCLUDING REMARKS Figure 3. L1 norms of the 12 selected components. By simulations and analysis of real data sets, we have illustrated that ANOVA boosting improves the interpretability of the standard boosting signiﬁcantly by estimating the components directly and providing componentwisely sparser models without sacriﬁcing prediction accuracy. Also, the newly proposed computational algorithm converges faster and can be applied to high dimensional data. The ﬁnal estimated components of ANOVA boosting are not smooth. This is because decision trees are used as base learners. If one wants smooth estimates, one can use smooth base learners such as the radial basis functions and smooth splines. As long as we have base learners for main eﬀect components, base learners for higher order interactions can be constructed via the tensor product operation. See [13] for this approach. However, there is an advantage of using decision trees as base learners. ANOVA boosting is expected Figure 4. Estimated functional forms of the 6 components having the largest L1 norm for the Breast Cancer data. Boosting on the functional ANOVA decomposition 367 to be robust to input noise since decision trees are so. This [11] Zhang, Z. and Lin, Y. (2006). Statistica Sinica 16 1021–1041. MR2281313 is because decision trees are invariant to a monotone trans[12] Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. formation of an input. So, in practice, we can use ANOVA (2000). Functional gradient techniques for combining hypotheses. boosting without preprocessing input variables. In Smola, A. J., Bartlett, P. L., Scholkopf, B., and Schuurmans, ACKNOWLEDGEMENTS This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MEST) R01-2007-000-20045-0 and the Engineering Research Center of Excellence Program of Korea Ministry of Education, Science and Technology (MEST)/Korea Science and Engineering Foundation (KOSEF), grant number R11-2008-007-01002-0. Received 26 May 2009 REFERENCES [1] Freund, Y. and Schapire, R. (1997). Journal of Computer and System Sciences 55 119–139. MR1473055 [2] Schapire, R. and Singer, Y. (1999). Machine Learning 37 297– 336. [3] Friedman, J. H. (2001). Annals of Statistics 29 1189–1232. MR1873328 [4] Friedman, J. H. and Popescu, B. E. (2005). Predictive learning via rule ensembles. Technical report, Stanford University. [5] Zou, H. (2006). Journal of the American Statistical Association 101 1418–1429. MR2279469 [6] Tutz, G. and Binder, H. (2006) Biometrics 62 961–971. MR2297666 [7] Buhlmann, P. and Yu, B. (2006). Journal of machine learning research 7 1001–1024. MR2274395 [8] Gunn, S. R. and Kandola, J. S. (2002). Machine Learning 48 137–163. [9] Lin, Y. and Zhang, H. (2003). Component Selection and Smoothing in Smoothing Spline Analysis of Variance Models. Technical Report 1072, Department of Statistics, University of WisconsinMadison. [10] Lee, Y., Kim, Y., Lee, S., and Koo, J.-Y. (2006). Biometrika 93 555–571. MR2261442 368 Y. Kim et al. View publication stats D. (eds.), Advances in Large Margin Classiﬁers, pp. 221–246. MIT press, Cambridge. MR1820960 [13] Zhang, H., Wahba, G., Lin, Y., Voelker, M., Ferris, M., Klein, R., and Klein, B. (2004). Journal of the American Statistical Association 99 659–672. MR2090901 Yongdai Kim Department of Statistics Seoul National University Korea E-mail address: ydkim0903@gmail.com Yuwon Kim NHN Corp. Korea E-mail address: gary@stats.snu.ac.kr Jinseog Kim Department of Statistics Dongguk University Korea E-mail address: jskim@stats.snu.ac.kr Sangin Lee Department of Statistics Seoul National University Korea E-mail address: lsi44@statcom.snu.ac.kr Sunghoon Kwon Department of Statistics Seoul National University Korea E-mail address: shkwon0522@gmail.com

Log In

Boosting on the functional ANOVA decomposition

Related papers

Related papers

Related topics