Unbiased Recursive Partitioning I: A Non-Parametric Conditional Inference Framework
Unbiased Recursive Partitioning I: A Non-Parametric Conditional Inference Framework
Unbiased Recursive Partitioning I: A Non-Parametric Conditional Inference Framework
Recursive Partitioning
Morgan and Sonquist (1963): Automated Interaction Detection.
Many variants have been (and still are) published, the majority of which are special
cases of a simple two-stage algorithm:
Step 1: partition the observations by univariate splits in a recursive way
Step 2: fit a constant model in each cell of the resulting partition.
Most prominent representatives: CART (Breiman et al., 1984) and C4.5 (Quinlan,
1993), both implementing an exhaustive search.
A Statistical Approach
We enter at the point where White and Liu (1994) demand for
[. . . ] a statistical approach [to recursive partitioning] which takes into account
the distributional properties of the measures.
and present a unified framework embedding recursive binary partitioning into the well
defined theory of
Part I: permutation tests developed by Strasser and Weber (1999),
Part II: tests for parameter instability in (parametric) regression models.
A Generic Algorithm
1. For case weights w test the global null hypothesis of independence between any
of the m covariates and the response. Stop if this hypothesis cannot be rejected.
Otherwise select the covariate Xj with strongest association to Y.
2. Choose a set A Xj in order to split Xj into two disjoint sets A and Xj \
A. The case weights wleft and wright determine the two subgroups with wleft,i =
wiI(Xj i A) and wright,i = wiI(Xj i 6 A) for all i = 1, . . . , n (I() denotes
the indicator function).
3. Recursively repeat steps 1 and 2 with modified case weights wleft and wright,
respectively.
Linear Statistics
Use a (multivariate) linear statistic
Tj (Ln, w) = vec
n
X
!
wigj (Xji)h(Yi, (Y1, . . . , Yn))>
Rpj q
i=1
n
X
i=1
w
V(h|S(Ln, w))
w 1
1
V(h|S(Ln, w))
w 1
with w =
Pn
i=1 wi .
!
X
!>
!
X
i
wigj (Xji)
X
i
wigj (Xji)
Test Statistics
A (multivariate) linear statistic Tj can now be used to construct a test statistic for
testing H0j , for example via
(t )
k
cmax(t, , ) =
max p
k=1,...,pq
()kk
or
cquad(t, , ) = (t )+(t )>
conditional on all permutations S(Ln, w) of the data. This solves the overfitting
problem.
When we can reject H0 in step 1 of the generic algorithm we select the covariate with
minimum P -value
Pj = PH j (c(Tj (Ln, w), j , j ) c(tj , j , j )|S(Ln, w))
0
of the conditional test for H0j . This prevents a variable selection bias.
Splitting Criteria
The goodness of a split is evaluated by two-sample linear statistics which are special
cases of the linear statistic T. For all possible subsets A of the sample space Xj the
linear statistic
TA
j (Ln , w) = vec
n
X
!
wiI(Xj i A)h(Yi, (Y1, . . . , Yn))>
i=1
j
j ).
j
Laser scanning images taken from the eye background are expected to serve as the
basis of an automated system for glaucoma diagnosis. For 98 patients and 98 controls,
matched by age and gender, 62 covariates describing the eye morphology are available.
1
vari
p < 0.001
0.059
> 0.059
2
vasg
p < 0.001
0.046
> 0.046
3
vart
p = 0.001
0.005
> 0.005
Node 4 (n = 51)
Node 5 (n = 22)
Node 6 (n = 14)
Node 7 (n = 109)
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
glaucoma normal
glaucoma normal
glaucoma normal
glaucoma normal
Interested in the class distribution in each inner node? Want to explore the process of
the split statistics in each inner node?
Node 1 (n = 196)
1
0.8
0.6
0.4
0.2
0
glaucoma normal
Node 2 (n = 87)
1
0.8
0.6
0.4
0.2
0
glaucoma normal
Node 3 (n = 73)
1
0.8
0.6
0.4
0.2
0
glaucoma normal
Node 4 (n = 51)
1
0.8
0.6
0.4
0.2
0
Node 5 (n = 22)
1
0.8
0.6
0.4
0.2
0
glaucoma normal
Node 6 (n = 14)
1
0.8
0.6
0.4
0.2
0
glaucoma normal
Node 7 (n = 109)
1
0.8
0.6
0.4
0.2
0
glaucoma normal
glaucoma normal
10
8
0.20
0
0.10
vari
Statistic
0
0.00
Statistic
8
6
4
Statistic
Node 3
10
Node 2
10
Node 1
0.01
0.03
0.05
vasg
0.002
0.005
vart
0.008
Evaluation of prognostic factors for the German Breast Cancer Study Group (GBSG2)
data, a prospective controlled clinical trial on the treatment of node positive breast
cancer patients. Complete data of seven prognostic factors of 686 women are used for
prognostic modeling
1
pnodes
p < 0.001
>3
2
horTh
p = 0.035
5
progrec
p < 0.001
no
Node 3 (n = 248)
20
yes
Node 4 (n = 128)
Node 6 (n = 144)
> 20
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
500 1000150020002500
500 1000150020002500
500 1000150020002500
Node 7 (n = 166)
500 1000150020002500
1
SYMPT
p < 0.001
Agree
> Agree
3
PB
p = 0.012
8
Node 2 (n = 113)
>8
Node 4 (n = 208)
Node 5 (n = 91)
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
Never
0
Never
Never
Benchmark Experiments
Hypothesis 1: Conditional inference trees with statistical stop criterion perform as
good as an exhaustive search algorithm with pruning.
Hypothesis 2: Conditional inference trees with statistical stop criterion perform as
good as parametric unbiased recursive partitioning (QUEST,GUIDE, Loh, 2002, is a
starting point).
Equivalence measured by ratio of misclassification or mean squared error with a
equivalence margin of 10% and Fieller confidence intervals.
Breast Cancer
Diabetes
Glass
Glaucoma
Vowel
Soybean
Vehicle
Servo
Sonar
Ionosphere
Ozone
Performance ratio
Breast Cancer
Diabetes
Glass
Glaucoma
Ionosphere
Vowel
Sonar
Servo
Vehicle
Ozone
Soybean
Performance ratio
Summary
The separation of variable selection and split point estimation first implemented in
CHAID (Kass, 1980) is the basis for unbiased recursive partitioning for responses and
covariates measured at arbitrary scales.
The statistical internal stop criterion ensures that interpretations drawn from such trees
are valid in a statistical sense, i.e., with appropriate control of type I errors.
Even the algorithm has no concept of prediction error, the performance is at least
equivalent to established procedures.
We are committed to reproducible research, see
R> vignette(package = "party")
References
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth,
California, 1984.
David W. Hosmer and Stanley Lemeshow. Applied Logistic Regression. John Wiley & Sons, New York,
2nd edition, 2000.
G.V. Kass. An exploratory technique for investigating large quantities of categorical data. Applied
Statistics, 29(2):119127, 1980.
Wei-Yin Loh. Regression trees with unbiased variable selection and interaction detection. Statistica
Sinica, 12:361386, 2002.
John Mingers. Expert systems rule induction with statistical data. Journal of the Operations Research
Society, 38(1):3947, 1987.
James N. Morgan and John A. Sonquist. Problems in the analysis of survey data, and a proposal. Journal
of the American Statistical Association, 58:415434, 1963.
John R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publ., San Mateo, California,
1993.
Helmut Strasser and Christian Weber. On the asymptotic theory of permutation statistics. Mathematical
Methods of Statistics, 8:220250, 1999.
Allan P. White and Wei Zhong Liu. Bias in information-based measures in decision tree induction.
Machine Learning, 15:321329, 1994.