An Introduction To Splines: James H. Steiger
An Introduction To Splines: James H. Steiger
An Introduction To Splines: James H. Steiger
James H. Steiger
Introduction
When transformation won’t linearize your model, the function is complicated, and you
don’t have deep theoretical predictions about the nature of the X -Y regression
relationship, but you do want to be able to characterize it, at least to the extent of
predicting new values, you may want to consider a generalized additive model (GAM).
A generalized additive model represents E (Y |X = x) as a weight sum of smooth
functions of x.
We’ll briefly discuss two examples, polynomial regression and spline regression.
Piecewise Regression
Nonlinear relationships between a predictor and response can sometimes be difficult to fit
with a single parameter function or a polynomial of “reasonable” degree, say, between 2
and 5.
For example, you are already familiar with the UN data relating per capita GDP with
infant mortality rates per 1000. We’ve seen before that these data are difficult to analyze
in their original form, but can be linearized by log-transforming both the predictor and
response.
Here are the original data from car.
Piecewise Regression
> data(UN)
> attach(UN)
> plot(gdp,infant.mortality)
150
infant.mortality
100
50
0
gdp
Piecewise Regression
> plot(log(gdp),log(infant.mortality))
5
4
log(infant.mortality)
3
2
1
4 5 6 7 8 9 10
log(gdp)
Piecewise Regression
Here we fit the log-log model, then back-transform it to the original metric and plot the curve.
100
50
0
gdp
Piecewise Regression
This works quite a bit better than, say, fitting a polynomial of order 5, because polynomials
can be very unstable at their boundaries!
100
50
0
gdp
Piecewise Regression
Piecewise Regression
Define an indicator variable, and then use it as a predictor, but also allow an interaction
between this dummy predictor and gdp
We can express the model as
The dummy variable (gdp > 2000)+ takes on the value 1 when gdp > 2000, zero
otherwise. You can see that for observations where gdp exceeds 2000, the model becomes
Piecewise Regression
E (Y |X ) = β0 + β1 X + β2 (X − a1 )+ + β3 (X − a2 )+
+ . . . + βk−1 (X − ak )+ (2)
Call:
lm(formula = infant.mortality ~ 1 + gdp + I((gdp - 1750) * (gdp >
1750)))
Residuals:
Min 1Q Median 3Q Max
-69.045 -11.923 -2.760 8.761 127.998
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.152745 4.061900 22.69 <2e-16 ***
gdp -0.037298 0.003347 -11.14 <2e-16 ***
I((gdp - 1750) * (gdp > 1750)) 0.036496 0.003474 10.51 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
100
50
0
gdp
Cubic spline regression fits cubic functions that are joined at a series of k knots.
These functions will look really smooth if they have the same first and second derivatives
at the knots.
Such a system follows the form
E (Y |X ) = β0 + β1 X + β2 X 2 + β3 X 3 +
β4 (X − a1 )3+ + β5 (X − a2 )3+ + . . . +
βk+3 (X − ak )3+ (3)
With enough knots, cubic spline regression can work very well.
However, like with polynomial regression, the system sometimes works very poorly at the
outer ranges of X .
A solution to this problem is to restrict the outer line segments at the lower and upper
range of X to be straight lines.
0
−1
0 2 4 6 8 10
In the following figures from Fox’s Applied Regression text, we see a progression of fits to
these data.
> library(pspline)
> n <- 100
> x <- (1:n)/n
> true <- ((exp(1.2*x)+1.5*sin(7*x))-1)/3
> noise <- rnorm(n, 0, 0.15)
> y <- true + noise
> library(pspline)
> n <- 100
> x <- (1:n)/n
> true <- ((exp(1.2*x)+1.5*sin(7*x))-1)/3
> noise <- rnorm(n, 0, 0.15)
> y <- true + noise
> fit <- smooth.Pspline(x, y, method=3)
> plot(x,y)
> lines(x,fit$ysmth,type='l',col="red")
> fit <- smooth.Pspline(x, y, method=3)
> plot(x,y)
> lines(x,fit$ysmth,type='l',add=TRUE)
> curve(((exp(1.2*x)+1.5*sin(7*x))-1)/3,0,
+ 1,add=TRUE,col="red")