Linear Regression Analysis For STARDEX: Trend Calculation
Linear Regression Analysis For STARDEX: Trend Calculation
Trend Calculation
The probability that our data (± some fixed ∆y at each point) occurred is the product of the
probabilities at each point:
N
1 y i − (a + bxi )
2
P α ∏ exp − ∆y
i =1 2 σi
Maximising this is equivalent to minimising:
y − (a + bx )
2
∑ i σ i
i
If the standard deviation σ i at each point is the same, then this is equivalent to minimising:
∑ ( y − (a + bx ))
2
i i
Solving this by finding a and b for which partial derivatives with respect to a and b are zero,
gives the best fit parameters for the regression constant and coefficient (α and β):
S xx S y − S x S xy
α=
∆
NS xy − S x S y
β=
∆
where ∆ = NS xx − ( S x ) 2
and S x = ∑ xi , S y = ∑ y i , S xy = ∑ xi y i , S xx = ∑ x 2
∑ y − (a + bx )
i =1
i i
The solution to this needs to be found numerically. Example code can be found in Press et al.
(1986).
Logistic Regression
Linear regression has been generalised under the field of generalised linear modelling, of
which logistic regression is a special case. This method utilises the binomial distribution and
can therefore be used to model counts of extreme events.
Often in a series, the variance of the residuals (from the linear model) varies with the
magnitude of the data. This goes against the assumptions of least squares regression, which
assumes residuals to have constant variance, but is a natural element of the binomial
distribution and logistic regression. Therefore data do not need to be normalised.
The logistic regression model expresses the probability π of a success (e.g. an event above a
particular threshold) as a function of time:
η (π ) = α + β ⋅ t
Since the probability of a success is in the range [0,1], it needs to be transformed to the range
(− ∞, ∞ ) using a link function:
x
η ( x ) = log
1− x
Solving for π gives:
eα + β ⋅t
π (t ;α , β ) =
1 + eα + β ⋅t
We are not fitting a straight line to the counts and therefore can not refer to a single trend
value. The odds ratio is used to express the relative change in the ratio of events to non-events
over the period (t1 ,t 2 ) :
π (t 2 ) π (t1 )
Θ≡ = e β ⋅(t2 −t1 )
1 − π (t 2 ) 1 − π (t1 )
Model fitting can be done using a maximum likelihood method.
Further information about logistic regression, together with an example using extreme
precipitation in Switzerland, can be found in Frei and Schär (2001).
Significance Testing
∑ ( y − (a + bx ))
2
i i
σ 2
= , with N-2 appearing in the denominator because two parameters are
N −2
estimated.
From the above, it can be shown that the regression coefficient b will be normally distributed
with variance:
σ2
Var[b] =
∑ (x )
2
i −x
Since the variance of b is estimated, Student’s t-distribution is used to define the multiplier t
for the confidence limits for the regression coefficient:
b = β ± t Var (b)
The assumption that the residuals are normally distributed can be tested with a quantile-
quantile (Q-Q) plot of the residuals against the quantiles from a Gaussian distribution.
For further information, see Wilks (1995) or Press et al. (1986).
Linear Correlation
The linear correlation coefficient (Pearson product-moment coefficient of linear correlation)
is used widely to assess relationships between variables and has a close relationship to least
squares regression.
The correlation coefficient is defined by:
cov( x, y )
rxy = i.e. the ratio of the covariance of x and y to the product of their standard
σ xσ y
deviations.
In a least squares linear model, the variance of the predictand can be proportioned into the
variance of the regression line and the variance of the predictand around the line:
SST=SSR+SSE
Sum of Squares Total = Sum of Squares Regression + Sum of Squares Error
In a good linear relationship between the predictor and predictand, SSE will be much smaller
than SSR i.e. the spread of points around the line will be much smaller than the variance of
the line. This goodness of fit can be described by the coefficient of determination:
SSR
R2 = = variance of predictand explained by the predictor
SST
It can be shown that the coefficient of determination is the same as the square of the
correlation coefficient. The correlation coefficient can therefore be used to assess how well
the linear model fits the data. Assessing the significance of a sample correlation is difficult,
however, as there is no way to calculate its distribution for the null hypothesis (that the
variables are not correlated). Most tables of significance use the approximation that, for a
small number of points and normally distributed data, the following statistic is distributed for
the null hypothesis like Student’s t-distribution:
N −2
t=r (1)
1− r2
The common basis of the correlation coefficient and least squares linear regression means that
they share the same shortcomings such as limited resistance to outliers.
See Wilks (1995) or Press et al. (1986) for further information.
Resampling
Resampling procedures are used extensively by climatologists and could be used to assess the
significance of a linear trend. The bootstrap method involves randomly resampling data (with
replacement) to create new samples, from which the distribution of the null hypothesis can be
estimated. Therefore no assumption needs be made about the sample distribution. If enough
random samples are generated, the significance of an observed linear trend can be assessed by
where it appears in the distribution of trends from the random samples.
A problem, however, is that the maximum likelihood derivation of the least squares estimate
for the linear trend assumed that data residuals about the line were normally distributed.
Therefore if the distribution of the residuals is not Gaussian, then the least squares estimate is
not valid. Still, bootstrapping could be used to test the significance of a least squares linear
trend, given that this may not be the best trend estimate.
An important assumption in resampling is that observations are independent. Zwiers (1990)
showed that, for the case of assessing the significance of the difference in two sample means,
the presence of serial correlation greatly affected the results. A method has been proposed by
Ebisuzaki (1997) whereby random samples are taken in the frequency domain (with random
phase) to retain the serial correlation of the data in each sample.
References
Ebisuzaki, W., 1997: A method to estimate the statistical significance of a correlation when
the data are serially correlated. J. Clim., 10, 2147–2153.
Frei, C. and C. Schär, 2001. Detection probability of trends in rare events: theory and
application to heavy precipitation in the alpine region. J. Clim., 14, 1568-1584.
Hoaglin, D.C., F. Mosteller and J.W. Tukey, 1983. Understanding robust and exploratory
data analysis. Wiley. 129-165.
Press, W.H., B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, 1986. Numerical recipes:
The art of scientific computing. Cambridge Univ. Press, 488-493.
Wilks, D.S., 1995. Statistical Methods in the Atmospheric Sciences. Academic Press. 160-
176.
Zwiers, F. W., 1990. The effect of serial correlation on statistical inferences made with
resampling procedures. J. Clim., 3, 1452-1461.