Linear Regression Analysis For STARDEX: Trend Calculation

Linear regression analysis methods can be used to calculate trends and their statistical significance in time series climate data. Common methods include least squares regression, which finds the linear model that best fits the data, and minimum absolute deviation, which is more resistant to outliers. Logistic regression can model counts of extreme events over time. Statistical significance of trends can be tested using confidence intervals, correlation coefficients, and non-parametric tests that are more robust to outliers.

Uploaded by

Srinivasu Upparapalli

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views

Linear Regression Analysis For STARDEX: Trend Calculation

Uploaded by

Srinivasu Upparapalli

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Linear Regression Analysis for STARDEX

Malcolm Haylock, Climatic Research Unit

The following document is an overview of linear regression methods for reference by
members of STARDEX. While it aims to cover the most common and relevant methods for
calculating trends and their levels of statistical significance, there will inevitably be
omissions. Please send any corrections or comments to Malcolm Haylock
(M.Haylock@uea.ac.uk).

Trend Calculation

Least squares (used in diagnostic tool)

Least squares linear regression is a maximum likelihood estimate i.e. given a linear model,
what is the likelihood that this data set could have occurred? The method attempts to find the
linear model that maximises this likelihood.
Suppose each data point yi has a measurement error that is independently random and
normally distributed around the linear model with a standard deviation σ i .

The probability that our data (± some fixed ∆y at each point) occurred is the product of the
probabilities at each point:
N 
  1  y i − (a + bxi )   
2

P α ∏ exp −    ∆y 
i =1   2 σi   
 
Maximising this is equivalent to minimising:
 y − (a + bx ) 
2

∑  i σ i 
 i 
If the standard deviation σ i at each point is the same, then this is equivalent to minimising:
∑ ( y − (a + bx ))
2
i i

Solving this by finding a and b for which partial derivatives with respect to a and b are zero,
gives the best fit parameters for the regression constant and coefficient (α and β):
S xx S y − S x S xy
α=
∆
NS xy − S x S y
β=
∆
where ∆ = NS xx − ( S x ) 2
and S x = ∑ xi , S y = ∑ y i , S xy = ∑ xi y i , S xx = ∑ x 2

For further information, see Wilks (1995) or Press et al. (1986).

Minimum Absolute Deviation

Least squares linear regression, like many statistical techniques, assumes that the departures
from the linear model (errors) are normally distributed. Techniques that do not rely on such
assumptions are termed robust.
Least squares regression is also sensitive to outliers. Although most of the errors may be
normally distributed, a few points with large errors can have a large affect on the estimated
parameters. Techniques that are not so sensitive to outliers are termed resistant.
A more resistant method for linear trend analysis is to assume that the errors are distributed as
a two-sided exponential. This distribution, with its larger tails, allows a higher probability of
outliers:
Pr{y i − (a + bxi )}~ exp(− y i − (a + bxi ) )

Similarly to the process of least squares, this requires that we minimise:

∑ y − (a + bx )
i =1
i i

The solution to this needs to be found numerically. Example code can be found in Press et al.
(1986).

Three-group resistant line

This method derives its resistance from the fact that one of the simplest resistant measures of
a sample is the median.
Data are divided into three groups depending on the rank of the x values. The left group
contains the points with the lowest third of x values. In a time series this is equivalent to the
first third of the series. Similarly, the middle and right groups contain points with the middle
and highest third of ranked x values respectively.
Next the x and y median values are determined for the three groups to give the points
(x L , y L ) , (x M , y M ) and (x R , y R ) .
The slope of the line is taken as the gradient of the line through the medians of the left and
right groups:
yR − yL
b0 =
xR − xL
The intercept of the line is calculated by finding the three lines with slope b0 that pass
through each of the points (x L , y L ) , (x M , y M ) and (x R , y R ) , then averaging their intercept:
1
a0 = [( y L − b0 x L ) + ( y M − b0 x M ) + ( y R − b0 x R )]
3
The three-group resistant line method usually requires iteration. After the first pass to find a 0
and b0 , the process can be repeated on the residuals to find a1 and b1 . The iterations are
continued until the adjustment to the slope is sufficiently small in magnitude (at most 1%).
The final slope and intercept is the sum of those from each iteration.
Further information can be found in Hoaglin et al. (1983).

Logistic Regression
Linear regression has been generalised under the field of generalised linear modelling, of
which logistic regression is a special case. This method utilises the binomial distribution and
can therefore be used to model counts of extreme events.
Often in a series, the variance of the residuals (from the linear model) varies with the
magnitude of the data. This goes against the assumptions of least squares regression, which
assumes residuals to have constant variance, but is a natural element of the binomial
distribution and logistic regression. Therefore data do not need to be normalised.
The logistic regression model expresses the probability π of a success (e.g. an event above a
particular threshold) as a function of time:
η (π ) = α + β ⋅ t
Since the probability of a success is in the range [0,1], it needs to be transformed to the range
(− ∞, ∞ ) using a link function:
 x 
η ( x ) = log 
1− x 
Solving for π gives:
eα + β ⋅t
π (t ;α , β ) =
1 + eα + β ⋅t
We are not fitting a straight line to the counts and therefore can not refer to a single trend
value. The odds ratio is used to express the relative change in the ratio of events to non-events
over the period (t1 ,t 2 ) :
π (t 2 ) π (t1 )
Θ≡ = e β ⋅(t2 −t1 )
1 − π (t 2 ) 1 − π (t1 )
Model fitting can be done using a maximum likelihood method.
Further information about logistic regression, together with an example using extreme
precipitation in Switzerland, can be found in Frei and Schär (2001).

Significance Testing

Confidence intervals for least squares

Often the standard deviations σ i for the observations are not known. If we assume that the
linear model does fit well and that all observations have the same standard deviation σ , the
assumption that the residuals are normally distributed around the linear model implies that:

∑ ( y − (a + bx ))
2
i i
σ 2
= , with N-2 appearing in the denominator because two parameters are
N −2
estimated.
From the above, it can be shown that the regression coefficient b will be normally distributed
with variance:
σ2
Var[b] =
∑ (x )
2
i −x

Since the variance of b is estimated, Student’s t-distribution is used to define the multiplier t
for the confidence limits for the regression coefficient:
b = β ± t Var (b)
The assumption that the residuals are normally distributed can be tested with a quantile-
quantile (Q-Q) plot of the residuals against the quantiles from a Gaussian distribution.
For further information, see Wilks (1995) or Press et al. (1986).

Linear Correlation
The linear correlation coefficient (Pearson product-moment coefficient of linear correlation)
is used widely to assess relationships between variables and has a close relationship to least
squares regression.
The correlation coefficient is defined by:
cov( x, y )
rxy = i.e. the ratio of the covariance of x and y to the product of their standard
σ xσ y
deviations.
In a least squares linear model, the variance of the predictand can be proportioned into the
variance of the regression line and the variance of the predictand around the line:
SST=SSR+SSE
Sum of Squares Total = Sum of Squares Regression + Sum of Squares Error
In a good linear relationship between the predictor and predictand, SSE will be much smaller
than SSR i.e. the spread of points around the line will be much smaller than the variance of
the line. This goodness of fit can be described by the coefficient of determination:
SSR
R2 = = variance of predictand explained by the predictor
SST
It can be shown that the coefficient of determination is the same as the square of the
correlation coefficient. The correlation coefficient can therefore be used to assess how well
the linear model fits the data. Assessing the significance of a sample correlation is difficult,
however, as there is no way to calculate its distribution for the null hypothesis (that the
variables are not correlated). Most tables of significance use the approximation that, for a
small number of points and normally distributed data, the following statistic is distributed for
the null hypothesis like Student’s t-distribution:
N −2
t=r (1)
1− r2
The common basis of the correlation coefficient and least squares linear regression means that
they share the same shortcomings such as limited resistance to outliers.
See Wilks (1995) or Press et al. (1986) for further information.

Spearman rank-order correlation coefficient

Non parametric correlation statistics are an attempt to overcome the limited resistance and
robustness of the linear correlation coefficient, as well as the uncertainty in determining its
significance.
If x and y data values are replaced by their rank, we are left with the set of points (i,j), i,j=1,N
which are drawn from an accurately known distribution. Although we are ignoring some
information in the data, this is far outweighed by the benefits of greater robustness and
resistance.
The Spearman rank-order correlation coefficient is just the correlation coefficient of these
ranked data. Significance is tested as for the linear correlation coefficient using (1), but in this
case the approximation does not depend on the distributions of the data.
See Press et al. (1986) for further information.

Kendall-Tau (used in diagnostic tool)

Kendall’s Tau differs from the Spearman rank-order correlation in that it only uses the
relative ordering of ranks when comparing points. It is calculated over all possible pairs of
data points using the following:
concordant − discordant
τ=
concordant + discordant + sameX concordant + discordant + sameY
where concordant is the number of pairs where the relative ordering of x and y are the same,
discordant where they are the opposite, sameX where the x values are the same and sameY
where the y values are the same.
τ is approximately normally distributed with zero mean and variance:
4 N + 10
Var (τ ) =
9 N ( N − 1)
One advantage of Kendall’s tau over the Spearman coefficient is the problem of assigning
ranks when data are tied. Kendall’s tau is only concerned whether a rank is higher or lower
than another, and can therefore be calculated by comparing the data themselves rather than
their rank. When data are limited to only a few discrete values, Kendall’s tau is a more
suitable statistic.
See Press et al. (1986) for further information.

Resampling
Resampling procedures are used extensively by climatologists and could be used to assess the
significance of a linear trend. The bootstrap method involves randomly resampling data (with
replacement) to create new samples, from which the distribution of the null hypothesis can be
estimated. Therefore no assumption needs be made about the sample distribution. If enough
random samples are generated, the significance of an observed linear trend can be assessed by
where it appears in the distribution of trends from the random samples.
A problem, however, is that the maximum likelihood derivation of the least squares estimate
for the linear trend assumed that data residuals about the line were normally distributed.
Therefore if the distribution of the residuals is not Gaussian, then the least squares estimate is
not valid. Still, bootstrapping could be used to test the significance of a least squares linear
trend, given that this may not be the best trend estimate.
An important assumption in resampling is that observations are independent. Zwiers (1990)
showed that, for the case of assessing the significance of the difference in two sample means,
the presence of serial correlation greatly affected the results. A method has been proposed by
Ebisuzaki (1997) whereby random samples are taken in the frequency domain (with random
phase) to retain the serial correlation of the data in each sample.
References
Ebisuzaki, W., 1997: A method to estimate the statistical significance of a correlation when
the data are serially correlated. J. Clim., 10, 2147–2153.
Frei, C. and C. Schär, 2001. Detection probability of trends in rare events: theory and
application to heavy precipitation in the alpine region. J. Clim., 14, 1568-1584.
Hoaglin, D.C., F. Mosteller and J.W. Tukey, 1983. Understanding robust and exploratory
data analysis. Wiley. 129-165.
Press, W.H., B.P. Flannery, S.A. Teukolsky and W.T. Vetterling, 1986. Numerical recipes:
The art of scientific computing. Cambridge Univ. Press, 488-493.
Wilks, D.S., 1995. Statistical Methods in the Atmospheric Sciences. Academic Press. 160-
176.
Zwiers, F. W., 1990. The effect of serial correlation on statistical inferences made with
resampling procedures. J. Clim., 3, 1452-1461.

Fruit Store Management System For Jayani Fresh Fruits - RUP
100% (1)
Fruit Store Management System For Jayani Fresh Fruits - RUP
97 pages
Shin-Nippon SLM-4000-5000 - Service Manual PDF
No ratings yet
Shin-Nippon SLM-4000-5000 - Service Manual PDF
46 pages
Autocorrelation
100% (1)
Autocorrelation
172 pages
Principles of Econometrics 4e Chapter 2 Solution
84% (19)
Principles of Econometrics 4e Chapter 2 Solution
33 pages
Topic 6 Simple Linear Regression
No ratings yet
Topic 6 Simple Linear Regression
57 pages
Chapter 6
0% (1)
Chapter 6
50 pages
Stock Watson 3u Exercise Solutions Chapter 13 Instructors
No ratings yet
Stock Watson 3u Exercise Solutions Chapter 13 Instructors
15 pages
HW 03 Sol
No ratings yet
HW 03 Sol
9 pages
Intro To Hydrology
No ratings yet
Intro To Hydrology
415 pages
Ch2 Slides
No ratings yet
Ch2 Slides
80 pages
A Brief Overview of The Classical Linear Regression Model: Introductory Econometrics For Finance' © Chris Brooks 2013 1
No ratings yet
A Brief Overview of The Classical Linear Regression Model: Introductory Econometrics For Finance' © Chris Brooks 2013 1
80 pages
Linear Regression
100% (1)
Linear Regression
14 pages
Midterm Principles
No ratings yet
Midterm Principles
8 pages
Practice Midterm2 Fall2011
No ratings yet
Practice Midterm2 Fall2011
9 pages
Part 8 Linear Regression
No ratings yet
Part 8 Linear Regression
6 pages
Week 10 Assignment Ch14
No ratings yet
Week 10 Assignment Ch14
16 pages
Multi Regression
No ratings yet
Multi Regression
17 pages
Annotated 4 Ch4 Linear Regression F2014
No ratings yet
Annotated 4 Ch4 Linear Regression F2014
11 pages
4 - LM Test and Heteroskedasticity
No ratings yet
4 - LM Test and Heteroskedasticity
13 pages
Multiple Regression Tutorial 3
100% (2)
Multiple Regression Tutorial 3
5 pages
K Kiran Kumar IIM Indore
100% (1)
K Kiran Kumar IIM Indore
115 pages
Stock Watson 4E Exercisesolutions Chapter12 Students
No ratings yet
Stock Watson 4E Exercisesolutions Chapter12 Students
6 pages
Vector Autoregressions: How To Choose The Order of A VAR
No ratings yet
Vector Autoregressions: How To Choose The Order of A VAR
8 pages
Generalized Method of Moments (GMM) Estimation: Outline
No ratings yet
Generalized Method of Moments (GMM) Estimation: Outline
16 pages
Chapter 1: The Nature of Econometrics and Economic Data Chapter 2: The Simple Regression Model
No ratings yet
Chapter 1: The Nature of Econometrics and Economic Data Chapter 2: The Simple Regression Model
19 pages
Week 1 - Intro To Stata
No ratings yet
Week 1 - Intro To Stata
35 pages
Chapter5 - Hypothesis Testing and Statistical Inference
No ratings yet
Chapter5 - Hypothesis Testing and Statistical Inference
50 pages
Lecture Series 1 Linear Random and Fixed Effect Models and Their (Less) Recent Extensions
No ratings yet
Lecture Series 1 Linear Random and Fixed Effect Models and Their (Less) Recent Extensions
62 pages
Econometricians Assignment
No ratings yet
Econometricians Assignment
4 pages
ARCH Model
No ratings yet
ARCH Model
26 pages
Autocorrelation
0% (1)
Autocorrelation
49 pages
Answer Set 5 - Fall 2009
No ratings yet
Answer Set 5 - Fall 2009
38 pages
Robust Regression Modeling With STATA Lecture Notes
No ratings yet
Robust Regression Modeling With STATA Lecture Notes
93 pages
Stock Watson 3U ExerciseSolutions Chapter04 Students PDF
No ratings yet
Stock Watson 3U ExerciseSolutions Chapter04 Students PDF
8 pages
2 Simple Regression Model Estimation and Properties
100% (1)
2 Simple Regression Model Estimation and Properties
48 pages
Linear Statistical Models The Less Than Full Rank Model: Yao-Ban Chan
100% (1)
Linear Statistical Models The Less Than Full Rank Model: Yao-Ban Chan
140 pages
Econometric Modelling: Module - 1
No ratings yet
Econometric Modelling: Module - 1
20 pages
Correlation-Regression 2019
No ratings yet
Correlation-Regression 2019
76 pages
Multiple Regression SPECIALISTICA
No ratings yet
Multiple Regression SPECIALISTICA
93 pages
Heteroscedasticity: What Heteroscedasticity Is. Recall That OLS Makes The Assumption That
No ratings yet
Heteroscedasticity: What Heteroscedasticity Is. Recall That OLS Makes The Assumption That
20 pages
Multiple Linear Regression Case
0% (1)
Multiple Linear Regression Case
7 pages
Multicollinearity Among The Regressors Included in The Regression Model
No ratings yet
Multicollinearity Among The Regressors Included in The Regression Model
13 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
Multicollinearity Autocorrelation
No ratings yet
Multicollinearity Autocorrelation
28 pages
Introduction To Econometrics - Stock & Watson - CH 5 Slides
100% (2)
Introduction To Econometrics - Stock & Watson - CH 5 Slides
71 pages
01 Econometrics - Overview
No ratings yet
01 Econometrics - Overview
41 pages
Multiple Regression MS
No ratings yet
Multiple Regression MS
35 pages
CH 02
No ratings yet
CH 02
88 pages
Homoscedastic That Is, They All Have The Same Variance: Heteroscedasticity
100% (1)
Homoscedastic That Is, They All Have The Same Variance: Heteroscedasticity
11 pages
Structure Project Topics For Projects 2016
No ratings yet
Structure Project Topics For Projects 2016
4 pages
Econometric Analysis of Panel Data: William Greene Department of Economics Stern School of Business
No ratings yet
Econometric Analysis of Panel Data: William Greene Department of Economics Stern School of Business
88 pages
Regression 2024
No ratings yet
Regression 2024
49 pages
12.simple Regression NLS Edit
No ratings yet
12.simple Regression NLS Edit
62 pages
Curve Fitting
No ratings yet
Curve Fitting
48 pages
3 - Wooldridge - Introductory Econometrics - Ch03
No ratings yet
3 - Wooldridge - Introductory Econometrics - Ch03
25 pages
ML-UNIT-IV - Complete
No ratings yet
ML-UNIT-IV - Complete
42 pages
Econometrics I: TA Session 5: Giovanna Ubida
No ratings yet
Econometrics I: TA Session 5: Giovanna Ubida
20 pages
Time Series Characteristic
No ratings yet
Time Series Characteristic
72 pages
Multivariate Regression
No ratings yet
Multivariate Regression
20 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
STAR Rando Questions Stats
No ratings yet
STAR Rando Questions Stats
14 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
Learn How To Relax!: Step 1: Focusing
No ratings yet
Learn How To Relax!: Step 1: Focusing
2 pages
Padakavitha Vaijayanti
No ratings yet
Padakavitha Vaijayanti
446 pages
Being Assertive: Role Plays: What Would You Do If
No ratings yet
Being Assertive: Role Plays: What Would You Do If
1 page
Give Me Blood I Shall Give You Freedom
0% (1)
Give Me Blood I Shall Give You Freedom
2 pages
Venkateswara Suprabathm
No ratings yet
Venkateswara Suprabathm
8 pages
The Verses of Vémana Moral Religious and Satirical
No ratings yet
The Verses of Vémana Moral Religious and Satirical
191 pages
An Optimal Definition For Ocean Mixed Layer Depth
No ratings yet
An Optimal Definition For Ocean Mixed Layer Depth
19 pages
Vivekachudamani
100% (1)
Vivekachudamani
207 pages
Lecture 2-2: Robotics Robotics and and Automation Automation
No ratings yet
Lecture 2-2: Robotics Robotics and and Automation Automation
8 pages
Introduction To Fourier Transform
No ratings yet
Introduction To Fourier Transform
10 pages
Effective Width of Slab Calculation as per IRC:22-2015
No ratings yet
Effective Width of Slab Calculation as per IRC:22-2015
2 pages
04 QuickLook Interpretation PDF
No ratings yet
04 QuickLook Interpretation PDF
20 pages
Actual4Test: Actual4test - Actual Test Exam Dumps-Pass For IT Exams
100% (1)
Actual4Test: Actual4test - Actual Test Exam Dumps-Pass For IT Exams
4 pages
Aiiot 2025 Overview
No ratings yet
Aiiot 2025 Overview
3 pages
CLASS 12 CS PRACTICAL Mysql
100% (3)
CLASS 12 CS PRACTICAL Mysql
10 pages
14-01b Section Properties of Cmu Walls
No ratings yet
14-01b Section Properties of Cmu Walls
8 pages
Crank Shaft Final Inspection
100% (1)
Crank Shaft Final Inspection
4 pages
William Gilbert Robert Hooke
No ratings yet
William Gilbert Robert Hooke
3 pages
Graphites and Fullerene
No ratings yet
Graphites and Fullerene
9 pages
18.0 Carbonyl Compounds
100% (2)
18.0 Carbonyl Compounds
9 pages
A Brief Comparison and Contrast Between Sir Isaac Newton and Albert Einstein
No ratings yet
A Brief Comparison and Contrast Between Sir Isaac Newton and Albert Einstein
3 pages
E 1 Pages Examples Guide
No ratings yet
E 1 Pages Examples Guide
4 pages
Powerone, Power Supply 48V Rectifier Module (Aspiro XR08.48)
No ratings yet
Powerone, Power Supply 48V Rectifier Module (Aspiro XR08.48)
3 pages
Elical 2: CALI-0550 03-3220 2024-08
No ratings yet
Elical 2: CALI-0550 03-3220 2024-08
2 pages
1 Algebra1
No ratings yet
1 Algebra1
3 pages
Polymers 14 04402 v2
No ratings yet
Polymers 14 04402 v2
22 pages
Gcse Maths Homework
100% (1)
Gcse Maths Homework
10 pages
Traditional Advanced Control Strategies
No ratings yet
Traditional Advanced Control Strategies
48 pages
IMO PS Speed Logs
75% (4)
IMO PS Speed Logs
2 pages
Sci Worksheet W3 Act 1
No ratings yet
Sci Worksheet W3 Act 1
4 pages
Soiltexture in R
No ratings yet
Soiltexture in R
98 pages
Deduction and Induction
100% (1)
Deduction and Induction
8 pages
Research Article: An Empirical Study of Machine Learning Algorithms For Stock Daily Trading Strategy
No ratings yet
Research Article: An Empirical Study of Machine Learning Algorithms For Stock Daily Trading Strategy
31 pages
Hybrid Beamforming For DFRC System Based On SINR Performance Metric
No ratings yet
Hybrid Beamforming For DFRC System Based On SINR Performance Metric
6 pages
SNiP Vs Eurocode
No ratings yet
SNiP Vs Eurocode
105 pages