Af Notes by Midhila)
Af Notes by Midhila)
Af Notes by Midhila)
Autocorrelation
Autocorrelation refers to the degree of correlation between the values of the same
variables across different observations in the data. The concept of autocorrelation
is most often discussed in the context of time series data in which observations
occur at different points in time (e.g., air temperature measured on different days
of the month). For example, one might expect the air temperature on the 1st day of
the month to be more similar to the temperature on the 2nd day compared to the
31st day. If the temperature values that occurred closer together in time are, in
fact, more similar than the temperature values that occurred farther apart in time,
the data would be autocorrelated.
Multicollinearity
Multicollinearity is a state of very high intercorrelations or inter-associations
among the independent variables. It is therefore a type of disturbance in the data,
and if present in the data the statistical inferences made about the data may not be
reliable.
There are certain signals which help the researcher to detect the degree of
multicollinearity.
One such signal is if the individual outcome of a statistic is not significant but the
overall outcome of the statistic is significant. In this instance, the researcher might
get a mix of significant and insignificant results that show the presence of
multicollinearity.Suppose the researcher, after dividing the sample into two parts,
finds that the coefficients of the sample differ drastically. This indicates the
presence of multicollinearity. This means that the coefficients are unstable due to
the presence of multicollinearity. Suppose the researcher observes drastic change
in the model by simply adding or dropping some variable. This also indicates that
multicollinearity is present in the data. Multicollinearity can also be detected
with the help of tolerance and its reciprocal, called variance inflation factor
(VIF). If the value of tolerance is less than 0.2 or 0.1 and, simultaneously, the
value of VIF 10 and above, then the multicollinearity is problematic.
PREDICTED AND RESIDUAL SCORES
The regression line expresses the best prediction of the dependent variable (Y),
given the independent variables (X). However, nature is rarely (if ever) perfectly
predictable, and usually there is substantial variation of the observed points around
the fitted regression line (as in the scatterplot shown earlier). The deviation of a
particular point from the regression line (its predicted value) is called
the residual value.
The graph allows you to evaluate the normality of the empirical distribution
because it also shows the normal curve superimposed over the histogram. It also
allows you to examine various aspects of the distribution qualitatively. For
example, the distribution could be bimodal (have 2 peaks). This might suggest that
the sample is not homogeneous but possibly its elements came from two different
populations, each more or less normally distributed. In such cases, in order to
understand the nature of the variable in question, you should look for a way to
quantitatively identify the two sub-samples.
Correlations
Purpose (What is Correlation?) Correlation is a measure of the relation between
two or more variables. The measurement scales used should be at least interval
scales, but other correlation coefficients are available to handle other types of data.
Correlation coefficients can range from -1.00 to +1.00. The value of -1.00
represents a perfect negative correlation while a value of +1.00 represents a
perfect positive correlation. A value of 0.00 represents a lack of correlation.
Typically, we believe that outliers represent a random error that we would like to
be able to control. Unfortunately, there is no widely accepted method to remove
outliers automatically (however, see the next paragraph), thus what we are left with
is to identify any outliers by examining a scatterplot of each important correlation.
Needless to say, outliers may not only artificially increase the value of a
correlation coefficient, but they can also decrease the value of a "legitimate"
correlation.
See also Confidence Ellipse.
Quantitative Approach to Outliers. Some researchers use quantitative methods
to exclude outliers. For example, they exclude observations that are outside the
range of ±2 standard deviations (or even ±1.5 sd's) around the group or design cell
mean. In some areas of research, such "cleaning" of the data is absolutely
necessary. For example, in cognitive psychology research on reaction times, even
if almost all scores in an experiment are in the range of 300-700 milliseconds, just
a few "distracted reactions" of 10-15 seconds will completely change the overall
picture. Unfortunately, defining an outlier is subjective (as it should be), and the
decisions concerning how to identify them must be made on an individual basis
(taking into account specific experimental paradigms and/or "accepted practice"
and general research experience in the respective area). It should also be noted that
in some rare cases, the relative frequency of outliers across a number of groups or
cells of a design can be subjected to analysis and provide interpretable results. For
example, outliers could be indicative of the occurrence of a phenomenon that is
qualitatively different than the typical pattern observed or expected in the sample,
thus the relative frequency of outliers could provide evidence of a relative
frequency of departure from the process or phenomenon that is typical for the
majority of cases in a group. See also Confidence Ellipse.
Correlations in Non-homogeneous Groups. A lack of homogeneity in the sample
from which a correlation was calculated can be another factor that biases the value
of the correlation. Imagine a case where a correlation coefficient is calculated from
data points which came from two different experimental groups but this fact is
ignored when the correlation is calculated. Let us assume that the experimental
manipulation in one of the groups increased the values of both correlated variables
and thus the data from each group form a distinctive "cloud" in the scatterplot (as
shown in the graph below).
In such cases, a high correlation may result that is entirely due to the arrangement
of the two groups, but which does not represent the "true" relation between the two
variables, which may practically be equal to 0 (as could be seen if we looked at
each group separately, see the following graph).
If you suspect the influence of such a phenomenon on your correlations and know
how to identify such "subsets" of data, try to run the correlations separately in each
subset of observations. If you do not know how to identify the hypothetical
subsets, try to examine the data with some exploratory multivariate techniques
(e.g., Cluster Analysis).
Nonlinear Relations between Variables. Another potential source of problems
with the linear (Pearson r) correlation is the shape of the relation. As mentioned
before, Pearson r measures a relation between two variables only to the extent to
which it is linear; deviations from linearity will increase the total sum of squared
distances from the regression line even if they represent a "true" and very close
relationship between two variables. The possibility of such non-linear relationships
is another reason why examining scatterplots is a necessary step in evaluating
every correlation. For example, the following graph demonstrates an extremely
strong correlation between the two variables which is not well described by the
linear function.
Measuring Nonlinear Relations. What do you do if a correlation is strong but
clearly nonlinear (as concluded from examining scatterplots)? Unfortunately, there
is no simple answer to this question, because there is no easy-to-use equivalent
of Pearson r that is capable of handling nonlinear relations. If the curve is
monotonous (continuously decreasing or increasing) you could try to transform
one or both of the variables to remove the curvilinearity and then recalculate the
correlation. For example, a typical transformation used in such cases is the
logarithmic function which will "squeeze" together the values at one end of the
range. Another option available if the relation is monotonous is to try a
nonparametric correlation (e.g., Spearman R, see Nonparametrics and Distribution
Fitting) which is sensitive only to the ordinal arrangement of values, thus, by
definition, it ignores monotonous curvilinearity. However, nonparametric
correlations are generally less sensitive and sometimes this method will not
produce any gains. Unfortunately, the two most precise methods are not easy to use
and require a good deal of "experimentation" with the data. Therefore you could:
1. Try to identify the specific function that best describes the curve. After a
function has been found, you can test its "goodness-of-fit" to your data.
2. Alternatively, you could experiment with dividing one of the variables into a
number of segments (e.g., 4 or 5) of an equal width, treat this new variable as
a grouping variable and run an analysis of variance on the data.
Exploratory Examination of Correlation Matrices. A common first step of
many data analyses that involve more than a very few variables is to run a
correlation matrix of all variables and then examine it for expected (and
unexpected) significant relations. When this is done, you need to be aware of the
general nature of statistical significance (see Elementary Concepts); specifically, if
you run many tests (in this case, many correlations), then significant results will be
found "surprisingly often" due to pure chance. For example, by definition, a
coefficient significant at the .05 level will occur by chance once in every 20
coefficients. There is no "automatic" way to weed out the "true" correlations. Thus,
you should treat all results that were not predicted or planned with particular
caution and look for their consistency with other results; ultimately, though, the
most conclusive (although costly) control for such a randomness factor is to
replicate the study. This issue is general and it pertains to all analyses that involve
"multiple comparisons and statistical significance." This problem is also briefly
discussed in the context of post-hoc comparisons of means and
the Breakdowns option.
Casewise vs. Pairwise Deletion of Missing Data. The default way of deleting
missing data while calculating a correlation matrix is to exclude all cases that have
missing data in at least one of the selected variables; that is, by casewise
deletion of missing data. Only this way will you get a "true" correlation matrix,
where all correlations are obtained from the same set of observations. However, if
missing data are randomly distributed across cases, you could easily end up with
no "valid" cases in the data set, because each of them will have at least one missing
data in some variable. The most common solution used in such instances is to use
so-called pairwise deletion of missing data in correlation matrices, where a
correlation between each pair of variables is calculated from all cases that have
valid data on those two variables. In many instances there is nothing wrong with
that method, especially when the total percentage of missing data is low, say 10%,
and they are relatively randomly distributed between cases and variables.
However, it may sometimes lead to serious problems.
For example, a systematic bias may result from a "hidden" systematic distribution
of missing data, causing different correlation coefficients in the same correlation
matrix to be based on different subsets of subjects. In addition to the possibly
biased conclusions that you could derive from such "pairwise calculated"
correlation matrices, real problems may occur when you subject such matrices to
another analysis (e.g., multiple regression, factor analysis, or cluster analysis) that
expects a "true correlation matrix," with a certain level of consistency and
"transitivity" between different coefficients. Thus, if you are using the pairwise
method of deleting the missing data, be sure to examine the distribution of missing
data across the cells of the matrix for possible systematic "patterns."
How to Identify Biases Caused by the Bias due to Pairwise Deletion of Missing
Data. If the pairwise deletion of missing data does not introduce any systematic
bias to the correlation matrix, then all those pairwise descriptive statistics for one
variable should be very similar. However, if they differ, then there are good
reasons to suspect a bias. For example, if the mean (or standard deviation) of the
values of variable A that were taken into account in calculating its correlation with
variable B is much lower than the mean (or standard deviation) of those values of
variable A that were used in calculating its correlation with variable C, then we
would have good reason to suspect that those two correlations (A-B and A-C) are
based on different subsets of data, and thus, that there is a bias in the correlation
matrix caused by a non-random distribution of missing data.
Pairwise Deletion of Missing Data vs. Mean Substitution. Another common
method to avoid loosing data due to casewise deletion is the so-called mean
substitution of missing data (replacing all missing data in a variable by the mean of
that variable). Mean substitution offers some advantages and some disadvantages
as compared to pairwise deletion. Its main advantage is that it produces "internally
consistent" sets of results ("true" correlation matrices). The main disadvantages
are:
1. Mean substitution artificially decreases the variation of scores, and this
decrease in individual variables is proportional to the number of missing data
(i.e., the more missing data, the more "perfectly average scores" will be
artificially added to the data set).
2. Because it substitutes missing data with artificially created "average" data
points, mean substitution may considerably change the values of correlations.
Spurious Correlations. Although you cannot prove causal relations based on
correlation coefficients (see Elementary Concepts), you can still identify so-
called spurious correlations; that is, correlations that are due mostly to the
influences of "other" variables. For example, there is a correlation between the
total amount of losses in a fire and the number of firemen that were putting out the
fire; however, what this correlation does not indicate is that if you call fewer
firemen then you would lower the losses. There is a third variable (the
initial size of the fire) that influences both the amount of losses and the number of
firemen. If you "control" for this variable (e.g., consider only fires of a fixed size),
then the correlation will either disappear or perhaps even change its sign. The main
problem with spurious correlations is that we typically do not know what the
"hidden" agent is. However, in cases when we know where to look, we can
use partial correlations that control for (partial out) the influence of specified
variables.
Are correlation coefficients "additive?" No, they are not. For example, an
average of correlation coefficients in a number of samples does not represent an
"average correlation" in all those samples. Because the value of the correlation
coefficient is not a linear function of the magnitude of the relation between the
variables, correlation coefficients cannot simply be averaged. In cases when you
need to average correlations, they first have to be converted into additive measures.
For example, before averaging, you can square them to obtain coefficients of
determination, which are additive (as explained before in this section), or convert
them into so-called Fisher z values, which are also additive.
How to Determine Whether Two Correlation Coefficients are Significant. A
test is available that will evaluate the significance of differences between two
correlation coefficients in two samples. The outcome of this test depends not only
on the size of the raw difference between the two coefficients but also on the size
of the samples and on the size of the coefficients themselves. Consistent with the
previously discussed principle, the larger the sample size, the smaller the effect
that can be proven significant in that sample. In general, due to the fact that the
reliability of the correlation coefficient increases with its absolute value, relatively
small differences between large correlation coefficients can be significant. For
example, a difference of .10 between two correlations may not be significant if the
two coefficients are .15 and .25, although in the same sample, the same difference
of .10 can be highly significant if the two coefficients are .80 and .90.
To index
t-Test for Independent Samples
Purpose, Assumptions. The t-test is the most commonly used method to evaluate
the differences in means between two groups. For example, the t-test can be used
to test for a difference in test scores between a group of patients who were given a
drug and a control group who received a placebo. Theoretically, the t-test can be
used even if the sample sizes are very small (e.g., as small as 10; some researchers
claim that even smaller n's are possible), as long as the variables are normally
distributed within each group and the variation of scores in the two groups is not
reliably different (see also Elementary Concepts). As mentioned before, the
normality assumption can be evaluated by looking at the distribution of the data
(via histograms) or by performing a normality test. The equality of variances
assumption can be verified with the F test, or you can use the more robust Levene's
test. If these conditions are not met, then you can evaluate the differences in means
between two groups using one of the nonparametric alternatives to the t- test
(see Nonparametrics and Distribution Fitting).
The p-level reported with a t-test represents the probability of error involved in
accepting our research hypothesis about the existence of a difference. Technically
speaking, this is the probability of error associated with rejecting the hypothesis of
no difference between the two categories of observations (corresponding to the
groups) in the population when, in fact, the hypothesis is true. Some researchers
suggest that if the difference is in the predicted direction, you can consider only
one half (one "tail") of the probability distribution and thus divide the standard p-
level reported with a t-test (a "two-tailed" probability) by two. Others, however,
suggest that you should always report the standard, two-tailed t-test probability.
See also, Student's t Distribution.
Arrangement of Data. In order to perform the t-test for independent samples, one
independent (grouping) variable (e.g., Gender: male/female) and at least one
dependent variable (e.g., a test score) are required. The means of the dependent
variable will be compared between selected groups based on the specified values
(e.g., male and female) of the independent variable. The following data set can be
analyzed with a t-test comparing the average WCC score in males and females.
GENDER WCC
case 1 male 111
case 2 male 110
case 3 male 109
case 4 female 102
case 5 female 104
mean WCC in males = 110
mean WCC in females = 103
To index
t-Test for Dependent Samples
Within-group Variation. As explained in Elementary Concepts, the size of a
relation between two variables, such as the one measured by a difference in means
between two groups, depends to a large extent on the differentiation of
values within the group. Depending on how differentiated the values are in each
group, a given "raw difference" in group means will indicate either a stronger or
weaker relationship between the independent (grouping) and dependent variable.
For example, if the mean WCC (White Cell Count) was 102 in males and 104 in
females, then this difference of "only" 2 points would be extremely important if all
values for males fell within a range of 101 to 103, and all scores for females fell
within a range of 103 to 105; for example, we would be able to predict WCC pretty
well based on gender. However, if the same difference of 2 was obtained from very
differentiated scores (e.g., if their range was 0-200), then we would consider the
difference entirely negligible. That is to say, reduction of the within-group
variation increases the sensitivity of our test.
Purpose. The t-test for dependent samples helps us to take advantage of one
specific type of design in which an important source of within-group variation (or
so-called, error) can be easily identified and excluded from the analysis.
Specifically, if two groups of observations (that are to be compared) are based on
the same sample of subjects who were tested twice (e.g., before and after a
treatment), then a considerable part of the within-group variation in both groups of
scores can be attributed to the initial individual differences between subjects. Note
that, in a sense, this fact is not much different than in cases when the two groups
are entirely independent (see t-test for independent samples), where individual
differences also contribute to the error variance; but in the case of independent
samples, we cannot do anything about it because we cannot identify (or "subtract")
the variation due to individual differences in subjects. However, if the same sample
was tested twice, then we can easily identify (or "subtract") this variation.
Specifically, instead of treating each group separately, and analyzing raw scores,
we can look only at the differences between the two measures (e.g., "pre-test" and
"post test") in each subject. By subtracting the first score from the second for each
subject and then analyzing only those "pure (paired) differences," we will exclude
the entire part of the variation in our data set that results from unequal base levels
of individual subjects. This is precisely what is being done in the t-test for
dependent samples, and, as compared to the t-test for independent samples, it
always produces "better" results (i.e., it is always more sensitive).
Assumptions. The theoretical assumptions of the t-test for independent
samples also apply to the dependent samples test; that is, the paired differences
should be normally distributed. If these assumptions are clearly not met, then one
of the nonparametric alternative tests should be used.
See also, Student's t Distribution.
Arrangement of Data. Technically, we can apply the t-test for dependent samples
to any two variables in our data set. However, applying this test will make very
little sense if the values of the two variables in the data set are not logically and
methodologically comparable. For example, if you compare the average WCC in a
sample of patients before and after a treatment, but using a different counting
method or different units in the second measurement, then a highly significant t-
test value could be obtained due to an artifact; that is, to the change of units of
measurement. Following, is an example of a data set that can be analyzed using
the t-test for dependent samples.
WCC WCC
before after
case 1 111.9 113
case 2 109 110
case 3 143 144
case 4 101 102
case 5 80 80.9
... ... ...
average change between WCC
"before" and "after" = 1
The average difference between the two conditions is relatively small (d=1) as
compared to the differentiation (range) of the raw scores (from 80 to 143, in the
first sample). However, the t-test for dependent samples analysis is performed only
on the paired differences , "ignoring" the raw scores and their potential
differentiation. Thus, the size of this particular difference of 1 will be compared
not to the differentiation of raw scores but to the differentiation of the individual
difference scores, which is relatively small: 0.2 (from 0.9 to 1.1). Compared to that
variability, the difference of 1 is extremely large and can yield a highly
significant t value.
Matrices of t-tests. t-tests for dependent samples can be calculated for long lists of
variables, and reviewed in the form of matrices produced
with casewise or pairwise deletion of missing data, much like the correlation
matrices. Thus, the precautions discussed in the context of correlations also apply
to t-test matrices; see:
1. the issue of artifacts caused by the pairwise deletion of missing data in t-
tests and
2. the issue of "randomly" significant test values.
More Complex Group Comparisons. If there are more than two "correlated
samples" (e.g., before treatment, after treatment 1, and after treatment 2), then
analysis of variance with repeated measures should be used. The repeated
measures ANOVA can be considered a generalization of the t-test for dependent
samples and it offers various features that increase the overall sensitivity of the
analysis. For example, it can simultaneously control not only for the base level of
the dependent variable, but it can control for other factors and/or include in the
design more than one interrelated dependent variable (MANOVA; for additional
details refer to ANOVA/MANOVA
To
ANOVA
Contents:
1. The ANOVA Test
2. One Way ANOVA
3. Two Way ANOVA
4. What is MANOVA?
5. What is Factorial ANOVA?
6. How to run an ANOVA
7. ANOVA vs. T Test
8. Repeated Measures ANOVA
9. Sphericity
10. Related Articles
The ANOVA Test
An ANOVA test is a way to find out if survey or experiment results are significant.
In other words, they help you to figure out if you need to reject the null
hypothesis or accept the alternate hypothesis. Basically, you’re testing groups to
see if there’s a difference between them. Examples of when you might want to test
different groups:
A group of psychiatric patients are trying three different therapies:
counseling, medication and biofeedback. You want to see if one therapy is
better than the others.
A manufacturer has two different processes to make light bulbs. They want
to know if one process is better than the other.
Students from different colleges take the same exam. You want to see if one
college outperforms the other.
What Does “One-Way” or “Two-Way Mean?
One-way or two-way refers to the number of independent variables (IVs) in your
Analysis of Variance test. One-way has one independent variable (with 2 levels)
and two-way has two independent variables (can have multiple levels). For
example, a one-way Analysis of Variance could have one IV (brand of cereal) and
a two-way Analysis of Variance has two IVs (brand of cereal, calories).
What are “Groups” or “Levels”?
Groups or levels are different groups in the same independent variable. In the
above example, your levels for “brand of cereal” might be Lucky Charms, Raisin
Bran, Cornflakes — a total of three levels. Your levels for “Calories” might be:
sweetened, unsweetened — a total of two levels.
Let’s say you are studying if Alcoholics Anonymous and individual counseling
combined is the most effective treatment for lowering alcohol consumption. You
might split the study participants into three groups or levels: medication only,
medication and counseling, and counseling only. Your dependent variable would
be the number of alcoholic beverages consumed per day.
If your groups or levels have a hierarchical structure (each level has unique
subgroups), then use a nested ANOVA for the analysis.
What Does “Replication” Mean?
It’s whether you are replicating your test(s) with multiple groups. With a two way
ANOVA with replication , you have two groups and individuals within that group
are doing more than one thing (i.e. two groups of students from two colleges taking
two tests). If you only have one group taking two tests, you would use without
replication.
Types of Tests.
There are two main types: one-way and two-way. Two-way tests can be with or
without replication.
MULTIPLE REGRESSION
Read the results. As well as creating a regression graph, Minitab will give you
values for S, R-sq and R-sq(adj) in the top right corner of the fitted line plot
window.
s = standard error.
R-Sq = Coefficient of Determination
R-Sq(adj) = Adjusted Coefficient of Determination (Adjusted R Squared).
AUTOCORRELATION TESTS
What Is the Durbin Watson Statistic?
The Durbin Watson (DW) statistic is a test for autocorrelation in the residuals from
a statistical regression analysis. The Durbin-Watson statistic will always have a
value between 0 and 4. A value of 2.0 means that there is no autocorrelation
detected in the sample. Values from 0 to less than 2 indicate positive
autocorrelation and values from from 2 to 4 indicate negative autocorrelation.
A stock price displaying positive autocorrelation would indicate that the price
yesterday has a positive correlation on the price today—so if the stock fell
yesterday, it is also likely that it falls today. A security that has a negative
autocorrelation, on the other hand, has a negative influence on itself over time—so
that if it fell yesterday, there is a greater likelihood it will rise today
KEY TAKEAWAYS
2 is no autocorrelation.
0 to <2 is positive autocorrelation (common in time series data).
>2 to 4 is negative autocorrelation (less common in time series data).
A rule of thumb is that test statistic values in the range of 1.5 to 2.5 are
relatively normal. Values outside of this range could be cause for concern.
Field(2009) suggests that values under 1 or more than 3 are a definite cause
for concern.
Regression Analysis
Suppose we want to assess the association between total cholesterol and body mass
index (BMI) in which total cholesterol is the dependent variable, and BMI is the
independent variable. In regression analysis, the dependent variable is denoted Y
and the independent variable is denoted X. So, in this case, Y=total cholesterol and
X=BMI.
, where
is the predicted or expected value of the outcome, X is the predictor , b0 is the
estimated Y-intercept, and b1 is the estimated slope. The Y-intercept and slope are
estimated from the sample data so as to minimize the sum of the squared
differences between the observed and the predicted values of the outcome, i.e., the
estimates minimize:
These differences between observed and predicted values of the outcome are
called residuals. The estimates of the Y-intercept and slope minimize the sum of
the squared residuals, and are called the least squares estimates.1
,
where b1 is the estimated regression coefficient that quantifies the association
between the risk factor and the outcome.
.
In the multiple linear regression equation, b1 is the estimated regression coefficient
that quantifies the association between the risk factor X1 and the outcome, adjusted
for X2 (b2 is the estimated regression coefficient that quantifies the association
between the potential confounder and the outcome). As noted earlier, some
investigators assess confounding by assessing how much the regression coefficient
associated with the risk factor (i.e., the measure of association) changes after
adjusting for the potential confounder. In this case, we compare b1 from the simple
linear regression model to b1 from the multiple linear regression model. As a rule
of thumb, if the regression coefficient from the simple linear regression model
changes by more than 10%, then X2 is said to be a confounder.
Once a variable is identified as a confounder, we can then use multiple linear
regression analysis to estimate the association between the risk factor and the
outcome adjusting for that confounder. The test of significance of the regression
coefficient associated with the risk factor can be used to assess whether the
association between the risk factor is statistically significant after accounting for
one or more confounding variables. This is also illustrated below.
Autocorrelation
Informally, autocorrelation is the similarity between observations as a function of
the time lag between them.
Example of an autocorrelation plot
Notice how the plot looks like sinusoidal function. This is a hint
for seasonality, and you can find its value by finding the period in the plot above,
which would give 24h.
Seasonality
Seasonality refers to periodic fluctuations. For example, electricity consumption is
high during the day and low during night, or online sales increase during Christmas
before slowing down again.
Example of seasonality
As you can see above, there is a clear daily seasonality. Every day, you see a peak
towards the evening, and the lowest points are the beginning and the end of each
day.
Remember that seasonality can also be derived from an autocorrelation plot if it has
a sinusoidal shape. Simply look at the period, and it gives the length of the season.
Stationarity
Stationarity is an important characteristic of time series. A time series is said to be
stationary if its statistical properties do not change over time. In other words, it
has constant mean and variance, and covariance is independent of time.
Example of a stationary process
Looking again at the same plot, we see that the process above is stationary. The
mean and variance do not vary over time.
Often, stock prices are not a stationary process, since we might see a growing trend,
or its volatility might increase over time (meaning that variance is changing).
Ideally, we want to have a stationary time series for modelling. Of course, not all of
them are stationary, but we can make different transformations to make them
stationary.
Without going into the technicalities of the Dickey-Fuller test, it test the null
hypothesis that a unit root is present.
Otherwise, p = 0, the null hypothesis is rejected, and the process is considered to be
stationary.
As an example, the process below is not stationary. Notice how the mean is not
constant through time.
Example of a non-stationary process
moving average
exponential smoothing
ARIMA
Moving average
The moving average model is probably the most naive approach to time series
modelling. This model simply states that the next observation is the mean of all past
observations.
Although simple, this model might be surprisingly good and it represents a good
starting point.
Otherwise, the moving average can be used to identify interesting trends in the data.
We can define a window to apply the moving average model to smooth the time
series, and highlight different trends.
Example of a moving average on a 24h window
In the plot above, we applied the moving average model to a 24h window. The
green line smoothed the time series, and we can see that there are 2 peaks in a 24h
period.
Of course, the longer the window, the smoother the trend will be. Below is an
example of moving average on a smaller window.
Exponential smoothing
Exponential smoothing uses a similar logic to moving average, but this time, a
different decreasing weight is assigned to each observations. In other words, less
importance is given to observations as we move further from the present.
As you can see, the smaller the smoothing factor, the smoother the time series will
be. This makes sense, because as the smoothing factor approaches 0, we approach
the moving average model.
Mathematically:
Double exponential smoothing expression
Below, you can see how different values of alpha and beta affect the shape of the
time series.
Where gamma is the seasonal smoothing factor and L is the length of the season.
TIME SERIES
General Introduction
In the following topics, we will review techniques that are useful for analyzing
time series data, that is, sequences of measurements that follow non-random
orders. Unlike the analyses of random samples of observations that are discussed in
the context of most other statistics, the analysis of time series is based on the
assumption that successive values in the data file represent consecutive
measurements taken at equally spaced time intervals.
Detailed discussions of the methods described in this section can be found in
Anderson (1976), Box and Jenkins (1976), Kendall (1984), Kendall and Ord
(1990), Montgomery, Johnson, and Gardiner (1990), Pankratz (1983), Shumway
(1988), Vandaele (1983), Walker (1991), and Wei (1989).
To index
ANALYSIS OF SEASONALITY
Where:
is a constant (intercept), and
1, 2, 3 are the autoregressive model parameters.
Put into words, each observation is made up of a random error component (random
shock, ) and a linear combination of prior observations.
Stationarity requirement. Note that an autoregressive process will only be stable
if the parameters are within a certain range; for example, if there is only one
autoregressive parameter then is must fall within the interval of -1 < < 1.
Otherwise, past effects would accumulate and the values of successive xt' s would
move towards infinity, that is, the series would not be stationary. If there is more
than one autoregressive parameter, similar (general) restrictions on the parameter
values can be defined (e.g., see Box & Jenkins, 1976; Montgomery, 1990).
Moving average process. Independent from the autoregressive process, each
element in the series can also be affected by the past error (or random shock) that
cannot be accounted for by the autoregressive component, that is:
xt = µ + t - 1* (t-1) - 2* (t-2) - 3* (t-3) - ...
Where:
µ is a constant, and
1, 2, 3 are the moving average model parameters.
Put into words, each observation is made up of a random error component (random
shock, ) and a linear combination of prior random shocks.
Invertibility requirement. Without going into too much detail, there is a "duality"
between the moving average process and the autoregressive process (e.g., see Box
& Jenkins, 1976; Montgomery, Johnson, & Gardiner, 1990), that is, the moving
average equation above can be rewritten (inverted) into an autoregressive form (of
infinite order). However, analogous to the stationarity condition described above,
this can only be done if the moving average parameters follow certain conditions,
that is, if the model is invertible. Otherwise, the series will not be stationary.
ARIMA METHODOLOGY
PARAMETER ESTIMATION
There are several different methods for estimating the parameters. All of them
should produce very similar estimates, but may be more or less efficient for any
given model. In general, during the parameter estimation phase a function
minimization algorithm is used (the so-called quasi-Newton method; refer to the
description of the Nonlinear Estimation method) to maximize the likelihood
(probability) of the observed series, given the parameter values. In practice, this
requires the calculation of the (conditional) sums of squares (SS) of the residuals,
given the respective parameters. Different methods have been proposed to compute
the SS for the residuals: (1) the approximate maximum likelihood method
according to McLeod and Sales (1983), (2) the approximate maximum likelihood
method with backcasting, and (3) the exact maximum likelihood method according
to Melard (1984).
Comparison of methods. In general, all methods should yield very similar
parameter estimates. Also, all methods are about equally efficient in most real-
world time series applications. However, method 1 above, (approximate maximum
likelihood, no backcasts) is the fastest, and should be used in particular for very
long time series (e.g., with more than 30,000 observations). Melard's exact
maximum likelihood method (number 3 above) may also become inefficient when
used to estimate parameters for seasonal models with long seasonal lags (e.g., with
yearly lags of 365 days). On the other hand, you should always use the
approximate maximum likelihood method first in order to establish initial
parameter estimates that are very close to the actual final values; thus, usually only
a few iterations with the exact maximum likelihood method (3, above) are
necessary to finalize the parameter estimates.
Parameter standard errors. For all parameter estimates, you will compute so-
called asymptotic standard errors. These are computed from the matrix of second-
order partial derivatives that is approximated via finite differencing (see also the
respective discussion in Nonlinear Estimation).
Penalty value. As mentioned above, the estimation procedure requires that the
(conditional) sums of squares of the ARIMA residuals be minimized. If the model
is inappropriate, it may happen during the iterative estimation process that the
parameter estimates become very large, and, in fact, invalid. In that case, it will
assign a very large value (a so-called penalty value) to the SS. This usually
"entices" the iteration process to move the parameters away from invalid ranges.
However, in some cases even this strategy fails, and you may see on the screen
(during the Estimation procedure) very large values for the SS in consecutive
iterations. In that case, carefully evaluate the appropriateness of your model. If
your model contains many parameters, and perhaps an intervention component (see
below), you may try again with different parameter start values.
Stationarity
Stationarity A common assumption in many time
series techniques is that the data are
stationary.
Any ‘non-seasonal’ time series that exhibits patterns and is not a random
white noise can be modeled with ARIMA models.
An ARIMA model is characterized by 3 terms: p, d, q
where,
If a time series, has seasonal patterns, then you need to add seasonal terms and
it becomes SARIMA, short for ‘Seasonal ARIMA’. More on that once we
finish ARIMA.
So, what does the ‘order of AR term’ even mean? Before we go there, let’s
first look at the ‘d’ term.
Why?
The most common approach is to difference it. That is, subtract the previous
value from the current value. Sometimes, depending on the complexity of the
series, more than one differencing may be needed.
The value of d, therefore, is the minimum number of differencing needed to
make the series stationary. And if the time series is already stationary, then d
= 0.
‘p’ is the order of the ‘Auto Regressive’ (AR) term. It refers to the number of
lags of Y to be used as predictors. And ‘q’ is the order of the ‘Moving
Average’ (MA) term. It refers to the number of lagged forecast errors that
should go into the ARIMA Model.
This acronym is descriptive, capturing the key aspects of the model itself. Briefly,
they are:
p: The number of lag observations included in the model, also called the lag
order.
d: The number of times that the raw observations are differenced, also called
the degree of differencing.
q: The size of the moving average window, also called the order of moving
average.
Stop learning Time Series Forecasting the slow way!
Take my free 7-day email course and discover how to get started (with sample
code).
Click to sign-up and also get a free PDF Ebook version of the course.
The 2016 5th edition of the textbook (Part Two, page 177) refers to the process
as a stochastic model building and that it is an iterative approach that consists of
the following 3 steps:
1. Identification. Use the data and all related information to help select a sub-
class of model that may best summarize the data.
2. Estimation. Use the data to train the parameters of the model (i.e. the
coefficients).
3. Diagnostic Checking. Evaluate the fitted model in the context of the
available data and check for areas where the model may be improved.
It is an iterative process, so that as new information is gained during diagnostics,
you can circle back to step 1 and incorporate that into new model classes.
1. Assess whether the time series is stationary, and if not, how many
differences are required to make it stationary.
2. Identify the parameters of an ARMA model for the data.
1.1 Differencing
Below are some tips during identification.
Unit Root Tests. Use unit root statistical tests on the time series to
determine whether or not it is stationary. Repeat after each round of differencing.
Avoid over differencing. Differencing the time series more than is required
can result in the addition of extra serial correlation and additional complexity.
1.2 Configuring AR and MA
Two diagnostic plots can be used to help choose the p and q parameters of the
ARMA or ARIMA. They are:
Autocorrelation Function (ACF). The plot summarizes the correlation of
an observation with lag values. The x-axis shows the lag and the y-axis shows the
correlation coefficient between -1 and 1 for negative and positive correlation.
Partial Autocorrelation Function (PACF). The plot summarizes the
correlations for an observation with lag values that is not accounted for by prior
lagged observations.
Both plots are drawn as bar charts showing the 95% and 99% confidence intervals
as horizontal lines. Bars that cross these confidence intervals are therefore more
significant and worth noting.
The model is AR if the ACF trails off after a lag and has a hard cut-off in the
PACF after a lag. This lag is taken as the value for p.
The model is MA if the PACF trails off after a lag and has a hard cut-off in
the ACF after the lag. This lag value is taken as the value for q.
The model is a mix of AR and MA if both the ACF and PACF trail off.
2. Estimation
Estimation involves using numerical methods to minimize a loss or error term.
We will not go into the details of estimating model parameters as these details are
handled by the chosen library or tool.
1. Overfitting
2. Residual Errors.
3.1 Overfitting
The first check is to check whether the model overfits the data. Generally, this
means that the model is more complex than it needs to be and captures random
noise in the training data.
This is a problem for time series forecasting because it negatively impacts the
ability of the model to generalize, resulting in poor forecast performance on out of
sample data.
A review of the distribution of errors can help tease out bias in the model. The
errors from an ideal model would resemble white noise, that is a Gaussian
distribution with a mean of zero and a symmetrical variance.
For this, you may use density plots, histograms, and Q-Q plots that compare the
distribution of errors to the expected distribution. A non-Gaussian distribution may
suggest an opportunity for data pre-processing. A skew in the distribution or a non-
zero mean may suggest a bias in forecasts that may be correct.
Additionally, an ideal model would leave no temporal structure in the time series
of forecast residuals. These can be checked by creating ACF and PACF plots of the
residual error time series.
The presence of serial correlation in the residual errors suggests further opportunity
for using this information in the model.