R Regression Exercise 2019
R Regression Exercise 2019
R Regression Exercise 2019
Now further Figure 1 - Following our trend, more relevant statistics insight from onwards into
the realm of xkcd. big data
analysis. In this worksheet we’re going to learn the background we need for the first multivariate
statistical technique of the course – multiple regression. In order to get to multiple regression, however,
we’re going to start with its bivariate ancestor, simple linear regression (a.k.a. ordinary least
squares/OLS).
On the previous worksheet we concentrated on describing the data that we have examined. In this
worksheet we will move to quantifying the relationships between two variables. It is no longer enough
to simply say that two variables are different – we must now quantitatively determine how one
influences the other. Through the course of this worksheet we will learn about the following concepts,
and how to implement them in R:
Dependent vs. independent variables (or whatever else you would care to call them)
Ordinary Least Squares Regression and the coefficient of determination (r2)
Residuals, leverage, heteroscedasticity, and normality (i.e. checking that you’re not just wasting
your time with all of this)
Exercise
As with all of my exercises, the questions you will be expected to answer are interspersed throughout
the text. Please read carefully to make sure you don’t miss anything, and be sure to include answers to
every question in your write-up.
Part 1 – Preparation
As you did last week, create a directory to hold this week’s data and analyses somewhere convenient.
Switch to this directory as your working directory.
Download the file USexports.csv from Learn to your computer and save it to your working directory.
Now load these data into R as a data frame called US.exports. Remember to use the row.names
command to label your rows:
Note that the data.matrix() command is being used here to avoid R importing the data as a list of text
values, rather than numbers. This is due to the odd way that the data I downloaded from the USDA are
formatted – you shouldn’t have to do this every time (but have it in your arsenal in case you hit this
problem again)!
Check that your data have imported correctly before moving on to the remainder of the worksheet. This
week I have sourced data about the values (in millions of dollars) of the 20 major US agricultural
exports over the 14 years from 2000-2013.
Question 1) You’ll notice that R has added an “X” in front of each of the column names (it’s getting
confused because the column names are just numbers). Rectify this by renaming the
columns without the “X”s. As we’ve imported our data as a matrix, rather than a data
frame, you’ll need to use the command colnames() to name your column, rather than
just names(). Add a row to your dataset, name it “Total Exports” and fill it with the total
dollar amount of the major US agricultural exports for each year.
What do you notice about the total exports through this period? It might be helpful to plot this up. As
we’ve renamed our column labels as numbers, we can do this pretty easily:
> plot(colnames(US.exports),US.exports[21,])
In this case, this command produces an xy plot, with the first variable on the x-axis and the second on
the y-axis. We can fancy this up a bit by connecting the points, which is appropriate given this is a time
series:
Your next task is to see what each of the individual exports is doing through this period. It would be
possible to plot each different commodity in the same way as we have for the totals, but that would be
tedious. Or maybe not if we knew the right commands…
R actually lets you plot multiple graphs in the same frame using the layout() command. Let’s first
create a 4x5 matrix into which to put the plots:
[Note that, depending on the size of your laptop’s screen, a 4x5 matrix may be too large to display. If
this is the case, you have two options: output the plot to a pdf, rather than your screen by wrapping
your code in the pdf() command, or plotting more smaller sets of graphs (2x2 or 3x3, for example). The
former wraps everything in one set of commands, but is harder to double-check in real time, the latter is
easier to troubleshoot, but more lengthy to code.]
Now we just need to fill it with one plot derived from each of the first 20 rows of our dataset. Sounds
like we’re in need of a for loop. A for loop has the following structure:
The code repeats whatever is in the curly brackets for every value of i between x and y. An example
would be:
This code takes a data frame that contains the ages of 20 of your friends and adds one to each of those
ages, so (when you run this on Jan 1st each year) you’ll know how old they’ll be on their next birthday.
The first time the loop runs, i will have a value of 1, and the first friend’s age will update, the second
time it will have a value of 2, and the second friend’s age will update, etc. Useful, eh? Loops are a
powerful time saver. Depending on what code you write in the curly brackets, you can do an amazing
variety of things with a loop. Note that I chose to use i as the variable that is incremented by the loop
out of convention. You can use any variable name that you would like. I can hear the cheers already!
Question 2) Write a for loop to replicate (as near as you can) the plot above. I’d recommend iterating
to the solution.
This is interesting… Although the overall pattern for each commodity shows increasing exports with
time, not all of the commodities are behaving the same way. Exports of processed fruits and nuts seem
to increase almost monotonically through time, whereas wheat and corn exports are very volatile.
Why might we be seeing these patterns? We can hypothesise that it is unlikely that US exports have
doubled since 2000. Without further data, it is not possible to test this at this point, but it seems
reasonable. A significant proportion of US agricultural exports go to the EU, however. Perhaps some of
this pattern is caused by the relative strength of the US dollar (USD) and the Euro (EUR)?
Let’s look at how many US dollars one Euro would buy you on January 1st of each year:
The plot looks pretty similar to some of the patterns that we see in the export data, but how similar is
it?
Question 4) By eye, which commodities do you think show some influence of the control of
USD/EUR on the value of their imports?
As the good big data analysts that we all are, our gut intuition isn’t sufficient for us to call these
patterns. What we want to do instead is quantify this relationship. We can do this through OLS
regression (I bet you were wondering when we were going to get to that…)!
OLS regression tests for a linear relationship between two variables of the form:
y = mx + c
Where x is the independent variable (the variable whose variations drive the relationship) and y is your
dependent variable (the variable that is changed in response to x). A simple example of this would be in
the correlation between how far you depress the accelerator on your car and how fast you go in a single
gear. Accelerator depression is the independent variable (x) and speed is the dependent variable (y). As
you change x, y changes.
As I hope you will remember from high school, m represents the slope of the line being described here,
and c represents the intercept on the y axis. OLS regression calculates the best-fitting line through a
cloud of data, and hence the best-fitting equation of the form y=mx+c that describes the relationship
between the dependent and independent variables. We won’t derive how it does this, but the principle is
that the regression algorithms find the line through the data that minimises the sum of the squared
residuals between the observed values of y and the values predicted by the equation. This is shown in
the following image (gratuitously stolen from G. D. Hutcheson, 2011) by the large dots (observed
values, off the trend line) and small dots (predicted values, on the trend line).
Any other line through these data would increase the size of the residuals (lines connecting the large
and small dots), and so would be a worse fit to the data and a poorer approximation of the relationship
between x and y.
You can calculate two very useful summary statistics from a regression. The first shows the probability
that the linear relationship modelled by your y=mx+c trendline is “real”, i.e. that this trendline does a
better job of explaining the data than a flat line at y = 0 (no relationship between x and y). This is a p-
value, calculated from the F-distribution. Providing your p-value is statistically significant (as we said
last time, this cut-off is conventionally chosen as p < 0.05, but this is a rule of thumb, and other values
can be interpreted), you can interpret the proportion of the variation in your y values that is explained
by the model – the goodness of fit of the model (r2). This is expressed as a proportion from zero to one,
with a value of one meaning your model perfectly describes your data (there are no residuals). An r2 of
0.62 would mean that 62% of the variation in your y values can be explained using y=mx+c, and 38%
cannot.
This seems like an ideal technique to investigate our agricultural export data further, so let’s prepare to
carry out an OLS regression to see if there is a relationship between USD/EUR and total US
agricultural exports.
First off, rule number one of data analysis – if in doubt, plot it up.
Question 5) Which of the two variables is the dependent and which is the independent? Why? Create
a labelled plot of x vs. y and provide it in your write-up. Are the points random, or do
they show a pattern?
Now to proceed on to quantify the relationship between the two variables using OLS regression. Use
the following command:
What you’ve done here is to calculate the best fitting linear model to explain your dependent variable
in terms of your independent variable (I might just have given away part of the answer to Q5 here if
you’ve read ahead, and can decode the R syntax…), and saved this output in a variable called reg1.
You’ll notice, however, that you haven’t got any output to know the results of the regression. You can
start by just calling the variable:
> reg1
Call:
Coefficients:
(Intercept) US.exports[22, ]
-72862 131885
This provides you with the coefficients (i.e. values) for c and m, respectively, in y = mx + c. You can
now calculate the predicted export total for any USD/EUR, simply by plugging the values into your
equation. Cool, eh?
Question 6) What total value of US agricultural exports would you expect to see when $1 buys you
€1.20? Does this correspond well to what you’d expect from looking at your plot?
Unfortunately, this regression output doesn’t provide you with all of the useful statistics (p, r2) to assess
the quality of your model that we detailed earlier. To get these, try:
> summary(reg1)
You’ll see a summary of your residuals, some more detailed data regarding both of your coefficients,
then some r2 values (only worry about the r2 at the moment) and a p-value.
Question 7) How do you interpret your p-value and adjusted r2? Is the model statistically significant,
and what proportion of the variation in exports can be explained by the variation in
USD/EUR? Is this a good model?
You can also get a little more information from the output looking at the Pr(>|t|) column in the
coefficients section. This tells you whether each of your coefficients (instead of the whole model) are
statistically significantly different from zero (i.e. are important in understanding the relationship you
are modelling).
Question 8) Are both of your coefficients significant? What do your results tell you?
Finally, returning to the first rule of data analysis, let’s plot our linear model on our data and take a look
at how good a fit we think it is.
If you closed it, call up your plot of exports vs. USD/EUR again. Then add the regression line to it (this
is something that R makes simple – hurrah!):
> abline(reg1)
Hmmm… That’s a little unfortunate. As we should, perhaps, have expected, our model doesn’t look the
greatest. Perhaps OLS wasn’t the best technique to describe these data? OLS assumes that (a) our data
are normally distributed, and (b) the relationship we are modelling is linear. Maybe we’ve violated
some of these assumptions? Let’s check:
This produces a 2x2 matrix in which to place our plots (another way would be to use the layout())
command from earlier, and then plots the four regression diagnostic plots in this new matrix. Hopefully
you’ll see something that looks a bit like this:
Each of these plots allows you to visually assess whether your data fit different assumptions of the OLS
model. In all cases, points that violate each assumption will be labelled to allow you to investigate them
further. You can see that the same three culprits are labelled in each of the four plots. There’s something
odd about 2005, 2012 and 2013…
The top-left plot (Residuals vs. Fitted) shows any heteroscedasticity within your data – where the range
of variation of your data from the model varies across the range of your data. Here, we can see that for
low fitted values, all the data have low residuals, but for high fitted values there is a lot more variability
in residuals – i.e. at high fitted values some data do not fit the model well, and so are affected by
something else that is not incorporated into the model. Any systematic pattern in this plot (i.e. not an
evenly distributed cloud) indicates failure of this assumption.
The top-right plot (Normal Q-Q) is a test of whether the data are normally distributed. If they are so, all
of the points will fall on the dotted line. If not, they will (usually) form an s-curve off the line. Our data
appear not to be normally distributed. Oops. Fortunately, with large datasets, violation of this
assumption is the least serious, but you still need to be aware.
The bottom-left plot (Scale-Location) is another way of looking at heteroscedasticity (basically just the
data from the Residuals vs. Fitted plot, transformed to be positive). We can see that the interpretation is
the same as in the top-left plot.
The bottom-right plot (Residuals vs. Leverage) shows whether any of your data points are outliers from
the remainder of the dataset – i.e. they have values that are markedly different from the bulk of your
data. In our plot we can see that the usual suspects are marked as exerting particularly high leverage on
the solution of the model, so we suspect yet further that something else is going on with these years.
What can we say from these analyses? Well, it looks like an OLS regression isn’t quite the right
technique to describe the relationship between USD/EUR and US agricultural exports. Although it
gives a statistically significant relationship, and explains a fair portion of the variability in the data, the
data don’t conform to a simple linear model. Looking back at your plot from question 5, this should be
obvious. As to where we can go to try and solve this problem, see the next worksheet…
Question 9) Choose another pair of variables from your data matrix that could conceivably have a
causal relationship, identify your dependent and independent variable, carry out an OLS
regression between the two, describe the meaning of your results, and test your
regression model for violations of assumptions.