Simple Linear Regression Using a Real Dataset in R and Excel
Simple Linear Regression Using a Real Dataset in R and Excel
Earlier, the marketing dataset from the datarium package was used in R to explore relationships in data. A strong
linear relationship was found to exist between the variables YouTube and sales. In this section, the relationship
between these two variables will be explored more using simple linear regression.
Load the marketing dataset into a variable in R using the following code:
require(datarium)
md <- marketing
Use the function lm() to perform linear regression with the variables sales and YouTube. In this case, sales is the
response variable, and YouTube is the explanatory variable.
summary(model)
Call:
Residuals:
Coefficients:
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation Of R Output
o The estimated regression line equation is sales = 8.439 + 0.048 · youtube. The value of the y-
intercept is 8.439 (to 3 decimal places) and the slope is 0.048 (to 3 decimal places). Note: the
coefficient labeled as "Estimate" under the "Coefficients" section represents the slope of
the YouTube estimated regression line.
o Alternatively, compute the slope of the regression line using the formula (to 3 decimal places),
where r is the correlation coefficient, is the standard deviation of the dependent variable (sales),
and is the standard deviation of the independent variable (YouTube). Calculate this value in R
using the following code:
o cor(md$youtube,md$sales)*(sd(md$sales)/sd(md$youtube))
[1] 0.04753664
o Alternatively, compute the y-intercept of the regression line using the formula ▁▁ (to 1 decimal
place). Calculate this value in R using the following code:
[1] 8.357352
Note: The slope and y-intercept calculated here are slightly different from the results of the R linear regression due
to rounding the slope to two significant figures.
o When the YouTube advertising budget is 0, sales are expected to amount to 8.439 = 8,439 dollars
(recall that sales and YouTube units are in thousands of dollars)
o A 1 unit increase in the YouTube budget should result in a 0.048 unit increase in sales.
As sales and YouTube units are given in thousands of dollars, it means that a 1000 dollar increase
in the YouTube budget should result in a 48-dollar increase in sales.
Import the marketing.csv dataset in the folder DATA to a new Excel worksheet.
The table in Figure 3-24 provides statistical measures that indicate how well the model fits the data.
Figure 3-24
R-square is a statistical measure that explains how much of the variance in the response variable (sales) is explained
by the explanatory variable (YouTube). Often, the larger the value of R-square, the better the regression model fits
your observations. The R-squared value of 0.6119 indicates that the YouTube predictor accounts for approximately
61% of the variance in sales. The Multiple R is the correlation coefficient that we computed earlier.
The standard error of the regression is a summary of how far each of the observed values falls from the regression
line. In this example, the distance is 3.91. A low distance value is better. Such a value would indicate that the
distances between the data points and the fitted values are small.
An analyst can show the relationship between sales and YouTube using a linear regression chart.
To create this chart, first, create a scatter plot of sales against YouTube using the method from earlier in the lesson.
Now, draw the least squares regression line. Right click on any point and select Add Trendline from the context
menu.
On the right pane that appears, select Linear under Trendline Options and check Display Equation on Chart.
Use the Fill & Line tab to customize the line, e.g., change the line to a solid line, the color of the line to red, and the
dash type to unbroken line. Figure 3-27 shows the resulting linear regression chart.
Figure 3-27
You will see that the regression line in Figure 3-27 has the same coefficients as those obtained from the R and Excel
regression outputs.