Math 3080 Finalproject
Math 3080 Finalproject
Math 3080 Finalproject
1
Curtis Miller
4/10/14
MATH 3080
Final Project
Problem 1: Car Data
The first question asks for an analysis on car data. The data was collected from the Kelly
Blue Book by S. Kiuper in 2008, and included a number of variables on 2005 General Motors
vehicles, including: vehicles suggested price; the number of miles the vehicle has been driven
(mileage); the manufacturer of the vehicle (make); the model of the vehicle; the specific type of
car model of the vehicle (trim); the body type; the number of cylinders the vehicle has in the
engine; the engine size (liter); the number of doors; whether the vehicle has cruise control;
whether the vehicle has upgraded speakers; and whether the vehicle has leather seats. I
developed linear regression models to try to describe the price of the vehicle as a function of
other variables.
My first model was a simple linear regression model that did not transform the data
other than to turn certain categorical variables (make, model, trim, type, cruise, sound, and
leather) into dummy variables in the model. In the case of cruise, sound, and leather, there is
only one dummy variable in the model that takes the value 1 if the respective condition is met
and 0 otherwise. Make, model, trim, and type are much more complex, and a dummy is created
for each of the possible conditions save 1 (so there are dummy variables for all but one make of
car, for example). The coefficients of all the variables are then estimated, and the resulting
model analyzed. All estimated models for the car data are listed in Table 1.
The coefficients for the type dummy variables and liters could not be estimated, and
were dropped by the R statistical package. Those variables must have been linearly dependent
on other variables in the model. A number of coefficients are statistically different from zero,
and of all the models, model 1 has the highest adjusted
of the model dropped, but some of the problems that were apparent in
the diagnostic plot of the first model improved, though the model is still not perfect (see Figure
3). The residuals variance is less a function of fitted values than before, and the Q-Q plot of the
residuals suggested they are more normally distributed than before, but there is still
heteroskedasticity in the residuals and they are still not normally distributed. There are still
influential datapoints, as indicated by the residual vs. leverage plot, and the residuals vs. index
Curtis Miller MATH 3080 Final Project pg. 3
plot is still not perfect, but the other plots have improved. Meanwhile, most of the coefficients in
the second model are statistically different from zero, save for the coefficients of cruise and
sound, and leather is statistically significant only at the level. So I felt I should consider
two more models.
The third model I estimated is similar to the second, but I removed the variables for
cruise, sound, and leather. Dropping those variables changed little in the model; no variables
changed level of significance, the adjusted
change in price.
But by transforming mileage, the new interpretation of the coefficient is that a 1% change in
mileage is associated with a
of all the models, and the diagnostic plots (Figure 5) are not much better
than the diagnostic plots of the third model; while the residuals appear to be somewhat more
homoskedastic and appropriately centered, they are less normal than the residuals of the third
model. Worse, the residual vs. mileage plot became worse; it appears that the variances of the
1
Note that log is the natural log, sometimes denoted ln.
Curtis Miller MATH 3080 Final Project pg. 4
residuals are not independent of mileage. I conclude that the fourth model is not much of an
improvement over the third model.
Given the choice between these four models, I prefer the third. It fits the data reasonably
well, and while it does not fit the assumptions necessary for statistical inference in linear
regression very well, it fares better than any other model estimated. The third model also does
not include variables that have been found to have minimal impact on price (save for some
dummies). Thus I feel that the third model is the best model for predicting price.
Problem 2: Dolphin Data
The second question asks for an analysis of the sound pressure of dolphin sonar signals
compared to the distance (range) from the dolphin to the target. The data is from Marianne
Rasmussen, collected off the coast of Iceland near Keflavik. The pressures were corrected for
water density and were expected to increase with distance.
The first model I estimated was a simple linear model. The equation of the model is:
The estimated regression coefficients are listed in Table 2 (along with the coefficients of all other
regression equations I estimated). As expected, the coefficient for range is positive and
statistically different from zero. However, the adjusted
Note that when interpreting the coefficient
(There is no simple interpretation of the coefficients of the quadratic model.)
The diagnostic plots for the logarithmic and quadratic models are shown in Figure 7 and
Figure 8, respectively. Both models have similar benefits and pitfalls. Their adjusted
s are
higher than the adjusted