This assignment explore the impact of transmission (am) on the miles per gallon (mpg) of a car using the mtcars dataset. The data is loaded and explored via boxplot and pairs plot to visualize the relationship of mpg vs am and the relationship between each variables respectively. Then, a few models were fitted on the data, and the diagnostic plots of the models were plotted. Model 3 was selected, where it includes am as the predictor with disp (displacement), wt (weight), and qsec (1/4 mile time) as the confounding variables.
By interpreting the coefficients, we conclude that manual transmission car have higher miles per gallon than auto transmission car, ceteris paribus.
library(ggplot2)
library(GGally)
library(dplyr)
library(MASS)
data(mtcars)
mtcars2 <- mtcars #just to create a backup
First, we look at the boxplot of Miles per gallon (mpg) vs Transmission (am).
mtcars2$transmission <- factor(mtcars$am, labels = c("Auto", "Manual"))
ggplot(data = mtcars2, aes(x = transmission, y = mpg, group = transmission, fill = transmission)) +
geom_boxplot()
The relationship is apparent, where manual transmission car has a relatively higher miles per gallon than auto transmission car.
Next, we do a pairs plot to look at the relationship between variables.
ggpairs(data = mtcars)
Hence, we can suggests a few models.
Model 1: Include all predictors.
Model 2: mpg ~ am + wt + hp - 1 **
Model 3: Use pre-built stepwise algorithms to determine the model.
**Because wt and hp have relatively low correlation with am, and they have an apparent linear relationship with outcome mpg.
fit1 <- lm(formula = mpg ~ .-1, data = mtcars)
summary(fit1)
Call:
lm(formula = mpg ~ . - 1, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.7721 -1.6249 0.1699 1.1068 4.4666
Coefficients:
Estimate Std. Error t value Pr(>|t|)
cyl 0.35083 0.76292 0.460 0.6501
disp 0.01354 0.01762 0.768 0.4504
hp -0.02055 0.02144 -0.958 0.3483
drat 1.24158 1.46277 0.849 0.4051
wt -3.82613 1.86238 -2.054 0.0520 .
qsec 1.19140 0.45942 2.593 0.0166 *
vs 0.18972 2.06825 0.092 0.9277
am 2.83222 1.97513 1.434 0.1656
gear 1.05426 1.34669 0.783 0.4421
carb -0.26321 0.81236 -0.324 0.7490
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.616 on 22 degrees of freedom
Multiple R-squared: 0.9893, Adjusted R-squared: 0.9844
F-statistic: 203 on 10 and 22 DF, p-value: < 2.2e-16
fit2 <- lm(formula = mpg ~ am+hp+wt-1, data = mtcars)
summary(fit2)
Call:
lm(formula = mpg ~ am + hp + wt - 1, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-13.111 -3.553 1.554 5.723 11.581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
am 16.35773 2.10496 7.771 1.43e-08 ***
hp -0.07726 0.02350 -3.288 0.00265 **
wt 7.39707 1.09971 6.726 2.22e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.556 on 29 degrees of freedom
Multiple R-squared: 0.9112, Adjusted R-squared: 0.9021
F-statistic: 99.25 on 3 and 29 DF, p-value: 2.366e-15
log <- capture.output({ #remove the printed output
step <- stepAIC(fit1, direction="both") #the output is too long!
})
#stepwise algorithm suggests this formula: mpg ~ disp + wt + qsec + am - 1
fit3 <- lm(formula = mpg ~ disp + wt + qsec + am - 1, data = mtcars)
summary(fit3)
Call:
lm(formula = mpg ~ disp + wt + qsec + am - 1, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.7169 -1.4638 -0.5382 1.7825 4.3566
Coefficients:
Estimate Std. Error t value Pr(>|t|)
disp 0.012020 0.008891 1.352 0.187238
wt -4.612795 1.158173 -3.983 0.000440 ***
qsec 1.705510 0.127486 13.378 1.1e-13 ***
am 4.180854 1.013616 4.125 0.000301 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.462 on 28 degrees of freedom
Multiple R-squared: 0.9879, Adjusted R-squared: 0.9862
F-statistic: 572.1 on 4 and 28 DF, p-value: < 2.2e-16
Fortunately, Transmission (am) is included in the model.
layout(matrix(c(1,2,3,4),2,2)) # 4 graphs/page
plot(fit1)
The residuals vs fitted values plot seems random, which suggests that the variables included do explain the variation in the outcome quite well. The normal Q-Q plot looks fine too, where most of the standardized residuals are scattered about the 45 degrees line.
layout(matrix(c(1,2,3,4),2,2)) # 4 graphs/page
plot(fit2)
The residuals vs fitted values plot seems show a negative relationship between the residuals and the fitter values. This suggests that other excluded variables may explain this negative relationship. Hence, Model 2 may not be a better model.
layout(matrix(c(1,2,3,4),2,2)) # 4 graphs/page
plot(fit3)
Both the residuals vs fitted values plot and the normal Q-Q plot look fine.
Therefore, looking at the diagnostic plots and the summary of the model fit, Model 3 is selected in favour of its higher adjusted R-squared value compared to Model 1 and also because of having less variables included and hence more parsimonious.
Let’s look at the coefficients of the fitted model again.
summary(fit3)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## disp 0.01202006 0.008891454 1.351866 1.872383e-01
## wt -4.61279456 1.158173236 -3.982819 4.400086e-04
## qsec 1.70550996 0.127485705 13.378049 1.099649e-13
## am 4.18085430 1.013616073 4.124692 3.005272e-04
The coefficients for am can be interpreted as the change in mean of miles per gallon for an manual transmission car relative to auto transmission car, while keeping the other variables constant.
Hence, a manual transmission car has a higher miles per gallon (4.18085) on average than auto transmission car, while keeping the other variables constant. Since, the p-value is less than 5%, we can conclude on 5% significance level that manual transmission car have higher miles per gallon than auto transmission car, ceteris paribus.