Summary

This assignment explore the impact of transmission (am) on the miles per gallon (mpg) of a car using the mtcars dataset. The data is loaded and explored via boxplot and pairs plot to visualize the relationship of mpg vs am and the relationship between each variables respectively. Then, a few models were fitted on the data, and the diagnostic plots of the models were plotted. Model 3 was selected, where it includes am as the predictor with disp (displacement), wt (weight), and qsec (1/4 mile time) as the confounding variables.
By interpreting the coefficients, we conclude that manual transmission car have higher miles per gallon than auto transmission car, ceteris paribus.

Load the Data and R Packages

library(ggplot2)
library(GGally)
library(dplyr)
library(MASS)

data(mtcars)
      
mtcars2 <- mtcars #just to create a backup

Exploratory Data Analysis

First, we look at the boxplot of Miles per gallon (mpg) vs Transmission (am).

mtcars2$transmission <- factor(mtcars$am, labels = c("Auto", "Manual"))
ggplot(data = mtcars2, aes(x = transmission, y = mpg, group = transmission, fill = transmission)) +
geom_boxplot() 

The relationship is apparent, where manual transmission car has a relatively higher miles per gallon than auto transmission car.

Next, we do a pairs plot to look at the relationship between variables.

ggpairs(data = mtcars)

Hence, we can suggests a few models.
Model 1: Include all predictors.
Model 2: mpg ~ am + wt + hp - 1 **
Model 3: Use pre-built stepwise algorithms to determine the model.

**Because wt and hp have relatively low correlation with am, and they have an apparent linear relationship with outcome mpg.

Model Selection

Model 1:

fit1 <- lm(formula = mpg ~ .-1, data = mtcars)
summary(fit1)

Call:
lm(formula = mpg ~ . - 1, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7721 -1.6249  0.1699  1.1068  4.4666 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)  
cyl   0.35083    0.76292   0.460   0.6501  
disp  0.01354    0.01762   0.768   0.4504  
hp   -0.02055    0.02144  -0.958   0.3483  
drat  1.24158    1.46277   0.849   0.4051  
wt   -3.82613    1.86238  -2.054   0.0520 .
qsec  1.19140    0.45942   2.593   0.0166 *
vs    0.18972    2.06825   0.092   0.9277  
am    2.83222    1.97513   1.434   0.1656  
gear  1.05426    1.34669   0.783   0.4421  
carb -0.26321    0.81236  -0.324   0.7490  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.616 on 22 degrees of freedom
Multiple R-squared:  0.9893,    Adjusted R-squared:  0.9844 
F-statistic:   203 on 10 and 22 DF,  p-value: < 2.2e-16

Model 2: am, hp, wt

fit2 <- lm(formula = mpg ~ am+hp+wt-1, data = mtcars)
summary(fit2)

Call:
lm(formula = mpg ~ am + hp + wt - 1, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.111  -3.553   1.554   5.723  11.581 

Coefficients:
   Estimate Std. Error t value Pr(>|t|)    
am 16.35773    2.10496   7.771 1.43e-08 ***
hp -0.07726    0.02350  -3.288  0.00265 ** 
wt  7.39707    1.09971   6.726 2.22e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.556 on 29 degrees of freedom
Multiple R-squared:  0.9112,    Adjusted R-squared:  0.9021 
F-statistic: 99.25 on 3 and 29 DF,  p-value: 2.366e-15

Model 3: stepwise

log <- capture.output({  #remove the printed output
      step <- stepAIC(fit1, direction="both") #the output is too long! 
})

#stepwise algorithm suggests this formula: mpg ~ disp + wt + qsec + am - 1
fit3 <- lm(formula = mpg ~ disp + wt + qsec + am - 1, data = mtcars)
summary(fit3)

Call:
lm(formula = mpg ~ disp + wt + qsec + am - 1, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7169 -1.4638 -0.5382  1.7825  4.3566 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
disp  0.012020   0.008891   1.352 0.187238    
wt   -4.612795   1.158173  -3.983 0.000440 ***
qsec  1.705510   0.127486  13.378  1.1e-13 ***
am    4.180854   1.013616   4.125 0.000301 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.462 on 28 degrees of freedom
Multiple R-squared:  0.9879,    Adjusted R-squared:  0.9862 
F-statistic: 572.1 on 4 and 28 DF,  p-value: < 2.2e-16

Fortunately, Transmission (am) is included in the model.

Diagnostic Plots

Model 1:

layout(matrix(c(1,2,3,4),2,2)) # 4 graphs/page 
plot(fit1) 

The residuals vs fitted values plot seems random, which suggests that the variables included do explain the variation in the outcome quite well. The normal Q-Q plot looks fine too, where most of the standardized residuals are scattered about the 45 degrees line.

Model 2:

layout(matrix(c(1,2,3,4),2,2)) # 4 graphs/page 
plot(fit2)

The residuals vs fitted values plot seems show a negative relationship between the residuals and the fitter values. This suggests that other excluded variables may explain this negative relationship. Hence, Model 2 may not be a better model.

Model 3:

layout(matrix(c(1,2,3,4),2,2)) # 4 graphs/page 
plot(fit3)

Both the residuals vs fitted values plot and the normal Q-Q plot look fine.

Therefore, looking at the diagnostic plots and the summary of the model fit, Model 3 is selected in favour of its higher adjusted R-squared value compared to Model 1 and also because of having less variables included and hence more parsimonious.

Conclusion

Let’s look at the coefficients of the fitted model again.

summary(fit3)$coefficients
##         Estimate  Std. Error   t value     Pr(>|t|)
## disp  0.01202006 0.008891454  1.351866 1.872383e-01
## wt   -4.61279456 1.158173236 -3.982819 4.400086e-04
## qsec  1.70550996 0.127485705 13.378049 1.099649e-13
## am    4.18085430 1.013616073  4.124692 3.005272e-04

The coefficients for am can be interpreted as the change in mean of miles per gallon for an manual transmission car relative to auto transmission car, while keeping the other variables constant.
Hence, a manual transmission car has a higher miles per gallon (4.18085) on average than auto transmission car, while keeping the other variables constant. Since, the p-value is less than 5%, we can conclude on 5% significance level that manual transmission car have higher miles per gallon than auto transmission car, ceteris paribus.