Chapter 3 Complex ANOVA and Multiple Linear Regression
In the One-Way ANOVA and simple linear regression models, there was only one variable - categorical for ANOVA and continuous for simple linear regression - to explain and predict our target variable. Rarely do we believe that only a single variable will suffice in predicting a variable of interest. Here in this Chapter we will generalize these models to the \(n\)-Way ANOVA and multiple linear regression models. These models contain multiple sets of variables to explain and predict our target variable.
This Chapter aims to answer the following questions:
- How do we include multiple variables in ANOVA?
- Exploration
- Assumptions
- Predictions
- What is an interaction between two predictor variables?
- Interpretation
- Evaluation
- Within Category Effects
- What is blocking in ANOVA?
- Nuisance Factors
- Differences Between Blocking and Two-Way ANOVA
- How do we include multiple variables in regression?
- Model Structure
- Global & Local Inference
- Assumptions
- Adjusted \(R^2\)
- Categorical Variables in Regression
3.1 Two-Way ANOVA
In the One-Way ANOVA model, we used a single categorical predictor variable with \(k\) levels to predict our continuous target variable. Now we will generalize this model to include \(n\) categorical variables that each have different numbers of levels (\(k_1, k_2, ..., k_n\)).
3.1.1 Exploration
Let’s use the basic example of two categorical predictor variables in a Two-Way ANOVA. Previously, we talked about using heating quality as a factor to explain and predict sale price of homes in Ames, Iowa. Now, we also consider whether the home has central air. Although similar in nature, these two factors potentially provide important, unique pieces of information about the home. Similar to previous data science problems, let us first explore our variables and their potential relationships.
Now that we have two variables that we will use to explain and predict sale price, here are some summary statistics (mean, standard deviation, minimum, and maximum) for each combination of category. We will use the group_by
function on both predictor variables of interest to split the data and then the summarise
function to calculate the metrics we are interested in.
train %>%
group_by(Heating_QC, Central_Air) %>%
dplyr::summarise(mean = mean(Sale_Price),
sd = sd(Sale_Price),
max = max(Sale_Price),
min = min(Sale_Price),
n = n())
## `summarise()` has grouped output by 'Heating_QC'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 7
## # Groups: Heating_QC [5]
## Heating_QC Central_Air mean sd max min n
## <ord> <fct> <dbl> <dbl> <int> <int> <int>
## 1 Poor N 50050 52255. 87000 13100 2
## 2 Poor Y 107000 NA 107000 107000 1
## 3 Fair N 84748. 28267. 158000 37900 29
## 4 Fair Y 145165. 38624. 230000 50000 36
## 5 Typical N 103469. 34663. 209500 12789 82
## 6 Typical Y 142003. 39657. 375000 60000 527
## 7 Good N 110811. 38455. 214500 59000 23
## 8 Good Y 160113. 54158. 415000 52000 318
## 9 Excellent N 115062. 33271. 184900 64000 11
## 10 Excellent Y 216401. 88518. 745000 58500 1022
We can already see above that there appears to be some differences in average sale price across the categories overall. Within each grouping of heating quality, homes with central air appear to have a higher sale price than homes without. Also, similar to before, homes with higher heating quality appear to have higher sale prices compared to homes with lower heating quality. Careful about samepl size in these situations though!
We also see these relationships in the bar chart in Figure 3.1.
CA_heat <- train %>%
group_by(Heating_QC, Central_Air) %>%
dplyr::summarise(mean = mean(Sale_Price),
sd = sd(Sale_Price),
max = max(Sale_Price),
min = min(Sale_Price),
n = n())
## `summarise()` has grouped output by 'Heating_QC'. You can override using the
## `.groups` argument.
ggplot(data = CA_heat, aes(x = Heating_QC, y = mean/1000, fill = Central_Air)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(y = "Sales Price (Thousands $)", x = "Heating Quality Category") +
scale_fill_brewer(palette = "Paired") +
theme_minimal()
As before, visually looking at bar charts and mean calculations only goes so far. We need to statistically be sure of any differences that exist between average sale price in categories.
3.1.2 Model
We are going to do this with the same approach as in the One-Way ANOVA, just with more variables as shown in the following equation:
\[ Y_{ijk} = \mu + \alpha_i + \beta_j + \varepsilon_{ijk} \] where \(\mu\) is the average baseline sales price of a home in Ames, Iowa, \(\alpha_i\) is the variable representing the impacts of the levels of heating quality, and \(\beta_j\) is the variable representing the impacts of the levels of central air. As mentioned previously, the unexplained error in this model is represented as \(\varepsilon_{ijk}\).
The same F test approach is also used, just for each one of the variables. Each variable’s test has a null hypothesis assuming all categories have the same mean. The alternative for each test is that at least one category’s mean is different.
Let’s view the results of the aov
function.
## Df Sum Sq Mean Sq F value Pr(>F)
## Heating_QC 4 2.891e+12 7.228e+11 147.60 < 2e-16 ***
## Central_Air 1 2.903e+11 2.903e+11 59.28 2.11e-14 ***
## Residuals 2045 1.002e+13 4.897e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above results, we have low p-values for each of the variables’ F test. Heating quality had 4 degrees of freedom, derived from the 5 categories \((4 = 5-1)\). Similarly, central air’s 2 categories produce 1 \((= 2-1)\) degree of freedom. The F values are calculated the exact same way as described before with the mean square for each variable divided by the mean square error. Based on these tests, at least one category in each variable is statistically different than the rest.
3.1.3 Post-Hoc Testing
As with the One-Way ANOVA, the next logical question is which of these categories is different. We will use the same post-hoc tests as before with the TukeyHSD
function.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sale_Price ~ Heating_QC + Central_Air, data = train)
##
## $Heating_QC
## diff lwr upr p adj
## Fair-Poor 49176.42 -63650.448 162003.29 0.7571980
## Typical-Poor 67781.01 -42800.320 178362.35 0.4506761
## Good-Poor 87753.89 -23040.253 198548.03 0.1945181
## Excellent-Poor 146288.89 35818.859 256758.92 0.0028361
## Typical-Fair 18604.59 -6326.425 43535.61 0.2484556
## Good-Fair 38577.47 12718.894 64436.04 0.0004622
## Excellent-Fair 97112.47 72679.867 121545.07 0.0000000
## Good-Typical 19972.87 7050.230 32895.52 0.0002470
## Excellent-Typical 78507.88 68746.678 88269.07 0.0000000
## Excellent-Good 58535.00 46602.229 70467.78 0.0000000
##
## $Central_Air
## diff lwr upr p adj
## Y-N 43256.57 31508.27 55004.87 0
Starting with the variable for central air, we can see there is a statistical difference between the two categories. This is the exact same result as the overall F test for the variable since there are only two categories. For the heating quality variable, we can see some categories are different from each other, while others are not. Noticeably, the combination of poor with fair, good, and typical categories are not statistically different. Notice also the different widths of these confidence intervals do to the different combinations of sample sizes for the categories being tested.
3.1.4 Python Code
Two-way ANOVA model
## C:\Users\sjsimmo2\AppData\Local\R\win-library\4.3\reticulate\python\rpytools\loader.py:117: UserWarning: Pandas requires version '2.8.4' or newer of 'numexpr' (version '2.8.3' currently installed).
## return _find_and_load(name, import_)
## C:\Users\sjsimmo2\AppData\Local\R\win-library\4.3\reticulate\python\rpytools\loader.py:117: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed).
## return _find_and_load(name, import_)
import statsmodels.api as sm
train=r.train
model = smf.ols("Sale_Price ~ C(Heating_QC) + C(Central_Air)", data = train).fit()
model.summary()
Dep. Variable: | Sale_Price | R-squared: | 0.241 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.239 |
Method: | Least Squares | F-statistic: | 129.9 |
Date: | Tue, 18 Jun 2024 | Prob (F-statistic): | 9.06e-120 |
Time: | 15:57:25 | Log-Likelihood: | -25788. |
No. Observations: | 2051 | AIC: | 5.159e+04 |
Df Residuals: | 2045 | BIC: | 5.162e+04 |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 5.264e+04 | 4.05e+04 | 1.301 | 0.193 | -2.67e+04 | 1.32e+05 |
C(Heating_QC)[T.Fair] | 3.833e+04 | 4.13e+04 | 0.927 | 0.354 | -4.28e+04 | 1.19e+05 |
C(Heating_QC)[T.Typical] | 4.161e+04 | 4.06e+04 | 1.024 | 0.306 | -3.81e+04 | 1.21e+05 |
C(Heating_QC)[T.Good] | 5.828e+04 | 4.08e+04 | 1.430 | 0.153 | -2.17e+04 | 1.38e+05 |
C(Heating_QC)[T.Excellent] | 1.14e+05 | 4.07e+04 | 2.803 | 0.005 | 3.42e+04 | 1.94e+05 |
C(Central_Air)[T.Y] | 4.918e+04 | 6387.850 | 7.700 | 0.000 | 3.67e+04 | 6.17e+04 |
Omnibus: | 806.993 | Durbin-Watson: | 1.995 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 4426.691 |
Skew: | 1.778 | Prob(JB): | 0.00 |
Kurtosis: | 9.258 | Cond. No. | 88.5 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## sum_sq df F PR(>F)
## C(Heating_QC) 2.220229e+12 4.0 113.339043 2.254858e-87
## C(Central_Air) 2.903304e+11 1.0 59.283540 2.110570e-14
## Residual 1.001502e+13 2045.0 NaN NaN
Post-hoc testing
from statsmodels.stats.multicomp import pairwise_tukeyhsd
train['anova_combo'] = train.Heating_QC.astype('string') + " / " + train.Central_Air
mcomp = pairwise_tukeyhsd(endog = train['Sale_Price'], groups = train['anova_combo'], alpha = 0.05)
mcomp.summary()
group1 | group2 | meandiff | p-adj | lower | upper | reject |
---|---|---|---|---|---|---|
Excellent / N | Excellent / Y | 101339.6201 | 0.0001 | 34220.4846 | 168458.7556 | True |
Excellent / N | Fair / N | -30313.4514 | 0.9685 | -108720.0544 | 48093.1515 | False |
Excellent / N | Fair / Y | 30103.1061 | 0.9641 | -46178.4107 | 106384.6228 | False |
Excellent / N | Good / N | -4250.8577 | 1.0 | -85421.1634 | 76919.448 | False |
Excellent / N | Good / Y | 45050.8262 | 0.5268 | -22854.8425 | 112956.4949 | False |
Excellent / N | Poor / N | -65011.7273 | 0.971 | -235219.081 | 105195.6264 | False |
Excellent / N | Poor / Y | -8061.7273 | 1.0 | -239327.9804 | 223204.5259 | False |
Excellent / N | Typical / N | -11592.5078 | 1.0 | -82690.3122 | 59505.2966 | False |
Excellent / N | Typical / Y | 26941.045 | 0.9612 | -40512.9182 | 94395.0082 | False |
Excellent / Y | Fair / N | -131653.0715 | 0.0 | -173349.1196 | -89957.0234 | True |
Excellent / Y | Fair / Y | -71236.514 | 0.0 | -108784.2815 | -33688.7466 | True |
Excellent / Y | Good / N | -105590.4778 | 0.0 | -152276.4883 | -58904.4673 | True |
Excellent / Y | Good / Y | -56288.7939 | 0.0 | -70506.5603 | -42071.0275 | True |
Excellent / Y | Poor / N | -166351.3474 | 0.0272 | -323072.4637 | -9630.2311 | True |
Excellent / Y | Poor / Y | -109401.3474 | 0.8653 | -330930.2278 | 112127.5331 | False |
Excellent / Y | Typical / N | -112932.1278 | 0.0 | -138345.9594 | -87518.2963 | True |
Excellent / Y | Typical / Y | -74398.5751 | 0.0 | -86273.0096 | -62524.1405 | True |
Fair / N | Fair / Y | 60416.5575 | 0.0194 | 5167.5586 | 115665.5563 | True |
Fair / N | Good / N | 26062.5937 | 0.9455 | -35761.3548 | 87886.5422 | False |
Fair / N | Good / Y | 75364.2776 | 0.0 | 32413.5859 | 118314.9693 | True |
Fair / N | Poor / N | -34698.2759 | 0.9996 | -196575.1591 | 127178.6074 | False |
Fair / N | Poor / Y | 22251.7241 | 1.0 | -202954.0971 | 247457.5454 | False |
Fair / N | Typical / N | 18720.9437 | 0.966 | -29117.1151 | 66559.0024 | False |
Fair / N | Typical / Y | 57254.4964 | 0.0008 | 15021.58 | 99487.4129 | True |
Fair / Y | Good / N | -34353.9638 | 0.7089 | -93459.5897 | 24751.6622 | False |
Fair / Y | Good / Y | 14947.7201 | 0.97 | -23988.5911 | 53884.0313 | False |
Fair / Y | Poor / N | -95114.8333 | 0.6877 | -255973.1553 | 65743.4886 | False |
Fair / Y | Poor / Y | -38164.8333 | 0.9999 | -262639.6345 | 186309.9678 | False |
Fair / Y | Typical / N | -41695.6138 | 0.0849 | -85964.7259 | 2573.4982 | False |
Fair / Y | Typical / Y | -3162.061 | 1.0 | -41305.1291 | 34981.0071 | False |
Good / N | Good / Y | 49301.6839 | 0.0369 | 1491.7993 | 97111.5685 | True |
Good / N | Poor / N | -60760.8696 | 0.9755 | -223994.2867 | 102472.5476 | False |
Good / N | Poor / Y | -3810.8696 | 1.0 | -229993.7272 | 222371.988 | False |
Good / N | Typical / N | -7341.6501 | 1.0 | -59586.2959 | 44902.9957 | False |
Good / N | Typical / Y | 31191.9027 | 0.5315 | -15974.212 | 78358.0174 | False |
Good / Y | Poor / N | -110062.5535 | 0.4435 | -267122.1275 | 46997.0205 | False |
Good / Y | Poor / Y | -53112.5535 | 0.9991 | -274881.0056 | 168655.8987 | False |
Good / Y | Typical / N | -56643.3339 | 0.0 | -84067.1251 | -29219.5428 | True |
Good / Y | Typical / Y | -18109.7812 | 0.0101 | -33832.4936 | -2387.0688 | True |
Poor / N | Poor / Y | 56950.0 | 0.9997 | -214233.7196 | 328133.7196 | False |
Poor / N | Typical / N | 53419.2195 | 0.9877 | -105046.6372 | 211885.0762 | False |
Poor / N | Typical / Y | 91952.7723 | 0.6985 | -64912.0329 | 248817.5775 | False |
Poor / Y | Typical / N | -3530.7805 | 1.0 | -226297.3945 | 219235.8336 | False |
Poor / Y | Typical / Y | 35002.7723 | 1.0 | -186627.7845 | 256633.3291 | False |
Typical / N | Typical / Y | 38533.5528 | 0.0002 | 12248.1645 | 64818.941 | True |
3.2 Two-Way ANOVA with Interactions
What if the relationship between a predictor and target variable changed depending on the value of another predictor variable? For our example, we would say that the average difference in sales price between having central air and not having central changed depending on what level of heating quality the home had. In the bar chart in Figure 3.1, a potential interaction effect is displayed when the differences between the two bars within each heating category is different across heating category. If the difference, was the same, then there is no interaction present. In other words, no matter the heating quality rating of the home, the average difference in sales price between having central air and not having central air is the same.
This interaction model is represented as follows:
\[ Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \varepsilon_{ijk} \]
with the interaction effect, \((\alpha \beta)_{ij}\), as the multiplication of the two variables involved in the interaction. Interactions can occur between more than two variables as well. Interactions are good to evaluate as they can mask the effects of individual variables. For example, imagine a hypothetical example as shown in Figure 3.2 where the impact of having central air is opposite depending on which category of heating quality you have.
fake_HQ <- c("Poor", "Poor", "Excellent", "Excellent")
fake_CA <- c("N", "Y", "N", "Y")
fake_mean <- c(100, 150, 150, 100)
fake_df <- as.data.frame(cbind(fake_HQ, fake_CA, fake_mean))
ggplot(data = fake_df, aes(x = fake_HQ, y = as.numeric(fake_mean), fill = fake_CA)) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(y = "Fake Sales Price (Thousands $)", x = "Fake Heating Quality Category") +
scale_fill_brewer(palette = "Paired") +
theme_minimal()
If you were to only look at the average sales price across heating quality, you would notice no difference between the two groups (average for both heating categories is 125). However, when the interaction is accounted for, you can clearly see in the bar heights that sales price is different depending on the value of central air.
Let’s evaluate the interaction term in our actual data. To do so, we just incorporate the multiplication of the two variables in the model statement by using the formula Sale_Price ~ Heating_QC + Central_Air + Heating_QC:Central_Air
. You could also use the shorthand version of this by using the formula Sale_Price ~ Heating_QC*Central_Air
. The *
will include both main effects (the individual variables) and the interaction between them.
## Df Sum Sq Mean Sq F value Pr(>F)
## Heating_QC 4 2.891e+12 7.228e+11 147.897 < 2e-16 ***
## Central_Air 1 2.903e+11 2.903e+11 59.403 1.99e-14 ***
## Heating_QC:Central_Air 4 3.972e+10 9.930e+09 2.032 0.0875 .
## Residuals 2041 9.975e+12 4.887e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As seen by the output above, the interaction effect between heating quality and central air is not significant at the 0.05 level. Again, this implies that the average difference in sale price of the home between having central air and not does not differ depending on which category of heating quality the home belongs to. If our interaction was significant (say a 0.02 p-value instead) then we would keep it in our model, but here we would remove the interaction term from our model and rerun the analysis.
3.2.1 Post-Hoc Testing
Post-hoc tests are also available for interaction models as well to determine where the statistical differences exist in all the combinations of possible categories. We evaluate these post-hoc tests using the same TukeyHSD
function and its corresponding plot
element.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sale_Price ~ Heating_QC * Central_Air, data = train)
##
## $Heating_QC
## diff lwr upr p adj
## Fair-Poor 49176.42 -63536.973 161889.81 0.7565086
## Typical-Poor 67781.01 -42689.104 178251.13 0.4496163
## Good-Poor 87753.89 -22928.822 198436.60 0.1936558
## Excellent-Poor 146288.89 35929.963 256647.82 0.0027979
## Typical-Fair 18604.59 -6301.351 43510.54 0.2474998
## Good-Fair 38577.47 12744.901 64410.03 0.0004543
## Excellent-Fair 97112.47 72704.440 121520.50 0.0000000
## Good-Typical 19972.87 7063.227 32882.52 0.0002425
## Excellent-Typical 78507.88 68756.495 88259.26 0.0000000
## Excellent-Good 58535.00 46614.230 70455.77 0.0000000
##
## $Central_Air
## diff lwr upr p adj
## Y-N 43256.57 31520.09 54993.05 0
##
## $`Heating_QC:Central_Air`
## diff lwr upr p adj
## Fair:N-Poor:N 34698.276 -127178.615 196575.17 0.9996249
## Typical:N-Poor:N 53419.220 -105046.645 211885.08 0.9876643
## Good:N-Poor:N 60760.870 -102472.555 223994.29 0.9755371
## Excellent:N-Poor:N 65011.727 -105195.635 235219.09 0.9709555
## Poor:Y-Poor:N 56950.000 -214233.733 328133.73 0.9996829
## Fair:Y-Poor:N 95114.833 -65743.496 255973.16 0.6876708
## Typical:Y-Poor:N 91952.772 -64912.040 248817.58 0.6985070
## Good:Y-Poor:N 110062.553 -46997.028 267122.13 0.4435353
## Excellent:Y-Poor:N 166351.347 9630.224 323072.47 0.0271785
## Typical:N-Fair:N 18720.944 -29117.117 66559.00 0.9659507
## Good:N-Fair:N 26062.594 -35761.358 87886.55 0.9454988
## Excellent:N-Fair:N 30313.451 -48093.155 108720.06 0.9685428
## Poor:Y-Fair:N 22251.724 -202954.108 247457.56 0.9999995
## Fair:Y-Fair:N 60416.557 5167.556 115665.56 0.0193847
## Typical:Y-Fair:N 57254.496 15021.578 99487.41 0.0007697
## Good:Y-Fair:N 75364.278 32413.584 118314.97 0.0000014
## Excellent:Y-Fair:N 131653.071 89957.021 173349.12 0.0000000
## Good:N-Typical:N 7341.650 -44902.998 59586.30 0.9999894
## Excellent:N-Typical:N 11592.508 -59505.300 82690.32 0.9999620
## Poor:Y-Typical:N 3530.780 -219235.844 226297.41 1.0000000
## Fair:Y-Typical:N 41695.614 -2573.500 85964.73 0.0848888
## Typical:Y-Typical:N 38533.553 12248.163 64818.94 0.0001576
## Good:Y-Typical:N 56643.334 29219.541 84067.13 0.0000000
## Excellent:Y-Typical:N 112932.128 87518.295 138345.96 0.0000000
## Excellent:N-Good:N 4250.858 -76919.452 85421.17 1.0000000
## Poor:Y-Good:N -3810.870 -229993.738 222372.00 1.0000000
## Fair:Y-Good:N 34353.964 -24751.665 93459.59 0.7089196
## Typical:Y-Good:N 31191.903 -15974.214 78358.02 0.5315494
## Good:Y-Good:N 49301.684 1491.797 97111.57 0.0369201
## Excellent:Y-Good:N 105590.478 58904.465 152276.49 0.0000000
## Poor:Y-Excellent:N -8061.727 -239327.991 223204.54 1.0000000
## Fair:Y-Excellent:N 30103.106 -46178.414 106384.63 0.9640522
## Typical:Y-Excellent:N 26941.045 -40512.921 94395.01 0.9611660
## Good:Y-Excellent:N 45050.826 -22854.846 112956.50 0.5267698
## Excellent:Y-Excellent:N 101339.620 34220.481 168458.76 0.0000809
## Fair:Y-Poor:Y 38164.833 -186309.978 262639.65 0.9999458
## Typical:Y-Poor:Y 35002.772 -186627.795 256633.34 0.9999711
## Good:Y-Poor:Y 53112.553 -168655.909 274881.02 0.9990789
## Excellent:Y-Poor:Y 109401.347 -112127.544 330930.24 0.8652766
## Typical:Y-Fair:Y -3162.061 -41305.131 34981.01 0.9999999
## Good:Y-Fair:Y 14947.720 -23988.593 53884.03 0.9699661
## Excellent:Y-Fair:Y 71236.514 33688.745 108784.28 0.0000001
## Good:Y-Typical:Y 18109.781 2387.068 33832.49 0.0101482
## Excellent:Y-Typical:Y 74398.575 62524.140 86273.01 0.0000000
## Excellent:Y-Good:Y 56288.794 42071.027 70506.56 0.0000000
In the giant table above as well as the confidence interval plots, you are able to inspect which combination of categories are statistically different.
With interactions present in ANOVA models, post-hoc tests might get overwhelming in trying to find where differences exist. To help guide the exploration of post-hoc tests with interactions, we can do slicing. Slicing is when you perform One-Way ANOVA on subsets of data within categories of other variables. Even though the interaction in our model was not significant, let’s imagine that it was for the sake of example. For example, to help discover differences in the interaction between central air and heating quality, we could subset the data into two groups - homes with central air and homes without. Within these two groups we perform One-Way ANOVA across heating quality to find where differences might exist within subgroups.
This can easily be done with the group_by
function to subset the data. The nest
and mutate
functions are also used to applied the aov
function to each subgroup. Here we run a One-Way ANOVA for heating quality within each subset of central air being present or not.
CA_aov <- train %>%
group_by(Central_Air) %>%
nest() %>%
mutate(aov = map(data, ~summary(aov(Sale_Price ~ Heating_QC, data = .x))))
CA_aov
## # A tibble: 2 × 3
## # Groups: Central_Air [2]
## Central_Air data aov
## <fct> <list> <list>
## 1 Y <tibble [1,904 × 81]> <summry.v>
## 2 N <tibble [147 × 81]> <summry.v>
## [[1]]
## Df Sum Sq Mean Sq F value Pr(>F)
## Heating_QC 4 2.242e+12 5.606e+11 108.5 <2e-16 ***
## Residuals 1899 9.809e+12 5.165e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## [[2]]
## Df Sum Sq Mean Sq F value Pr(>F)
## Heating_QC 4 1.774e+10 4.435e+09 3.793 0.00582 **
## Residuals 142 1.660e+11 1.169e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that both of these results are significant at the 0.05 level. This implies that there are statistical differences in average sale price across heating quality within homes that have central air as well as those that don’t have central air.
3.2.2 Assumptions
The assumptions for the \(n\)-Way ANOVA are the same as with the One-Way ANOVA - independence of observations, normality for each category of variable, and equal variances. With the inclusion of two or more variables (\(n\)-Way ANOVA with \(n > 1\)), these assumptions can be harder to evaluate and test. These assumptions are then applied to the residuals of the model.
For equal variances, we can still apply the Levene Test for equal variances using the same leveneTest
function as in Chapter 2.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 9 24.17 < 2.2e-16 ***
## 2041
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From this test, we can see that we do not meet our assumption of equal variance.
Let’s explore the normality assumption. Again, we will assume this normality on the random error component, \(\varepsilon_{ijk}\), of the ANOVA model. More details are found for diagnostic testing using the error component in Diagnostic chapter. We can check normality using the same approaches of the QQ-plot or statistical testing as in the section on EDA. Here we will use the plot
function on the aov
object. Specifically, we want the second plot which is why we have a 2
in the plot
function option. We then use the shapiro.test
function on the error component. The estimate of the error component is calculated using the residuals
function.
##
## Shapiro-Wilk normality test
##
## data: ames_res
## W = 0.8838, p-value < 2.2e-16
Neither of the normality or equal variance assumptions appear to be met here. This would be a good scenario to have a non-parametric approach. Unfortunately, the Kruskal-Wallis approach is not applicable to \(n\)-Way ANOVA where \(n > 1\). These approaches would need more non-parametric versions of multiple regression models to account for them.
3.2.3 Python Code
Two-way ANOVA with interactions
train=r.train
import matplotlib.pyplot as plt
import scipy as sp
import pandas as pd
import numpy as np
model = smf.ols("Sale_Price ~ C(Heating_QC)*C(Central_Air)", data = train).fit()
model.summary()
Dep. Variable: | Sale_Price | R-squared: | 0.244 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.241 |
Method: | Least Squares | F-statistic: | 73.24 |
Date: | Tue, 18 Jun 2024 | Prob (F-statistic): | 1.99e-117 |
Time: | 15:57:28 | Log-Likelihood: | -25784. |
No. Observations: | 2051 | AIC: | 5.159e+04 |
Df Residuals: | 2041 | BIC: | 5.164e+04 |
Df Model: | 9 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 5.005e+04 | 4.94e+04 | 1.012 | 0.311 | -4.69e+04 | 1.47e+05 |
C(Heating_QC)[T.Fair] | 3.47e+04 | 5.11e+04 | 0.679 | 0.497 | -6.55e+04 | 1.35e+05 |
C(Heating_QC)[T.Typical] | 5.342e+04 | 5e+04 | 1.068 | 0.286 | -4.47e+04 | 1.52e+05 |
C(Heating_QC)[T.Good] | 6.076e+04 | 5.15e+04 | 1.179 | 0.239 | -4.03e+04 | 1.62e+05 |
C(Heating_QC)[T.Excellent] | 6.501e+04 | 5.37e+04 | 1.210 | 0.227 | -4.04e+04 | 1.7e+05 |
C(Central_Air)[T.Y] | 5.695e+04 | 8.56e+04 | 0.665 | 0.506 | -1.11e+05 | 2.25e+05 |
C(Heating_QC)[T.Fair]:C(Central_Air)[T.Y] | 3466.5575 | 8.74e+04 | 0.040 | 0.968 | -1.68e+05 | 1.75e+05 |
C(Heating_QC)[T.Typical]:C(Central_Air)[T.Y] | -1.842e+04 | 8.6e+04 | -0.214 | 0.831 | -1.87e+05 | 1.5e+05 |
C(Heating_QC)[T.Good]:C(Central_Air)[T.Y] | -7648.3161 | 8.69e+04 | -0.088 | 0.930 | -1.78e+05 | 1.63e+05 |
C(Heating_QC)[T.Excellent]:C(Central_Air)[T.Y] | 4.439e+04 | 8.82e+04 | 0.503 | 0.615 | -1.29e+05 | 2.17e+05 |
Omnibus: | 806.508 | Durbin-Watson: | 1.995 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 4446.618 |
Skew: | 1.774 | Prob(JB): | 0.00 |
Kurtosis: | 9.280 | Cond. No. | 217. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## sum_sq df F PR(>F)
## C(Heating_QC) 2.220229e+12 4.0 113.567764 1.612447e-87
## C(Central_Air) 2.903304e+11 1.0 59.403176 1.991166e-14
## C(Heating_QC):C(Central_Air) 3.971968e+10 4.0 2.031716 8.746590e-02
## Residual 9.975296e+12 2041.0 NaN NaN
Post-hoc testing
model_slice1 = smf.ols("Sale_Price ~ C(Heating_QC)", data = train[train["Central_Air"] == 'Y']).fit()
model_slice1.summary()
Dep. Variable: | Sale_Price | R-squared: | 0.186 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.184 |
Method: | Least Squares | F-statistic: | 108.5 |
Date: | Tue, 18 Jun 2024 | Prob (F-statistic): | 2.29e-83 |
Time: | 15:57:28 | Log-Likelihood: | -23991. |
No. Observations: | 1904 | AIC: | 4.799e+04 |
Df Residuals: | 1899 | BIC: | 4.802e+04 |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 1.07e+05 | 7.19e+04 | 1.489 | 0.137 | -3.4e+04 | 2.48e+05 |
C(Heating_QC)[T.Fair] | 3.816e+04 | 7.29e+04 | 0.524 | 0.600 | -1.05e+05 | 1.81e+05 |
C(Heating_QC)[T.Typical] | 3.5e+04 | 7.19e+04 | 0.487 | 0.627 | -1.06e+05 | 1.76e+05 |
C(Heating_QC)[T.Good] | 5.311e+04 | 7.2e+04 | 0.738 | 0.461 | -8.81e+04 | 1.94e+05 |
C(Heating_QC)[T.Excellent] | 1.094e+05 | 7.19e+04 | 1.521 | 0.128 | -3.16e+04 | 2.5e+05 |
Omnibus: | 730.142 | Durbin-Watson: | 2.004 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3727.430 |
Skew: | 1.749 | Prob(JB): | 0.00 |
Kurtosis: | 8.895 | Cond. No. | 116. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## sum_sq df F PR(>F)
## C(Heating_QC) 2.242210e+12 4.0 108.518706 2.294916e-83
## Residual 9.809268e+12 1899.0 NaN NaN
model_slice2 = smf.ols("Sale_Price ~ C(Heating_QC)", data = train[train["Central_Air"] == 'N']).fit()
model_slice2.summary()
Dep. Variable: | Sale_Price | R-squared: | 0.097 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.071 |
Method: | Least Squares | F-statistic: | 3.793 |
Date: | Tue, 18 Jun 2024 | Prob (F-statistic): | 0.00582 |
Time: | 15:57:28 | Log-Likelihood: | -1740.7 |
No. Observations: | 147 | AIC: | 3491. |
Df Residuals: | 142 | BIC: | 3506. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 5.005e+04 | 2.42e+04 | 2.070 | 0.040 | 2253.387 | 9.78e+04 |
C(Heating_QC)[T.Fair] | 3.47e+04 | 2.5e+04 | 1.388 | 0.167 | -1.47e+04 | 8.41e+04 |
C(Heating_QC)[T.Typical] | 5.342e+04 | 2.45e+04 | 2.183 | 0.031 | 5043.233 | 1.02e+05 |
C(Heating_QC)[T.Good] | 6.076e+04 | 2.52e+04 | 2.410 | 0.017 | 1.09e+04 | 1.11e+05 |
C(Heating_QC)[T.Excellent] | 6.501e+04 | 2.63e+04 | 2.473 | 0.015 | 1.31e+04 | 1.17e+05 |
Omnibus: | 10.251 | Durbin-Watson: | 1.957 |
---|---|---|---|
Prob(Omnibus): | 0.006 | Jarque-Bera (JB): | 11.172 |
Skew: | 0.516 | Prob(JB): | 0.00375 |
Kurtosis: | 3.871 | Cond. No. | 23.0 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## sum_sq df F PR(>F)
## C(Heating_QC) 1.773945e+10 4.0 3.793029 0.00582
## Residual 1.660284e+11 142.0 NaN NaN
Assumptions
## ShapiroResult(statistic=0.8838016140166404, pvalue=1.0540317548240295e-36)
3.3 Randomized Block Design
There are two typical groups of data analysis studies that are conducted. The first are observational/retrospective studies which are the typical data problems people try to solve. The primary characteristic of these analysis are looking at what already happened (retrospective) and potentially inferring those results to further data. These studies have little control over other factors contributing to the target of interest because data is collected after the fact.
The other type of data analysis study are controlled experiments. In these situations, you often want to look at the outcome measure prospectively. The focus of the controlled experiment is to control for other factors that might contribute to the target variable. By manipulating these factors of interest, one can more reasonably claim causation. We can more reasonably claim causation when random assignment is used to eliminate potential nuisance factors. Nuisance factors are variables that can potentially impact the target variable, but are not of interest in the analysis directly.
3.3.1 Garlic Bulb Weight Example
For this analysis we will use a new dataset. This dataset contains the average garlic bulb weight from different plots of land. We want to compare the effects of fertilizer on average bulb weight. However, different plots of land could have different levels of sun exposure, pH for the soil, and rain amounts. Since we cannot alter the pH of the soil easily, or control the sun and rain, we can use blocking to account for these nuisance factors. Each fertilizer was randomly applied in quadrants of 8 plots of land. These 8 plots have different values for sun exposure, pH, and rain amount. Therefore, if we only put one fertilizer on each plot, we would not know if the fertilizer was the reason the garlic crop grew or if it was one of the nuisance factors. Since we blocked these 8 plots and applied all four fertilizers in each we have essentially accounted for (or removed the effect of) the nuisance factors.
Let’s briefly explore this new dataset by looking at all 32 values using the print
function.
## Rows: 32 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Sector, Position, Fertilizer, BulbWt, Cloves, BedId
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 32 × 6
## Sector Position Fertilizer BulbWt Cloves BedId
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 3 0.259 11.6 22961
## 2 1 2 4 0.207 12.6 23884
## 3 1 3 1 0.275 12.1 19642
## 4 1 4 2 0.245 12.1 20384
## 5 2 1 3 0.215 11.6 20303
## 6 2 2 4 0.170 12.7 21004
## 7 2 3 1 0.225 12.0 16117
## 8 2 4 2 0.168 11.9 19686
## 9 3 1 4 0.217 12.4 26527
## 10 3 2 3 0.226 11.7 23574
## # ℹ 22 more rows
How do we account for this blocking in an ANOVA model context? This blocking ANOVA model is the exact same as the Two-Way ANOVA model. The variable that identifies which sector (block) an observation is in serves as another variable in the model. Think about this variable as the variable that accounts for all the nuisance factors in your ANOVA. That means we have two variables in this ANOVA model - fertilizer and sector (the block that accounts for sun exposure, pH level of soil, rain amount, etc.).
For this we can use the same aov
function we described above.
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(Fertilizer) 3 0.005086 0.0016954 4.307 0.016222 *
## factor(Sector) 7 0.017986 0.0025695 6.527 0.000364 ***
## Residuals 21 0.008267 0.0003937
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using the summary
function we can see that both the sector (block) and fertilizer variables are significant in the model at the 0.05 level. What are the interpretations of this?
First, let’s address the blocking variable. Whether it is significant or not, it should always be included in the model. This is due to the fact that the data is structured in that way. It is a construct of the data that should be accounted for regardless of the significance. However, since the blocking variable (sector) was significant, that implies that different plots of land have different impacts of the average bulb weight of garlic. Again, this is most likely due to the differences between the plots of land - namely sun exposure, pH of soil, rain fall, etc.
Second, the variable of interest is the fertilizer variable. It is significant, implying that there is a difference in the average bulb weight of garlic for different fertilizers. To examine which fertilizer pairs are statistically difference we can use post-hos testing as described in the previous parts of this chapter using the TukeyHSD
function.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = BulbWt ~ factor(Fertilizer) + factor(Sector), data = block)
##
## $`factor(Fertilizer)`
## diff lwr upr p adj
## 2-1 -0.02509875 -0.05275125 0.002553751 0.0840024
## 3-1 -0.01294875 -0.04060125 0.014703751 0.5698678
## 4-1 -0.03336125 -0.06101375 -0.005708749 0.0144260
## 3-2 0.01215000 -0.01550250 0.039802501 0.6186232
## 4-2 -0.00826250 -0.03591500 0.019390001 0.8382800
## 4-3 -0.02041250 -0.04806500 0.007240001 0.1995492
##
## $`factor(Sector)`
## diff lwr upr p adj
## 2-1 -0.0520675 -0.099126544 -5.008456e-03 0.0234315
## 3-1 -0.0145075 -0.061566544 3.255154e-02 0.9634255
## 4-1 -0.0450550 -0.092114044 2.004044e-03 0.0669646
## 5-1 -0.0616250 -0.108684044 -1.456596e-02 0.0051483
## 6-1 -0.0196650 -0.066724044 2.739404e-02 0.8466335
## 7-1 0.0084950 -0.038564044 5.555404e-02 0.9984089
## 8-1 -0.0393325 -0.086391544 7.726544e-03 0.1469768
## 3-2 0.0375600 -0.009499044 8.461904e-02 0.1841786
## 4-2 0.0070125 -0.040046544 5.407154e-02 0.9995370
## 5-2 -0.0095575 -0.056616544 3.750154e-02 0.9966777
## 6-2 0.0324025 -0.014656544 7.946154e-02 0.3337758
## 7-2 0.0605625 0.013503456 1.076215e-01 0.0061094
## 8-2 0.0127350 -0.034324044 5.979404e-02 0.9819446
## 4-3 -0.0305475 -0.077606544 1.651154e-02 0.4025951
## 5-3 -0.0471175 -0.094176544 -5.845586e-05 0.0495704
## 6-3 -0.0051575 -0.052216544 4.190154e-02 0.9999400
## 7-3 0.0230025 -0.024056544 7.006154e-02 0.7227812
## 8-3 -0.0248250 -0.071884044 2.223404e-02 0.6454690
## 5-4 -0.0165700 -0.063629044 3.048904e-02 0.9286987
## 6-4 0.0253900 -0.021669044 7.244904e-02 0.6208608
## 7-4 0.0535500 0.006490956 1.006090e-01 0.0186102
## 8-4 0.0057225 -0.041336544 5.278154e-02 0.9998793
## 6-5 0.0419600 -0.005099044 8.901904e-02 0.1034664
## 7-5 0.0701200 0.023060956 1.171790e-01 0.0012997
## 8-5 0.0222925 -0.024766544 6.935154e-02 0.7514897
## 7-6 0.0281600 -0.018899044 7.521904e-02 0.5004099
## 8-6 -0.0196675 -0.066726544 2.739154e-02 0.8465530
## 8-7 -0.0478275 -0.094886544 -7.684559e-04 0.0446174
3.3.2 Assumptions
Outside of the typical assumptions for ANOVA that still hold here, there are two additional assumptions to be met:
- Treatments are randomly assigned within each block
- The effects of the treatment variable are constant across the levels of the blocking variable
The first, new assumption of randomness deals with the reliability of the analysis. Randomness is key to removing the impact of the nuisance factors. The second, new assumption implies there is no interaction between the treatment variable and the blocking variable. For example, we are implying that the fertilizers’ impacts ob garlic bulb weight are not changed depending on what block you are on. In other words, fertilizers have the same impact regardless of sun exposure, pH levels, rain fall, etc. We are not saying these nuisance factors do not impact the target variable or bulb weight of garlic, just that they do not change the effect of the fertilizer on bulb weight.
3.3.3 Python Code
block=pd.read_csv('https://raw.githubusercontent.com/IAA-Faculty/statistical_foundations/master/garlic_block.csv')
block.head(n = 32)
## Sector Position Fertilizer BulbWt Cloves BedId
## 0 1 1 3 0.25881 11.6322 22961
## 1 1 2 4 0.20746 12.5837 23884
## 2 1 3 1 0.27453 12.0597 19642
## 3 1 4 2 0.24467 12.1001 20384
## 4 2 1 3 0.21454 11.5863 20303
## 5 2 2 4 0.16953 12.7132 21004
## 6 2 3 1 0.22504 12.0470 16117
## 7 2 4 2 0.16809 11.9071 19686
## 8 3 1 4 0.21720 12.3655 26527
## 9 3 2 3 0.22551 11.6864 23574
## 10 3 3 2 0.23536 12.0258 17499
## 11 3 4 1 0.24937 11.6668 16636
## 12 4 1 4 0.20811 12.5996 24834
## 13 4 2 1 0.21138 12.1393 19946
## 14 4 3 2 0.19320 12.0792 21504
## 15 4 4 3 0.19256 11.6464 23181
## 16 5 1 4 0.19851 12.5355 24533
## 17 5 2 1 0.18603 12.3307 15009
## 18 5 3 3 0.19698 11.5608 23845
## 19 5 4 2 0.15745 11.8735 18948
## 20 6 1 2 0.20058 11.9077 20019
## 21 6 2 3 0.25346 11.7294 22228
## 22 6 3 4 0.19838 12.7670 24424
## 23 6 4 1 0.25439 12.0139 13755
## 24 7 1 3 0.26578 11.7448 21087
## 25 7 2 4 0.21678 12.8531 25751
## 26 7 3 2 0.26183 12.3990 20296
## 27 7 4 1 0.27506 11.9383 20038
## 28 8 1 1 0.21420 12.1034 17843
## 29 8 2 3 0.17877 11.5682 21394
## 30 8 3 4 0.20714 12.5213 27191
## 31 8 4 2 0.22803 12.2317 20202
import statsmodels.formula.api as smf
model_b = smf.ols("BulbWt ~ C(Fertilizer) + C(Sector)", data = block).fit()
model_b.summary()
Dep. Variable: | BulbWt | R-squared: | 0.736 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.611 |
Method: | Least Squares | F-statistic: | 5.861 |
Date: | Tue, 18 Jun 2024 | Prob (F-statistic): | 0.000328 |
Time: | 15:57:30 | Log-Likelihood: | 86.773 |
No. Observations: | 32 | AIC: | -151.5 |
Df Residuals: | 21 | BIC: | -135.4 |
Df Model: | 10 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.2642 | 0.012 | 22.713 | 0.000 | 0.240 | 0.288 |
C(Fertilizer)[T.2] | -0.0251 | 0.010 | -2.530 | 0.019 | -0.046 | -0.004 |
C(Fertilizer)[T.3] | -0.0129 | 0.010 | -1.305 | 0.206 | -0.034 | 0.008 |
C(Fertilizer)[T.4] | -0.0334 | 0.010 | -3.363 | 0.003 | -0.054 | -0.013 |
C(Sector)[T.2] | -0.0521 | 0.014 | -3.711 | 0.001 | -0.081 | -0.023 |
C(Sector)[T.3] | -0.0145 | 0.014 | -1.034 | 0.313 | -0.044 | 0.015 |
C(Sector)[T.4] | -0.0451 | 0.014 | -3.211 | 0.004 | -0.074 | -0.016 |
C(Sector)[T.5] | -0.0616 | 0.014 | -4.392 | 0.000 | -0.091 | -0.032 |
C(Sector)[T.6] | -0.0197 | 0.014 | -1.402 | 0.176 | -0.049 | 0.010 |
C(Sector)[T.7] | 0.0085 | 0.014 | 0.605 | 0.551 | -0.021 | 0.038 |
C(Sector)[T.8] | -0.0393 | 0.014 | -2.803 | 0.011 | -0.069 | -0.010 |
Omnibus: | 1.984 | Durbin-Watson: | 2.574 |
---|---|---|---|
Prob(Omnibus): | 0.371 | Jarque-Bera (JB): | 1.162 |
Skew: | -0.071 | Prob(JB): | 0.559 |
Kurtosis: | 2.077 | Cond. No. | 9.67 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
## sum_sq df F PR(>F)
## C(Fertilizer) 0.005086 3.0 4.306540 0.016222
## C(Sector) 0.017986 7.0 6.526673 0.000364
## Residual 0.008267 21.0 NaN NaN
3.4 Multiple Linear Regression
Most practical applications of of regression modeling involve using more complicated models than a simple linear regression with only one predictor variable to predict your target. Additional variables in a model can lead to better explanations and predictions of the target. These linear regressions with more than one variable are called multiple linear regression models. However, as we will see in this section and the following chapters, with more variables comes much more complication.
3.4.1 Model Structure
Multiple linear regression models have the same structure as simple linear regression models, only with more variables. The multiple linear regression model with \(k\) variables is structured like the following:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \varepsilon \]
This model has the predictor variables \(x_1, x_2, ..., x_k\) trying to either explain or predict the target variable \(y\). The intercept, \(\beta_0\), still gives the expected value of \(y\), when all of the predictor variables take a value of 0. With the addition of multiple predictors, the interpretation of the slope coefficients change slightly. The slopes, \(\beta_1, \beta_2, ..., \beta_k\), give the expected change in \(y\) for a one unit change in the respective predictor variable, holding all other predictor variables constant. The random error term, \(\varepsilon\), is the error between our predicted value, \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1 + \cdots + \hat{\beta}_k x_k\), and our actual value of \(y\).
Unlike simple linear regression that can be visualized as a line through a 2-dimensional scatterplot of data, a multiple linear regression is better thought of as a multi-dimensional plane through a multi-dimensional scatterplot of data.
Let’s visual an example with two predictor variables - the square footage of greater living area and the total number of rooms. These will predict sale price of a home. When none of the variables have any relationship with the target variable, we get a horizontal plane like the one shown below. This is similar in concept to a horizontal line in simple linear regression having a slope of 0, implying that the target variable does not change as the predictor variable changes.
Much like if the slope of a simple linear regression line is not 0 (a relationship exists between the predictor and target variable), then a relationship between any of the predictor variables and the target variable shifts and rotates the plane around like the one shown below.
To the naive viewer, the shifted plane would still make sense because of the model naming convention of multiple linear regression. However, the linear in linear regression doesn’t have to deal with the visualization of the fitted plane (or line in two dimensions), but instead refers to the linear combination of variables. A linear combination is an expression constructed from a set of terms by multiplying each term by a constant and adding the results. For example, \(ax + by\) is a linear combination of \(x\) and \(y\). Therefore, the linear model
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k + \varepsilon \] is a linear combination of predictor variables in their relationship with the target variable \(y\). These predictor variables do not all have to contain linear effects though. For example, let’s look at a linear regression model with four predictor variables:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \varepsilon \]
One would not be hard pressed to call this model a linear regression. However, what if we defined \(x_3 = x_1^2\) and \(x_4 = x_2^2\)?
This model is still a linear regression model. The structure of the model did not change. The model is still a linear combination of predictor variables related to the target variable. The predictor variables just do not all have a linear effect in terms of their relationship with \(y\). However, mathematically, it is still a linear combination and a linear regression model.
3.4.2 Global & Local Inference
In simple linear regression we could just look at the t-test for our slope parameter estimate to determine the utility of our model. With multiple parameter estimates comes multiple t-tests. Instead of looking at every individual parameter estimate initially, there is a way to determine the model adequacy for predicting the target variable overall.
The utility of a multiple regression model can be tested with a single test that encompasses all the coefficients from the model. This kind of test is called a global test since it tests all \(\beta\)’s simultaneously. The Global F-Test uses the F-distribution to do just that for multiple linear regression models. The hypotheses for this test are the following:
\[ H_0: \beta_1 = \beta_2 = \cdots = \beta_k = 0 \\ H_A: \text{at least one } \beta \text{ is nonzero} \]
In simpler terms, the null hypothesis is that none of the variables are useful in predicting the target variable. The alternative hypothesis is that at least one of these variables is useful in predicting the target variable.
The F-distribution is a distribution that has the following characteristics:
- Bounded below by 0
- Right-skewed
- Both numerator and denominator degrees of freedom
A plot of a variety of F distributions is shown here:
## Warning: Removed 1500 rows containing missing values (`geom_line()`).
If the global test is significant, the next step would be to examine the individual t-tests to see which variables are significant and which ones are not. This is similar to post-hoc testing in ANOVA where we explored which of the categories was statistically different when we knew at least one was.
These tests are all available using the summary
function on an lm
function for linear regression. To build a multiple linear regression in R using the lm
function, you just add another variable to the formula element. Here we will predict the sales price (Sale_Price
) based on the square footage of the greater living area of the home (Gr_Liv_Area
) as well as total number of rooms above ground (TotRms_AbvGrd
).
##
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area + TotRms_AbvGrd, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -528656 -30077 -1230 21427 361465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42562.657 5365.721 7.932 3.51e-15 ***
## Gr_Liv_Area 136.982 4.207 32.558 < 2e-16 ***
## TotRms_AbvGrd -10563.324 1370.007 -7.710 1.94e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56630 on 2048 degrees of freedom
## Multiple R-squared: 0.5024, Adjusted R-squared: 0.5019
## F-statistic: 1034 on 2 and 2048 DF, p-value: < 2.2e-16
At the bottom of the above output is the result of the global F-test. Since the p-value on this test is lower than the significance level of 0.05, we have statistical evidence that at least of the two variables - Gr_Liv_Area
and TotRms_AbvGrd
- is significant at predicting the sale price of the home. By looking at the individual t-tests in the output above, we can see that both variables are actually significant.
3.4.3 Assumptions
The main assumptions for the multiple linear regression model are the same as with the simple linear regression model:
- The expected value of \(y\) is linear in the \(x\)’s (proper model specification).
- The random errors are independent.
- The random errors are normally distributed.
- The random errors have equal variance (homoskedasticity).
However, with multiple variables there is an additional assumption that people tend to add to multiple linear regression modeling:
- No perfect collinearity (also called multicollinearity)
The new assumption means that no combination of predictor variables is a perfect linear combination with any other predictor variables. Collinearity, also called multicollinearity, occurs when predictor variables are correlated with each other. People often misstate this additional assumption as having no collinearity at all. This is too restrictive and basically impossible to meet in a realistic setting. Only when collinearity has a drastic impact on the linear regression do we need to concern ourselves. In fact, linear regression only completely breaks when that collinearity is perfect. Dealing with multicollinearity is discussed later.
Similar to simple linear regression, we can evaluate the assumptions by looking at residual plots. The plot
function on the lm
object provides these.
These will again be covered in much more detail in Diagnostic Chapter.
3.4.4 Multiple Coefficients of Determination
One of the main advantages of multiple linear regression is that the complexity of the model enables us to investigate the relationship among \(y\) and several predictor variables simultaneously. However, this increased complexity makes it more difficult to not only interpret the models, but also ascertain which model is “best.”
One example of this would be the coefficient of determination, \(R^2\), that we discussed earlier. The calculation for \(R^2\) is the exact same:
\[ R^2 = 1 - \frac{SSE}{TSS} \]
However, the problem with the calculation of \(R^2\) in a multiple linear regression is that the addition of any variable (useful or not) will never make the \(R^2\) decrease. In fact, it typically increases even with the addition of a useless variable. The reason is rather intuitive. When adding information to a regression model, your predictions can only get better, not worse. If a new predictor variable has no impact on the target variable, then the predictions can not get any worse than what they already were before the addition of the useless variable. Therefore, the \(SSE\) would never increase, making the \(R^2\) never decrease.
To account for this problem, there is the adjusted coefficient of determination, \(R^2_a\). The calculation is the following:
\[ R^2_a = 1 - [(\frac{n-1}{n-(k+1)})\times (\frac{SSE}{TSS})] \]
Notice what the calculation is doing. It takes the original ratio on the right hand side of the \(R^2\) equation, \(SSE/TSS\), and penalizes it. It multiplies this number by a ratio that is always greater than 1 if \(k > 0\). Remember, \(k\) is the number of variables in the model. Therefore, as the number of variables increases, the calculation penalizes the model more and more. However, if the reduction of SSE from adding a useful variable is low enough, then even with the additional penalization, the \(R^2_a\) will increase if the variable is a useful addition to the model. If the variable is not a useful addition to the model, the \(R^2_a\) will decrease. The \(R^2_a\) is only one of many ways to select the “best” model for multiple linear regression.
One downside of this new metric is that the \(R^2_a\) loses its interpretation. Since \(R^2_a \le R^2\), it is no longer bounded below by zero. Therefore, it can no longer be the proportion of variation explained in the target variable by the model. However, we can easily use \(R^2_a\) to select a model correctly and interpret that model with \(R^2\). Both of these numbers can be found using the summary
function on the lm
object from the previous model.
##
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area + TotRms_AbvGrd, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -528656 -30077 -1230 21427 361465
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42562.657 5365.721 7.932 3.51e-15 ***
## Gr_Liv_Area 136.982 4.207 32.558 < 2e-16 ***
## TotRms_AbvGrd -10563.324 1370.007 -7.710 1.94e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56630 on 2048 degrees of freedom
## Multiple R-squared: 0.5024, Adjusted R-squared: 0.5019
## F-statistic: 1034 on 2 and 2048 DF, p-value: < 2.2e-16
From this output we can say that the combination of Gr_Liv_Area
and TotRmsAbvGrd
account for 50.24% of the variation in Sale_Price
. Now let’s add a random variable to the model. This random variable will take random values from a normal distribution with mean of 0 and standard deviation of 1 and has no impact on the target variable.
set.seed(1234)
ames_lm3 <- lm(Sale_Price ~ Gr_Liv_Area + TotRms_AbvGrd + rnorm(length(Sale_Price), 0, 1), data = train)
summary(ames_lm3)
##
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area + TotRms_AbvGrd + rnorm(length(Sale_Price),
## 0, 1), data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -527926 -29943 -1298 21427 363925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42589.091 5364.877 7.939 3.34e-15 ***
## Gr_Liv_Area 136.927 4.207 32.548 < 2e-16 ***
## TotRms_AbvGrd -10552.425 1369.808 -7.704 2.05e-14 ***
## rnorm(length(Sale_Price), 0, 1) 1629.854 1259.478 1.294 0.196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56620 on 2047 degrees of freedom
## Multiple R-squared: 0.5028, Adjusted R-squared: 0.502
## F-statistic: 689.9 on 3 and 2047 DF, p-value: < 2.2e-16
Notice that the \(R^2\) of this model actually increased to 0.5028 from 0.5024. However, the \(R^2_a\) value stayed approximately the same at 0.502 since the addition of this new variable did not provide enough predictive power to outweigh the penalty of adding it.
3.4.5 Categorical Predictor Variables
As mentioned in EDA Section, there are two types of variables typically used in modeling:
- Quantitative (or numeric)
- Qualitative (or categorical)
Categorical variables need to be coded differently because they are not numerical in nature. As mentioned in EDA Section, two common coding techniques for linear regression are reference and effects coding. The interpretation of the coefficients (\(\beta\)’s) of these variables in a regression model depend on the specific coding used. The predictions from the model, however, will remain the same regardless of the specific coding that is used.
Let’s use the example of the Central_Air
variable with 2 categories - Y and N. Using reference coding, the reference coded variable to describe these 2 categories (with N as the reference level) would be the following:
Central Air | X1 |
Y | 1 |
N | 0 |
The linear regression equation would be:
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 \]
Let’s see the mathematical interpretation of the coefficient \(\hat{\beta}_1\). To do this, let’s get the average sale price of a home prediction for a home with central air (\(\hat{y}_Y\)) and without central air (\(\hat{y}_N\)):
\[ \hat{y}_Y = \hat{\beta}_0 + \hat{\beta}_1 \cdot 1 = \hat{\beta}_0 + \hat{\beta}_1 \\ \hat{y}_N = \hat{\beta}_0 + \hat{\beta}_1 \cdot 0 = \hat{\beta}_0 \]
By subtracting these two equations (\(\hat{y}_Y - \hat{y}_N = \hat{\beta}_1\)), we can get the prediction for the average difference in price between a home with central air and without central air. This shows that in reference coding, the coefficient on each dummy variable is the average difference between that category and the reference category (the category not represented with its own variable). The math can be extended to as many categories as needed.
Using effects coding, the effects coded variable to describe these 2 categories (with N as the reference level) would be the following:
Central Air | X1 |
Y | 1 |
N | -1 |
The linear regression equation would be:
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 X_1 \]
Let’s see the mathematical interpretation of the coefficient \(\hat{\beta}_1\). To do this, let’s get the average sale price of a home prediction for a home with central air (\(\hat{y}_Y\)) and without central air (\(\hat{y}_N\)):
\[ \hat{y}_Y = \hat{\beta}_0 + \hat{\beta}_1 \cdot 1 = \hat{\beta}_0 + \hat{\beta}_1 \\ \hat{y}_N = \hat{\beta}_0 + \hat{\beta}_1 \cdot -1 = \hat{\beta}_0 - \hat{\beta}_1 \]
Similar to reference coding, the coefficient \(\hat{\beta}_1\) is the average difference between homes with central air and \(\hat{\beta}_0\). However, what is \(\hat{\beta}_0\)? By taking the average of our two predictions:
\[ \frac{1}{2} \times (\hat{y}_Y + \hat{y}_N) = \frac{1}{2} \times (\hat{\beta}_0 + \hat{\beta}_1 + \hat{\beta}_0 - \hat{\beta}_1) = \frac{1}{2} \times (2\hat{\beta}_0) = \hat{\beta}_0 \]
From this average we can get the prediction for the average difference in price between a home with central air and the average price across all homes. This shows that in effects coding, the coefficient on each dummy variable is the average difference between that category and the average price across all homes (including both with and without central air). The math can be extended to as many categories as needed.
Let’s see an example with Central_Air
as a variable added to our multiple linear regression model as a reference coded variable.
ames_lm4 <- lm(Sale_Price ~ Gr_Liv_Area + TotRms_AbvGrd + Central_Air, data = train)
summary(ames_lm4)
##
## Call:
## lm(formula = Sale_Price ~ Gr_Liv_Area + TotRms_AbvGrd + Central_Air,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -510745 -28984 -2317 20273 356742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7169.259 6778.879 -1.058 0.29
## Gr_Liv_Area 129.594 4.131 31.374 < 2e-16 ***
## TotRms_AbvGrd -8980.938 1335.669 -6.724 2.29e-11 ***
## Central_AirY 54513.082 4762.926 11.445 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54910 on 2047 degrees of freedom
## Multiple R-squared: 0.5323, Adjusted R-squared: 0.5316
## F-statistic: 776.6 on 3 and 2047 DF, p-value: < 2.2e-16
With these results we estimate the average difference in sales price between homes with central air and without central air to be $54,513.08.
3.4.6 Python Code
Global and Local Inference
model_mlr = smf.ols("Sale_Price ~ Gr_Liv_Area + TotRms_AbvGrd", data = train).fit()
model_mlr.summary()
Dep. Variable: | Sale_Price | R-squared: | 0.502 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.502 |
Method: | Least Squares | F-statistic: | 1034. |
Date: | Tue, 18 Jun 2024 | Prob (F-statistic): | 4.37e-311 |
Time: | 15:57:31 | Log-Likelihood: | -25355. |
No. Observations: | 2051 | AIC: | 5.072e+04 |
Df Residuals: | 2048 | BIC: | 5.073e+04 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 4.256e+04 | 5365.721 | 7.932 | 0.000 | 3.2e+04 | 5.31e+04 |
Gr_Liv_Area | 136.9821 | 4.207 | 32.558 | 0.000 | 128.731 | 145.233 |
TotRms_AbvGrd | -1.056e+04 | 1370.007 | -7.710 | 0.000 | -1.33e+04 | -7876.571 |
Omnibus: | 408.938 | Durbin-Watson: | 2.020 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 6517.477 |
Skew: | 0.473 | Prob(JB): | 0.00 |
Kurtosis: | 11.682 | Cond. No. | 6.86e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.86e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Assumptions
train['pred_mlr'] = model_mlr.predict()
train['resid_mlr'] = model_mlr.resid
train[['Sale_Price', 'pred_mlr', 'resid_mlr']].head(n = 10)
## Sale_Price pred_mlr resid_mlr
## 0 232600 172585.780285 60014.219715
## 1 166000 209070.126180 -43070.126180
## 2 170000 198749.366853 -28749.366853
## 3 252000 215018.840374 36981.159626
## 4 134000 144809.807959 -10809.807959
## 5 164700 160973.699138 3726.300862
## 6 193500 198612.384724 -5112.384724
## 7 118500 145752.982966 -27252.982966
## 8 94000 109742.383030 -15742.383030
## 9 111250 108098.597486 3151.402514
Multiple Coefficient of Determination
Dep. Variable: | Sale_Price | R-squared: | 0.502 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.502 |
Method: | Least Squares | F-statistic: | 1034. |
Date: | Tue, 18 Jun 2024 | Prob (F-statistic): | 4.37e-311 |
Time: | 15:57:32 | Log-Likelihood: | -25355. |
No. Observations: | 2051 | AIC: | 5.072e+04 |
Df Residuals: | 2048 | BIC: | 5.073e+04 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 4.256e+04 | 5365.721 | 7.932 | 0.000 | 3.2e+04 | 5.31e+04 |
Gr_Liv_Area | 136.9821 | 4.207 | 32.558 | 0.000 | 128.731 | 145.233 |
TotRms_AbvGrd | -1.056e+04 | 1370.007 | -7.710 | 0.000 | -1.33e+04 | -7876.571 |
Omnibus: | 408.938 | Durbin-Watson: | 2.020 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 6517.477 |
Skew: | 0.473 | Prob(JB): | 0.00 |
Kurtosis: | 11.682 | Cond. No. | 6.86e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.86e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Categorical Predictor Variables
model_mlr2 = smf.ols("Sale_Price ~ Gr_Liv_Area + TotRms_AbvGrd + C(Central_Air)", data = train).fit()
model_mlr2.summary()
Dep. Variable: | Sale_Price | R-squared: | 0.532 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.532 |
Method: | Least Squares | F-statistic: | 776.6 |
Date: | Tue, 18 Jun 2024 | Prob (F-statistic): | 0.00 |
Time: | 15:57:32 | Log-Likelihood: | -25292. |
No. Observations: | 2051 | AIC: | 5.059e+04 |
Df Residuals: | 2047 | BIC: | 5.061e+04 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -7169.2593 | 6778.879 | -1.058 | 0.290 | -2.05e+04 | 6124.960 |
C(Central_Air)[T.Y] | 5.451e+04 | 4762.926 | 11.445 | 0.000 | 4.52e+04 | 6.39e+04 |
Gr_Liv_Area | 129.5945 | 4.131 | 31.374 | 0.000 | 121.494 | 137.695 |
TotRms_AbvGrd | -8980.9382 | 1335.669 | -6.724 | 0.000 | -1.16e+04 | -6361.527 |
Omnibus: | 463.227 | Durbin-Watson: | 2.032 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 7273.636 |
Skew: | 0.621 | Prob(JB): | 0.00 |
Kurtosis: | 12.142 | Cond. No. | 9.92e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.92e+03. This might indicate that there are
strong multicollinearity or other numerical problems.