Chapter 4 Model Selection

This section will dive into some basic foundations in model selection, or finding the best model for a data set. In most data sets, there will most likely be variables that are informative and ones that are uninformative in predicting the response. With many explanatory variables, it could be extremely time consuming to try all potential models by hand, and the use of automatic procedures can greatly assist in obtaining subsets of variables in which to focus your attention.

This Chapter aims to answer the following questions:

  • What are the different selection criteria that can be used in model selection?

  • How to perform a stepwise search algorithm.

    • Forward Selection
    • Backward Elimination
    • Stepwise Selection
  • Considerations when looking at p-values.

CAUTION: you should NEVER just use the final model created from an automatic procedure! Always explore your data (both automatically selected and excluded variables), and use domain knowledge, diagnostics, and critical thought to decide on your final model.

We will focus on two techniques for automatic variable selection: stepwise procedures and LASSO. Within the stepwise procedures, we will discuss forward, backward and stepwise searches using several different selection criteria. We will end this section discussing important considerations in use of p-values when dealing with large data sets.

4.1 Selection Criteria

When trying to find the best model, there are many selection criteria at our disposal. For example, you have already been introduced to \(R^{2}\) (the larger the value of \(R^{2}\), the better the model). However, when comparing multivariate models, the adjusted \(R^{2}\) is better due to the fact that \(R^{2}\) can potentially increase even when adding noise. The adjusted \(R^{2}\) can be thought of as \(R^{2}\) with penalty (for every additional variable added to the model, we add a penalty). This allows us to weigh the contribution of adding new variables against the added complexity of more variables in the model. There are other selection criteria that are also used in selecting the “best model” (or variable selection). As you will notice, these selection criteria also have a penalty to take into account the addition of variables. We will use two of the most common ones: AIC and BIC (can also be referred to as SBC).

The AIC, or Akaike Information Criterion, was developed by statistician Hirotugu Akaike in the 1970’s and is defined by \[ AIC = -2log(Likelihood) + 2p. \] In this case, “Likelihood” is the likelihood of the data and \(2p\) is the “penalty”, where \(p\) is the number of parameters in the model. A smaller AIC indicates a better model.

BIC, also known as the Bayesian Information Criterion (also called SBC or Schwarz Bayesian Information) was first developed by Gideon E. Schwarz, also back in the 1970’s and is defined by \[BIC = -2log(Likelihood) + plog(n). \] In this case, “Likelihood” is the likelihood of the data and \(plog(n)\) is the “penalty”, where \(p\) is the number of parameters in the model and \(n\) is the sample size. A smaller BIC indicates a better model. Notice that both penalties have a common term of \(p\) in them (this will be important to remember when we start discussing the R code).

4.2 Stepwise Selection

The three different algorithms in the stepwise selection search are forward, backward and stepwise. Each of these algorithms either add or take away one variable at a time based on a given criterion until this criterion can no longer be met. At which point the algorithm stops.

Forward

For forward selection, we start with a null model (only the intercept) and add one variable at a time until no other variables can be added based on a given criterion. The algorithm is as follows:

  1. Start with a null model, this is the base model (just the intercept)
    1. For each variable not in model, create a linear regression model with the base model plus this one variable
    2. See which linear regression is best (based on criterion)
    3. Is this regression better than the base model?
      1. Yes, then continue on to step 4
      2. No, exit the algorithm with the base model as the chosen model
    4. The base model is now the previous base model plus the variable selected in step 3. Using this as your new base model, go back to step 1 and continue.

To do forward selection in R, you will use the step function. The “empty model” should be used as your initial model (which is just the intercept). For the scope of the model, you need to put the “smallest model” (just the intercept) to the “largest model” (full model). The direction is forward. The penalty can be controlled by defining “\(k\)”. Using a value of 2 for “\(k\)” will use the AIC criterion (remember AIC penalty was \(2p\)) and using \(log(n)\) for “\(k\)” will use the BIC criterion (remember BIC penalty was \(plog(n)\)). If you do not specify anything for “\(k\)”, the default is AIC. You can also define “\(k\)” as the upper \(\alpha\)-quantile of a \(\chi^{2}\) distribution with one degree of freedom, which will use p-value for its selection of variables. In this case, the best “model” is the one with the lowest p-value for the new variable. In order to enter the model, the p-value of this variable will need to be lower than the “cut-off” for the p-values (\(\alpha\)), which means that the criterion to enter the model would be that the variable has a p-value smaller than \(\alpha\). Here we use a trimmed down version of the training dataset that contains variables we want to bring to the model after initial data exploration.

# Create full model and empty model
full.model <- lm(Sale_Price ~ . , data = train_sel)
empty.model <- lm(Sale_Price ~ 1, data = train_sel)

# k = 2 for AIC selection
for.model <- step(empty.model,
                  scope = list(lower = empty.model,
                               upper = full.model),
                  direction = "forward", k = 2) 

Below is edited output with comments to lead you through the forward selection. Notice that the above code has \(k\)=2, which means we are using AIC as our selection criterion. The first information R provides is for the initial model, which is \[Y_{i}=\beta_{0} + \varepsilon_{i}.\] The “empty model” has an AIC of 46,323.64 (in order for ANY variable to be added, that model must be better, or in other words have a smaller AIC than this initial model).

Start:  AIC=46323.64
Sale_Price ~ 1

R then displays all simple linear regressions with their corresponding AIC values. Notice that R puts the information in ascending AIC values (so, the best simple linear regression is 43,817 which is indeed smaller than 46,323.64). You will also see a line with a blank for the “variable”, this is the current “base model”. In this first step, it is just the intercept. Therefore, we will add the variable Overall_Qual.

Df Sum of Sq RSS AIC
+Overall_Qual 9 9.3437e+12 3.8531e+12 43817
+Gr_Liv_Area 1 6.4389e+12 6.7578e+12 44953
+Garage_Area 1 5.3561e+12 7.8407e+12 45258
+First_Flr_SF 1 4.8867e+12 8.3100e+12 45377
+Full_Bath 1 3.7827e+12 9.4141e+12 45633
+TotRms_AbvGrd 1 3.2304e+12 9.9663e+12 45750
+Fireplaces 1 2.9715e+12 1.0225e+13 45802
+Half_Bath 1 1.1209e+12 1.2076e+13 46144
+Roof_Style 5 1.0724e+12 1.2124e+13 46160
+Central_Air 1 9.6147e+11 1.2235e+13 46170
+House_Style 7 1.0245e+12 1.2172e+13 46172
+Second_Flr_SF 1 9.4611e+11 1.2251e+13 46173
+Lot_Area 1 9.0332e+11 1.2293e+13 46180
+Bldg_Type 4 4.6434e+11 1.2732e+13 46258
+Street 1 3.1752e+10 1.3165e+13 46321
1.3197e+13 46324

Now, R shows you the new “base model”, which includes Overall_Qual (and of course the intercept is still included) and the new AIC value to beat which is 43816.66.

Step:  AIC=43816.66
Sale_Price ~ Overall_Qual

Using this as our base model, we now try adding each of the remaining variables in separate regression models, output is shown below:

Df Sum of Sq RSS AIC
+Gr_Liv_Area 1 9.8905e+11 2.8640e+12 43210
+First_Flr_SF 1 5.2665e+11 3.3264e+12 43517
+Garage_Area 1 4.6644e+11 3.3866e+12 43554
+TotRms_AbvGrd 1 4.6123e+11 3.3918e+12 43557
+Full_Bath 1 4.1206e+11 3.4410e+12 43587
+Fireplaces 1 4.0551e+11 3.4476e+12 43591
+Lot_Area 1 3.8148e+11 3.4716e+12 43605
+Bldg_Type 4 2.3715e+11 3.6159e+12 43694
+Second_Flr_SF 1 1.7555e+11 3.6775e+12 43723
+Half_Bath 1 1.3948e+11 3.7136e+12 43743
+Central_Air 1 9.1322e+10 3.7617e+12 43769
+House_Style 7 6.1815e+10 3.7912e+12 43797
+Roof_Style 5 5.1448e+10 3.8016e+12 43799
3.8531e+12 43817
+Street 1 1.9573e+06 3.8531e+12 43819

Best model includes Gr_Liv_Area (in addition to the intercept and Overall_Qual). The AIC is 43210, which beats the previous one of 43817. So, we will add Gr_Liv_Area to the model. Our new base model includes the intercept, Overall_Qual and Gr_Liv_Area with an AIC of 43210.

Step:  AIC=43210.24
Sale_Price ~ Overall_Qual + Gr_Liv_Area

We continue in this fashion until adding a new variable does NOT decrease the AIC. The last step is shown below:

Step:  AIC=42676.1
Sale_Price ~ Overall_Qual + Gr_Liv_Area + House_Style + Garage_Area + Bldg_Type + Fireplaces 
        + Full_Bath + Half_Bath + Lot_Area + Roof_Style + Central_Air + Second_Flr_SF 
        + TotRms_AbvGrd + First_Flr_SF
Df Sum of Sq RSS AIC
2.1542e+12 42676
+Street 1 1.028e+09 2.1532e+12 42677

Notice that the previous step had a model with the intercept, Overall_Qual, Gr_Liv_Area, House_Style, Garage_Area, Bldg_Type, Fireplaces, Full_Bath, Half_Bath, Lot_Area, Roof_Style, Central_Air, Second_Flr_SF, TotRms_AbvGrd, and First_Flr_SF (as illustrated by the above formula). The AIC for this model is 42676.1. There is only one variable left that could be added (Street). However, when we add street to the model, the AIC now becomes 42677 (in other words AIC increases, which is a worse model). The algorithm stops here.

The code below illustrates how to do forward selection with BIC and p-values. In the output from R, it still says “AIC”, but now these values are calculated by using the BIC formula or the corresponding “\(k\)” values for p-values (take a look at the values R has for the “AIC” column and you will see that it has changed!).

# k = log(n) for BIC selection
for.model2 <- step(empty.model,
                   scope = list(lower = empty.model,
                               upper = full.model),
                   direction = "forward", k = log(nrow(train_sel))) # k = qchisq(alpha, 1, lower.tail = FALSE) for p-value with alpha selection
alpha.f=0.05
for.model3 <- step(empty.model,
                   scope = list(lower = empty.model,
                                upper = full.model),
                   direction = "forward", k = qchisq(alpha.f, 1, lower.tail = FALSE)) 

Backward

Backward elimination starts with the “base” model as the full model (i.e. all variables are contained within the model). We remove only one variable and we do this for each variable in the model. We want to see if the any of the models improve over the “base model” (in other words, is the model better with that one variable removed based on a given criterion?). If this new model (with that one variable removed) is better than the previous model, then we remove that variable and this new model now becomes the base model. We are looking for the best improvement, therefore we look at the model that would be the best over the base model. The algorithm continues in this fashion until no other variables can be removed, based on the chosen criterion.

  1. Start with full model with all predictor variables in it, this is the base model and calculate the criterion on this model
    1. Create models such that each model has exactly one predictor variable removed from it and calculate the criterion for each model
    2. In step 1, find the best model based on the criterion
    3. Is this regression model better than the base model?
      1. Yes, then continue on to step 4
      2. No, exit the algorithm with the base model as the chosen model
    4. The base model is now the model with the variable removed. Using this as your new base model, go back to step 1 and continue.

To do backward elimination in R, you will use the step function. The “full model” should be used as your initial model (which is the model with all the predictor variables in it). For the scope of the model, you need to put the “smallest model” (just the intercept) to the “largest model” (full model). The direction is backward. The penalty (which indicates the criteria) can be controlled by defining “\(k\)”. As discussed before, a value of 2 will use the AIC criterion, a value of \(log(n)\) will produce the BIC penalty, and finally the upper \(\alpha\)-quantile of a \(\chi^{2}\) distribution with one degree of freedom will use p-value for its removal of variables.

# Create full model and empty model
full.model <- lm(Sale_Price ~ . , data = train_sel)
empty.model <- lm(Sale_Price ~ 1, data = train_sel)

# k = 2 for AIC selection
back.model <- step(full.model,
                  scope = list(lower = empty.model,
                               upper = full.model),
                  direction = "backward", k = 2) 

Below is edited output with comments to lead you through the backward elimination. Notice that the above code has \(k\)=2, which means we are using AIC as our selection criterion. The first information R provides is for the initial model, which is \[Y_{i}=\beta_{0} + \beta_{1}x_{i,1}+\beta_{2}x_{i,2}+\beta_{3}x_{i,3}+\beta_{4}x_{i,4}+\beta_{5}x_{i,5}+\beta_{6}x_{i,6}+\beta_{i,7}x_{7}+\beta_{8}x_{i,8}+\beta_{9}x_{i,9}+\beta_{10}x_{i,10}+\beta_{11}x_{i,11}+\beta_{12}x_{i,12}+\beta_{13}x_{i,13}+\beta_{14}x_{i,14}+\beta_{15}x_{i,15}+\varepsilon_{i}.\] The “full model” has an AIC of 42,677.12 (in order for ANY variable to be removed, that model must be better, or in other words have a smaller AIC than this initial model).

R then displays linear regression models with one variable removed and their corresponding AIC values. For example, if we remove Street (just that one variable), the AIC of that model will be 42675. Notice that R put the information in ascending AIC values (so, best linear regression is at the top with an AIC value of 42,675 which is indeed smaller than 42,677.12, so Street will be removed). You will also see a line with . This is the “base model”, or in this case the full model.

Df Sum of Sq RSS AIC
- Gr_Liv_Area 1 4.9138e+08 2.1537e+12 42676
- Street 1 1.0280e+09 2.1542e+12 42676
2.1532e+12 42677
- First_Flr_SF 1 3.1548e+09 2.1563e+12 42678
- TotRms_AbvGrd 1 3.4112e+09 2.1566e+12 42678
- Second_Flr_SF 1 6.4939e+09 2.1597e+12 42681
- Central_Air 1 1.6533e+10 2.1697e+12 42691
- Roof_Style 5 2.8786e+10 2.1820e+12 42694
- Half_Bath 1 3.5009e+10 2.1882e+12 42708
- Lot_Area 1 3.5997e+10 2.1892e+12 42709
- Fireplaces 1 3.6853e+10 2.1900e+12 42710
- House_Style 7 7.0980e+10 2.2241e+12 42730
- Garage_Area 1 6.4143e+10 2.2173e+12 42735
- Bldg_Type 4 7.1274e+10 2.2244e+12 42736
- Full_Bath 1 6.8198e+10 2.2214e+12 42739
- Overall_Qual 9 1.7183e+12 3.8715e+12 43862

Once Street is removed, the new “base model” (without this variable) has an AIC value of 42,674.6. This is now the new value to beat. Repeating this process with removing the other variables one at a time, we see the output:

Df Sum of Sq RSS AIC
2.1547e+12 42675
- TotRms_AbvGrd 1 2.9784e+09 2.1577e+12 42675
- Central_Air 1 1.7247e+10 2.1720e+12 42689
- Roof_Style 5 2.8560e+10 2.1833e+12 42692
- Half_Bath 1 3.4751e+10 2.1895e+12 42705
- Lot_Area 1 3.5041e+10 2.1898e+12 42706
- Fireplaces 1 3.6680e+10 2.1914e+12 42707
- House_Style 7 7.3149e+10 2.2279e+12 42729
- Garage_Area 1 6.3520e+10 2.2182e+12 42732
- Bldg_Type 4 7.3044e+10 2.2278e+12 42735
- Full_Bath 1 6.8973e+10 2.2237e+12 42737
- Second_Flr_SF 1 1.2513e+11 2.2798e+12 42788
- First_Flr_SF 1 1.4221e+11 2.2969e+12 42804
- Overall_Qual 9 1.7202e+12 3.8749e+12 43860

Notice that none of the removal of variables is better than the “base model” based on the AIC (best one is TotRms_AbvGrd and that has an AIC of 42675, which is not better than the base model). Since none of the removals improve the model, the algorithm stops here with all variables in the model except Street. A quick note here: notice that backward elimination selected the same model as forward selection. This will not always be the case! Therefore, the final model is:

Step:  AIC=42674.6
Sale_Price ~ Lot_Area + Bldg_Type + House_Style + Overall_Qual + 
    Roof_Style + Central_Air + First_Flr_SF + Second_Flr_SF + 
    Full_Bath + Half_Bath + Fireplaces + Garage_Area + TotRms_AbvGrd

The code below illustrates how to do backward elimination with BIC and p-values.

# k = log(n) for BIC selection
back.model2 <- step(full.model,
                   scope = list(lower = empty.model,
                               upper = full.model),
                   direction = "backward", k = log(nrow(train_sel))) # k = qchisq(alpha, 1, lower.tail = FALSE) for p-value with alpha selection
alpha.f=0.05
back.model3 <- step(full.model,
                   scope = list(lower = empty.model,
                                upper = full.model),
                   direction = "backward", k = qchisq(alpha.f, 1, lower.tail = FALSE)) 

Stepwise

Stepwise selection is a combination of both of these methods. In this algorithm, we start with the “base model” as the empty model (i.e. just the intercept) and will add variables (as in forward selection). However, as we add new variables, we will also check that the variables in the model are still contributing (in other words, after a new variable is added, we check to see if the model would be better if we drop one of the other variables in the model). Keep in mind that this algorithm is similar to forward selection and backward elimination in that only one variable may either enter or be removed at each step. The algorithm stops when no more variables can be added to nor taken away from the model (in other words, the current “base model” is better than adding any single addition of one variable or any single extraction of one variable). The addition and removal of variables is again based upon the criteria specified. See the algorithm below:

  1. Start with empty model with only the intercept in it, this is the base model and calculate the criterion on this model
    1. For each variable not in model, create a linear regression model with the base model plus this variable; create additional models with the base model taking away one variable at a time
    2. See which linear regression is best (based on criterion)
    3. Is this regression better than the base model?
      1. Yes, then continue on to step 4
      2. No, exit the algorithm with the base model as the chosen model
    4. The base model is now the best model selected in step 3. Using this as your new base model, go back to step 1 and continue.

To do stepwise selection in R, you will use the step function. The “empty model” should be used as your initial model. For the scope of the model, you need to put the “smallest model” (just the intercept) to the “largest model” (full model). The direction is both. The penalty (which indicates the criteria) can be controlled by defining “\(k\)”. As discussed before, a value of 2 will use the AIC criterion, a value of \(log(n)\) will produce the BIC penalty, and finally the upper \(\alpha\)-quantile of a \(\chi^{2}\) distribution with one degree of freedom will use p-value for its removal of variables.

# Create full model and empty model
full.model <- lm(Sale_Price ~ . , data = train_sel)
empty.model <- lm(Sale_Price ~ 1, data = train_sel)

# k = 2 for AIC selection
step.model <- step(empty.model,
                  scope = list(lower = empty.model,
                               upper = full.model),
                  direction = "both", k = 2) 

As you can see, the initial base model is just the intercept (AIC=46,323.64). From this initial model, all simple linear regressions are created and the AIC is observed.

Start:  AIC=46323.64
Sale_Price ~ 1
Df Sum of Sq RSS AIC
+ Overall_Qual 9 9.3437e+12 3.8531e+12 43817
+ Gr_Liv_Area 1 6.4389e+12 6.7578e+12 44953
+ Garage_Area 1 5.3561e+12 7.8407e+12 45258
+ First_Flr_SF 1 4.8867e+12 8.3100e+12 45377
+ Full_Bath 1 3.7827e+12 9.4141e+12 45633
+ TotRms_AbvGrd 1 3.2304e+12 9.9663e+12 45750
+ Fireplaces 1 2.9715e+12 1.0225e+13 45802
+ Half_Bath 1 1.1209e+12 1.2076e+13 46144
+ Roof_Style 5 1.0724e+12 1.2124e+13 46160
+ Central_Air 1 9.6147e+11 1.2235e+13 46170
+ House_Style 7 1.0245e+12 1.2172e+13 46172
+ Second_Flr_SF 1 9.4611e+11 1.2251e+13 46173
+ Lot_Area 1 9.0332e+11 1.2293e+13 46180
+ Bldg_Type 4 4.6434e+11 1.2732e+13 46258
+ Street 1 3.1752e+10 1.3165e+13 46321
1.3197e+13 46324

As you can see from the above output, the “best” model is the one containing Overall_Qual. Therefore, the new base model is the one containing just this variable.  

Using this as the new base model, we look at adding each of the individual variables to this model (see output below), as well as taking away the variables in the model (in this case, just Overall_Qual). From the output, you can see a “+” for when a variable is being added and a “-” for when a variable is being taken away.

Step:  AIC=43816.66
Sale_Price ~ Overall_Qual
Df Sum of Sq RSS AIC
+ Gr_Liv_Area 1 9.8905e+11 2.8640e+12 43210
+ First_Flr_SF 1 5.2665e+11 3.3264e+12 43517
+ Garage_Area 1 4.6644e+11 3.3866e+12 43554
+ TotRms_AbvGrd 1 4.6123e+11 3.3918e+12 43557
+ Full_Bath 1 4.1206e+11 3.4410e+12 43587
+ Fireplaces 1 4.0551e+11 3.4476e+12 43591
+ Lot_Area 1 3.8148e+11 3.4716e+12 43605
+ Bldg_Type 4 2.3715e+11 3.6159e+12 43694
+ Second_Flr_SF 1 1.7555e+11 3.6775e+12 43723
+ Half_Bath 1 1.3948e+11 3.7136e+12 43743
+ Central_Air 1 9.1322e+10 3.7617e+12 43769
+ House_Style 7 6.1815e+10 3.7912e+12 43797
+ Roof_Style 5 5.1448e+10 3.8016e+12 43799
3.8531e+12 43817
+ Street 1 1.9573e+06 3.8531e+12 43819
- Overall_Qual 9 9.3437e+12 1.3197e+13 46324

The best model is now the one adding Gr_Liv_Area to the old base model that just included Overall_Qual. This becomes the new base model and the algorithm continues below..

Step:  AIC=43210.24
Sale_Price ~ Overall_Qual + Gr_Liv_Area
Df Sum of Sq RSS AIC
+ House_Style 7 2.5351e+11 2.6105e+12 43034
+ Garage_Area 1 2.1638e+11 2.6476e+12 43051
+ Lot_Area 1 1.3097e+11 2.7330e+12 43116
+ First_Flr_SF 1 1.2210e+11 2.7419e+12 43123
+ Fireplaces 1 1.1069e+11 2.7533e+12 43131
+ Central_Air 1 1.1050e+11 2.7535e+12 43132
+ Second_Flr_SF 1 1.0207e+11 2.7619e+12 43138
+ Bldg_Type 4 1.0299e+11 2.7610e+12 43143
+ Roof_Style 5 6.0726e+10 2.8033e+12 43176
+ Full_Bath 1 3.2970e+10 2.8310e+12 43188
+ TotRms_AbvGrd 1 2.4688e+10 2.8393e+12 43194
2.8640e+12 43210
+ Half_Bath 1 4.0261e+07 2.8640e+12 43212
+ Street 1 2.2632e+07 2.8640e+12 43212
- Gr_Liv_Area 1 9.8905e+11 3.8531e+12 43817
- Overall_Qual 9 3.8938e+12 6.7578e+12 44953

Skipping ahead to the final step, we see that the base model includes the variables Overall_Qual, House_Style, Garage_Area, Bldg_Type, Fireplaces, Full_Bath, Half_Bath, Lot_Area, Roof_Style, Central_Air, Second_Flr_SF, TotRms_AbvGrd and First_Flr_SF. Looking at the output below, this is the “best model” (no other model, either inputting one more variable nor removing one variable can beat it based on this criteria).

Step:  AIC=42674.6
Sale_Price ~ Overall_Qual + House_Style + Garage_Area + Bldg_Type + 
    Fireplaces + Full_Bath + Half_Bath + Lot_Area + Roof_Style + 
    Central_Air + Second_Flr_SF + TotRms_AbvGrd + First_Flr_SF
Df Sum of Sq RSS AIC
2.1547e+12 42675
- TotRms_AbvGrd 1 2.9784e+09 2.1577e+12 42675
+ Street 1 1.0581e+09 2.1537e+12 42676
+ Gr_Liv_Area 1 5.2156e+08 2.1542e+12 42676
- Central_Air 1 1.7247e+10 2.1720e+12 42689
- Roof_Style 5 2.8560e+10 2.1833e+12 42692
- Half_Bath 1 3.4751e+10 2.1895e+12 42705
- Lot_Area 1 3.5041e+10 2.1898e+12 42706
- Fireplaces 1 3.6680e+10 2.1914e+12 42707
- House_Style 7 7.3149e+10 2.2279e+12 42729
- Garage_Area 1 6.3520e+10 2.2182e+12 42732
- Bldg_Type 4 7.3044e+10 2.2278e+12 42735
- Full_Bath 1 6.8973e+10 2.2237e+12 42737
- Second_Flr_SF 1 1.2513e+11 2.2798e+12 42788
- First_Flr_SF 1 1.4221e+11 2.2969e+12 42804
- Overall_Qual 9 1.7202e+12 3.8749e+12 43860

As you can see from the above stepwise procedures of forward, backward and stepwise, the algorithms attempt to find the “best model” by either adding one variable or taking away one variable at at a time. There is no guarantee that these algorithms will find the best model, but they do provide some guidance in terms of potential significance variables and can provide assistance when the number of variables under consideration is very large. As a caution though, you should always do further investigation once you have found models through these algorithms. These algorithms can give the same models or different models as their final selection based on the algorithm selected and the criteria used.

4.3 Significance Levels

If you are going to use the “p-value” method, you need to be aware of some considerations due to sample size. The larger the sample size, the smaller the p-values will be and the more likely it is that you will end up seeing many “significant” p-values due to the relationship between p-values and sample size (not necessarily because that variable was informative). The paper by Raftery on the Moodle page sums up the problem nicely and provides at least some guidance in terms of potential alpha-levels to use for different sample sizes. A summary of the table is provided below.

Evidence 30 50 100 1000
Weak 0.076 0.53 0.032 0.009
Fair 0.028 0.019 0.010 0.003
Strong 0.005 0.003 0.001 0.0003
Very Strong 0.001 0.0005 0.0001 0.00004

You can quickly see that for even sample sizes of 1000, one should consider the value of alpha that is used to determine significance. One thing that most researchers agree on is that considerations need to be taken for larger sample size. However, what those considerations should be and exactly how to do that is not agreed upon. You should consider your sample size when you are determining what level of significance you want to choose for your analysis (also take into account which is worse….a type I error or a type II error). There is no quick answer to this question and each decision will depend on the analysis being conducted.

4.3.1 Python Code

Python does NOT have nice capabilities to do this automatically in statsmodels, scikitlearn, or scipy. All resources I can find involve downloading and installing a package (mlxtend) that is not included by default in anaconda or writing your own function. Scikit learn has something similar but uses the model’s coefficients (!!!) to select, not p-values. Scikit learn can do this by evaluating a metric on cross-validation, but that is not covered until machine learning in Fall 3.