## Introduction

Machine learning methods allow us to automate a lot of modeling work. One common example is using Elastic Net, Ridge Regression, and Lasso for variable selection and shrinkage [Zou]. In this article, we suggest two adjustments that can improve these methods to get better results. We create simple illustrative data and models to show how both of these adjustments would work. The R code for these simulations is in the appendices.

## 1. Historical background

Ridge regression was invented by Arthur Hoerl and Robert Kennard, who were both working at DuPont. They found this method very helpful in dealing with correlated predictors and it was tractable with the limited computing power that they had at the time. Ridge regression was performed by adding some multiple of the identify matrix to X’X^{[1]} when performing ordinary least squared regression. [Hoerl has an interesting history of ridge regression.]

Ridge regression left all of predictors in the model. To address this, Robert Tibshirani invented lasso regression. Lasso regression “enjoys some of the favourable properties of both subset selection and ridge regression,” by removing some variables and then “shrinking” the parameters for the remaining variables [Tibshirani].

Later Hui Zou and Trevor Hastie blended these methods in creating elastic net which allows the user more flexibility [Zou]. We will propose an approach that will improve all three methods.

## 2. Selecting the base class for categorical variables

When we have categorical variables, the choice of which class we make the base class will not matter without shrinkage. When we use shrinkage, the function we are minimizing includes actual parameters. In this situation, our choice of base class will influence our estimates. We show this with a simple example. Consider this simple dataset:

We only have three degrees of freedom, so we can either fit a linear model with no intercept,

or we can fit a model with an intercept and only terms for two of the three colors, If we use linear regression, these options will give the same predictions because they are equivalent.With shrinkage this is more complicated as we are trying to minimize a combination of the errors in our predictions and the coefficients. With elastic net [Hastie], we are trying to minimize:

minβ0,β1NN∑i=1wil(yi,β0+βTxi)+λ[(1−α)‖β‖22/2+α‖β‖1]

Per their article, “*l(y _{i},η_{i})* is the negative log-likelihood contribution for observation

*i*; e.g., for the Gaussian case it is ½(

*y*)

_{i}−η_{i}^{2}. The elastic net penalty is controlled by α and bridges the gap between lasso regression (α=1, the default) and ridge regression (α=0). The tuning parameter λ controls the overall strength of the penalty.”

If we consider the model with no intercept, and assume the weights are all 1, λ is 0.2, and α is 0, we are choosing betas to minimize

0.2*[(*β _{white}*-10)

^{2}/2+(

*β*-15)

_{black}^{2}/2+(

*β*-20)

_{red}^{2}/2]+0.2*[

*β*

_{white}^{2}/2+

*β*

_{black}^{2}/2+

*β*

_{red}^{2}/2].

For *β _{white}*, we are minimizing 0.1*[(

*β*-10)

_{white}^{2}]+0.1*

*β*

_{white}^{2}or 0.2

*β*

_{white}^{2}– 2

*β*+ 10. The derivative of this is 0.4

_{white}*β*– 2, so this is minimized when

_{white}*β*= 5.

_{white}For *β _{black}*, we are minimizing 0.1*[(

*β*-15)

_{black}^{2}]+0.1*

*β*

_{black}^{2}or 0.2

*β*

_{black}^{2}– 3

*β*+ 22.5. The derivative of this is 0.4

_{black}*β*– 3, so this is minimized when

_{black}*β*= 7.5.

_{blue}For *β _{red}*, we are minimizing 0.1*[(

*β*-20)

_{red}^{2}]+0.1*

*β*

_{red}^{2}or 0.2

*β*

_{black}^{2}– 4

*β*+ 40. The derivative of this is 0.4

_{black}*β*– 4, so this is minimized when

_{red}*β*= 10.

_{blue}In each case, our *β* is half of what we get with no shrinkage.

If we include an intercept and make white our base class, we will get a better result. Here, we are minimizing

0.2*[(*β _{0}*-10)

^{2}/2+(

*β*-15)

_{0}+β_{black}^{2}/2+(

*β*-20)

_{0}+β_{red}^{2}/2]+0.2*[

*β*

_{0}^{2}/2+

*β*

_{black}^{2}/2+

*β*

_{red}^{2}/2].

The derivative with *β _{0}* is 0.2[(

*β*-10)+(

_{0}*β*-15)+(

_{0}+β_{black}*β*-20)]+0.2*[

_{0}+β_{red}*β*] = 0.2[(4

_{0}*β*+

_{0}*β*+

_{black}*β*-45]

_{red}The derivative with *β _{black}* is 0.2[(

*β*-15)]+0.2*[

_{0}+β_{black}*β*] = 0.2[(

_{black}*β*-15)].

_{0}+*2*β_{black}The derivative with *β _{red}* is 0.2[(

*β*-20)]+0.2*[

_{0}+β_{red}*β*] = 0.2[(

_{red}*β*-20)].

_{0}+*2*β_{red}These derivatives are all 0 when *β _{0}*=9.167,

*β*=2.917, and

_{black}*β*=5.417.

_{red}Making black the base case, we are minimizing

0.2*[(*β _{0}+β_{white}*-10)

^{2}/2+(

*β*-15)

_{0}^{2}/2+(

*β*-20)

_{0}+β_{red}^{2}/2]+0.2*[

*β*

_{0}^{2}/2+

*β*

_{white}^{2}/2+

*β*

_{red}^{2}/2].

The derivative with *β _{0}* is 0.2[(

*β*-10)+(

_{0}+β_{wbhite}*β*-15)+(

_{0}*β*-20)]+0.2*[

_{0}+β_{red}*β*] = 0.2[(4

_{0}*β*+

_{0}*β*+

_{white}*β*-45]

_{red}The derivative with *β _{white}* is 0.2[(

*β*-10)]+0.2*[

_{0}+β_{white}*β*] = 0.2[(

_{white}*β*-10)].

_{0}+*2*β_{white}The derivative with *β _{red}* is 0.2[(

*β*-20)]+0.2*[

_{0}+β_{red}*β*] = 0.2[(

_{red}*β*-20)].

_{0}+*2*β_{red}These derivatives are all 0 when *β _{0}*=10,

*β*=0, and

_{white}*β*=5.

_{red}Making red the base case, we are minimizing

0.2*[(*β _{0}+β_{white}*-10)

^{2}/2+(

*β*-15)

_{0}+β_{black}^{2}/2+(

*β*-20)

_{0}^{2}/2]+0.2*[

*β*

_{0}^{2}/2+

*β*

_{white}^{2}/2+

*β*

_{black}^{2}/2].

The derivative with *β _{0}* is 0.2[(

*β*-10)+(

_{0}+β_{white}*β*-15)+(

_{0}+β_{black}*β*-20)]+0.2*[

_{0}*β*] = 0.2[(4

_{0}*β*+

_{0}*β*+

_{white}*β*-45]

_{black}The derivative with *β _{white}* is 0.2[(

*β*-10)]+0.2*[

_{0}+β_{white}*β*] = 0.2[(

_{white}*β*-10)].

_{0}+*2*β_{white}The derivative with *β _{black}* is 0.2[(

*β*-15)]+0.2*[

_{0}+β_{black}*β*] = 0.2[(

_{black}*β*-15)].

_{0}+*2*β_{black}These derivatives are all 0 when *β _{0}*=10.833,

*β*=-0.417, and

_{white}*β*=2.083.

_{black}To summarize, we found the following four cases:

Not using an intercept gave us the worst results (the largest mean squared error of our predictions) because this case requires the other three coefficients to all be large. In the other three cases, we can have a large intercept which makes all three estimates close to the actual value. The other two parameters can then be smaller than they are in the no intercept case. Thus, the penalty term is smaller and we see less shrinkage when the intercept term is in the model. (This is true in our case, because all three observations have the same sign.) We have the best results when white is the base case. This is because the pre-shrinkage parameters are smaller (after being squared) than in the other two cases.

The above is illustrative, but it doesn’t reflect how we actually do things. Glmnet (a popular R package), first normalizes the variables and then applies shrinkage. We use Glmnet for these four cases with lasso, ridge, and a blend. Note that because the variables were normalized, these results don’t match the ones above. The code for this is in Appendix 1.

In these cases, we see the best results (in terms of mean squared error) with white as the base class. In Appendix 2, we show that this is the case even after normalizing the variables.

When categorical variables have the same number of observations for each category, normalizing their respective indicator variables will scale them all by the same amount; so after normalizing glmnet will still give the best results using a case with an effect in the middle as the base class. We suspect this would be the median effect. If some categories are more common than others, the variables will be scaled differently and it’s less clear which effect we will want for the base class. This would be an interesting topic for further study.

## 3. Scaling Predictors

With elastic net (and it’s special cases of ridge regression and lasso), we might have two predictors with the same correlation with our target. One with a larger variance (and a small coefficient in any model) will see a smaller change from shrinkage. One with a smaller variance (and a larger coefficient in any model) will see more shrinkage. The common approach to avoid this is to standardize all variables so that they each have a variance of 1.0. This section will propose an alternative method.

We will illustrate our proposal with another simple example. We simulate a set of 200 cars. 100 will be sedans and 100 will be SUV’s. Each car will be sold at one of six dealers. The price of each car will be $30,000 + $8,000 * SUV + $10,000 * ε where ε is *N*(0,1). We will fit a linear model to predict price using SUV and dealer. For our base case, we use the glmnet package in R with the standardize option.

Our proposal is to first fit separate linear regression models^{[2]} with each predictor variable. From the linear regression models, the standard deviation of each parameter estimate will give us a sense of how confident we are in that parameter. For each *x*_{i}, we will multiply all values of *x _{i}* by where

*p*is the number of parameters in the model

^{[3]}. After running the elastic net with the adjusted predictors, we will multiply each of our

*β*’s by

*a*to get parameter estimates for the original predictors.

_{i}### 3.1. Intuition

In our SUV example above, we might find that *sd*(*β _{j}*) is around 1300 for the SUV indicator and around 2000 for each dealer indicator. Thus,

*a*would be around 1.45 and each

_{SUV}*a*would be on the order of 0.94. Multiplying our indicators by these factors would make the SUV indicator larger and

_{dealer}*β*smaller and the dealer indicators smaller and

_{SUV}*β*larger.

_{dealer}Now, elastic net is trying to minimize the following function^{[4]}:

minβ0,β1NN∑i=1wil(yi,β0+βTxi)+λ[(1−α)‖β‖22/2+α‖β‖1]

Since our adjustment has made *β _{SUV}* smaller, it will not be as material in the second half of the equation and elastic net will have a smaller effect on it. Similarly, our adjustment has made

*β*larger, so it will be more material in the second half of the equation and elastic net will have a larger effect on it.

_{dealer}### 3.2. Multiple Simulations with fixed λ

We repeated the above simulation 10,000 times and saved all of the parameters. Below we show the mean and empirical standard deviation from our simulations for each parameter for the same six scenarios we that considered above. We use a smaller λ with the adjusted predictors because the adjustment reduced the estimates for the dealer variables.

We see here that our adjusted approach is closer to the 8,000 actual value for *β _{suv}* for elastic net and lasso and as close for ridge regression. We also see that the standard deviations for the nuisance variables are much smaller for the proposed approach. The code for this simulation is in appendix 3.

### 3.3. Multiple Simulations with λ chosen by cross-validation

We repeated the above simulation with the cross-validation option to select the value of λ. We did this 1,000 times because it was slower, and we saved all of the parameters. Below we show the mean and empirical standard deviation from our simulations for each parameter for the same six scenarios that we considered above.

We see here that our adjusted approach is closer to the 8,000 actual value for *β _{suv}* with a slightly smaller standard deviation, though this could be noise. We also see that the standard deviations for the nuisance variables are much smaller for the proposed approach. This code is in Appendix 4.

## 4. Conclusion

In section 2, we showed that the choice of the base class can improve our estimates. In section 3, our simulations show that this scaling approach provides better estimates. It is also supported by the intuition that we shared in section 3.1. We are seeing these benefits with a minimal cost, fitting *n* simple models each having one predictor. In light of this, we recommend the reader try these methods at home