Making Elastic Net Less Naïve

Joel Atkins

Atkins, Joel. 2024. “Making Elastic Net Less Naïve.” CAS E-Forum Fall (September).

Abstract

This paper suggests two adjustments to elastic net.

It shows that for categorical variables, the answer depends on the choice of base class and recommends using a class with an effect in the middle of the range of class effects.

It also suggests an alternative way to scale predictors that performs better than standardizing variables.

Introduction

Machine learning methods allow us to automate a lot of modeling work. One common example is using Elastic Net, Ridge Regression, and Lasso for variable selection and shrinkage [Zou]. In this article, we suggest two adjustments that can improve these methods to get better results. We create simple illustrative data and models to show how both of these adjustments would work. The R code for these simulations is in the appendices.

1. Historical background

Ridge regression was invented by Arthur Hoerl and Robert Kennard, who were both working at DuPont. They found this method very helpful in dealing with correlated predictors and it was tractable with the limited computing power that they had at the time. Ridge regression was performed by adding some multiple of the identify matrix to X’X^[1] when performing ordinary least squared regression. [Hoerl has an interesting history of ridge regression.]

Ridge regression left all of predictors in the model. To address this, Robert Tibshirani invented lasso regression. Lasso regression “enjoys some of the favourable properties of both subset selection and ridge regression,” by removing some variables and then “shrinking” the parameters for the remaining variables [Tibshirani].

Later Hui Zou and Trevor Hastie blended these methods in creating elastic net which allows the user more flexibility [Zou]. We will propose an approach that will improve all three methods.

2. Selecting the base class for categorical variables

When we have categorical variables, the choice of which class we make the base class will not matter without shrinkage. When we use shrinkage, the function we are minimizing includes actual parameters. In this situation, our choice of base class will influence our estimates. We show this with a simple example. Consider this simple dataset:

Car color	Rental Car Daily Insurance Premium
White	$10
Black	$15
Red	$20

We only have three degrees of freedom, so we can either fit a linear model with no intercept, $Y= \sum \beta_{color} * 1_{color}$ , or we can fit a model with an intercept and only terms for two of the three colors, $Y = \beta_0 + \sum \beta_{color} * 1_{color}$ . If we use linear regression, these options will give the same predictions because they are equivalent.

With shrinkage this is more complicated as we are trying to minimize a combination of the errors in our predictions and the coefficients. With elastic net [Hastie], we are trying to minimize:

$\min _{\beta_0, \beta} \frac{1}{N} \sum_{i=1}^N w_i l\left(y_i, \beta_0+\beta^T x_i\right)+\lambda\left[(1-\alpha)\|\beta\|_2^2 / 2+\alpha\|\beta\|_1\right]$

Per their article, “l(y_i,η_i) is the negative log-likelihood contribution for observation i; e.g., for the Gaussian case it is ½(y_i−η_i)². The elastic net penalty is controlled by α and bridges the gap between lasso regression (α=1, the default) and ridge regression (α=0). The tuning parameter λ controls the overall strength of the penalty.”

If we consider the model with no intercept, and assume the weights are all 1, λ is 0.2, and α is 0, we are choosing betas to minimize

0.2*[(β_white-10)²/2+(β_black-15)²/2+(β_red-20)²/2]+0.2*[β_white²/2+β_black²/2+β_red²/2].

For β_white, we are minimizing 0.1*[(β_white-10)²]+0.1*β_white² or 0.2β_white² – 2β_white + 10. The derivative of this is 0.4β_white – 2, so this is minimized when β_white = 5.

For β_black, we are minimizing 0.1*[(β_black-15)²]+0.1*β_black² or 0.2β_black² – 3β_black + 22.5. The derivative of this is 0.4β_black – 3, so this is minimized when β_blue = 7.5.

For β_red, we are minimizing 0.1*[(β_red-20)²]+0.1*β_red² or 0.2β_black² – 4β_black + 40. The derivative of this is 0.4β_red – 4, so this is minimized when β_blue = 10.

In each case, our β is half of what we get with no shrinkage.

If we include an intercept and make white our base class, we will get a better result. Here, we are minimizing

0.2*[(β₀-10)²/2+(β₀+β_black-15)²/2+(β₀+β_red-20)²/2]+0.2*[β₀²/2+β_black²/2+β_red²/2].

The derivative with β₀ is 0.2[(β₀-10)+(β₀+β_black-15)+(β₀+β_red-20)]+0.2*[β₀] = 0.2[(4β₀+β_black+β_red-45]

The derivative with β_black is 0.2[(β₀+β_black-15)]+0.2*[β_black] = 0.2[(β₀+2β_black-15)].

The derivative with β_red is 0.2[(β₀+β_red-20)]+0.2*[β_red] = 0.2[(β₀+2β_red-20)].

These derivatives are all 0 when β₀=9.167, β_black=2.917, and β_red=5.417.

Making black the base case, we are minimizing

0.2*[(β₀+β_white-10)²/2+(β₀-15)²/2+(β₀+β_red-20)²/2]+0.2*[β₀²/2+β_white²/2+β_red²/2].

The derivative with β₀ is 0.2[(β₀+β_wbhite-10)+(β₀-15)+(β₀+β_red-20)]+0.2*[β₀] = 0.2[(4β₀+β_white+β_red-45]

The derivative with β_white is 0.2[(β₀+β_white-10)]+0.2*[β_white] = 0.2[(β₀+2β_white-10)].

The derivative with β_red is 0.2[(β₀+β_red-20)]+0.2*[β_red] = 0.2[(β₀+2β_red-20)].

These derivatives are all 0 when β₀=10, β_white=0, and β_red=5.

Making red the base case, we are minimizing

0.2*[(β₀+β_white-10)²/2+(β₀+β_black-15)²/2+(β₀-20)²/2]+0.2*[β₀²/2+β_white²/2+β_black²/2].

The derivative with β₀ is 0.2[(β₀+β_white-10)+(β₀+β_black-15)+(β₀-20)]+0.2*[β₀] = 0.2[(4β₀+β_white+β_black-45]

The derivative with β_white is 0.2[(β₀+β_white-10)]+0.2*[β_white] = 0.2[(β₀+2β_white-10)].

The derivative with β_black is 0.2[(β₀+β_black-15)]+0.2*[β_black] = 0.2[(β₀+2β_black-15)].

These derivatives are all 0 when β₀=10.833, β_white=-0.417, and β_black=2.083.

To summarize, we found the following four cases:

	No shrinkage				Shrinkage				Predictions
case	β₀	β_white	β_black	β_red	β₀	β_white	β_black	β_red	White	Black	Red
Actual Values									10	15	20
No intercept	.	10	15	20	.	5	7.5	10	5	7.5	10
White is base class	10	.	5	10	9.167	.	2.917	5.417	9.167	12.083	14.583
Black is base class	15	-5	.	5	10	0	.	5	10	10	15
Red is base class	20	-10	-5	.	10.833	-0.417	2.083	.	10.417	12.917	10.833

Not using an intercept gave us the worst results (the largest mean squared error of our predictions) because this case requires the other three coefficients to all be large. In the other three cases, we can have a large intercept which makes all three estimates close to the actual value. The other two parameters can then be smaller than they are in the no intercept case. Thus, the penalty term is smaller and we see less shrinkage when the intercept term is in the model. (This is true in our case, because all three observations have the same sign.) We have the best results when white is the base case. This is because the pre-shrinkage parameters are smaller (after being squared) than in the other two cases.

The above is illustrative, but it doesn’t reflect how we actually do things. Glmnet (a popular R package), first normalizes the variables and then applies shrinkage. We use Glmnet for these four cases with lasso, ridge, and a blend. Note that because the variables were normalized, these results don’t match the ones above. The code for this is in Appendix 1.

	Parameter Estimates				Predictions			MSE
case	β₀	β_white	β_black	β_red	White	Black	Red
Ridge – red base class	18.36	-7.18	-2.88	.	11.17	15.47	18.36	4.28
Ridge – black base class	15.00	-4.30	.	4.30	10.70	15.00	19.30	0.98
Ridge – white base class	11.64	.	2.88	7.18	11.64	14.53	18.83	4.28
Ridge – no intercept		9.59	14.38	19.18	9.59	14.38	19.18	1.22
Lasso – red base class	17.17	-5.76	-0.76	.	11.41	16.41	17.17	11.99
Lasso – black base class	15	-3.59	.	3.59	11.41	15	18.59	3.98
Lasso – white base class	12.83	.	0.76	5.76	12.83	13.59	18.59	11.99
Lasso – no intercept	.	8.59	13.59	18.59	8.59	13.59	18.59	5.96
Elastic net – red base class	17.88	-6.63	-2.01	.	11.25	15.81	17.88	6.71
Elastic net – black base class	15.00	-3.97	.	3.97	11.03	15.00	18.97	2.12
Elastic net – white base class	12.12	.	2.01	6.63	12.12	14.13	18.75	6.81
Elastic net – no intercept	.	9.10	13.99	18.89	9.10	13.99	18.89	3.06

In these cases, we see the best results (in terms of mean squared error) with white as the base class. In Appendix 2, we show that this is the case even after normalizing the variables.

When categorical variables have the same number of observations for each category, normalizing their respective indicator variables will scale them all by the same amount; so after normalizing glmnet will still give the best results using a case with an effect in the middle as the base class. We suspect this would be the median effect. If some categories are more common than others, the variables will be scaled differently and it’s less clear which effect we will want for the base class. This would be an interesting topic for further study.

3. Scaling Predictors

With elastic net (and it’s special cases of ridge regression and lasso), we might have two predictors with the same correlation with our target. One with a larger variance (and a small coefficient in any model) will see a smaller change from shrinkage. One with a smaller variance (and a larger coefficient in any model) will see more shrinkage. The common approach to avoid this is to standardize all variables so that they each have a variance of 1.0. This section will propose an alternative method.

We will illustrate our proposal with another simple example. We simulate a set of 200 cars. 100 will be sedans and 100 will be SUV’s. Each car will be sold at one of six dealers. The price of each car will be $30,000 + $8,000 * SUV + $10,000 * ε where ε is N(0,1). We will fit a linear model to predict price using SUV and dealer. For our base case, we use the glmnet package in R with the standardize option.

Our proposal is to first fit separate linear regression models^[2] with each predictor variable. From the linear regression models, the standard deviation of each parameter estimate will give us a sense of how confident we are in that parameter. For each x_i, we will multiply all values of x_i by $a_i=\left(\sum_{j \text { in } 1: p} s d\left(\beta_j\right)\right) /$ $\left(p^* s d\left(\beta_i\right)\right)$ where p is the number of parameters in the model^[3]. After running the elastic net with the adjusted predictors, we will multiply each of our β’s by a_i to get parameter estimates for the original predictors.

3.1. Intuition

In our SUV example above, we might find that sd(β_j) is around 1300 for the SUV indicator and around 2000 for each dealer indicator. Thus, a_SUV would be around 1.45 and each a_dealer would be on the order of 0.94. Multiplying our indicators by these factors would make the SUV indicator larger and β_SUV smaller and the dealer indicators smaller and β_dealer larger.

Now, elastic net is trying to minimize the following function^[4]:

$\min _{\beta_0, \beta} \frac{1}{N} \sum_{i=1}^N w_i l\left(y_i, \beta_0+\beta^T x_i\right)+\lambda\left[(1-\alpha)\|\beta\|_2^2 / 2+\alpha\|\beta\|_1\right]$

Since our adjustment has made β_SUV smaller, it will not be as material in the second half of the equation and elastic net will have a smaller effect on it. Similarly, our adjustment has made β_dealer larger, so it will be more material in the second half of the equation and elastic net will have a larger effect on it.

3.2. Multiple Simulations with fixed λ

We repeated the above simulation 10,000 times and saved all of the parameters. Below we show the mean and empirical standard deviation from our simulations for each parameter for the same six scenarios we that considered above. We use a smaller λ with the adjusted predictors because the adjustment reduced the estimates for the dealer variables.

	Standardized predictors with λ = 100	Adjusted predictors with λ = 50
Ridge Regression	β_suv= 7920 +/- 1409 β_dealer1= 21 +/- 3056 β_dealer2= 24 +/- 2767 β_dealer3= 60 +/- 2979 β_dealer4= -7 +/- 3094 β_dealer5= 44 +/- 3440	β_suv= 7920 +/- 1410 β_dealer1= 14 +/- 2518 β_dealer2= 17 +/- 2282 β_dealer3= 52 +/- 2476 β_dealer4= -14 +/- 2540 β_dealer5= 33 +/- 2668
Lasso	β_suv= 7792 +/- 1415 β_dealer1= 9 +/- 1973 β_dealer2= 10 +/- 1642 β_dealer3= 41 +/- 1886 β_dealer4= -21 +/- 2022 β_dealer5= 28 +/- 2418	β_suv= 7851 +/- 1417 β_dealer1= -1 +/- 1726 β_dealer2= 3 +/- 1536 β_dealer3= 33 +/- 1709 β_dealer4= -27 +/- 1750 β_dealer5= 15 +/- 1854
Elastic Net with alpha = 0.5	β_suv= 7856 +/- 1412 β_dealer1 = 15 +/- 2405 β_dealer2= 17 +/- 2095 β_dealer3= 51 +/- 2324 β_dealer4 = -14 +/- 2447 β_dealer5 = 36 +/- 2817	β_suv= 7885 +/- 1413 β_dealer1= 8 +/- 2046 β_dealer2 = 9 +/- 1827 β_dealer3 = 42 +/- 2014 β_dealer4 = -21 +/- 2068 β_dealer5 = 24 +/- 2190

We see here that our adjusted approach is closer to the 8,000 actual value for β_suv for elastic net and lasso and as close for ridge regression. We also see that the standard deviations for the nuisance variables are much smaller for the proposed approach. The code for this simulation is in appendix 3.

3.3. Multiple Simulations with λ chosen by cross-validation

We repeated the above simulation with the cross-validation option to select the value of λ. We did this 1,000 times because it was slower, and we saved all of the parameters. Below we show the mean and empirical standard deviation from our simulations for each parameter for the same six scenarios that we considered above.

	Standardized predictors	Adjusted predictors
Ridge Regression	β_suv= 6575 +/- 1617 β_dealer1= 84 +/- 2318 β_dealer2= 43 +/- 1989 β_dealer3= 30 +/- 2233 β_dealer4= -50 +/- 2275 β_dealer5= 90 +/- 2655	β_suv= 6600 +/- 1500 β_dealer1= 35 +/- 1656 β_dealer2= 21 +/- 1595 β_dealer3= 10 +/- 1689 β_dealer4= -53 +/- 1603 β_dealer5= 40 +/- 1715
Lasso	β_suv= 7906 +/- 1427 β_dealer1= 42 +/- 1511 β_dealer2= 31 +/- 1231 β_dealer3= 7 +/- 1497 β_dealer4= -30 +/- 1443 β_dealer5= 42 +/- 1810	β_suv= 7927 +/- 1420 β_dealer1= -25 +/- 1437 β_dealer2= -42 +/- 1346 β_dealer3= -73 +/- 1426 β_dealer4= -56 +/- 1327 β_dealer5= -10 +/- 1542
Elastic Net with alpha = 0.1	β_suv= 6578 +/- 1619 β_dealer1 = 84 +/- 2317 β_dealer2= 43 +/- 1988 β_dealer3= 30 +/- 2232 β_dealer4 = -50 +/- 2273 β_dealer5 = 90 +/- 2654	β_suv= 7902 +/- 1430 β_dealer1= 7 +/- 1441 β_dealer2 = -16 +/- 1368 β_dealer3 = -37 +/- 1489 β_dealer4 = -57 +/- 1388 β_dealer5 = 28 +/- 1494

We see here that our adjusted approach is closer to the 8,000 actual value for β_suv with a slightly smaller standard deviation, though this could be noise. We also see that the standard deviations for the nuisance variables are much smaller for the proposed approach. This code is in Appendix 4.

4. Conclusion

In section 2, we showed that the choice of the base class can improve our estimates. In section 3, our simulations show that this scaling approach provides better estimates. It is also supported by the intuition that we shared in section 3.1. We are seeing these benefits with a minimal cost, fitting n simple models each having one predictor. In light of this, we recommend the reader try these methods at home

Submitted: July 21, 2024 EDT

Accepted: August 19, 2024 EDT

References

Hastie, T., J. Qian, and K. Tay. 2023. “An Introduction to Glmnet.” Stanford University. 2023. https://glmnet.stanford.edu/articles/glmnet.html.

Hoerl, R. n.d. “Ridge Regression: A Historical Context.” Technometrics 62 (4): 420–25. https://doi.org/10.1080/00401706.2020.1742207.

Google Scholar

Tibshirani, R. n.d. “Regression Shrinkage and Selection Via the Lasso.” Journal of the Royal Statistical Society: Series B 58 (1): 267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.

Google Scholar

Zou, H., and T. Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society, Series B 67 (2): 301–20. https://academic.oup.com/jrsssb/article/67/2/301/7109482.

Google Scholar

Appendices

Appendix 1 – Code for the base case example

# load necessary tables
library(data.table)
library(glmnet) 

# create a dataset with indicators for each car color and a variable
# for insurance premium. We are using indicators for each car color
# because glmnet does not support categorical variables
dt <- rbind(c(1,0,0,10),
c(0,1,0,15),
c(0,0,1,20))
dt <- as.data.table(dt)
setnames(dt,names(dt),c('white','black','red','premium'))

# if we fit a linear regression, we will get coefficients that provide
# exact predictions, independent of our selection of the base class.
# we see this with the next lines of code.
lm(premium ~ white + black, data=dt)
lm(premium ~ black + red, data=dt)

# now, we will fit a number of models with elastic net and see how
# the coefficients and predictors vary
# now consider ridge regression with red as the base class
coef(glmnet(dt[,1:2],dt$premium,alpha=0),s=1,exact=TRUE,x=dt[,1:2],y=dt$premium)
# with brown as the base class
coef(glmnet(dt[,c(1,3)],dt$premium,alpha=0),s=1,
exact=TRUE,x=dt[,c(1,3)],y=dt$premium)
# with white as the base class
coef(glmnet(dt[,2:3],dt$premium,alpha=0),s=1,exact=TRUE,x=dt[,2:3],y=dt$premium)
# and without an intercept
coef(glmnet(dt[,1:3],dt$premium,alpha=0,intercept=FALSE),s=1,
exact=TRUE,x=dt[,1:3],y=dt$premium)

# now consider lasso with the same three choices
# first red as the base class
coef(glmnet(dt[,1:4],dt$premium,alpha=1),s=.1)
# with brown as the base class
coef(glmnet(dt[,c(1,2,4,5)],dt$premium,alpha=1),s=.1)
# then white as the base class
coef(glmnet(dt[,2:5],dt$premium,alpha=1),s=.1)
# then with no intercept
coef(glmnet(dt[,1:5],dt$premium,alpha=1,intercept=FALSE),s=.1)

# now consider an elastic net that blends ridge and lasso
# first red as the base class
coef(glmnet(dt[,1:4],dt$premium,alpha=0.5),s=.1)
# with brown as the base class
coef(glmnet(dt[,c(1,2,4,5)],dt$premium,alpha=0.5),s=.1)
# then white as the base class
coef(glmnet(dt[,2:5],dt$premium,alpha=0.5),s=.1)
# then with no intercept
coef(glmnet(dt[,1:5],dt$premium,alpha=0.5,intercept=FALSE),s=.1)

Appendix 2 – Normalizing variables

We are considering the dataset:

Car color	Rental Car Daily Insurance Premium
White	$10
Black	$15
Red	$20

If we use indicators that are 0 or 1 for each color, we can express this as

$\left\lbrack \begin{array}{r} 10 \\ 15 \\ 20 \end{array} \right\rbrack = \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 1 \end{bmatrix}\left\lbrack \begin{array}{r} Β_{0} \\ Β_{white} \\ Β_{red} \end{array} \right\rbrack$

and without shrinkage, our estimate would be β₀=15, β_white=-5, and β_red=5.

Alternatively, we can normalize each variable so that the color indicators are $\frac{- 1}{\sqrt{3}}$ when a car is not that color and $\frac{2}{\sqrt{3}}$ when a car is that color. We can also normalize the premiums to be -1, 0, and 1. This gives us:

$\left\lbrack \begin{array}{r} - 1 \\ 0 \\ 1 \end{array} \right\rbrack = \begin{bmatrix} 1 & \frac{2}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{2}{\sqrt{3}} \end{bmatrix}\left\lbrack \begin{array}{r} Β_{0} \\ Β_{white} \\ Β_{red} \end{array} \right\rbrack$

In this case, our estimates without shrinkage will be β₀=0, β_white= $\frac{- 1}{\sqrt{3}}$ , and β_red= $\frac{1}{\sqrt{3}}$ .

Alternatively, we could have made white or red the base case. With the normalized variables, we would have had either:

$\left\lbrack \begin{array}{r} - 1 \\ 0 \\ 1 \end{array} \right\rbrack = \begin{bmatrix} 1 & \frac{- 1}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{2}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{2}{\sqrt{3}} \end{bmatrix}\left\lbrack \begin{array}{r} Β_{0} \\ Β_{black} \\ Β_{red} \end{array} \right\rbrack$ or $\left\lbrack \begin{array}{r} - 1 \\ 0 \\ 1 \end{array} \right\rbrack = \begin{bmatrix} 1 & \frac{2}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{2}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \end{bmatrix}\left\lbrack \begin{array}{r} Β_{0} \\ Β_{white} \\ Β_{black} \end{array} \right\rbrack$

In these cases, our estimates without shrinkage would be β₀=0, β_black= $\frac{1}{\sqrt{3}}$ , and β_red= $\frac{2}{\sqrt{3}}$ or β₀=0, β_white= $\frac{- 2}{\sqrt{3}}$ , and β_red= $\frac{- 1}{\sqrt{3}}$ , respectively. Thus, we see in this example that normalizing indicator variables only scales the parameter estimates and we still see the smallest estimates when black (the middle value) is the base case.

Appendix 3 – Code for fixed λ

# This program tests a possible adjustment to elastic net.
# We will simulate data with one real predictor and one
# nuisance predictor. We will then evaluate elastic net
# with the predictors standardized and with our proposed
# adjustment.
# This program will do this for one simulation so that
# we can compare graphs of the results. We will then do
# this with 1000 simulations so we can get a better sense
# of how they compare in another program

# for illustrative purposes, we will use a simple linear
# model. our target will be car prices. they will be
# based on the suv of the car and we will add dealer as
# the nuisance variable

# A - load libraries ----------------------------
library("data.table")
library("glmnet") 

# B - set parameters ----------------------------
# set seed once
set.seed(12345)
# set lambda
a_value = 64
s_value = 61
# and values to evaluate elastic net
grid = 10^seq(8, -1, length = 91) 

# B - simulate a dataset -------------------------
# we will simulate a dataset with car type, price, and dealer
# we will then fit models to predict the car price
# we will capture the coefficients over 1,000 simulations
# and see which method gives the closest coefficients

# B1 - set up arrays to capture values -----------
rrs <- NULL # ridge regression with standardized option
las <- NULL # lasso with standardized option
els <- NULL # elastic net with standardized option
rra <- NULL # ridge regression with adjustment and not standardized option
laa <- NULL # lasso with adjustment and not standardized option
ela <- NULL # elastic net with adjustment and not standardized option

# run simulations for each option - we will compare them below
for (i in 1:10000){

# B2 - simulate a dataset ------------------
# create raw variables
# make half the cars sedans (0) and half suvs (1).
suv <- as.data.table(c(rep(0,100),rep(1,100)))

# assign each car to one of six dealers
dealer <- as.data.table(rnorm(200))
dealer1 <- as.data.table(as.numeric(dealer < -1))
dealer2 <- as.data.table(as.numeric(dealer < 0)) -
dealer1
dealer3 <- as.data.table(as.numeric(dealer < 0.5)) -
dealer1 - dealer2
dealer4 <- as.data.table(as.numeric(dealer < 1)) -
dealer1 - dealer2 - dealer3
dealer5 <- as.data.table(as.numeric(dealer < 1.5)) -
dealer1 - dealer2 - dealer3 - dealer4
dealer6 <- as.data.table(as.numeric(dealer >= 1.5))

# put them in a table
noise <- as.data.table(rnorm(200))
cars <- cbind(suv,dealer1,dealer2,dealer3,dealer4,
dealer5,dealer6,noise)
setnames(cars,c("suv","dealer1","dealer2","dealer3",
"dealer4","dealer5","dealer6","noise"))

# and create the target variable
cars[,price := 30000+8000*suv+10000*noise]

# B3 - use elastic net to fit models with standardized predictors --------------
# first try ridge regression
mrr <- glmnet(cars[,1:6],cars$price,family="gaussian",
alpha=0,lambda=grid,standardize = TRUE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
rrs <- rbind(rrs,coef(mrr)[,s_value])

# then, try lasso
mla <- glmnet(cars[,1:6],cars$price,family="gaussian",
alpha=1,lambda=grid,standardize = TRUE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
las <- rbind(las,coef(mla)[,s_value])

# now, try elastic net
mel <- glmnet(cars[,1:6],cars$price,family="gaussian",
alpha=0.5,lambda=grid,standardize = TRUE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
els <- rbind(els,coef(mel)[,s_value])



# B4 - proposed method - adjust predictors -------------------

# scale the predictors

# find the standard deviation for each predictor's estimate
scalers <- rep(0,6)
scalers[1] <- summary(lm(price ~ suv,data = cars))$coefficients[2,2]
scalers[2] <- summary(lm(price ~ dealer1,data = cars))$coefficients[2,2]
scalers[3] <- summary(lm(price ~ dealer2,data = cars))$coefficients[2,2]
scalers[4] <- summary(lm(price ~ dealer3,data = cars))$coefficients[2,2]
scalers[5] <- summary(lm(price ~ dealer4,data = cars))$coefficients[2,2]
scalers[6] <- summary(lm(price ~ dealer5,data = cars))$coefficients[2,2]

# scale them
scalers <- mean(scalers) / scalers

# divide each predictor by its scaled standard deviation
cars2 <- cars
cars2[,suv := suv * scalers[1]]
cars2[,dealer1 := dealer1 * scalers[2]]
cars2[,dealer2 := dealer2 * scalers[3]]
cars2[,dealer3 := dealer3 * scalers[4]]
cars2[,dealer4 := dealer4 * scalers[5]]
cars2[,dealer5 := dealer5 * scalers[6]]

# first, try ridge regression on the new data
mrr2 <- glmnet(cars2[,1:6],cars2$price,family="gaussian",
alpha=0,lambda=grid,standardize = FALSE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
coefs <- coef(mrr2)[,a_value]
coefs[2] <- coefs[2] * scalers[1]
coefs[3] <- coefs[3] * scalers[2]
coefs[4] <- coefs[4] * scalers[3]
coefs[5] <- coefs[5] * scalers[4]
coefs[6] <- coefs[6] * scalers[5]
coefs[7] <- coefs[7] * scalers[6]
rra <- rbind(rra,coefs)


# then, try lasso
mla2 <- glmnet(cars2[,1:6],cars2$price,family="gaussian",
alpha=1,lambda=grid,standardize = FALSE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
coefs <- coef(mla2)[,a_value]
coefs[2] <- coefs[2] * scalers[1]
coefs[3] <- coefs[3] * scalers[2]
coefs[4] <- coefs[4] * scalers[3]
coefs[5] <- coefs[5] * scalers[4]
coefs[6] <- coefs[6] * scalers[5]
coefs[7] <- coefs[7] * scalers[6]
laa <- rbind(laa,coefs)

# now, try elastic net
mel2 <- glmnet(cars2[,1:6],cars$price,family="gaussian",
alpha=0.5,lambda=grid,standardize = FALSE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
coefs <- coef(mel2)[,a_value]
coefs[2] <- coefs[2] * scalers[1]
coefs[3] <- coefs[3] * scalers[2]
coefs[4] <- coefs[4] * scalers[3]
coefs[5] <- coefs[5] * scalers[4]
coefs[6] <- coefs[6] * scalers[5]
coefs[7] <- coefs[7] * scalers[6]
ela <- rbind(ela,coefs)

} 

# C - calculate the range of estimates so that we can compare them

# first we capture the mean and standard deviation for ridge regression
# we do not divide by the square root of 10,000 because we are not calculating
# a confidence interval. we are calculating how much it can vary in a given
# simulation
for (i in 2:7){
print(round(mean(rrs[,i]),0))
print(round(var(rrs[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(rra[,i]),0))
print(round(var(rra[,i])**0.5,0))
}

# next is lasso
for (i in 2:7){
print(round(mean(las[,i]),0))
print(round(var(las[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(laa[,i]),0))
print(round(var(laa[,i])**0.5,0))
}

# and last is elastic net
for (i in 2:7){
print(round(mean(els[,i]),0))
print(round(var(els[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(ela[,i]),0))
print(round(var(ela[,i])**0.5,0))
}

Appendix 4 – Code for λ chosen by cross-validation

# This program tests a possible adjustment to elastic net.
# We will simulate data with one real predictor and one
# nuisance predictor. We will then evaluate elastic net
# with the predictors standardized and with our proposed
# adjustment.
# This program will do this for one simulation so that
# we can compare graphs of the results. We will then do
# this with 1000 simulations so we can get a better sense
# of how they compare in another program

# for illustrative purposes, we will use a simple linear
# model. our target will be car prices. they will be
# based on the suv of the car and we will add dealer as
# the nuisance variable

# A - load libraries ----------------------------
library("data.table")
library("glmnet") 

# B - set parameters ----------------------------
# set seed once
set.seed(12345) 

# B - simulate a dataset -------------------------
# we will simulate a dataset with car type, price, and dealer
# we will then fit models to predict the car price
# we will capture the coefficients over 1,000 simulations
# and see which method gives the closest coefficients

# B1 - set up arrays to capture values -----------
rrs <- NULL # ridge regression with standardized option
las <- NULL # lasso with standardized option
els <- NULL # elastic net with standardized option
rra <- NULL # ridge regression with adjustment and not standardized option
laa <- NULL # lasso with adjustment and not standardized option
ela <- NULL # elastic net with adjustment and not standardized option

# run simulations for each option - we will compare them below
for (i in 1:1000){

# B2 - simulate a dataset ------------------
# create raw variables
# make half the cars sedans (0) and half suvs (1).
suv <- as.data.table(c(rep(0,100),rep(1,100)))

# assign each car to one of six dealers
dealer <- as.data.table(rnorm(200))
dealer1 <- as.data.table(as.numeric(dealer < -1))
dealer2 <- as.data.table(as.numeric(dealer < 0)) -
dealer1
dealer3 <- as.data.table(as.numeric(dealer < 0.5)) -
dealer1 - dealer2
dealer4 <- as.data.table(as.numeric(dealer < 1)) -
dealer1 - dealer2 - dealer3
dealer5 <- as.data.table(as.numeric(dealer < 1.5)) -
dealer1 - dealer2 - dealer3 - dealer4
dealer6 <- as.data.table(as.numeric(dealer >= 1.5))

# put them in a table
noise <- as.data.table(rnorm(200))
cars <- cbind(suv,dealer1,dealer2,dealer3,dealer4,
dealer5,dealer6,noise)
setnames(cars,c("suv","dealer1","dealer2","dealer3",
"dealer4","dealer5","dealer6","noise"))

# and create the target variable
cars[,price := 30000+8000*suv+10000*noise]

# B3 - use elastic net to fit models with standardized predictors --------------
# first try ridge regression
cvfitrs <- cv.glmnet(as.matrix(cars[,1:6]), cars$price, family="gaussian",
alpha = 0, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = TRUE)
rrs <- rbind(rrs,t(coef(cvfitrs,s="lambda.min")))

# then, try lasso
cvfitls <- cv.glmnet(as.matrix(cars[,1:6]), cars$price, family="gaussian",
alpha = 1, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = TRUE)
las <- rbind(las,t(coef(cvfitls,s="lambda.min")))

# now, try elastic net
cvfites <- cv.glmnet(as.matrix(cars[,1:6]), cars$price, family="gaussian",
alpha = 0.5, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = TRUE)
els <- rbind(rrs,t(coef(cvfites,s="lambda.min")))

# B4 - proposed method - adjust predictors -------------------

# scale the predictors

# find the standard deviation for each predictor's estimate
scalers <- rep(0,6)
scalers[1] <- summary(lm(price ~ suv,data = cars))$coefficients[2,2]
scalers[2] <- summary(lm(price ~ dealer1,data = cars))$coefficients[2,2]
scalers[3] <- summary(lm(price ~ dealer2,data = cars))$coefficients[2,2]
scalers[4] <- summary(lm(price ~ dealer3,data = cars))$coefficients[2,2]
scalers[5] <- summary(lm(price ~ dealer4,data = cars))$coefficients[2,2]
scalers[6] <- summary(lm(price ~ dealer5,data = cars))$coefficients[2,2]

# scale them
scalers <- mean(scalers) / scalers

# divide each predictor by its scaled standard deviation
cars2 <- cars
cars2[,suv := suv * scalers[1]]
cars2[,dealer1 := dealer1 * scalers[2]]
cars2[,dealer2 := dealer2 * scalers[3]]
cars2[,dealer3 := dealer3 * scalers[4]]
cars2[,dealer4 := dealer4 * scalers[5]]
cars2[,dealer5 := dealer5 * scalers[6]]

# first, try ridge regression on the new data
cvfitra <- cv.glmnet(as.matrix(cars2[,1:6]), cars2$price, family="gaussian",
alpha = 0, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = FALSE)
rra <- rbind(rra,t(coef(cvfitra,s="lambda.min")))
rra[i,2] <- rra[i,2] * scalers[1]
rra[i,3] <- rra[i,3] * scalers[2]
rra[i,4] <- rra[i,4] * scalers[3]
rra[i,5] <- rra[i,5] * scalers[4]
rra[i,6] <- rra[i,6] * scalers[5]
rra[i,7] <- rra[i,7] * scalers[6]

# then, try lasso
cvfitla <- cv.glmnet(as.matrix(cars2[,1:6]), cars2$price, family="gaussian",
alpha = 1, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = FALSE)
laa <- rbind(laa,t(coef(cvfitla,s="lambda.min")))
laa[i,2] <- laa[i,2] * scalers[1]
laa[i,3] <- laa[i,3] * scalers[2]
laa[i,4] <- laa[i,4] * scalers[3]
laa[i,5] <- laa[i,5] * scalers[4]
laa[i,6] <- laa[i,6] * scalers[5]
laa[i,7] <- laa[i,7] * scalers[6]

# now, try elastic net
cvfitea <- cv.glmnet(as.matrix(cars2[,1:6]), cars2$price, family="gaussian",
alpha = 0.5, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = FALSE)
ela <- rbind(ela,t(coef(cvfitea,s="lambda.min")))
ela[i,2] <- ela[i,2] * scalers[1]
ela[i,3] <- ela[i,3] * scalers[2]
ela[i,4] <- ela[i,4] * scalers[3]
ela[i,5] <- ela[i,5] * scalers[4]
ela[i,6] <- ela[i,6] * scalers[5]
ela[i,7] <- ela[i,7] * scalers[6]

}

# C - calculate the range of estimates so that we can compare them

# first we capture the mean and standard deviation for ridge regression
# we do not divide by the square root of 10,000 because we are not calculating
# a confidence interval. we are calculating how much it can vary in a given
# simulation
for (i in 2:7){
print(round(mean(rrs[,i]),0))
print(round(var(rrs[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(rra[,i]),0))
print(round(var(rra[,i])**0.5,0))
}

# next is lasso
for (i in 2:7){
print(round(mean(las[,i]),0))
print(round(var(las[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(laa[,i]),0))
print(round(var(laa[,i])**0.5,0))
}

# and last is elastic net
for (i in 2:7){
print(round(mean(els[,i]),0))
print(round(var(els[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(ela[,i]),0))
print(round(var(ela[,i])**0.5,0))
}

The full formula is $\dot{B}=\left(X^{\prime} X+k I\right)^{-1} X^{\prime} y$
I am concerned that fitting one linear regression model with all predictors won’t work well when we have a large number of predictors as there may be significant aliasing and the parameter estimates may be noisy.
I made the arbitrary decision to exclude the intercept term. I am trying to pick winners and losers from the possible predictors. I am assuming that the intercept term will be in the model.
https://glmnet.stanford.edu/articles/glmnet.html. Per their article, l(y_i,η_i) is the negative log-likelihood contribution for observation i; e.g., for the Gaussian case it is ½ (y_i−η_i)². The elastic net penalty is controlled by α and bridges the gap between lasso regression (α=1, the default) and ridge regression (α=0). The tuning parameter λ controls the overall strength of the penalty.