Introduction
Machine learning methods allow us to automate a lot of modeling work. One common example is using Elastic Net, Ridge Regression, and Lasso for variable selection and shrinkage [Zou]. In this article, we suggest two adjustments that can improve these methods to get better results. We create simple illustrative data and models to show how both of these adjustments would work. The R code for these simulations is in the appendices.
1. Historical background
Ridge regression was invented by Arthur Hoerl and Robert Kennard, who were both working at DuPont. They found this method very helpful in dealing with correlated predictors and it was tractable with the limited computing power that they had at the time. Ridge regression was performed by adding some multiple of the identify matrix to X’X when performing ordinary least squared regression. [Hoerl has an interesting history of ridge regression.]
Ridge regression left all of predictors in the model. To address this, Robert Tibshirani invented lasso regression. Lasso regression “enjoys some of the favourable properties of both subset selection and ridge regression,” by removing some variables and then “shrinking” the parameters for the remaining variables [Tibshirani].
Later Hui Zou and Trevor Hastie blended these methods in creating elastic net which allows the user more flexibility [Zou]. We will propose an approach that will improve all three methods.
2. Selecting the base class for categorical variables
When we have categorical variables, the choice of which class we make the base class will not matter without shrinkage. When we use shrinkage, the function we are minimizing includes actual parameters. In this situation, our choice of base class will influence our estimates. We show this with a simple example. Consider this simple dataset:
Car color |
Rental Car Daily Insurance Premium |
White |
$10 |
Black |
$15 |
Red |
$20 |
We only have three degrees of freedom, so we can either fit a linear model with no intercept, Y=∑βcolor∗1color, or we can fit a model with an intercept and only terms for two of the three colors, Y=β0+∑βcolor∗1color. If we use linear regression, these options will give the same predictions because they are equivalent.
With shrinkage this is more complicated as we are trying to minimize a combination of the errors in our predictions and the coefficients. With elastic net [Hastie], we are trying to minimize:
minβ0,β1NN∑i=1wil(yi,β0+βTxi)+λ[(1−α)‖
Per their article, “l(yi,ηi) is the negative log-likelihood contribution for observation i; e.g., for the Gaussian case it is ½(yi−ηi)2. The elastic net penalty is controlled by α and bridges the gap between lasso regression (α=1, the default) and ridge regression (α=0). The tuning parameter λ controls the overall strength of the penalty.”
If we consider the model with no intercept, and assume the weights are all 1, λ is 0.2, and α is 0, we are choosing betas to minimize
0.2*[(βwhite-10)2/2+(βblack-15)2/2+(βred-20)2/2]+0.2*[βwhite2/2+βblack2/2+βred2/2].
For βwhite, we are minimizing 0.1*[(βwhite-10)2]+0.1*βwhite2 or 0.2βwhite2 – 2βwhite + 10. The derivative of this is 0.4βwhite – 2, so this is minimized when βwhite = 5.
For βblack, we are minimizing 0.1*[(βblack-15)2]+0.1*βblack2 or 0.2βblack2 – 3βblack + 22.5. The derivative of this is 0.4βblack – 3, so this is minimized when βblue = 7.5.
For βred, we are minimizing 0.1*[(βred-20)2]+0.1*βred2 or 0.2βblack2 – 4βblack + 40. The derivative of this is 0.4βred – 4, so this is minimized when βblue = 10.
In each case, our β is half of what we get with no shrinkage.
If we include an intercept and make white our base class, we will get a better result. Here, we are minimizing
0.2*[(β0-10)2/2+(β0+βblack-15)2/2+(β0+βred-20)2/2]+0.2*[β02/2+βblack2/2+βred2/2].
The derivative with β0 is 0.2[(β0-10)+(β0+βblack-15)+(β0+βred-20)]+0.2*[β0] = 0.2[(4β0+βblack+βred-45]
The derivative with βblack is 0.2[(β0+βblack-15)]+0.2*[βblack] = 0.2[(β0+2βblack-15)].
The derivative with βred is 0.2[(β0+βred-20)]+0.2*[βred] = 0.2[(β0+2βred-20)].
These derivatives are all 0 when β0=9.167, βblack=2.917, and βred=5.417.
Making black the base case, we are minimizing
0.2*[(β0+βwhite-10)2/2+(β0-15)2/2+(β0+βred-20)2/2]+0.2*[β02/2+βwhite2/2+βred2/2].
The derivative with β0 is 0.2[(β0+βwbhite-10)+(β0-15)+(β0+βred-20)]+0.2*[β0] = 0.2[(4β0+βwhite+βred-45]
The derivative with βwhite is 0.2[(β0+βwhite-10)]+0.2*[βwhite] = 0.2[(β0+2βwhite-10)].
The derivative with βred is 0.2[(β0+βred-20)]+0.2*[βred] = 0.2[(β0+2βred-20)].
These derivatives are all 0 when β0=10, βwhite=0, and βred=5.
Making red the base case, we are minimizing
0.2*[(β0+βwhite-10)2/2+(β0+βblack-15)2/2+(β0-20)2/2]+0.2*[β02/2+βwhite2/2+βblack2/2].
The derivative with β0 is 0.2[(β0+βwhite-10)+(β0+βblack-15)+(β0-20)]+0.2*[β0] = 0.2[(4β0+βwhite+βblack-45]
The derivative with βwhite is 0.2[(β0+βwhite-10)]+0.2*[βwhite] = 0.2[(β0+2βwhite-10)].
The derivative with βblack is 0.2[(β0+βblack-15)]+0.2*[βblack] = 0.2[(β0+2βblack-15)].
These derivatives are all 0 when β0=10.833, βwhite=-0.417, and βblack=2.083.
To summarize, we found the following four cases:
|
No shrinkage |
Shrinkage |
Predictions |
case |
β0 |
βwhite |
βblack |
βred |
β0 |
βwhite |
βblack |
βred |
White |
Black |
Red |
Actual Values |
|
|
|
|
|
|
|
|
10 |
15 |
20 |
No intercept |
. |
10 |
15 |
20 |
. |
5 |
7.5 |
10 |
5 |
7.5 |
10 |
White is base class |
10 |
. |
5 |
10 |
9.167 |
. |
2.917 |
5.417 |
9.167 |
12.083 |
14.583 |
Black is base class |
15 |
-5 |
. |
5 |
10 |
0 |
. |
5 |
10 |
10 |
15 |
Red is base class |
20 |
-10 |
-5 |
. |
10.833 |
-0.417 |
2.083 |
. |
10.417 |
12.917 |
10.833 |
Not using an intercept gave us the worst results (the largest mean squared error of our predictions) because this case requires the other three coefficients to all be large. In the other three cases, we can have a large intercept which makes all three estimates close to the actual value. The other two parameters can then be smaller than they are in the no intercept case. Thus, the penalty term is smaller and we see less shrinkage when the intercept term is in the model. (This is true in our case, because all three observations have the same sign.) We have the best results when white is the base case. This is because the pre-shrinkage parameters are smaller (after being squared) than in the other two cases.
The above is illustrative, but it doesn’t reflect how we actually do things. Glmnet (a popular R package), first normalizes the variables and then applies shrinkage. We use Glmnet for these four cases with lasso, ridge, and a blend. Note that because the variables were normalized, these results don’t match the ones above. The code for this is in Appendix 1.
|
Parameter Estimates |
Predictions |
MSE |
case |
β0 |
βwhite |
βblack |
βred |
White |
Black |
Red |
|
Ridge – red base class |
18.36 |
-7.18 |
-2.88 |
. |
11.17 |
15.47 |
18.36 |
4.28 |
Ridge – black base class |
15.00 |
-4.30 |
. |
4.30 |
10.70 |
15.00 |
19.30 |
0.98 |
Ridge – white base class |
11.64 |
. |
2.88 |
7.18 |
11.64 |
14.53 |
18.83 |
4.28 |
Ridge – no intercept |
|
9.59 |
14.38 |
19.18 |
9.59 |
14.38 |
19.18 |
1.22 |
Lasso – red base class |
17.17 |
-5.76 |
-0.76 |
. |
11.41 |
16.41 |
17.17 |
11.99 |
Lasso – black base class |
15 |
-3.59 |
. |
3.59 |
11.41 |
15 |
18.59 |
3.98 |
Lasso – white base class |
12.83 |
. |
0.76 |
5.76 |
12.83 |
13.59 |
18.59 |
11.99 |
Lasso – no intercept |
. |
8.59 |
13.59 |
18.59 |
8.59 |
13.59 |
18.59 |
5.96 |
Elastic net – red base class |
17.88 |
-6.63 |
-2.01 |
. |
11.25 |
15.81 |
17.88 |
6.71 |
Elastic net – black base class |
15.00 |
-3.97 |
. |
3.97 |
11.03 |
15.00 |
18.97 |
2.12 |
Elastic net – white base class |
12.12 |
. |
2.01 |
6.63 |
12.12 |
14.13 |
18.75 |
6.81 |
Elastic net – no intercept |
. |
9.10 |
13.99 |
18.89 |
9.10 |
13.99 |
18.89 |
3.06 |
In these cases, we see the best results (in terms of mean squared error) with white as the base class. In Appendix 2, we show that this is the case even after normalizing the variables.
When categorical variables have the same number of observations for each category, normalizing their respective indicator variables will scale them all by the same amount; so after normalizing glmnet will still give the best results using a case with an effect in the middle as the base class. We suspect this would be the median effect. If some categories are more common than others, the variables will be scaled differently and it’s less clear which effect we will want for the base class. This would be an interesting topic for further study.
3. Scaling Predictors
With elastic net (and it’s special cases of ridge regression and lasso), we might have two predictors with the same correlation with our target. One with a larger variance (and a small coefficient in any model) will see a smaller change from shrinkage. One with a smaller variance (and a larger coefficient in any model) will see more shrinkage. The common approach to avoid this is to standardize all variables so that they each have a variance of 1.0. This section will propose an alternative method.
We will illustrate our proposal with another simple example. We simulate a set of 200 cars. 100 will be sedans and 100 will be SUV’s. Each car will be sold at one of six dealers. The price of each car will be $30,000 + $8,000 * SUV + $10,000 * ε where ε is N(0,1). We will fit a linear model to predict price using SUV and dealer. For our base case, we use the glmnet package in R with the standardize option.
Our proposal is to first fit separate linear regression models with each predictor variable. From the linear regression models, the standard deviation of each parameter estimate will give us a sense of how confident we are in that parameter. For each xi, we will multiply all values of xi by a_i=\left(\sum_{j \text { in } 1: p} s d\left(\beta_j\right)\right) / \left(p^* s d\left(\beta_i\right)\right) where p is the number of parameters in the model. After running the elastic net with the adjusted predictors, we will multiply each of our β’s by ai to get parameter estimates for the original predictors.
3.1. Intuition
In our SUV example above, we might find that sd(βj) is around 1300 for the SUV indicator and around 2000 for each dealer indicator. Thus, aSUV would be around 1.45 and each adealer would be on the order of 0.94. Multiplying our indicators by these factors would make the SUV indicator larger and βSUV smaller and the dealer indicators smaller and βdealer larger.
Now, elastic net is trying to minimize the following function:
\min _{\beta_0, \beta} \frac{1}{N} \sum_{i=1}^N w_i l\left(y_i, \beta_0+\beta^T x_i\right)+\lambda\left[(1-\alpha)\|\beta\|_2^2 / 2+\alpha\|\beta\|_1\right]
Since our adjustment has made βSUV smaller, it will not be as material in the second half of the equation and elastic net will have a smaller effect on it. Similarly, our adjustment has made βdealer larger, so it will be more material in the second half of the equation and elastic net will have a larger effect on it.
3.2. Multiple Simulations with fixed λ
We repeated the above simulation 10,000 times and saved all of the parameters. Below we show the mean and empirical standard deviation from our simulations for each parameter for the same six scenarios we that considered above. We use a smaller λ with the adjusted predictors because the adjustment reduced the estimates for the dealer variables.
|
Standardized predictors with λ = 100 |
Adjusted predictors with λ = 50 |
Ridge Regression |
βsuv = 7920 +/- 1409
βdealer1 = 21 +/- 3056
βdealer2 = 24 +/- 2767
βdealer3 = 60 +/- 2979
βdealer4 = -7 +/- 3094
βdealer5 = 44 +/- 3440 |
βsuv = 7920 +/- 1410
βdealer1 = 14 +/- 2518
βdealer2 = 17 +/- 2282
βdealer3 = 52 +/- 2476
βdealer4 = -14 +/- 2540
βdealer5 = 33 +/- 2668 |
Lasso |
βsuv = 7792 +/- 1415
βdealer1 = 9 +/- 1973
βdealer2 = 10 +/- 1642
βdealer3 = 41 +/- 1886
βdealer4 = -21 +/- 2022
βdealer5 = 28 +/- 2418 |
βsuv = 7851 +/- 1417
βdealer1 = -1 +/- 1726
βdealer2 = 3 +/- 1536
βdealer3 = 33 +/- 1709
βdealer4 = -27 +/- 1750
βdealer5 = 15 +/- 1854 |
Elastic Net with alpha = 0.5 |
βsuv = 7856 +/- 1412
βdealer1 = 15 +/- 2405
βdealer2 = 17 +/- 2095
βdealer3 = 51 +/- 2324
βdealer4 = -14 +/- 2447
βdealer5 = 36 +/- 2817 |
βsuv = 7885 +/- 1413
βdealer1 = 8 +/- 2046
βdealer2 = 9 +/- 1827
βdealer3 = 42 +/- 2014
βdealer4 = -21 +/- 2068
βdealer5 = 24 +/- 2190 |
We see here that our adjusted approach is closer to the 8,000 actual value for βsuv for elastic net and lasso and as close for ridge regression. We also see that the standard deviations for the nuisance variables are much smaller for the proposed approach. The code for this simulation is in appendix 3.
3.3. Multiple Simulations with λ chosen by cross-validation
We repeated the above simulation with the cross-validation option to select the value of λ. We did this 1,000 times because it was slower, and we saved all of the parameters. Below we show the mean and empirical standard deviation from our simulations for each parameter for the same six scenarios that we considered above.
|
Standardized predictors |
Adjusted predictors |
Ridge Regression |
βsuv = 6575 +/- 1617
βdealer1 = 84 +/- 2318
βdealer2 = 43 +/- 1989
βdealer3 = 30 +/- 2233
βdealer4 = -50 +/- 2275
βdealer5 = 90 +/- 2655 |
βsuv = 6600 +/- 1500
βdealer1 = 35 +/- 1656
βdealer2 = 21 +/- 1595
βdealer3 = 10 +/- 1689
βdealer4 = -53 +/- 1603
βdealer5 = 40 +/- 1715 |
Lasso |
βsuv = 7906 +/- 1427
βdealer1 = 42 +/- 1511
βdealer2 = 31 +/- 1231
βdealer3 = 7 +/- 1497
βdealer4 = -30 +/- 1443
βdealer5 = 42 +/- 1810 |
βsuv = 7927 +/- 1420
βdealer1 = -25 +/- 1437
βdealer2 = -42 +/- 1346
βdealer3 = -73 +/- 1426
βdealer4 = -56 +/- 1327
βdealer5 = -10 +/- 1542 |
Elastic Net with alpha = 0.1 |
βsuv = 6578 +/- 1619
βdealer1 = 84 +/- 2317
βdealer2 = 43 +/- 1988
βdealer3 = 30 +/- 2232
βdealer4 = -50 +/- 2273
βdealer5 = 90 +/- 2654 |
βsuv = 7902 +/- 1430
βdealer1 = 7 +/- 1441
βdealer2 = -16 +/- 1368
βdealer3 = -37 +/- 1489
βdealer4 = -57 +/- 1388
βdealer5 = 28 +/- 1494 |
We see here that our adjusted approach is closer to the 8,000 actual value for βsuv with a slightly smaller standard deviation, though this could be noise. We also see that the standard deviations for the nuisance variables are much smaller for the proposed approach. This code is in Appendix 4.
4. Conclusion
In section 2, we showed that the choice of the base class can improve our estimates. In section 3, our simulations show that this scaling approach provides better estimates. It is also supported by the intuition that we shared in section 3.1. We are seeing these benefits with a minimal cost, fitting n simple models each having one predictor. In light of this, we recommend the reader try these methods at home
Appendices
Appendix 1 – Code for the base case example
# load necessary tables
library(data.table)
library(glmnet)
# create a dataset with indicators for each car color and a variable
# for insurance premium. We are using indicators for each car color
# because glmnet does not support categorical variables
dt <- rbind(c(1,0,0,10),
c(0,1,0,15),
c(0,0,1,20))
dt <- as.data.table(dt)
setnames(dt,names(dt),c('white','black','red','premium'))
# if we fit a linear regression, we will get coefficients that provide
# exact predictions, independent of our selection of the base class.
# we see this with the next lines of code.
lm(premium ~ white + black, data=dt)
lm(premium ~ black + red, data=dt)
# now, we will fit a number of models with elastic net and see how
# the coefficients and predictors vary
# now consider ridge regression with red as the base class
coef(glmnet(dt[,1:2],dt$premium,alpha=0),s=1,exact=TRUE,x=dt[,1:2],y=dt$premium)
# with brown as the base class
coef(glmnet(dt[,c(1,3)],dt$premium,alpha=0),s=1,
exact=TRUE,x=dt[,c(1,3)],y=dt$premium)
# with white as the base class
coef(glmnet(dt[,2:3],dt$premium,alpha=0),s=1,exact=TRUE,x=dt[,2:3],y=dt$premium)
# and without an intercept
coef(glmnet(dt[,1:3],dt$premium,alpha=0,intercept=FALSE),s=1,
exact=TRUE,x=dt[,1:3],y=dt$premium)
# now consider lasso with the same three choices
# first red as the base class
coef(glmnet(dt[,1:4],dt$premium,alpha=1),s=.1)
# with brown as the base class
coef(glmnet(dt[,c(1,2,4,5)],dt$premium,alpha=1),s=.1)
# then white as the base class
coef(glmnet(dt[,2:5],dt$premium,alpha=1),s=.1)
# then with no intercept
coef(glmnet(dt[,1:5],dt$premium,alpha=1,intercept=FALSE),s=.1)
# now consider an elastic net that blends ridge and lasso
# first red as the base class
coef(glmnet(dt[,1:4],dt$premium,alpha=0.5),s=.1)
# with brown as the base class
coef(glmnet(dt[,c(1,2,4,5)],dt$premium,alpha=0.5),s=.1)
# then white as the base class
coef(glmnet(dt[,2:5],dt$premium,alpha=0.5),s=.1)
# then with no intercept
coef(glmnet(dt[,1:5],dt$premium,alpha=0.5,intercept=FALSE),s=.1)
Appendix 2 – Normalizing variables
We are considering the dataset:
Car color |
Rental Car Daily Insurance Premium |
White |
$10 |
Black |
$15 |
Red |
$20 |
If we use indicators that are 0 or 1 for each color, we can express this as
\left\lbrack \begin{array}{r} 10 \\ 15 \\ 20 \end{array} \right\rbrack = \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 0 \\ 1 & 0 & 1 \end{bmatrix}\left\lbrack \begin{array}{r} Β_{0} \\ Β_{white} \\ Β_{red} \end{array} \right\rbrack
and without shrinkage, our estimate would be β0=15, βwhite=-5, and βred=5.
Alternatively, we can normalize each variable so that the color indicators are \frac{- 1}{\sqrt{3}} when a car is not that color and \frac{2}{\sqrt{3}} when a car is that color. We can also normalize the premiums to be -1, 0, and 1. This gives us:
\left\lbrack \begin{array}{r} - 1 \\ 0 \\ 1 \end{array} \right\rbrack = \begin{bmatrix} 1 & \frac{2}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{2}{\sqrt{3}} \end{bmatrix}\left\lbrack \begin{array}{r} Β_{0} \\ Β_{white} \\ Β_{red} \end{array} \right\rbrack
In this case, our estimates without shrinkage will be β0=0, βwhite=\frac{- 1}{\sqrt{3}}, and βred=\frac{1}{\sqrt{3}}.
Alternatively, we could have made white or red the base case. With the normalized variables, we would have had either:
\left\lbrack \begin{array}{r} - 1 \\ 0 \\ 1 \end{array} \right\rbrack = \begin{bmatrix} 1 & \frac{- 1}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{2}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{2}{\sqrt{3}} \end{bmatrix}\left\lbrack \begin{array}{r} Β_{0} \\ Β_{black} \\ Β_{red} \end{array} \right\rbrack or \left\lbrack \begin{array}{r} - 1 \\ 0 \\ 1 \end{array} \right\rbrack = \begin{bmatrix} 1 & \frac{2}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{2}{\sqrt{3}} \\ 1 & \frac{- 1}{\sqrt{3}} & \frac{- 1}{\sqrt{3}} \end{bmatrix}\left\lbrack \begin{array}{r} Β_{0} \\ Β_{white} \\ Β_{black} \end{array} \right\rbrack
In these cases, our estimates without shrinkage would be β0=0, βblack=\frac{1}{\sqrt{3}}, and βred=\frac{2}{\sqrt{3}}or β0=0, βwhite=\frac{- 2}{\sqrt{3}}, and βred=\frac{- 1}{\sqrt{3}}, respectively. Thus, we see in this example that normalizing indicator variables only scales the parameter estimates and we still see the smallest estimates when black (the middle value) is the base case.
Appendix 3 – Code for fixed λ
# This program tests a possible adjustment to elastic net.
# We will simulate data with one real predictor and one
# nuisance predictor. We will then evaluate elastic net
# with the predictors standardized and with our proposed
# adjustment.
# This program will do this for one simulation so that
# we can compare graphs of the results. We will then do
# this with 1000 simulations so we can get a better sense
# of how they compare in another program
# for illustrative purposes, we will use a simple linear
# model. our target will be car prices. they will be
# based on the suv of the car and we will add dealer as
# the nuisance variable
# A - load libraries ----------------------------
library("data.table")
library("glmnet")
# B - set parameters ----------------------------
# set seed once
set.seed(12345)
# set lambda
a_value = 64
s_value = 61
# and values to evaluate elastic net
grid = 10^seq(8, -1, length = 91)
# B - simulate a dataset -------------------------
# we will simulate a dataset with car type, price, and dealer
# we will then fit models to predict the car price
# we will capture the coefficients over 1,000 simulations
# and see which method gives the closest coefficients
# B1 - set up arrays to capture values -----------
rrs <- NULL # ridge regression with standardized option
las <- NULL # lasso with standardized option
els <- NULL # elastic net with standardized option
rra <- NULL # ridge regression with adjustment and not standardized option
laa <- NULL # lasso with adjustment and not standardized option
ela <- NULL # elastic net with adjustment and not standardized option
# run simulations for each option - we will compare them below
for (i in 1:10000){
# B2 - simulate a dataset ------------------
# create raw variables
# make half the cars sedans (0) and half suvs (1).
suv <- as.data.table(c(rep(0,100),rep(1,100)))
# assign each car to one of six dealers
dealer <- as.data.table(rnorm(200))
dealer1 <- as.data.table(as.numeric(dealer < -1))
dealer2 <- as.data.table(as.numeric(dealer < 0)) -
dealer1
dealer3 <- as.data.table(as.numeric(dealer < 0.5)) -
dealer1 - dealer2
dealer4 <- as.data.table(as.numeric(dealer < 1)) -
dealer1 - dealer2 - dealer3
dealer5 <- as.data.table(as.numeric(dealer < 1.5)) -
dealer1 - dealer2 - dealer3 - dealer4
dealer6 <- as.data.table(as.numeric(dealer >= 1.5))
# put them in a table
noise <- as.data.table(rnorm(200))
cars <- cbind(suv,dealer1,dealer2,dealer3,dealer4,
dealer5,dealer6,noise)
setnames(cars,c("suv","dealer1","dealer2","dealer3",
"dealer4","dealer5","dealer6","noise"))
# and create the target variable
cars[,price := 30000+8000*suv+10000*noise]
# B3 - use elastic net to fit models with standardized predictors --------------
# first try ridge regression
mrr <- glmnet(cars[,1:6],cars$price,family="gaussian",
alpha=0,lambda=grid,standardize = TRUE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
rrs <- rbind(rrs,coef(mrr)[,s_value])
# then, try lasso
mla <- glmnet(cars[,1:6],cars$price,family="gaussian",
alpha=1,lambda=grid,standardize = TRUE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
las <- rbind(las,coef(mla)[,s_value])
# now, try elastic net
mel <- glmnet(cars[,1:6],cars$price,family="gaussian",
alpha=0.5,lambda=grid,standardize = TRUE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
els <- rbind(els,coef(mel)[,s_value])
# B4 - proposed method - adjust predictors -------------------
# scale the predictors
# find the standard deviation for each predictor's estimate
scalers <- rep(0,6)
scalers[1] <- summary(lm(price ~ suv,data = cars))$coefficients[2,2]
scalers[2] <- summary(lm(price ~ dealer1,data = cars))$coefficients[2,2]
scalers[3] <- summary(lm(price ~ dealer2,data = cars))$coefficients[2,2]
scalers[4] <- summary(lm(price ~ dealer3,data = cars))$coefficients[2,2]
scalers[5] <- summary(lm(price ~ dealer4,data = cars))$coefficients[2,2]
scalers[6] <- summary(lm(price ~ dealer5,data = cars))$coefficients[2,2]
# scale them
scalers <- mean(scalers) / scalers
# divide each predictor by its scaled standard deviation
cars2 <- cars
cars2[,suv := suv * scalers[1]]
cars2[,dealer1 := dealer1 * scalers[2]]
cars2[,dealer2 := dealer2 * scalers[3]]
cars2[,dealer3 := dealer3 * scalers[4]]
cars2[,dealer4 := dealer4 * scalers[5]]
cars2[,dealer5 := dealer5 * scalers[6]]
# first, try ridge regression on the new data
mrr2 <- glmnet(cars2[,1:6],cars2$price,family="gaussian",
alpha=0,lambda=grid,standardize = FALSE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
coefs <- coef(mrr2)[,a_value]
coefs[2] <- coefs[2] * scalers[1]
coefs[3] <- coefs[3] * scalers[2]
coefs[4] <- coefs[4] * scalers[3]
coefs[5] <- coefs[5] * scalers[4]
coefs[6] <- coefs[6] * scalers[5]
coefs[7] <- coefs[7] * scalers[6]
rra <- rbind(rra,coefs)
# then, try lasso
mla2 <- glmnet(cars2[,1:6],cars2$price,family="gaussian",
alpha=1,lambda=grid,standardize = FALSE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
coefs <- coef(mla2)[,a_value]
coefs[2] <- coefs[2] * scalers[1]
coefs[3] <- coefs[3] * scalers[2]
coefs[4] <- coefs[4] * scalers[3]
coefs[5] <- coefs[5] * scalers[4]
coefs[6] <- coefs[6] * scalers[5]
coefs[7] <- coefs[7] * scalers[6]
laa <- rbind(laa,coefs)
# now, try elastic net
mel2 <- glmnet(cars2[,1:6],cars$price,family="gaussian",
alpha=0.5,lambda=grid,standardize = FALSE,
type.gaussian="covariance",
type.measure="class",
type.multinomial="ungrouped")
coefs <- coef(mel2)[,a_value]
coefs[2] <- coefs[2] * scalers[1]
coefs[3] <- coefs[3] * scalers[2]
coefs[4] <- coefs[4] * scalers[3]
coefs[5] <- coefs[5] * scalers[4]
coefs[6] <- coefs[6] * scalers[5]
coefs[7] <- coefs[7] * scalers[6]
ela <- rbind(ela,coefs)
}
# C - calculate the range of estimates so that we can compare them
# first we capture the mean and standard deviation for ridge regression
# we do not divide by the square root of 10,000 because we are not calculating
# a confidence interval. we are calculating how much it can vary in a given
# simulation
for (i in 2:7){
print(round(mean(rrs[,i]),0))
print(round(var(rrs[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(rra[,i]),0))
print(round(var(rra[,i])**0.5,0))
}
# next is lasso
for (i in 2:7){
print(round(mean(las[,i]),0))
print(round(var(las[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(laa[,i]),0))
print(round(var(laa[,i])**0.5,0))
}
# and last is elastic net
for (i in 2:7){
print(round(mean(els[,i]),0))
print(round(var(els[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(ela[,i]),0))
print(round(var(ela[,i])**0.5,0))
}
Appendix 4 – Code for λ chosen by cross-validation
# This program tests a possible adjustment to elastic net.
# We will simulate data with one real predictor and one
# nuisance predictor. We will then evaluate elastic net
# with the predictors standardized and with our proposed
# adjustment.
# This program will do this for one simulation so that
# we can compare graphs of the results. We will then do
# this with 1000 simulations so we can get a better sense
# of how they compare in another program
# for illustrative purposes, we will use a simple linear
# model. our target will be car prices. they will be
# based on the suv of the car and we will add dealer as
# the nuisance variable
# A - load libraries ----------------------------
library("data.table")
library("glmnet")
# B - set parameters ----------------------------
# set seed once
set.seed(12345)
# B - simulate a dataset -------------------------
# we will simulate a dataset with car type, price, and dealer
# we will then fit models to predict the car price
# we will capture the coefficients over 1,000 simulations
# and see which method gives the closest coefficients
# B1 - set up arrays to capture values -----------
rrs <- NULL # ridge regression with standardized option
las <- NULL # lasso with standardized option
els <- NULL # elastic net with standardized option
rra <- NULL # ridge regression with adjustment and not standardized option
laa <- NULL # lasso with adjustment and not standardized option
ela <- NULL # elastic net with adjustment and not standardized option
# run simulations for each option - we will compare them below
for (i in 1:1000){
# B2 - simulate a dataset ------------------
# create raw variables
# make half the cars sedans (0) and half suvs (1).
suv <- as.data.table(c(rep(0,100),rep(1,100)))
# assign each car to one of six dealers
dealer <- as.data.table(rnorm(200))
dealer1 <- as.data.table(as.numeric(dealer < -1))
dealer2 <- as.data.table(as.numeric(dealer < 0)) -
dealer1
dealer3 <- as.data.table(as.numeric(dealer < 0.5)) -
dealer1 - dealer2
dealer4 <- as.data.table(as.numeric(dealer < 1)) -
dealer1 - dealer2 - dealer3
dealer5 <- as.data.table(as.numeric(dealer < 1.5)) -
dealer1 - dealer2 - dealer3 - dealer4
dealer6 <- as.data.table(as.numeric(dealer >= 1.5))
# put them in a table
noise <- as.data.table(rnorm(200))
cars <- cbind(suv,dealer1,dealer2,dealer3,dealer4,
dealer5,dealer6,noise)
setnames(cars,c("suv","dealer1","dealer2","dealer3",
"dealer4","dealer5","dealer6","noise"))
# and create the target variable
cars[,price := 30000+8000*suv+10000*noise]
# B3 - use elastic net to fit models with standardized predictors --------------
# first try ridge regression
cvfitrs <- cv.glmnet(as.matrix(cars[,1:6]), cars$price, family="gaussian",
alpha = 0, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = TRUE)
rrs <- rbind(rrs,t(coef(cvfitrs,s="lambda.min")))
# then, try lasso
cvfitls <- cv.glmnet(as.matrix(cars[,1:6]), cars$price, family="gaussian",
alpha = 1, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = TRUE)
las <- rbind(las,t(coef(cvfitls,s="lambda.min")))
# now, try elastic net
cvfites <- cv.glmnet(as.matrix(cars[,1:6]), cars$price, family="gaussian",
alpha = 0.5, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = TRUE)
els <- rbind(rrs,t(coef(cvfites,s="lambda.min")))
# B4 - proposed method - adjust predictors -------------------
# scale the predictors
# find the standard deviation for each predictor's estimate
scalers <- rep(0,6)
scalers[1] <- summary(lm(price ~ suv,data = cars))$coefficients[2,2]
scalers[2] <- summary(lm(price ~ dealer1,data = cars))$coefficients[2,2]
scalers[3] <- summary(lm(price ~ dealer2,data = cars))$coefficients[2,2]
scalers[4] <- summary(lm(price ~ dealer3,data = cars))$coefficients[2,2]
scalers[5] <- summary(lm(price ~ dealer4,data = cars))$coefficients[2,2]
scalers[6] <- summary(lm(price ~ dealer5,data = cars))$coefficients[2,2]
# scale them
scalers <- mean(scalers) / scalers
# divide each predictor by its scaled standard deviation
cars2 <- cars
cars2[,suv := suv * scalers[1]]
cars2[,dealer1 := dealer1 * scalers[2]]
cars2[,dealer2 := dealer2 * scalers[3]]
cars2[,dealer3 := dealer3 * scalers[4]]
cars2[,dealer4 := dealer4 * scalers[5]]
cars2[,dealer5 := dealer5 * scalers[6]]
# first, try ridge regression on the new data
cvfitra <- cv.glmnet(as.matrix(cars2[,1:6]), cars2$price, family="gaussian",
alpha = 0, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = FALSE)
rra <- rbind(rra,t(coef(cvfitra,s="lambda.min")))
rra[i,2] <- rra[i,2] * scalers[1]
rra[i,3] <- rra[i,3] * scalers[2]
rra[i,4] <- rra[i,4] * scalers[3]
rra[i,5] <- rra[i,5] * scalers[4]
rra[i,6] <- rra[i,6] * scalers[5]
rra[i,7] <- rra[i,7] * scalers[6]
# then, try lasso
cvfitla <- cv.glmnet(as.matrix(cars2[,1:6]), cars2$price, family="gaussian",
alpha = 1, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = FALSE)
laa <- rbind(laa,t(coef(cvfitla,s="lambda.min")))
laa[i,2] <- laa[i,2] * scalers[1]
laa[i,3] <- laa[i,3] * scalers[2]
laa[i,4] <- laa[i,4] * scalers[3]
laa[i,5] <- laa[i,5] * scalers[4]
laa[i,6] <- laa[i,6] * scalers[5]
laa[i,7] <- laa[i,7] * scalers[6]
# now, try elastic net
cvfitea <- cv.glmnet(as.matrix(cars2[,1:6]), cars2$price, family="gaussian",
alpha = 0.5, nfolds = 10, alignment = c("lambda", "fraction"),
relax = TRUE, standardize = FALSE)
ela <- rbind(ela,t(coef(cvfitea,s="lambda.min")))
ela[i,2] <- ela[i,2] * scalers[1]
ela[i,3] <- ela[i,3] * scalers[2]
ela[i,4] <- ela[i,4] * scalers[3]
ela[i,5] <- ela[i,5] * scalers[4]
ela[i,6] <- ela[i,6] * scalers[5]
ela[i,7] <- ela[i,7] * scalers[6]
}
# C - calculate the range of estimates so that we can compare them
# first we capture the mean and standard deviation for ridge regression
# we do not divide by the square root of 10,000 because we are not calculating
# a confidence interval. we are calculating how much it can vary in a given
# simulation
for (i in 2:7){
print(round(mean(rrs[,i]),0))
print(round(var(rrs[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(rra[,i]),0))
print(round(var(rra[,i])**0.5,0))
}
# next is lasso
for (i in 2:7){
print(round(mean(las[,i]),0))
print(round(var(las[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(laa[,i]),0))
print(round(var(laa[,i])**0.5,0))
}
# and last is elastic net
for (i in 2:7){
print(round(mean(els[,i]),0))
print(round(var(els[,i])**0.5,0))
}
for (i in 2:7){
print(round(mean(ela[,i]),0))
print(round(var(ela[,i])**0.5,0))
}