Aaron Meyer & Yosuke Tanigawa (based on slides from Rob Tibshirani)
Example
Predicting Drug Response
The Bias-Variance Tradeoff
Estimating \(\boldsymbol{\beta}\)
As usual, we assume the model: \[y=f(\mathbf{z})+\epsilon, \epsilon\sim N(0,\sigma^2I)\]
In regression analysis, our major goal is to come up with some good regression function \[\hat{f}(\mathbf{z}) = \mathbf{z}^\top \hat{\boldsymbol{\beta}}\]
So far, we’ve been dealing with \(\hat{\boldsymbol{\beta}}^{ls}\), or the least squares solution:
\(\hat{\boldsymbol{\beta}}^{ls}\) has well-known properties (e.g., Gauss-Markov, ML)
But can we do better?
Choosing a good regression function
Suppose we have an estimator \[\hat{f}(\mathbf{z}) = \mathbf{z}^\top \hat{\boldsymbol{\beta}}\]
To see if this is a good candidate, we can ask ourselves two questions:
Is \(\hat{\boldsymbol\beta}\) close to the true \(\boldsymbol{\beta}\)?
Will \(\hat{f}(\mathbf{z})\) fit future observations well?
These might have very different outcomes!!
Is \(\hat{\boldsymbol{\beta}}\) close to the true \(\boldsymbol{\beta}\)?
To answer this question, we might consider the mean squared error of our estimate \(\hat{\boldsymbol{\beta}}\):
i.e., consider squared distance of \(\hat{\boldsymbol{\beta}}\) to the true \(\boldsymbol{\beta}\): \[\textrm{MSE}(\hat{\boldsymbol{\beta}}) = \mathop{\mathbb{E}}\left[\lVert \hat{\boldsymbol{\beta}} - \boldsymbol{\beta} \rVert^2\right] = \mathop{\mathbb{E}}[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^\top (\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})]\]
Example: In least squares (LS), we know that: \[\mathrm{MSE}(\hat{\boldsymbol{\beta}}^{ls}) = \mathop{\mathbb{E}}[(\hat{\boldsymbol{\beta}}^{ls} - \boldsymbol{\beta})^\top (\hat{\boldsymbol{\beta}}^{ls} - \boldsymbol{\beta})] = \sigma^2 \mathrm{tr}[(\mathbf{Z}^\top \mathbf{Z})^{-1}]\]
\(\sigma^2\): the magnitude of noise in the linear model
\((\mathbf{Z}^\top \mathbf{Z})^{-1}\): Sensitivity to the design matrix. Poorly conditioned or highly correlated predictors will lead to large MSE
\(\mathrm{tr}[(\mathbf{Z}^\top \mathbf{Z})^{-1}]\): Total variance across all coefficients
Will \(\hat{f}(\mathbf{z})\) fit future observations well?
Just because \(\hat{f}(\mathbf{z})\) fits our data well does not mean that it will fit new data well
In fact, suppose that we take new measurements \(y_i'\) at the same \(\mathbf{z}_i\)’s: \[(\mathbf{z}_1,\mathbf{y}_1'),(\mathbf{z}_2,\mathbf{y}_2'), \ldots ,(\mathbf{z}_n,\mathbf{y}_n')\]
So if \(\hat{f}(\cdot)\) is a good model, then \(\hat{f}(\mathbf{z}_i)\) should also be close to the new target \(y_i'\)
This motivates prediction error (PE)
Prediction error and the bias-variance tradeoff
Good estimators should, on average, have small prediction errors
Let’s consider the PE at a particular target point \(\mathbf{z}_0\):
\(\mathrm{PE}(\mathbf{z}_0) = \sigma_{\epsilon}^2 + \textrm{Bias}^2(\hat{f}(\mathbf{z}_0)) + \textrm{Var}(\hat{f}(\mathbf{z}_0))\), where
As model becomes more complex (more terms included), local structure/curvature is picked up
But coefficient estimates suffer from high variance as more terms are included in the model
So introducing a little bias in our estimate for \(\boldsymbol{\beta}\)might lead to a large decrease in variance, and hence a substantial decrease in PE
We’ll cover a few well-known techniques
Depicting the bias-variance tradeoff
A graph depicting the bias-variance tradeoff.
Ridge Regression
Ridge regression: \(l_2\) penalty as regularization
If the \(\beta_j\)’s are unconstrained…
They can explode
And hence are susceptible to very high variance
To control variance, we might regularize the coefficients
i.e., we might control how large the coefficients grow
Might impose the ridge constraint (both equivalent):
Its solution may have smaller average PE than \(\hat{\boldsymbol{\beta}}^{ls}\)
\(\textrm{PRSS}(\boldsymbol{\beta})_{\ell_2}\) is convex, and hence has a unique solution
Taking derivatives, we obtain: \[\frac{\delta \textrm{PRSS}(\boldsymbol{\beta})_{\ell_2}}{\delta \boldsymbol{\beta}} = -2\mathbf{Z}^\top (\mathbf{y}-\mathbf{Z}\boldsymbol{\beta})+2\lambda\boldsymbol{\beta}\]
The ridge solutions
The solution to \(\textrm{PRSS}(\hat{\boldsymbol{\beta}})_{\ell_2}\) is now seen to be: \[\hat{\boldsymbol{\beta}}_\lambda^{ridge} = (\mathbf{Z}^\top \mathbf{Z} + \lambda \mathbf{I}_p)^{-1} \mathbf{Z}^\top \mathbf{y}\]
Remember that Z is standardized
y is centered
\(\lambda\) makes the problem non-singular even if \(\mathbf{Z}^\top \mathbf{Z}\) is not invertible
Was the original motivation for ridge (Hoerl & Kennard, 1970)
Solution is indexed by the tuning parameter \(\lambda\)
Ridge coefficient paths
The \(\lambda\) values trace out a set of ridge solutions, as illustrated below
Ridge coefficient path for the diabetes data set found in the lars library in R.
Tuning parameter \(\lambda\)
Notice that the solution is indexed by the parameter \(\lambda\)
For each value of \(\lambda\), we obtain a solution
Varying \(\lambda\) gives a series of solutions
\(\lambda\) is the shrinkage parameter
\(\lambda\) controls the amount of regularization
As \(\lambda\) decreases to 0, we recover the least squares solution
As \(\lambda\) becomes very large, the coefficients shrink toward 0 (intercept-only model if \(\mathbf{y}\) is centered)
A few notes on ridge regression
Ridge regression decreases the effective degrees of freedom of the model
It can still be fit when \(p > n\), but the effective complexity stays below the unregularized case
This can be shown by examination of the smoother matrix
We won’t do this—it’s a complicated argument
Choosing \(\lambda\)
We need to tune \(\lambda\) to minimize the mean squared error
This is part of the bigger problem of model selection
In their original paper, Hoerl and Kennard introduced ridge traces:
Plot the components of \(\hat{\beta}_\lambda^{ridge}\) against \(\lambda\)
Choose \(\lambda\) for which the coefficients are not rapidly changing and have “sensible” signs
No objective basis; heavily criticized by many
K-fold cross-validation
A common method to determine \(\lambda\) is K-fold cross-validation.
Plot of CV errors and standard error bands
Cross validation errors from a ridge regression example on spam data.
The LASSO
The LASSO: \(l_1\) penalty
Tibshirani (J. Roy. Stat. Soc. B, 1996) introduced the LASSO: least absolute shrinkage and selection operator
LASSO coefficients are the solutions to the \(l_1\) optimization problem: \[\mathrm{minimize}\: (\mathbf{y}-\mathbf{Z}\boldsymbol{\beta})^\top (\mathbf{y}-\mathbf{Z}\boldsymbol{\beta})\: \mathrm{s.t.} \sum_{j=1}^p \lVert \beta_j \rVert \leq t\]
This is equivalent to the penalized loss function: \[PRSS(\boldsymbol{\beta})_{\ell_1} = \sum_{i=1}^n (y_i - \mathbf{z}_i^\top \boldsymbol{\beta})^2 + \lambda \sum_{j=1}^p \lVert \beta_j \rVert\]\[\quad = (\mathbf{y}-\mathbf{Z}\boldsymbol{\beta})^\top (\mathbf{y}-\mathbf{Z}\boldsymbol{\beta}) + \lambda\lVert \boldsymbol{\beta} \rVert_1\]
\(\lambda\) (or t) as a tuning parameter
Again, we have a tuning parameter \(\lambda\) that controls the amount of regularization
One-to-one correspondence with the threshold t:
recall the constraint: \[\sum_{j=1}^p \lVert \beta_j \rVert \leq t\]
Hence, have a “path” of solutions indexed by \(t\)
If \(t_0 = \sum_{j=1}^p \lVert \hat{\beta}_j^{ls} \rVert\) (equivalently, \(\lambda = 0\)), we obtain no shrinkage (and hence obtain the LS solutions as our solution)
Often, the path of solutions is indexed by a fraction of shrinkage factor of \(t_0\)
Sparsity and exact zeros
Often, we believe that many of the \(\beta_j\)’s should be 0
Hence, we seek a set of sparse solutions
Large enough \(\lambda\) (or small enough t) will set some coefficients exactly equal to 0!
So LASSO will perform model selection for us!
Computing the LASSO solution
Unlike ridge regression, \(\hat{\boldsymbol{\beta}}^{lasso}_{\lambda}\) has no simple closed-form expression
Original implementation involves quadratic programming techniques from convex optimization
LARS (least angle regression) computes a closely related path efficiently
Efron et al., Ann. Statist., 2004 proposed LARS
With a small modification, LARS can also be used to compute the LASSO path
Forward stagewise is a closely related method and is straightforward to implement
https://doi.org/10.1214/07-EJS004
Forward stagewise algorithm
As usual, assume Z is standardized and y is centered
Choose a small step size \(\epsilon\). The forward-stagewise algorithm then proceeds as follows:
Start with initial residual \(\mathbf{r}=\mathbf{y}\), and \(\beta_1=\beta_2=\ldots=\beta_p=0\)
Find the predictor \(\mathbf{Z}_j (j=1,\ldots,p)\) most correlated with r
What is the bias-variance tradeoff? Why might we want to introduce bias into a model?
What is regularization? What are some reasons to use it?
What is the difference between ridge regression and LASSO? How should you choose between them?
Are you guaranteed to find the global optimal answer for ridge regression? What about LASSO?
What is variable selection? Which method(s) perform it? What can you say about the answers?
What does it mean when one says ridge regression and LASSO give a series of solutions?
What can we say about the relationship between fitting error and prediction error?
What does regularization do to the variance of a model?
A colleague tells you about a new form of regularization they’ve come up with (e.g., force all parameters to be within the range 1–3). How would this influence the variance of the model? Might this improve the prediction error?
Can you regularize NNLS? If so, how could you implement this?