Fitting & Regression Redux, Regularization
Example
Predicting Drug Response
The Bias-Variance Tradeoff
Estimating β
- As usual, we assume the model: \[y=f(\mathbf{z})+\epsilon, \epsilon\sim \mathcal{N}(0,\sigma^2)\]
- In regression analysis, our major goal is to come up with some good regression function \[\hat{f}(\mathbf{z}) = \mathbf{z}^\intercal \hat{\beta}\]
- So far, we’ve been dealing with \(\hat{\beta}^{ls}\), or the least squares solution:
- \(\hat{\beta}^{ls}\) has well known properties (e.g., Gauss-Markov, ML)
- But can we do better?
Choosing a good regression function
- Suppose we have an estimator \[\hat{f}(\mathbf{z}) = \mathbf{z}^\intercal \hat{\beta}\]
- To see if this is a good candidate, we can ask ourselves two questions:
- Is \(\hat{\beta}\) close to the true \(\beta\)?
- Will \(\hat{f}(\mathbf{z})\) fit future observations well?
- These might have very different outcomes!!
Is \(\hat{\boldsymbol\beta}\) close to the true β?
- To answer this question, we might consider the mean squared error of our estimate \(\hat{\boldsymbol\beta}\):
- i.e., consider squared distance of \(\hat{\boldsymbol\beta}\) to the true \(\boldsymbol\beta\): \[MSE(\hat{\boldsymbol\beta}) = \mathop{\mathbb{E}}\left[\lVert \hat{\boldsymbol\beta} - \boldsymbol\beta \rVert^2\right] = \mathop{\mathbb{E}}[(\hat{\boldsymbol\beta} - \boldsymbol\beta)^\intercal (\hat{\boldsymbol\beta} - \boldsymbol\beta)]\]
- Example: In least squares (LS), we know that: \[\mathop{\mathbb{E}}[(\hat{\boldsymbol\beta}^{ls} - \boldsymbol\beta)^\intercal (\hat{\boldsymbol\beta}^{ls} - \boldsymbol\beta)] = \sigma^2 \mathrm{tr}[(\mathbf{Z}^T \mathbf{Z})^{-1}]\]
Will \(\hat{f}(z)\) fit future observations well?
- Just because \(\hat{f}(z)\) fits our data well, this doesn’t mean that it will be a good fit to new data
- In fact, suppose that we take new measurements \(y_i'\) at the same \(\mathbf{z}_i\)’s: \[(\mathbf{z}_1,\mathbf{y}_1'),(\mathbf{z}_2,\mathbf{y}_2'),...,(\mathbf{z}_n,\mathbf{y}_n')\]
- So if \(\hat{f}(\cdot)\) is a good model, then \(\hat{f}(\mathbf{z}_i)\) should also be close to the new target \(y_i'\)
- This is the notion of prediction error (PE)
Have students draw lines for smoking vs. heart attack risk over data. Make two datasets to explain variance.
Prediction error and the bias-variance tradeoff
- So good estimators should, on average have, small prediction errors
- Let’s consider the PE at a particular target point \(\mathbf{z}_0\):
- \(PE(\mathbf{z}_0) = \sigma_{\epsilon}^2 + Bias^2(\hat{f}(\mathbf{z}_0)) + Var(\hat{f}(\mathbf{z}_0))\)
- Not going to derive, but comes directly from previous definitions
- Such a decomposition is known as the bias-variance tradeoff
- As model becomes more complex (more terms included), local structure/curvature is picked up
- But coefficient estimates suffer from high variance as more terms are included in the model
- So introducing a little bias in our estimate for β might lead to a large decrease in variance, and hence a substantial decrease in PE
- Bias is squared, so a small amount is ok.
Depicting the bias-variance tradeoff
- Regularization can be anything.
- Today we’ll see two/three examples.
Ridge Regression
Ridge regression as regularization
- If the \(\beta_j\)’s are unconstrained…
- They can explode
- And hence are susceptible to very high variance
- To control variance, we might regularize the coefficients
- i.e., Might control how large the coefficients grow
- Might impose the ridge constraint (both equivalent):
- minimize \(\sum_{i=1}^n (y_i - \boldsymbol\beta^\intercal \mathbf{z}_i)^2\: \mathrm{s.t.} \sum_{j=1}^p \beta_j^2 \leq t\)
- minimize \((y - \mathbf{Z}\boldsymbol\beta)^\intercal (y - \mathbf{Z}\boldsymbol\beta)\: \mathrm{s.t.} \sum_{j=1}^p \beta_j^2 \leq t\)
- By convention (very important):
- Z is assumed to be standardized (mean 0, unit variance)
- y is assumed to be centered
Ridge regression: \(l_2\)-penalty
- Can write the ridge constraint as the following penalized residual sum of squares (PRSS):
\[ \begin{aligned} PRSS(\beta)_{\ell_2} &= \sum_{i=1}^{n}(y_i - \mathbf{z}_i^{\top}\beta)^2 + \lambda\sum_{j=1}^{p}\beta_j^2 \\ &= (\mathbf{y} - \mathbf{Z}\beta)^{\top}(\mathbf{y} - \mathbf{Z}\beta) + \lambda\|\beta\|_2^2 \end{aligned} \]
- Its solution may have smaller average PE than \(\hat{\boldsymbol\beta}^{ls}\)
- \(PRSS(\boldsymbol\beta)_{l_2}\) is convex, and hence has a unique solution
- Taking derivatives, we obtain: \[\frac{\delta PRSS(\beta)_{l_2}}{\delta \beta} = -2\mathbf{Z}^T (y-\mathbf{Z}\beta)+2\lambda\beta\]
- So this is just an added term to OLS!
- Ask if folks have seen convex before.
- What happens with \(\lambda\) going to infinity?
- How about \(\lambda\) going to 0?
- Often software might ask for \(log(\lambda)\) or \(1/\lambda\). Be sure to check.
The ridge solutions
- The solution to \(PRSS(\hat{\beta})_{l2}\) is now seen to be: \[\hat{\beta}_\lambda^{ridge} = (\mathbf{Z}^\intercal \mathbf{Z} + \lambda \mathbf{I}_p)^{-1} \mathbf{Z}^\intercal \mathbf{y}\]
- Remember that Z is standardized
- y is centered
- Solution is indexed by the tuning parameter λ
- λ makes problem non-singular even if \(\mathbf{Z}^\intercal \mathbf{Z}\) is not invertible
- Was the original motivation for ridge (Hoerl & Kennard, 1970)
- As simple to compute as OLS!
- How do we figure out the tuning parameter?
- What does the invertibility statement mean? We can solve even when p > n!
Tuning parameter λ
- Notice that the solution is indexed by the parameter λ
- So for each λ, we have a solution
- Hence, the λ’s trace out a path of solutions (see next page)
- λ is the shrinkage parameter
- λ controls the size of the coefficients
- λ controls amount of regularization
- As λ decreases, we obtain the least squares solutions
- As λ increases, we have \(\hat{\beta}_{\lambda=0}^{ridge} = 0\) (intercept-only model)
Ridge coefficient paths
- The λ’s trace out a set of ridge solutions, as illustrated below
lars
library in R.Choosing λ
- Need disciplined way of selecting λ
- That is, we need to “tune” the value of λ
- In their original paper, Hoerl and Kennard introduced ridge traces:
- Plot the components of \(\hat{\beta}_\lambda^{ridge}\) against λ
- Choose λ for which the coefficients are not rapidly changing and have “sensible” signs
- No objective basis; heavily criticized by many
- Standard practice now is to use cross-validation (next lecture!)
A few notes on ridge regression
- The regularization decreases the degrees of freedom of the model
- So you still cannot fit a model with more degrees of freedom than points
- This can be shown by examination of the smoother matrix
- We won’t do this—it’s a complicated argument
How do we choose λ?
- We need a disciplined way of choosing λ
- Obviously want to choose λ that minimizes the mean squared error
- Issue is part of the bigger problem of model selection
K-Fold Cross-Validation
- A common method to determine \(\lambda\) is K-fold cross-validation.
- We will discuss this next lecture.
Plot of CV errors and standard error bands
The LASSO
The LASSO: \(l_1\) penalty
- Tibshirani (J of the Royal Stat Soc 1996) introduced the LASSO: least absolute shrinkage and selection operator
- LASSO coefficients are the solutions to the \(l_1\) optimization problem: \[\mathrm{minimize}\: (\mathbf{y}-\mathbf{Z}\boldsymbol\beta)^T (\mathbf{y}-\mathbf{Z}\boldsymbol\beta)\: \mathrm{s.t.} \sum_{j=1}^p \lVert \beta_j \rVert \leq t\]
- This is equivalent to loss function: \[PRSS(\boldsymbol\beta)_{l_1} = \sum_{i=1}^n (y_i - \mathbf{z}_i^T \boldsymbol\beta)^2 + \lambda \sum_{j=1}^p \lVert \beta_j \rVert\] \[\quad = (\mathbf{y}-\mathbf{Z}\boldsymbol\beta)^T (\mathbf{y}-\mathbf{Z}\boldsymbol\beta) + \lambda\lVert \boldsymbol\beta \rVert_1\]
- Date shows the difficulty in solving.
- \(l_1\) indicates the sum of absolute values.
- So $l_2 norm is more sensitive to large values. Here, sensitive to all values.
- What happens if \(\hat{\beta}_j\) doesn’t really affect the fitting quality?
λ (or t) as a tuning parameter
- Again, we have a tuning parameter λ that controls the amount of regularization
- One-to-one correspondence with the threshhold t:
- recall the constraint: \[\sum_{j=1}^p = \lVert \beta_j \rVert \leq t\]
- Hence, have a “path” of solutions indexed by \(t\)
- If \(t_0 = \sum_{j=1}^p \lVert \hat{\beta}_j^{ls} \rVert\) (equivalently, λ = 0), we obtain no shrinkage (and hence obtain the LS solutions as our solution)
- Often, the path of solutions is indexed by a fraction of shrinkage factor of \(t_0\)
- Software will implement this in different ways.
Sparsity and exact zeros
- Often, we believe that many of the \(\beta_j\)’s should be 0
- Hence, we seek a set of sparse solutions
- Large enough \(\lambda\) (or small enough t) will set some coefficients exactly equal to 0!
- So LASSO will perform model selection for us!
Computing the LASSO solution
- Unlike ridge regression, \(\hat{\beta}^{lasso}_{\lambda}\) has no closed form \(\lambda\)
- Original implementation involves quadratic programming techniques from convex optimization
- But Efron et al, Ann Statist, 2004 proposed LARS (least angle regression), which computes the LASSO path efficiently
- Interesting modification called is called forward stagewise
- In many cases it is the same as the LASSO solution
- Forward stagewise is easy to implement: https://doi.org/10.1214/07-EJS004
Forward stagewise algorithm
- As usual, assume Z is standardized and y is centered
- Choose a small \(\epsilon\). The forward-stagewise algorithm then proceeds as follows:
- Start with initial residual \(\mathbf{r}=\mathbf{y}\), and \(\beta_1=\beta_2=\ldots=\beta_p=0\)
- Find the predictor \(\mathbf{Z}_j (j=1,\ldots,p)\) most correlated with r
- Update \(\beta_j=\beta_j+\delta_j\), where \(\delta_j = \epsilon\cdot \mathrm{sign} \langle\mathbf{r},\mathbf{Z}_j\rangle = \epsilon\cdot \mathrm{sign}(\mathbf{Z}^T_j \mathbf{r})\)
- Set \(\mathbf{r}=\mathbf{r} - \delta_j \mathbf{Z}_j\)
- Repeat from step 2 many times
Comparing LS, Ridge, and the LASSO
Comparing LS, Ridge, and the LASSO
- Even though \(\mathbf{Z}^{T}\mathbf{Z}\) may not be of full rank, both ridge regression and the LASSO admit solutions
- We have a problem when \(p \gg n\) (more predictor variables than observations)
- But both ridge regression and the LASSO have solutions
- Regularization tends to reduce prediction error
The LASSO, LARS, and Forward Stagewise paths
More comments on variable selection
- Now suppose \(p \gg n\)
- Of course, we would like a parsimonious model (Occam’s Razor)
- Ridge regression produces coefficient values for each of the p-variables
- But because of its \(l_1\) penalty, the LASSO will set many of the variables exactly equal to 0!
- That is, the LASSO produces sparse solutions
- So LASSO takes care of model selection for us
- And we can even see when variables jump into the model by looking at the LASSO path
Variants
- Zou and Hastie (2005) propose the elastic net, which is a convex combination of ridge and the LASSO
- Paper asserts that the elastic net can improve error over LASSO
- Still produces sparse solutions
- Write out penalized error with all three norms (error, \(l_1\), \(l_2\))
High-dimensional data and underdetermined systems
High-dimensional data and underdetermined systems
- In many modern data analysis problems, we have \(p \gg n\)
- These comprise “high-dimensional” problems
- When fitting the model \(y = \mathbf{z}^\intercal \beta\), we can have many solutions
- i.e., our system is underdetermined
- Reasonable to suppose that most of the coefficients are exactly equal to 0
But do these methods pick the right/true variables?
- Suppose that only \(S\) elements of \(\beta\) are non-zero
- Now suppose we had an “Oracle” that told us which components of the \(\beta = (\beta_1,\beta_2,\ldots,\beta_p)\) are truly non-zero
- Let \(\beta^{*}\) be the least squares estimate of this “ideal” estimator:
- So \(\beta^{*}\) is 0 in every component that \(\beta\) is 0
- The non-zero elements of \(\beta^{*}\) are computed by regressing \(\mathbf{y}\) on only the \(S\) important covariates
- It turns out we get really close to this cheating solution without cheating!
- Candes & Tao. Ann Statist. 2007.
Implementation
- The notebook can be found on the course website.
Further Reading
Review Questions
- What is the bias-variance tradeoff? Why might we want to introduce bias into a model?
- What is regularization? What are some reasons to use it?
- What is the difference between ridge regression and LASSO? How should you choose between them?
- Are you guaranteed to find the global optimal answer for ridge regression? What about LASSO?
- What is variable selection? Which method(s) perform it? What can you say about the answers?
- What does it mean when one says ridge regression and LASSO give a family of answers?
- What can we say about the relationship between fitting error and prediction error?
- What does regularization do to the degrees of freedom (p) of a model? How about the number of measurements (n)?
- A colleague tells you about a new form of regularization they’ve come up with (e.g., remove X% of variables least correlated with Y). How would this influence the variance of the model? Might this improve the prediction error?
- Can you regularize NNLS? If so, how would this be implemented?