As usual, we assume the model: \[y=f(\mathbf{z})+\epsilon, \epsilon\sim \mathcal{N}(0,\sigma^2)\]
In regression analysis, our major goal is to come up with some good regression function \[\hat{f}(\mathbf{z}) = \mathbf{z}^\intercal \hat{\beta}\]
So far, we’ve been dealing with \(\hat{\beta}^{ls}\), or the least squares solution:
\(\hat{\beta}^{ls}\) has well known properties (e.g., Gauss-Markov, ML)
But can we do better?
Choosing a good regression function
Suppose we have an estimator \[\hat{f}(\mathbf{z}) = \mathbf{z}^\intercal \hat{\beta}\]
To see if this is a good candidate, we can ask ourselves two questions:
Is \(\hat{\beta}\) close to the true \(\beta\)?
Will \(\hat{f}(\mathbf{z})\) fit future observations well?
These might have very different outcomes!!
Is \(\hat{\boldsymbol\beta}\) close to the true β?
To answer this question, we might consider the mean squared error of our estimate \(\hat{\boldsymbol\beta}\):
i.e., consider squared distance of \(\hat{\boldsymbol\beta}\) to the true \(\boldsymbol\beta\): \[MSE(\hat{\boldsymbol\beta}) = \mathop{\mathbb{E}}\left[\lVert \hat{\boldsymbol\beta} - \boldsymbol\beta \rVert^2\right] = \mathop{\mathbb{E}}[(\hat{\boldsymbol\beta} - \boldsymbol\beta)^\intercal (\hat{\boldsymbol\beta} - \boldsymbol\beta)]\]
Example: In least squares (LS), we know that: \[\mathop{\mathbb{E}}[(\hat{\boldsymbol\beta}^{ls} - \boldsymbol\beta)^\intercal (\hat{\boldsymbol\beta}^{ls} - \boldsymbol\beta)] = \sigma^2 \mathrm{tr}[(\mathbf{Z}^T \mathbf{Z})^{-1}]\]
Will \(\hat{f}(z)\) fit future observations well?
Just because \(\hat{f}(z)\) fits our data well, this doesn’t mean that it will be a good fit to new data
In fact, suppose that we take new measurements \(y_i'\) at the same \(\mathbf{z}_i\)’s: \[(\mathbf{z}_1,\mathbf{y}_1'),(\mathbf{z}_2,\mathbf{y}_2'),...,(\mathbf{z}_n,\mathbf{y}_n')\]
So if \(\hat{f}(\cdot)\) is a good model, then \(\hat{f}(\mathbf{z}_i)\) should also be close to the new target \(y_i'\)
This is the notion of prediction error (PE)
Prediction error and the bias-variance tradeoff
So good estimators should, on average have, small prediction errors
Let’s consider the PE at a particular target point \(\mathbf{z}_0\):
Z is assumed to be standardized (mean 0, unit variance)
y is assumed to be centered
Ridge regression: \(l_2\)-penalty
Can write the ridge constraint as the following penalized residual sum of squares (PRSS):
Its solution may have smaller average PE than \(\hat{\boldsymbol\beta}^{ls}\)
\(PRSS(\boldsymbol\beta)_{l_2}\) is convex, and hence has a unique solution
Taking derivatives, we obtain: \[\frac{\delta PRSS(\beta)_{l_2}}{\delta \beta} = -2\mathbf{Z}^T (y-\mathbf{Z}\beta)+2\lambda\beta\]
The ridge solutions
The solution to \(PRSS(\hat{\beta})_{l2}\) is now seen to be: \[\hat{\beta}_\lambda^{ridge} = (\mathbf{Z}^\intercal \mathbf{Z} + \lambda \mathbf{I}_p)^{-1} \mathbf{Z}^\intercal \mathbf{y}\]
Remember that Z is standardized
y is centered
Solution is indexed by the tuning parameter λ (more on this later)
Inclusion of λ makes problem non-singular even if \(\mathbf{Z}^\intercal \mathbf{Z}\) is not invertible
This was the original motivation for ridge regression (Hoerl and Kennard, 1970)
Tuning parameter λ
Notice that the solution is indexed by the parameter λ
So for each λ, we have a solution
Hence, the λ’s trace out a path of solutions (see next page)
λ is the shrinkage parameter
λ controls the size of the coefficients
λ controls amount of regularization
As λ decreases, we obtain the least squares solutions
As λ increases, we have \(\hat{\beta}_{\lambda=0}^{ridge} = 0\) (intercept-only model)
Ridge coefficient paths
The λ’s trace out a set of ridge solutions, as illustrated below
Ridge coefficient path for the diabetes data set found in the lars library in R.
Choosing λ
Need disciplined way of selecting λ
That is, we need to “tune” the value of λ
In their original paper, Hoerl and Kennard introduced ridge traces:
Plot the components of \(\hat{\beta}_\lambda^{ridge}\) against λ
Choose λ for which the coefficients are not rapidly changing and have “sensible” signs
No objective basis; heavily criticized by many
Standard practice now is to use cross-validation (next lecture!)
A few notes on ridge regression
The regularization decreases the degrees of freedom of the model
So you still cannot fit a model with more degrees of freedom than points
This can be shown by examination of the smoother matrix
We won’t do this—it’s a complicated argument
How do we choose λ?
We need a disciplined way of choosing λ
Obviously want to choose λ that minimizes the mean squared error
Issue is part of the bigger problem of model selection
K-Fold Cross-Validation
A common method to determine \(\lambda\) is K-fold cross-validation.
We will discuss this next lecture.
Plot of CV errors and standard error bands
Cross validation errors from a ridge regression example on spam data.
The LASSO
The LASSO
The LASSO: \(l_1\) penalty
Tibshirani (J of the Royal Stat Soc 1996) introduced the LASSO: least absolute shrinkage and selection operator
LASSO coefficients are the solutions to the \(l_1\) optimization problem: \[\mathrm{minimize}\: (\mathbf{y}-\mathbf{Z}\boldsymbol\beta)^T (\mathbf{y}-\mathbf{Z}\boldsymbol\beta)\: \mathrm{s.t.} \sum_{j=1}^p \lVert \beta_j \rVert \leq t\]
This is equivalent to loss function: \[PRSS(\boldsymbol\beta)_{l_1} = \sum_{i=1}^n (y_i - \mathbf{z}_i^T \boldsymbol\beta)^2 + \lambda \sum_{j=1}^p \lVert \beta_j \rVert\]\[\quad = (\mathbf{y}-\mathbf{Z}\boldsymbol\beta)^T (\mathbf{y}-\mathbf{Z}\boldsymbol\beta) + \lambda\lVert \boldsymbol\beta \rVert_1\]
λ (or t) as a tuning parameter
Again, we have a tuning parameter λ that controls the amount of regularization
One-to-one correspondence with the threshhold t:
recall the constraint: \[\sum_{j=1}^p = \lVert \beta_j \rVert \leq t\]
Hence, have a “path” of solutions indexed by \(t\)
If \(t_0 = \sum_{j=1}^p \lVert \hat{\beta}_j^{ls} \rVert\) (equivalently, λ = 0), we obtain no shrinkage (and hence obtain the LS solutions as our solution)
Often, the path of solutions is indexed by a fraction of shrinkage factor of \(t_0\)
Sparsity and exact zeros
Often, we believe that many of the \(\beta_j\)’s should be 0
Hence, we seek a set of sparse solutions
Large enough \(\lambda\) (or small enough t) will set some coefficients exactly equal to 0!
So LASSO will perform model selection for us!
Computing the LASSO solution
Unlike ridge regression, \(\hat{\beta}^{lasso}_{\lambda}\) has no closed form \(\lambda\)
Original implementation involves quadratic programming techniques from convex optimization
But Efron et al, Ann Statist, 2004 proposed LARS (least angle regression), which computes the LASSO path efficiently
Interesting modification called is called forward stagewise
In many cases it is the same as the LASSO solution