In this project, you have two options for the general route you can take:
More details in final project guidelines.
The proposal should be less than two pages and describe the following items:
We are available to discuss your ideas whenever you are ready, and you should discuss your idea with us prior to submitting your proposal.
Recommend an early start—the earlier you finalize a proposal the sooner you can begin the project.
The process of fitting is nothing more than finding the maximum likelihood distribution of models for a set of points.
The key factor is how one defines the problem—i.e. how the distribution is described.
Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. ~George Box, J American Stat Assoc, 1976
Fitting does not happen in a vacuum!
If we have a vector of \(n\) observations \(\mathbf{y}\), our predictions are going to follow the form:
\[ \mathbf{y} = \mathbf{X} \beta + \mathbf{\epsilon} \]
Here \(\mathbf{X}\) is a \(n \times p\) structure matrix, \(\beta\) is a p-dimensional vector with the parameters of our model, and \(\mathbf{\epsilon} = \left( \epsilon_1, \epsilon_2, ... \epsilon_n \right)'\) is the noise present in the model.
\[ \mathbf{y} = \mathbf{X} \beta + \mathbf{\epsilon} \]
\(\mathbf{\epsilon}\) is usually handled to be uncorrelated random components with constant variance \(\sigma^2\):
\[ \mathbf{\epsilon} \sim \left( \mathbf{0}, \sigma^2 \mathbf{I} \right) \]
The structure matrix is little more than the data, sometimes transformed, usually with an offset. So, another way to write:
\[ \mathbf{y} = \mathbf{X} \beta + \mathbf{\epsilon} \]
would be:
\[ \mathbf{y} = m_1 \mathbf{x_1} + m_2 \mathbf{x_2} \ldots + b + \mathbf{\epsilon} \]
\[ \mathbf{y} = m \mathbf{x} + b + \mathbf{\epsilon} \]
The values of \(m\) and \(b\) that minimize the distance from \(y\) are optimal, and they don’t depend on \(\epsilon\).
Gauss and Markov in the early 1800s identified that the least squares estimate of \(\beta\), \(\hat{\beta}\), is:
\[ \hat{\mathbf{\beta}} = \arg\min_{\beta}{\left\Vert \mathbf{y} - \mathbf{X} \beta \right\Vert^{2}} \]
\[ \hat{\mathbf{\beta}} = \arg\min_{\beta}{\left\Vert \mathbf{y} - \mathbf{X} \beta \right\Vert^{2}} \]
can be directly calculated by:
\[ \hat{\beta} = \mathbf{S}^{-1} \mathbf{X}' \mathbf{y} \]
where
\[ \mathbf{S} = \mathbf{X}'\mathbf{X} \]
\(\hat{\beta}\) is the maximum likelihood estimate of \(\beta\), and has covariance matrix \(\sigma^2 \mathbf{S}^{-1}\):
\[\hat{\beta}\sim\left(\beta,\sigma^2 \mathbf{S}^{-1}\right)\]
In the normal case (when our assumptions hold), \(\hat{\beta}\) is an unbiased estimator of \(\beta\). Making these calculations tractable for larger data sets used to be a challenge but is now trivial.
\[ \frac{-n}{2} \log(\sigma^{2}) - \frac{1}{2 \sigma^{2}} \sum_{i=1}^{n} (y_{i}-x_{i} \beta)^{2} \]
therefore, only considering \(\beta\) (the only factor that influences predictions), we need to minimize:
\[ \sum_{i=1}^{n} (y_{i}-x_{i} \beta)^{2} \]
Exactly how we calculate \(\beta\)!
What might be some advantages of a method such as this?
What are some of the assumptions?
What are the implications of these assumptions not holding?
What are some downsides?
By Rdbickel - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49599354
Kong et al, Nature, 2012
Kong et al, Nature, 2012
sklearn provides a very basic function for ordinary least squares.
sklearn.linear_model.LinearRegression
fit_intercept
: Should an intercept value be fit?normalize
: Should the input variables be mean and variance scaled?Or there’s an even more bare function in numpy numpy.linalg.lstsq
.
a
and b
.from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes
from matplotlib import pyplot as plt
lr = LinearRegression()
data = load_diabetes()
lr.fit(data.data, data.target) # X, y
predicted = lr.predict(data.data)
fig, ax = plt.subplots()
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
Non-Linear Least Squares makes similar assumptions to ordinary least squares, but for arbitrary functions. Thus, instead of following the form:
\[ \mathbf{y} = \mathbf{X} \beta + \mathbf{\epsilon} \]
Our input-output relationship is:
\[ \mathbf{y} = f(\mathbf{X}, \beta) + \mathbf{\epsilon} \]
for the same construction of \(\mathbf{\epsilon}\).
NNLSQ used to be mostly performed by transforming one’s data into a linear model.
E.g. taking the ratio of variables, or log-transforming them.
This is now considered bad practice.
Why?
We again need to solve for \(\beta\) to minimize the sum of squared error:
One property we can take advantage of is that the gradient of the SSE w.r.t. \(\beta\) at the minimum is zero (\(r_i\) is the residual of the ith point):
\[ {\frac {\partial S}{\partial \beta_j}}=2\sum_i r_i{\frac {\partial r_i}{\partial \beta_j}}=0 \]
\[ {\frac {\partial S}{\partial \beta_j}}=2\sum_i r_i{\frac {\partial r_i}{\partial \beta_j}}=0 \]
SciPy’s scipy.optimize.least_squares
is a very capable implementation.
fun
, the functionx0
, an initial guess for the parameter valuesfun
should return a vector of the residuals
Let’s say we’re looking at a protein-protein interaction such as this:
We can predict the amount of binding we’d observe from a single-site binding model:
SciPy asks for the residuals at each fitting point, so we need to convert a prediction to that:
sp.optimize.least_squares(ls_obj_k1, 1., args=(X,Y))
# --------
active_mask: array([ 0.])
cost: 0.0086776496708916573
fun: array([ 4.79e-05, 9.00e-05, -1.09e-04,
8.04e-04, -9.67e-04, 3.85e-03,
4.61e-03, 2.34e-03, 2.36e-02,
9.64e-03, -2.48e-02, 1.93e-02,
-4.93e-02, 5.54e-02, -3.66e-02,
2.97e-03, 3.39e-02, -8.74e-02])
grad: array([ -9.57228474e-09])
jac: array([[-0.00099809],
[-0.00199235],
[-0.0039695 ],
# ...
[-0.03119024],
[-0.01608763],
[-0.00817133]])
message: '`gtol` termination condition is satisfied.'
nfev: 4
njev: 4
optimality: 9.5722847420895082e-09
status: 1
success: True
x: array([ 0.95864059])