In this project, you have two options for the general route you can take:
More details in final project guidelines.
Meiosis of a cell with 2n=4 chromosomes.
Kong et al, Nature, 2012
Kong et al, Nature, 2012
Table. De novo mutations observed with parental origin assigned (Kong et al, Nature, 2012).
| Trio | Father’s age | Mother’s age | Paternal chromosome | Maternal chromosome | Combined |
|---|---|---|---|---|---|
| Trio 1 | 21.8 | 19.3 | 39 | 9 | 48 |
| Trio 2 | 22.7 | 19.8 | 43 | 10 | 53 |
| Trio 3 | 25.0 | 22.1 | 51 | 11 | 62 |
| Trio 4 | 36.2 | 32.2 | 53 | 26 | 79 |
| Trio 5 | 40.0 | 39.1 | 91 | 15 | 106 |
| Mean | 29.1 | 26.5 | 55.4 | 14.2 | 69.6 |
| s.d. | 8.4 | 8.8 | 20.7 | 7.0 | 23.5 |
Kong et al, Nature, 2012
The process of fitting is choosing parameter values so that a model explains the observed data as well as possible.
The key factor is how one defines the problem—i.e., how the distribution is described.
Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity. ~George Box, J American Stat Assoc, 1976
Tyler Vigen. Spurious correlation.
Fitting does not happen in a vacuum!
If we have a vector of \(n\) observations \(\mathbf{y}\), our predictions are going to follow the form:
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]
Here:
\[ \boldsymbol{\epsilon} \sim \left( \mathbf{0}, \sigma^2 \mathbf{I} \right) \]
OLS picks the line, plane, or hyperplane that makes predictions as close as possible to the observed data in total squared distance.
For a single input variable, ordinary least squares is just:
\[ \mathbf{y} = m \mathbf{x} + \beta_0 + \boldsymbol{\epsilon} \]
Here, \(m\) is the slope, \(\beta_0\) is the intercept, and \(\boldsymbol{\epsilon}\) is the noise.
The structure matrix is little more than the data, sometimes transformed, usually with an intercept (offset). So, another way to write:
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]
would be:
\[ \mathbf{y} = m_1 \mathbf{x}_1 + m_2 \mathbf{x}_2 + \ldots + \beta_0 + \boldsymbol{\epsilon} \]
The values of \(m\) and \(\beta_0\) that minimize the sum of squared error (SSE) are optimal.
Consider including the intercept \(\beta_0\) by augmenting the design matrix \(\mathbf{X}\) and parameter vector \(\boldsymbol{\beta}\):
\[ \mathbf{X} \boldsymbol{\beta} = \left(\mathbf{1} \; \mathbf{x}_1 \; \mathbf{x}_2 \; \cdots \; \mathbf{x}_p \right) \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{pmatrix} \]
Now, the standard notation considers the intercept.
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]
Gauss and Markov in the early 1800s identified that the least squares estimate of \(\boldsymbol{\beta}\), \(\hat{\boldsymbol{\beta}}\), is obtained by minimizing the total squared error:
\[ \hat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta}}{\left\Vert \mathbf{y} - \mathbf{X} \boldsymbol{\beta} \right\Vert^{2}} \]
When \(\mathbf{X}^\top \mathbf{X}\) is invertible, this can be directly calculated by:
\[ \hat{\boldsymbol{\beta}} = \mathbf{S}^{-1} \mathbf{X}^\top \mathbf{y} \]
where
\[ \mathbf{S} = \mathbf{X}^\top \mathbf{X} \]
\[ \log L(\boldsymbol{\beta},\sigma^{2}) = -\frac{n}{2} \log(2\pi\sigma^{2}) - \frac{1}{2 \sigma^{2}} \sum_{i=1}^{n} (y_{i}-\mathbf{x}_{i}^\top \boldsymbol{\beta})^{2} \]
Under the Gaussian noise assumption, maximizing the likelihood is therefore equivalent to minimizing:
\[ \sum_{i=1}^{n} (y_{i}-\mathbf{x}_{i}^\top \boldsymbol{\beta})^{2} \]
So least squares is also the maximum likelihood estimate under this model.
\(\hat{\boldsymbol{\beta}}\) is the maximum likelihood estimate of \(\boldsymbol{\beta}\), and has covariance \(\sigma^2 \mathbf{S}^{-1}\):
\[\hat{\boldsymbol{\beta}}\sim{}N\left(\boldsymbol{\beta},\sigma^2 \mathbf{S}^{-1}\right)\]
In the normal case (when our assumptions hold), \(\hat{\boldsymbol{\beta}}\) is an unbiased estimator of \(\boldsymbol{\beta}\). Making these calculations tractable for larger data sets used to be a challenge but is now trivial.
scikit-learn provides a very basic function for ordinary least squares.
sklearn.linear_model.LinearRegression
fit_intercept: Should an intercept value be fit?Or there’s an even more bare-bones function in NumPy: numpy.linalg.lstsq.
A and B.from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes
from matplotlib import pyplot as plt
lr = LinearRegression()
data = load_diabetes()
y = data.target
lr.fit(data.data, y) # X, y
predicted = lr.predict(data.data)
fig, ax = plt.subplots()
ax.scatter(y, predicted, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')This plot shows training-set fit only. It is useful for illustration, but it does not tell us how well the model will generalize to new data.
Non-Linear Least Squares makes similar assumptions to ordinary least squares, but for arbitrary functions. Thus, instead of following the form:
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon} \]
Our input-output relationship is:
\[ \mathbf{y} = f(\mathbf{X}, \boldsymbol{\beta}) + \boldsymbol{\epsilon} \]
for the same construction of \(\boldsymbol{\epsilon} \sim \left( \mathbf{0}, \sigma^2 \mathbf{I} \right)\).
If we believe the underlying relationship is non-linear (e.g., curved), it is usually better to fit that curved model directly than to force the data into a linear form by transformation.
We again need to solve for \(\boldsymbol{\beta}\) to minimize the sum of squared error (SSE):
One property we can take advantage of is that the gradient of the SSE with respect to \(\boldsymbol{\beta}\) at the minimum is zero (\(r_i\) is the residual of the \(i\)th point):
\[ {\frac {\partial S}{\partial \beta_j}}=2\sum_i r_i{\frac {\partial r_i}{\partial \beta_j}}=0 \]
\[ {\frac {\partial S}{\partial \beta_j}}=2\sum_i r_i{\frac {\partial r_i}{\partial \beta_j}}=0 \]
SciPy’s scipy.optimize.least_squares is a very capable implementation.
fun, the functionx0, an initial guess for the parameter valuesfun should return a vector of the residuals
Let’s say we’re looking at a protein-protein interaction such as this:
Let \(X\) denote ligand concentration and let \(\beta = k_1\) denote the association constant. We can predict the amount of binding we’d observe from a single-site binding model.
\[ y = f(X, \beta) + \epsilon \]
where,
\[ f(X, \beta) = \frac{X\beta}{1 + X\beta} \]
We can define custom function: \(f([L], k) = \frac{k[L]}{1 + k[L]}\)
def klotz1(k1, lig):
"""
Calculates the fractional binding for a single-site receptor model.
Args:
k1: The association constant. Higher values indicate stronger binding affinity.
lig: The concentration of the free ligand.
Returns:
The predicted fractional occupancy (Y).
This will be a value strictly between 0.0 (no binding) and 1.0 (100% saturation).
"""
return (k1*lig)/(1 + k1*lig)SciPy asks for the residuals at each fitting point, so we need to convert a prediction to that:
sp.optimize.least_squares(ls_obj_k1, 1., args=(X,Y))
# --------
active_mask: array([ 0.])
cost: 0.0086776496708916573
fun: array([ 4.79e-05, 9.00e-05, -1.09e-04,
8.04e-04, -9.67e-04, 3.85e-03,
4.61e-03, 2.34e-03, 2.36e-02,
9.64e-03, -2.48e-02, 1.93e-02,
-4.93e-02, 5.54e-02, -3.66e-02,
2.97e-03, 3.39e-02, -8.74e-02])
grad: array([ -9.57228474e-09])
jac: array([[-0.00099809],
[-0.00199235],
[-0.0039695 ],
# ...
[-0.01608763],
[-0.00817133]])
message: '`gtol` termination condition is satisfied.'
nfev: 4
njev: 4
optimality: 9.5722847420895082e-09
status: 1
success: True
x: array([ 0.95864059])\[ \begin{aligned} y &= \frac{k[L]}{1 + k[L]} = \frac{1}{1 + (k[L])^{-1}} \\ &= \frac{1}{1 + \left( \exp(\ln(k)) \cdot \exp(\ln([L])) \right)^{-1}} \\ &= \frac{1}{1 + \exp \left( -\left( \ln(k) + \ln([L]) \right) \right)} \end{aligned} \]