If we can find a quantitative correlation between the input and output, we can predict new outcomes for measurements we haven’t yet seen.
Challenges with Univariate Relationships
Janes, et al. Science, 2005
The relationship between JNK activation and apoptosis appears to be highly context-dependent
Univariate relationships are often insufficient
Cells respond to an environment with multiple factors present
Notes about Methods Today
Both methods are supervised learning methods, however have a number of distinct properties from others we will discuss.
Learning about PLS is more difficult than it should be, partly because papers describing it span areas of chemistry, economics, medicine and statistics, with little agreement on terminology and notation.
These methods will show one example of where the model and algorithm are quite distinct—there are multiple algorithms for calculating a PLSR model.
Multi-Linear Regression (MLR)
In biology we often have multiple signals and multiple responses that were measured: \[ y_1 =a_1x_1 + b_1x_2 + e_1 \]\[ y_2 =a_2x_1 + b_2x_2 + e_2 \]
This can be written more concisely in matrix notation as: \[ Y = XB + E \]
Where Y is a \(n \times p\) matrix and X is a \(n \times m\) matrix; minimizing E and solving for B: \[ B = (X^tX)^{-1}X^tY \]
Underdetermined Systems
If \(n\) observations and \(m\) variables:
\(m<n\): no exact solution, least-squares solution possible
\(m=n\): one solution
\(m>n\): no unique solution unless we delete independent variables since \(X^tX\) cannot be inverted
\(m>n\) is often the case in systems biology!
If a model is underdetermined with multiple solutions, there are two general approches we can take:
Regularization: We can use other information we know to focus on one answer
Sampling: We can look at all possible models
Regularization
Today we will use regularization.
We will assume the larger variation in the data is more meaningful.
Therefore, we will assume that smaller changes are less important.
This is a choice that must be correct for the relevant biological question at hand.
Principal Components Regression (PCR)
One solution - use the concepts from PCA to reduce dimensionality.
Regress Y against the scores (Scores describe observations – by using them we link X and Y for each observation)
\[Y = TB + E\]
Challenge
How might we determine the number of components using our prediction?
Potential Problem
PCs for the X matrix do not necessarily capture X-variation that is important for Y
So later PCs are going to be more important to regression
Example: the first components capture signaling that is related to another cell fate, while the signals that co-vary for this particular y are buried in later components
How might we handle this differently?
PLSR
PLSR - NIPALs with Scores Exchanged
PLSR - NIPALs with Scores Exchanged
Components in PLSR and PCA Differ
Determining the Number of Components
Determining the Number of Components
Variants of PLSR
Discriminant PLSR
Tensor PLSR
Application
Application
Application
Application
Application
Application
Application
Practical Notes
PCR
sklearn does not implement PCR directly
Can be applied by chaining sklearn.decomposition.PCA and sklearn.linear_model.LinearRegression