Dimensionality Reduction: PCA and NMF

Author

Aaron Meyer & Yosuke Tanigawa

High-dimensional datasets

Dealing with many variables

  • So far we’ve largely concentrated on cases in which we have relatively large numbers of measurements for a few variables
    • This is frequently referred to as \(n > p\)
  • Two other extremes are important
    • Many observations and many variables
    • Many variables but few observations (\(p > n\))
  • Write out a big table of variables for people:
    • Age, HR, BP, Height, Weight, Income, Disposable income, Body fat %, 1 mile running time, Birth date, cholesterol level
  • Ask for data reduction methods

Why many variables are hard

Usually, when we’re dealing with many variables, we don’t have a great understanding of how they relate to each other.

  • E.g. if gene X is high, we can’t be sure that will mean gene Y will be too
  • If we had these relationships, we could reduce the data
    • E.g. if we had variables to tell us it’s 3 pm in Los Angeles, we don’t need one to say it’s daytime

Dimensionality Reduction

Generate a low-dimensional encoding of a high-dimensional space

Purposes:

  • Data compression / visualization
  • Robustness to noise and uncertainty
  • Potentially easier to interpret

Bonus: Many of the other methods from the class can be applied after dimensionality reduction with little or no adjustment!

  • When we aren’t using prediction, there won’t be a clear benchmark for what is best.
  • We will think about these methods both numerically and geometrically.

Matrix factorization

Many dimensionality reduction methods involve matrix factorization

Basic Idea: Find two (or more) matrices whose product best approximates the original matrix

Low rank approximation to original \(N\times M\) matrix:

\[ \mathbf{X} \approx \mathbf{W} \mathbf{H}^\top \]

where \(\mathbf{W}\) is \(N\times R\), \(\mathbf{H}\) is \(M\times R\), and \(R \ll \min(N, M)\).

  • What is the implicit assumption here?
  • So a values determine observation associations, b values determine variable associations.
  • So we can think about these as row-wise, column-wise effects.

\[ \begin{bmatrix} a_1 \\ a_2 \\ a_3 \end{bmatrix} \begin{bmatrix} b_1 & b_2 & b_3 \end{bmatrix} = \begin{bmatrix} a_1b_1 & a_1b_2 & \ldots \\ a_2b_1 & & \\ a_3b_1 & \ldots & \end{bmatrix} \]


Visualizing matrix factorization

Diagram illustrating matrix factorization X approx W H^T, showing the dimensions of the original matrix X and the factor matrices W and H^T.

Generalization of many methods (e.g., SVD, QR, CUR, Truncated SVD, etc.)


How to choose the rank, \(R\)?

\[ \mathbf{X} \approx \mathbf{W} \mathbf{H}^\top \]

where \(\mathbf{W}\) is \(N\times R\), \(\mathbf{H}\) is \(M\times R\), and \(R \ll \min(N, M)\).

  • Reduced R simplifies the model substantially
  • Also decreases fidelity of reconstruction
  • Trade-off always
  • Worst case, let’s say N = M = R
    • \(X \approx W H^T\)
    • \(N^2 \rightarrow N^2 + N^2\)
    • Just doubled the number of values to track!
    • But can reconstruct perfectly. We could just make \(W = I\) and \(H = X^T\)
  • Most reduced:
    • \(R = 1\)
    • \(X \approx w \otimes h\)
    • \(N^2 \rightarrow 2N\)
    • Say \(N=100\), we go from 10,000 to 200 values.

Example: image compression

Original image

Original image of a cat.

https://www.aaronschlegel.com/image-compression-principal-component-analysis/

Rank = 3

Compressed image of the cat using 3 components, showing significant quality loss.

https://www.aaronschlegel.com/image-compression-principal-component-analysis/

Rank = 46

Compressed image of the cat using 46 components, showing better quality reconstruction.

https://www.aaronschlegel.com/image-compression-principal-component-analysis/
  • This is about 1% of the original data size.
  • There is always information lost
  • We just hope for it to be information that doesn’t matter
  • We have a good reason to believe in the relevance of low-dimensional structures
    • White noise has the highest entropy.
    • Adjacent pixels, edges, etc. indicate that the data points are sitting in a low-dimensional subspace.

Applications

Process control

  • Large bioreactor runs may be recorded in a database, along with a variety of measurements from those runs
  • We may be interested in how those different runs varied, and how each factor relates to one another
  • Plotting a compressed version of that data can indicate when an anomalous change is present

Mutational processes

  • Anytime multiple contributory factors give rise to a phenomenon, matrix factorization can separate them out
  • Will talk about this in greater detail

Cell heterogeneity

  • Enormous interest in understanding how cells are similar or different
  • Answer to this can be in millions of different ways
  • But cells often follow programs

Principal Components Analysis (PCA)

PCA as matrix factorization

For data table \(\mathbf{X}\) with its singular value decomposition, \(\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^\top\),

  • \(\mathbf{W} = \mathbf{U}\mathbf{D}\) gives the scores, placing observations in a low-dimensional space
  • \(\mathbf{H} = \mathbf{V}\) gives the loadings, relating the features to that low-dimensional space
    • Each principal component (PC) is a linear combination of the original features
    • PCs are uncorrelated with each other

PCA intuition

  • Ordered in terms of variance
  • \(k\)th PC is orthogonal to all previous PCs
  • Reduce dimensionality while maintaining maximal variance
    • PCA minimizes reconstruction error: \(\left\|\mathbf{X} - \mathbf{W}\mathbf{H}^\top\right\|_F^2\)

Scatter plot of data points with arrows indicating the directions of the first and second principal components (PC1 and PC2).

PCA example: words

  • Consider an example dataset of two variables about words
    • The number of lines of its dictionary definition
    • The length of a word
    • Example from Abidi & Williams, 2010
  • We obtain \(\mathbf{X}\) after applying appropriate centering and normalization

PCA identifies two orthogonal axes of variation

  • The first component (PC1) explains the most variation
  • The second component (PC2) is orthogonal to PC1

One can project the observed data points onto the principal components


Scores for each data point

  • We can read the projected scores for each data point

  • SVD: \(\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^\top\)
  • Scores: \(\mathbf{W} = \mathbf{U}\mathbf{D}\)
  • Loadings: \(\mathbf{H} = \mathbf{V}\)
  • Projection: \(\mathbf{X}\mathbf{H} = \mathbf{U}\mathbf{D}\mathbf{V}^\top\mathbf{H} = \mathbf{U}\mathbf{D}\mathbf{H}^\top\mathbf{H} = \mathbf{U}\mathbf{D}\mathbf{I} = \mathbf{W}\)

Projecting new data with PCA

  • Once you have characterized the PCA components, you may use the loadings to project new data points.

PCA words example: projecting a new word, ‘sur’.

  • sur means ‘on’ in French

Score for the new word, ‘sur’.

  • Go through example
  • Mention normalization
  • Construct scores and loadings plot
  • Walk through interpretation of plots
  • Go through selection of component numbers
  • Talk about plotting higher components
  • Talk about relationship going back to data from PCA

Algorithms: computing PCA

  • All methods are essentially deterministic
  • Iterative computation
    • More robust with high numbers of variables
    • Slower to calculate
  • NIPALS (nonlinear iterative partial least squares)
    • Able to efficiently calculate a few PCs at once
    • Breaks down for high numbers of variables (large p)

PCA in practice

  • Implemented within sklearn.decomposition.PCA
    • PCA.fit_transform(X) fits the model to X, then provides the data in principal component space
    • PCA.components_ provides the “loadings matrix”, or directions of maximum variance
    • PCA.explained_variance_ provides the amount of variance explained by each component

Go over explained variance.

PCA code example

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.decomposition import PCA

iris = datasets.load_iris()

X = iris.data
y = iris.target
target_names = iris.target_names

pca = PCA(n_components=2)
X_r = pca.fit_transform(X)

# Print PC1 loadings
print(pca.components_[0, :])

# Print PC1 scores
print(X_r[:, 0])

# Percentage of variance explained for each component
print(pca.explained_variance_ratio_)
# [ 0.92461621  0.05301557]

PCA example: separating flower species

PCA plot of the Iris dataset, showing separation of the three flower species in the first two principal components.

PCA example: genes mirror geography within Europe

PCA plot from Novembre et al. 2008, showing population structure of 1,387 European individuals.

A statistical summary of genetic data from 1,387 Europeans based on principal component axis one (PC1) and axis two (PC2). Novembre et al. Nature. 2008.
  • Talk about the population structure. Migration.
  • Talk about the continuum of genetic ancestry.

Non-Negative Matrix Factorization (NMF)

What if we have data wherein effects always accumulate?

NMF application: mutational processes in cancer

Diagram from Helleday et al (2014) illustrating mutational processes in cancer.

Helleday et al, Nat Rev Gen, 2014

Diagram from Helleday et al (2014) illustrating mutational processes.

Helleday et al, Nat Rev Gen, 2014

Figure from Alexandrov et al (2013) showing mutational signature A.

Alexandrov et al, Cell Rep, 2013

Figure from Alexandrov et al (2013) showing mutational signature B.

Alexandrov et al, Cell Rep, 2013
  • Also used in interpreting medical health records

Important considerations

  • Like PCA, except the coefficients must be non-negative
  • Forcing positive coefficients implies an additive combination of parts to reconstruct the whole
  • Leads to sparse factors
  • The answer you get will always depend on the error metric, starting point, and search method

Multiplicative updates

  • The update rule is multiplicative instead of additive
  • If the initial values for \(\mathbf{W}\) and \(\mathbf{H}\) are non-negative, then \(\mathbf{W}\) and \(\mathbf{H}\) can never become negative
  • This guarantees a non-negative factorization
  • Will converge to a local minimum
    • Therefore starting point matters

Updating W

\[[W]_{ij} \leftarrow [W]_{ij} \frac{[\color{darkred} X \color{darkblue}{H^T} \color{black}]_{ij}}{[\color{darkred}{WH} \color{darkblue}{H^T} \color{black}]_{ij}}\]

Color indicates the reconstruction of the data and the projection matrix.

Updating H

\[[H]_{ij} \leftarrow [H]_{ij} \frac{[\color{darkblue}{W^T} \color{darkred}X \color{black}]_{ij}}{[\color{darkblue}{W^T} \color{darkred}{WH} \color{black}]_{ij}}\]

Color indicates the reconstruction of the data and the projection matrix.

Coordinate descent

  • Another approach is to find the gradient across all the variables in the matrix
  • Not going to go through implementation
  • Will also converge to a local minimum

NMF in practice

  • Implemented within sklearn.decomposition.NMF.
    • n_components: number of components
    • init: how to initialize the search
    • solver: ‘cd’ for coordinate descent, or ‘mu’ for multiplicative update
    • l1_ratio, alpha_H, alpha_W: Can regularize fit
  • Provides:
    • NMF.components_: components x features matrix
    • Returns transformed data through NMF.fit_transform()

Review

Summary: PCA and NMF

PCA

  • Preserves the covariation within a dataset
  • Therefore mostly preserves axes of maximal variation
  • Number of components will vary in practice

NMF

  • Explains the dataset through two non-negative matrices
  • Much more stable patterns when assumptions are appropriate
  • Will explain less variance for a given number of components
  • Excellent for separating out additive factors

Closing

As always, selection of the appropriate method depends upon the question being asked.

Reading & Resources

Review Questions

  1. What do dimensionality reduction methods reduce? What is the tradeoff?
  2. What are three benefits of dimensionality reduction?
  3. Does matrix factorization have one answer? If not, what are two choices you could make?
  4. What does principal components analysis aim to preserve?
  5. What are the minimum and maximum number of principal components one can have for a dataset of 300 observations and 10 variables?
  6. How can you determine the “right” number of PCs to use?
  7. What is a loading matrix? What would be the dimensions of this matrix for the dataset in Q5 when using three PCs?
  8. What is a scores matrix? What would be the dimensions of this matrix for the dataset in Q5 when using three PCs?
  9. By definition, what is the direction of PC1?
  10. See question 5 on midterm W20. How does movement of the siControl EGF point represent changes in the original data?