Principal component analysis
Introduction
It’s not a real blog unless I give my take on PCA.
There are several other blog posts out there that provide intuition for PCA, but most tend to fall back on the idea of “maximizing variance explained”. This intuition never fully clicked for me. As my research turned more toward graphs, I tried to derive PCA from more of a graph perspective, omitting explicit mention of variance altogether. Here, I’ll briefly share the intuition that did end up clicking: it turns out that variance explained can instead be thought of as the similarity between two graphs.
Problem statement
Consider the data matrix
\[\mathbf{X} \in \mathbb{R}^{n \times g},\]where $n$ is the number of observations and $g$ is the number of features. Given my background in single-cell omics, I tend to think of this as a cell-by-gene matrix – hence the odd choice of the letter $g$. In this case, each row corresponds to a measured cell, each column the genes measured across all cells, and each entry the amount that a given gene is expressed in a given cell. (I’ll continue referring to observations as cells and features as genes just to tie things down to an application.) Viewed as a dimensionality reduction algorithm, the goal of PCA is to reduce “unnecessary” information, thereby emphasizing more prominent patterns in the data. But how do we specify that mathematically?
First, we need to explicitly specify what patterns we’re interested in. The patterns we will focus on are the relationships between cells (although the relationships between genes would lead to the same result). More specifically, we will consider the pairwise relationships of all cells in terms of their genes. To do this for all pairs of cells, we can calculate the matrix product
\[\begin{equation} \label{eq:graphtrue} \mathbf{X} \mathbf{X}^{\top} \in \mathbb{R}^{n \times n}, \end{equation}\]where each entry describes the similarity between cells $i$ and $j$ in terms of their inner product. Assuming proper mean centering of $\mathbf{X}$, this is simply a covariance matrix. However, in keeping with a graph perspective, we can instead think of this as a network connecting similar cells to one another. It’s not sparse like most adjacency matrices – in fact it’s dense with connections from each cell to all others. But these connections are weighted by similarity, enabling us to intuitively think of $\mathbf{X} \mathbf{X}^{\top}$ as a graph connecting molecularly similar cells.
Next, we need to specify what “reducing information” means. Let’s think of it as trying to assign each cell a single value (i.e. a “metagene” describing multiple genes) while trying to maintain the original molecular relationships between cells. Quantitatively, these new values correspond to a vector $\mathbf{v} \in \mathbb{R}^n$. Just as the true molecular relationships between cells were given by $\mathbf{X} \mathbf{X}^{\top} \in \mathbb{R}^{n \times n}$, the graph resulting from our new values is given by
\[\begin{equation} \label{eq:graphpred} \mathbf{v} \mathbf{v}^{\top} \in \mathbb{R}^{n \times n}. \end{equation}\]Finally, because we want to maintain the original graph structure as best we can, we need a way to compare the two graphs given by eqs. \eqref{eq:graphtrue} and \eqref{eq:graphpred}. We’ll do it by summing over the similarities between edge weights in each graph, which are just the individual entries. (To be clear, this is a bit meta and entails summing over “similarities between similarities”) This leads us to our final metric for the similarity between our two graphs:
\[\begin{equation} \label{eq:metric} \lambda = \sum_{ij} (\mathbf{X} \mathbf{X}^{\top})_{ij} (\mathbf{v} \mathbf{v}^{\top})_{ij}. \end{equation}\]Solution
Now we can ask: how do we maximize this similarity? It turns out we can solve this just by simplifying the equation. Let $\mathbf{C} = \mathbf{X} \mathbf{X}^{\top}$ represent our original graph of molecular relationships between cells. Then we can rearrange our metric into
\[\lambda = \sum_{ij} \mathbf{C}_{ij} v_i v_j.\]This simplifies to the quadratic form
\[\lambda = \mathbf{v}^{\top} \mathbf{C} \mathbf{v}.\]Now consider that we don’t care about the overall magnitude of $\mathbf{v}$. Rather, we care about its relative differences among the cells. Thus, we should normalize our expression by dividing out the magnitude:
\[\lambda = \frac{\mathbf{v}^{\top} \mathbf{C} \mathbf{v}}{\mathbf{v}^{\top} \mathbf{v}}.\]Finally, we can rearrange this equation into an eigenvalue problem:
\[\lambda = {\frac{\mathbf{v}^{\top} \mathbf{C} \mathbf{v}}{\mathbf{v}^{\top} \mathbf{v}}} \nonumber \\\] \[\Rightarrow \lambda \mathbf{v}^{\top} \mathbf{v} = \mathbf{v}^{\top} \mathbf{C} \mathbf{v} \nonumber \\\] \[\begin{equation} \label{eq:eigprob} \Rightarrow \lambda \mathbf{v} = \mathbf{C} \mathbf{v}. \end{equation}\]Thus, the maximal $\mathbf{v}$ is just the eigenvector associated with the largest eigenvalue. In PCA, this eigenvector is PC1. The remaining eigenvectors, in order of eigenvalue, are PC2, PC3, etc. The loadings for each PC (i.e. how much each gene contributes to that pattern over the cell graph) can then be calculated by taking the inner product with each gene, i.e. $\mathbf{X}^{\top} \mathbf{v}$.
Conclusion
In summary, we first represented the molecular relationships between cells as a graph with eq. \eqref{eq:graphtrue}. We then sought a reconstruction of that graph from less information, given by eq. \eqref{eq:graphpred}. To find the optimal values that do this, we calculated the similarity between the two graphs as eq. \eqref{eq:metric} and simplified it to eq. \eqref{eq:eigprob} to find that the top eigenvector of the original data graph provides the desired information.
This may seem like quite a bit of legwork, but I think it’s worth it for a few reasons. First, it provides a purely geometric and visual derivation of PCA that may be more intuitive to those who feel comfortable with graphs. Second, it makes the connection between PCA and the Fourier transform very clear, as eigenvectors of a symmetric matrix are just Fourier modes over the associated graph domain. Thus, you can think of PCs as the lowest-frequency Fourier modes over the molecular similarity graph. (This is literally the same math introduced in my previous post on graph signal processing except that it relies on an adjacency matrix rather than a Laplacian matrix).
Hope this is insightful.