The generalized eigenvalue problem

Introduction

When I was just getting started in computational research, I often found myself exploring the scipy.linalg documentation just to get a sense of what tools were available. A critical tool I ended up using frequently was the function for symmetric matrix eigendecomposition, scipy.linalg.eigh, as it solves the eigenvalue problem underlying principal component analysis (PCA) and much of graph signal processing. Briefly, if $\mathbf{A}$ represents a covariance matrix, one could calculate the top PC by solving

\[\mathbf{A} \mathbf{v} = \lambda \mathbf{v}.\]

However, as I became more familiar with this function, I noticed that it optionally accepted a second matrix as input. Given the two matrices, $\mathbf{A}$ and $\mathbf{B}$, it would apparently solve the generalized eigenvalue problem:

\[\begin{equation} \label{eq:geneigprob} \mathbf{A} \mathbf{v} = \lambda \mathbf{B} \mathbf{v}. \end{equation}\]

For a long time, I didn’t have a great intuition for what this equation represented. One day, after watching Arrival on a flight, I was inspired to make great scientific strides. I didn’t do that, but I did end up with the following interpretation of the generalized eigenvalue problem.

Brief recap of PCA

As shown in a previous post, PCA can be seen as reconstructing the graph of similarities between data points with the least amount of information. We’ll slightly change our approach to focus on the features here, but the results are the same whether you consider observations or features. Consider the data matrix

\[\mathbf{X} \in \mathbb{R}^{g \times n},\]

which we will think of as $g$ genes measured across $n$ cells from a patient with a particular disease.

We are interested in a “reduced” representation given by

\[\mathbf{v} \in \mathbb{R}^{g},\]

which can be thought of as a data matrix with only a single column – i.e. a “metagene” or “gene program” – and which we might assume is associated with the patient’s disease. Each representation yields relationships between features that we can think of as graphs (i.e. covariance matrices), which are given by

\[\mathbf{X} \mathbf{X}^{\top}, \mathbf{v} \mathbf{v}^{\top} \in \mathbb{R}^{g \times g}.\]

We can then quantify how well these graphs match by comparing each edge weight via the summation

\[\lambda = \sum_{ij} (\mathbf{X} \mathbf{X}^{\top})_{ij} (\mathbf{v} \mathbf{v}^{\top})_{ij},\]

which boils down to the quadratic form

\[\lambda = \mathbf{v}^{\top} \mathbf{A} \mathbf{v},\]

where $\mathbf{A} = \mathbf{X} \mathbf{X}^{\top}$ is the adjacency matrix associated with the true data graph.

Normalizing by the magnitude of $\mathbf{v}$, we get

\[\lambda = \frac{\mathbf{v}^{\top} \mathbf{A} \mathbf{v}}{\mathbf{v}^{\top} \mathbf{v}}.\]

which simplifies down to an eigenvalue problem via

\[\Rightarrow \lambda \mathbf{v}^{\top} \mathbf{v} = \mathbf{v}^{\top} \mathbf{A} \mathbf{v} \nonumber \\\] \[\begin{equation} \label{eq:eigprob} \Rightarrow \lambda \mathbf{v} = \mathbf{A} \mathbf{v}. \end{equation}\]

Thus, by finding the eigenvector $\mathbf{v}$ with the largest eigenvalue $\lambda$, we can find a “condensed representation” of the gene expression patterns in $\mathbf{X}$. With this intuition in mind, we can now “generalize” PCA to find relationships present in one graph, $\mathbf{A}$, but not another, $\mathbf{B}$.

Generalization

In the single-cell analysis context mentioned above, the top gene programs (i.e. PCs) would likely just describe critical functions present in both healthy and disease patients. In order to deal with this, we would probably need to somehow specify that we want to find patterns that are present uniquely in disease data and not healthy data.

Imagine we collected a second sample from a healthy patient, resulting in a second data matrix,

\[\mathbf{Y} \in \mathbb{R}^{g \times m}.\]

While the number of cells measured in this assay might differ from that of the first (i.e. $n\neq m$), we can still compare the two in terms of their features (hence why we are working with feature covariances in this post). The graph of feature relationships for this healthy sample is given by

\[\mathbf{B} = \mathbf{Y} \mathbf{Y}^{\top} \in \mathbb{R}^{g \times g}.\]

Our goal is now to somehow remove the information in $\mathbf{B}$ from that of $\mathbf{A}$. If we were doing this with scalar values associated with each dataset $a,b \in \mathbb{R}$, we could simply divide out $b$:

\[\frac{a}{b}.\]

But we are trying to do this with matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{g \times g}$, which we can’t divide in the same way. Nevertheless, we can think of division instead as multiplication by an inverse:

\[b^{-1} a.\]

This actually does generalize to matrices, so long as they are invertible. Because $\mathbf{A}$ and $\mathbf{B}$ are positive definite, they are invertible and we can calculate the inverse

\[\mathbf{B}^{-1} \mathbf{A},\]

which represents the information present in $\mathbf{A}$ but not $\mathbf{B}$. Because this product is just another symmetric matrix itself, we should be able to plug it into an eigenvalue problem to find the dominant patterns, which should now be disease-specific:

\[\mathbf{B}^{-1} \mathbf{A} \mathbf{v} = \lambda \mathbf{v}.\]

And that’s it, there’s our final equation and intuition. All we need to do is multiply each side by $\mathbf{B}$ and we arrive at our original eq. \eqref{eq:geneigprob}.

One last thing I’d like to do is provide intuition in the form of Rayleigh quotients. Just as we flipped between eigenvalue problem and Rayleigh quotient in eq. \eqref{eq:eigprob}, so we can do with the generalized problem:

\[\mathbf{A} \mathbf{v} = \lambda \mathbf{B} \mathbf{v}\] \[\Rightarrow \mathbf{v}^{\top} \mathbf{A} \mathbf{v} = \lambda \mathbf{v}^{\top} \mathbf{B} \mathbf{v}\nonumber \\\] \[\Rightarrow \lambda = \frac{\mathbf{v}^{\top} \mathbf{A} \mathbf{v}}{\mathbf{v}^{\top} \mathbf{B} \mathbf{v}}. \nonumber\]

Here, we can see more clearly that the prominence of our component (i.e. the eigenvalue $\lambda$ of eigenvector $\mathbf{v}$) is governed by how much it can emphasize the relationships in graph $\mathbf{A}$ while deemphasizing the relationships in graph $\mathbf{B}$.

Conclusion

Altogether, we have shown two different ways in which generalized eigenvalue problems can be interpreted as a sort of comparative PCA. Note that this interpretation is also made clear in linear discriminant analysis. It can also be applied to canonical correlations analysis, as seen here and here. Honestly, I’m surprised it isn’t totally abused in single-cell analysis to find disease-specific features, but maybe that’s partially because scRNA-seq data violates key distributional assumptions.

Hope this was insightful.