Mathematics for Machine Learning: PCA
The last few weeks I’ve been working through the last course in the Mathematics for Machine Learning Specialization on Coursera. This course builds on the linear algebra and multivariate calculus courses, and uses them to derive principal component analysis (PCA).
PCA is a dimensionality reduction technique that is commonly used when we have many linearly-separable features which may be highly correlated. PCA helps us identify where a certain variable (or collections of variables) are linear combinations of others in our data set, and can determine which features to extract to capture as much of the relevant data as possible using fewer uncorrelated features. It’s also helpful for being able to visualize multidimensional data in a simple graph, and can help prevent overfitting.
Since we want to save as much information as possible from our original high-dimension data set, we want to minimize information lost in the process. We can measure information loss as the “reconstruction error,” or the average squared distance between the original data and the reconstructed points.
We do this by decomposing each data point in another orthonormal basis in such a way that the first set of components contains most of the data variance. Essentially, we are looking for a new basis vector that spans a subspace with the maximum variance (in other words, we try to find a new “direction” or axis where the variance of the projected data points is greatest). This allows us to retain as much of the original information as possible.
Mathematically, this can be done using eigendecomposition or singular value decomposition (SVD). Eigendecomposition is a bit simpler to understand, since to find the PCs all we need to do is:
- Center and standardize the data.
- Compute the covariance matrix for the centered data and find its eigenvectors and eigenvalues.
- Sort eigenvectors by corresponding eigenvalues in descending order. The eigenvector matrix contains principal components, and the eigenvalue matrix contains the eigenvalues corresponding to the magnitude of the PCs, where the larger the eigenvalue is, the more important the corresponding PC is.
- Project any data point onto the principal subspace that is spanned by the eigenvectors that belong to the largest eigenvalues.
(This blog has a really straightforward summary of the above in more detail).
In practice the SVD approach is more common since it avoids computing the covariance matrix, which is computationally challenging with large datasets.
In either case, the resulting principal components are always orthogonal to each other and descend in order of variability, where the first principal component has the highest variance, the second has the next highest, etc. In a two-dimensional problem, the result would be two vectors at 90 degrees, while in three dimensions the result is three vectors that are orthogonal to each other.
There is no rule about how many principal components should be selected from the resulting list, but elbow plots are a commonly-used tool to help determine this. Alternatively, in many implementations of PCA you can select how much of the variance you want to be captured by the principal components.