PCA how is the "minimal amount of information to be preserved" used

gbonamy · June 16, 2011, 12:52am

Does anyone have any information about how the "minimal amount of information to be preserved" relates to the number of principal componnents used for the PCA node? Is it done by decreasing the number of Principal components and look at how much of the variance changes changes after PCA decomposition, or does it use something more clever?

Perhaps it would be nice to have this info in the Node description.

uwe · June 16, 2011, 12:08pm

short answer: something more clever

The pca projection is actually based on a spectral decomposition of the covariance matrix, using the eigenvectors related to the largest eigenvalues for the actual projection
The amount of variance that is preserved after projection is directly related to the magnitude of the corresponding eigenvalues, e.g. if there are three eigenvalues 2,1,0 then a projection to the first eigenvector preserves 2/(2+1+0)= 2/3 of the overall variance, while using the first two eigenvectors preserves 100% of the given information (the last eigenvector does not contain any information, since its eigenvalue is zero)

If you set that threshold value, Knime will use as much eigenvectors as necessary to preserve at least this amount of variance (i.e. overall information).