PCA and information loss

PCA Apply and Compute nodes deal with dimensionality reduction. One of the parameters for inferring the number of expected dimensions, is the “information fraction to preserve”. But what does this exactly mean?



I don’t want to go into to much details here and going to try to keep it simple.

To compute the PCA you calculate eigenvectors and eigenvalues based on your input data.
The eigenvalues in PCA tell you how much variance can be explained by its associated eigenvector.

Assume you have eigenvalues = (3,2,1) then 3 would explain 50% ( 3 / (3 + 2 + 1)), 2 33% ( 2 / (3 + 2 + 1)), and 1 16% ( 1 / (3+2+1)) of the variance. Therefore, if you would reduce to 1 dimension you’d preserve 50% of your information and if you’d reduce to two dimension you’d preserve 50% + 33% = 88% of information (PCA favors the eigenvectors with the largest eigenvalues).
Vice versa if you want to maintain 80% information then you’ll have to reduce to 2 dimensions and in order to keep 90% you’ll have to reduce to 3 dimensions.

Hope that this is somewhat clear and gives you an intuition of what this option refers to. If you want more details feel free to ask :).




This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.