PCA Compute doesn't exclude columns

pipetman · January 23, 2014, 5:51pm

Just run a workflow with “PCA” and “PCA Compute” nodes and in both nodes it is possible to define columns to include or exclude from the PCA. But looking at the output (covariance matrix and loadings) it appears, that the column that I excluded is still part of the calculation.

Here’s small part of the output with the “Concentration A” column present in the input file, but excluded in the “PCA Compute node” configuration:

eigenvalue Concentration A Feature1 Feature2
12.2156918 0.695533692 0.127841119 0.143758568
7.319064878 -0.703948333 0.168538412 0.183373315
4.497250157 0.079419285 -0.015860147 0.076603463
2.500490397 -0.109694686 -0.381236912 -0.195004551
1.482244583 -0.015410615 -0.011739858 -0.262190766

The “Concentration A” column is still present in the output file when it shouldn’t be. In addition, if I put a column filter in front of the “PCA Compute” node and eliminate this column, it doesn’t show up (naturally), but also the eigenvalues and loadings are different (which seems to indicate, that the column was taken into account for the calculation in the above example).

eigenvalue Feature1 Feature2
9.815795857 0.198503037 0.220085087
4.52425556 4.85E-04 0.091242809
2.577484582 0.38346734 0.203967157
1.484091467 0.008478681 -0.255752408

That’s something that shouldn’t happen, unless I’m missing something (the node configuration appears to be pretty straightforward). (???)

BTW: this is using Knime 2.9.1

Aaron_Hart · January 27, 2014, 5:29pm

I had a quick look and for me at least, it appears to be working as expected. The components are the same whether you filter out the extra columns before or after the PCA computation.

Regards, Aaron

pca_tests.png

pipetman · January 27, 2014, 7:55pm

Thanks for checking!

Could the issue be related to or caused by the size of the datatable (1.2M rows, ~70 columns)?