PCA node 'Replace original data columns'


I am new to Knime and also not an expert on PCA. I have been given a project and was experimenting with PCA to reduce dimensionality.

So, in the PCA node there is the option to replace original data columns. The output table that it creates, with all the PCA dimensions, is the one I further use an the input to my actual learner and predictor. From what I could understand this is how it should be done; using the new input table. But the output accuracy with this table is really poor.

I compared that with the scenario where I did not select replace original data columns. The table contained the original columns + the PCA columns. When I used this table, which makes no sense, the output accuracy was very high.

I cannot understand how this can happen. Can anyone explain the proper use of PCA? The purpose of replacing original data column?



Hi Furqan, 

There is good news and bad news.  

The bad news is that PCA doesn't guarantee to maintain the separabilty of your data in subsequent ML tasks.  It only preserves variance, so it is very possible that your algorithms perform poorly on the transformed data. It might be interesting for you to look at LDA as well, as this does seek to preserve separation.  

The good news is that it sounds like your data can be learned from wihtout transformation.  Is working with un transformed data an option?