Split and select from a large number of variables?

Hi, I’m new to KNIME and trying to figure out the best way to do this. I’m dealing with a large number of variables (5K+ predictors) that I need to narrow down for modeling. My plan is to use a correlation filter first to remove highly-correlated ones and then use something like a genetic algorithm or decision trees to do the rest. The problem is there are too many variable for the correlation filter to handle. I’m thinking maybe the best way to do this is to split the data into 10 sets, apply the correlation filter + genetic algorithm to the first set, add the result to second set and repeat. Any advice on how to build a workflow like this?

You might enjoy reading this, there is a full white paper where we reduced a table with 15K Variables. https://www.knime.com/blog/seven-techniques-for-data-dimensionality-reduction

And the accompanying workflow is here

4 Likes

I will also add a link to the latest version of our blog post on this topic from February, which may be useful:

https://www.knime.com/blog/three-new-techniques-for-data-dimensionality-reduction-in-machine-learning

5 Likes

Thank you! This is very useful.

I tried executing the workflow and ran into a few errors:

  1. Most have an unconnected node for parameter optimization. These were easy to figure out and execute successfully

  2. Reduction based on LDA: I could not figure out this error: Linear Discriminant Analysis Apply 4:363:0:343:164 The model is expecting column “PCA dimension 9” which is missing in the input table

  3. Auto-encoder based reduction: my python installation could not be determined. I’ll have to read a KNIME guide on this as I have not connected the two yet

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.