correlated features

Hi I am importing about 100-200 features into knime and using the features selection loop. However, I was wondering if there is a way for me to remove correlated features through some sort of node before I start the feature selection loop so that only features without any correlation are selected. I know that I can plot a heat map of the features with the linear correlation node, but I am looking if I can get something like this node that also removes the correlated features or selects one of the two. I’m not sure if it exists or not, just wondering if it does.

Sure - give the Correlation Filter node a try. :slight_smile:

3 Likes

Hi, I tried the node before. It doesn’t get rid of the features that are correlated. It only shows the features that are correlated.

I see the issue, I am confusing the correlation filter node with the linear correlation node!! The correlation filter node does work for this, thank you!!

3 Likes

I would avoid that. Search my other posts for an in-depth answer why. Simply said you need to do it right, eg. cross-validation for each loop and even then it’s borderline p-hacking and simply extremely time-consuming for close to 0 gain.

For feature selection, yes removing correlated features is the most important part (After fixing(removing missing values). After that you can remove constant or low variance columns (Constant Value Filter; low variance filter nodes). As an additional step I like random forest feature importance. I tend to set the most important feature to 1 and filter based on “relative importance”. The H2O random forest has this output by default, for the knime one you need to create it. Besides importance you can then also filter on “Top n” features. Eg. keep only say the 50 most important features. If you have few rows, this can help you to stick to the rule of thumb of 10 rows per feature.

2 Likes

Hi @beginner
Thank you so much for the insight!! I used the correlated features, constant, and low variance filters and my models did improve. However, I wanted to ask you if there’s a specific reason that you stay away from the feature selection loop. I have a small dataset and a large number of features, so I felt that a loop to select the best 30 or so features would be very helpful. In fact, when i combined the 3 filtering nodes plus the loop, I got some of my best models in terms of consistency on cross validation and test set. I was looking through some of your old posts for a specific reason, would you be able to direct me to a specific post if you mentioned this before. Thank you so much once again!!

The fewer rows you have and the more features the more likely you will find some “accidental” good working feature set. You are essential trying thousands to millions of “permutations” which simply leads to a non-zero chance of some of these leading to good enough model by chance.

Besides that again you need to do it right, not like in the examples. You need a cross-validation loop to access each feature combination else you are just optimizing for 1 train/test split and said split usually also has a large impact on model performance. And of course you are only allowed to use the training set for this loop meaning even less rows. You can’t make this selection with all rows or else you are leaking data and your performance statistics are wrong.

If you remove correlated features tree-based model should have no issue ignoring unimportant features. So removing more features should helps with faster run-time and can lead to simpler models but probably not much better models.

Also as far as I remember you also had a thread about fingerprints eg. working with chemicals. Without knowing the exact context I would also be careful. Within a limited context it’s usually possible to make ok to very good models but the often don’t transfer to the real-world or said otherwise newly made molecules. Eg. you should really apply it to 1 series only and only make predicting for this same series.
It’s also advisable to try a time-split validation. eg. make your train/test split based on assay? measuring date. This simulates model performance in real-world better as you predict always only future molecules. Often this will make models fall apart because the new molecules often are just a little too different.

Thank you so much!! This is all very helpful, and I’ll definitely incorporate your advice into my models. I have been doing a 60/20/20% training/cross validation/test set split and its working well. I have noticed “accidental” good models that give incredible results on the 20% cross validation but terrible results on the test set. The correlation, low variance, and constant value filters has thus far really helped limit this to some degree. I’ll try to see if a time-split validation can apply for the model I am running. Thank you once again!!

1 Like

You can read more about it here (paywall):

https://pubs.acs.org/doi/10.1021/ci400084k

to give the original author credit.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.