help with feature selection scoring options

Hi, I’m wondering if there’s a way to do feature selection based on two scoring /optimization options or make it easier to look at another parameter in addition to the one being optimized.

For example, I’m trying to find a model that minimizes RMSE, but has good R2 values. When I use the feature selection node and minimize RMSE, I am not able to see the R2 in the feature selection filter so I have to select and evaluate each model at a time till I find one with a good R2. Any advice on an easier way to do this?

My advice would be not to use the feature selection loop at all. I’ve written about this on this forum multiple times so if you can look it up for details why.

Instead of using that loop, you should do feature selection by eliminating correlated features (linear correlation + correlation filter node) and using low variance filter (or constant value filter). The low variance filter is tricky as features should be normalized beforehand and it can in my experience lead to removing wrong or too many features. So use low variance with care. Constant value is easier.
If you then still have too many features, you can use Random Forest feature importance. In fact I tend to use relative feature importance were the most important feature = importance 1. The you can define a cut-off, say 0.05 (but depends on your model and how many features you want to retain) and remove all features that have an importance of less than 5% of your most important feature.

Correlated features must absolutely be removed. They negatively impact all types of models. Constant features simply slow down the calculation and have by definition 0 value so they should also be removed always. The rest depends on type of model you use, amount of data and your hardware. Tree-based models are pretty robust to useless features. For other models curse of dimensionality may apply. With 100s or 1000s of features it can really slow down training massively (exponentially) for all types of models.

Rule of thumb is to have at least 10 times for observations than features.

Off Topic:

Some also advocate you should also use domain knowledge and only add features to your data set that actually are or could be relevant. I’m not sure I fully agree with that notion beside that number of storks has nothing to do with birth rate. I don’t agree with because it adds preconceptions of the modeler into the model. I mean we are using these models because we don’t know what affects the outcome.

2 Likes

Thanks @beginner. I actually had about 5K features and reduced them down to a few hundred after removing constant and highly correlated ones. I will definitely go with your advice and try other feature selection methods like random forests.

The reason I started with the feature selection node is that the genetic algorithm there and the leave-one-out cross-validation are typically used in the literature for the data I’m working with (QSPR models for property prediction from chemical structure). I’m lucky if I have more than a hundred rows of data. They typically use a genetic algorithm with leave-one-out cross validation for this type of data. I worry that this type of validation can give a good model but not great predictive power, but not sure what else to do with such small data sets.

Off-topic:
I’m having trouble with the feature selection filter. Manual selection appears to not work all the time. I get different features selected in the output after executing the node. I saw a couple of posts complaining that their selected features are not giving the expected R2 after using this node. My current solution is to use the threshold value and control it with a flow variable.

So basically your using dragon descriptors and/or fingerprints on a very small data set. That’s not going to work unless the data set is a series and then the model can only be applied to that series and even then within limits.

Genetic algorithms is a waste of time in my opinion. Doesn’t really compute meaningfully faster than backwards elimination which in itself is a problem. Why are these strategies problems? Because you are simply trying different combinations of features till you find one that somewhat fits. That’s basically the definition of “p-hacking” or “data dredging”-

Using random forest feature importance and keeping top 10-20 features could be an option but I wouldn’t be very hopeful of getting anything usable out of just 100 observations even though predicting phys-chem properties usually works surprisingly well using descriptors.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.