Dear KNIME community,
I have a short question regarding feature selection technique by using random forest ensemble methods.
I have around 100 features (all numeric) and 1 target classification variable. My idea is to predict a target by training my features. I would like to rank them according to their importance, because 100 features is simply a lot and I need to reduce them. I am looking for something similar to the scikit package for low cardinality features. Scikit describes it here: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
So my question would be, is there any node or a combination of nodes, which can perform for me a similar feature ranking? Or do you have any other similar methods(nodes) in KNIME, that can help me to select my relevant features?
Thank you in advance.
As @mlauber71 mentioned, the H2O Random Forest Learner makes this pretty straightforward - it calculates feature importance for you on a separate output port, which is nice. Check out this example workflow:
Thank you. But this Worklow itself has nothing to do with feature selection, because it already does a prediction. My question was how to extract relevant features, BEFORE actually training the forest.
Or do you mean that H20 Cross Validation does not require a feature selection step at all? Like feature selection is already integrated in H20, and I can throw into it 10000 multi dimensional dataset, doesnt’t matter how many and which features do I have?
As far as i know, one must reduce features and only then start prediction model. Please, if I am wrong, explain me your H20 logic. Thank you.
well you could just run it tow times. Use the first round to eliminate features and use the remaining features to do one more round of model building. One benefit could be to make the model more stable and reduce the number of variables you would have to constantly provide when bringing the model into production. But this would also depend on your data and your business case (there could be a trade off between a stable and robust model and one that would also use more exotic features to find small and maybe interesting groups).
Other than that there are several examples on the KNIME hub how to reduce features
One big example is this workflow that would compare several techniques (it seems to have been updated since some time back I had problems running it):
Other things you could do would involve other techniques like PCA or elimination of highly correlated values and so on. But you specifically asked for variable importance.
100 features isn’t all that much assuming you have a reasonable (1000+) amount of rows/observations.
I’m asking back: Why do you think it’s too many? Just doing something for no rational reason usually isn’t a good idea.
The good thing about Random forest (and in general tree-based methods) is that they can deal rather well with useless features. Only impact you will get is slower runtime. This in contrast to other algorithms which suffer more from “Curse of dimensionality”.
To get the feature importance from Random Forest you need to train a Random Forest model. No way around that. But very much important is that Random Forest (and other tree-based methods) can get negatively affected by highly correlated features. So before running Random Forest (be it for feature importance or as final model) you will need to remove such highly correlated features.
On top of that this need to be done within a cross-validation loop on the training set only.
Thank you for providing further Feature Selection workflows.
I have 100 Features and approximately 750 rows (I cant influence that, this is the actual data I received). Basically, all features tell me if a customer buys a product or not - and all these Features are numeric. At the end there is the Target Variable (Boolean 1,0), 1 - says yes, the customer buys, and 0 - is no. So - this is a classical example of Supervised Learning.
So, I guess then, in my case it only makes sense to select features based on correlation (meaning delete highly correlated features) and then simply put them into Random Forest Learner.
Would you see it as a good solution? And would your Workflow (the one you posted above with H2O) is a good suitable example for that?
Personally I would want to know what the features mean. Domain knowledge matters.
Till you know more about their meaning, yeah that about what you could do as a first sanity check. RF is usually good for a first check because it doesn’t need much tuning. If you don’t get at least some signal, most likely nothing can be done.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.