Feature Selection Question

The feature selection nodes help identify which features improve the accuracy of the overall model. Is there a way to improve the accuracy of the one of the target variables over the other?

Here’s an example of what I mean:

There is a data set called Fruit.
There are 50 features
The target is Type and the 2 variables are Apples and Oranges.
I run the Forward Feature Selection node and it identifies 3 features that give the best indicator if the type of fruit is an apple or orange.

However, I only want to know which features improve the accuracy for if a fruit is an Apple only.

Is there a way to determine this in Knime?

Welcome @trm318,

I guess, you would want to use recall. This quantity is calculated using Scorer for binomial problems. So inside of the Feature Selection Loop you can use Scorer and then extract from its second port the content of Recall column and Apple row as a flow variable and use that to make decision in the loop end.

Cheers,
Misha

Thank you for your prompt response, Misha! I’m still having some trouble. This is the current setup:

The only way I see options to bring recall into the picture is to attach the second port of the scorer to a numeric scorer:

When I attach the first output port of the scorer to the numeric scorer I see the option to select either one of the binary targets:

I’m assuming that I should have to connect the Scorer to a Numeric Scorer (which I believe is for linear regression). No matter what the configuration is, in the configuration of for the Loop End, I do not see an option to select “Recall”. What am I missing? Thank you again for your help!

Just a little converting left to do so you can connect it to the loop end as a flow variable.
If you take that second output port and filter it down with the Row Filter and Column Filter nodes so you just have one cell left, the apple recall. You can use the Table Row to Variable node to turn that Apple Recall statistic into a flow variable that you’ll plug into your Loop End.

Hope this works for you!

Screenshot with view of table after filters are applied:

2 Likes

Corey, thanks for the huge help, that worked perfectly!!

1 Like

Great!
I’d just be careful optimizing for recall. A model that just says everything is an apple would have a perfect score in that regard, so there’s always a risk your optimization loop may select a variable with no predictive power and just say: “everything is an apple, but our recall is perfect, so this is still the best feature set!”

Maybe try some weighted average with another statistic as well to avoid situations like that?

Just something I was thinking about after replying in this thread. Best of luck with your project! Let us know how it goes.

3 Likes

Doing it like this is highly dangerous. I refer to my own post:

Any loop of the Feature elimination loop should contain a cross-validation loop to access the features performance. Just doing that on a single split will simply optimize for that single split and not be general at all.

And that doesn’t even include the other problems this tactic has.

1 Like