First, I would like to mention I am brand new to KNIME so I apologize in advance for my lack of understanding. I am trying to recreate a modeling process I created in R. Part of my process in R was to run a Logistic Regression (Forward, Backward, Stepwise) and compare variable selections. I am trying to do this in KNIME using the Feature Selection Loop Start (1:1). I started with Backward Elimination and so far I have run the Feature Selection Loop Start (1:1) - Backward Elimination on my training set and then run a Logistic Regression Learner. I am confused why all of the examples I see then link to a Logistic Regression Predictor attached to the training set. I don’t want to use the training set at this point. In R when you run forward or backward regression you will end up with the model and feature selection. How, do I achieve this result in KNIME?
we designed the Feature Selection nodes to be very flexible which sometimes can lead to confusion.
I am not familiar with R’s feature selection so I don’t know how they evaluate the performance of a certain feature.
In KNIME we typically split the dataset into a training and validation/testing set and use the training set to build the model (the split can happen before or within the loop although I think before is the cleaner solution because each feature should be evaluated on the same split for fair comparison). Once we have the model, we evaluate it with e.g. the Scorer node and use one of the metrics (e.g. Accuracy) to determine the score for the current loop iteration.
Note that the Feature Selection Loop End node accepts a flow variable as input so you can use whichever metric you see fit for your selection process. (Once again I don’t know what R does exactly but I assume they do the split and evaluation under the hood)
I hope this makes things (slightly) clearer =) If not feel free to ask
Then you just select for feature optimized for this train-test split. In fact better to use cross-validation and hence greatly increase the runtime. And even with cross-validation the whole backwards elimination is in my opinion not good and way to time consuming.
I’ve posted this multiple times already. it simply removed the feature with the smallest impact. This can mean 2 things: the feature is irrelevant OR it’s only relevant for a small fraction of the data points but for these it could be highly relevant.
It’s much better and simpler to do low variance and correlation filtering (if you really have that many features) and use a robust learning algorithm that detects useless features.
Thank you for your reply Adrian! So my modeling process splits my data into Train/Test/Validation before getting into the Backward Feature Elimination node. Then within the node there is an additional partitioning of the Training dataset. Is this what was intended for the Backward Feature Elimination?
First of all: Beginner is right in pointing out the greedy nature of backward feature elimination and that ideally you would use cross validation within the feature elimination loop but often this approach is just too time consuming to be useful during model prototyping.
It makes sense to first reduce your feature space by applying simpler selection approaches as mentioned by beginner.
@aross that depends on your overall modeling process, but yes you should not use the test or validation set for feature selection as this would invalidate them as means to estimate the final model’s generalization ability. If you use a partitioning node within the feature elimination loop, you should ensure that its random seed is constant throughout iterations because otherwise each feature is evaluated on a different split and hence you can’t really compare them. However, if you have the same split for all iterations you can only argue about the most important features for this particular split (and model family).
I am sorry for all the ifs but there is a lot to keep in mind.
That makes sense Adrian. Thank you for the explanation. I was confused by the training set in the Backward Feature Elimination. Can you tell me if this is an accurate statement, this process is eliminating features based on accuracy when model is applied to test data? After some research I believe the step() function in R (which performs stepwise regression) is using the AIC comparisons to remove features. I could be wrong but I don’t think using AIC to remove features requires a test data set. (The reason I mention R is because I am trying to recreate an existing model I have in R). Thank you in advance for your assistance.
Your statement is correct for the examples you can find on our public server but you could also use something like the AIC as selection criterion by calculating it e.g. with the Math Formula node.