Backward Feature Elimination


The Backward Feature Elimination tools are great to try and improve the quality of your predictive models and find out which columns are necessary for the model to be predictive.


But besides this, it is also useful to be able to systematically remove rows of data and rerun the model Learner and Predictor to see if the models are any more predictive because there can be data in the rows which make the models worse whether because the data is spurious or just causes the model to become unpredictive. Is there any way of doing this in an automated way like the "Feature Elimination Metanode" to do this, as otherwise it needs to be done manually which is very time consuming. So basically takes certain rows out and see how predictive the model is compared to when the rows were present, and continue to do this systematically until all the rows have been explored.





try to partition your data into training, validation and test data.

train your network with the training data and validate the results with the validation data (make a prediction with the validation data). If you make the prediction with te training data you may run into an overfitting while changing the paramters of your learner.

In the end test your model with the test data.

To do the traning and validation for diferent partitions you can run a x-validation (there is a metaloop in knime).


I have had a look at the Cross Validation example workflow which is a useful feature, but if there are bad datapoints in the data which leads to an inaccurate (or suboptimal) model, it wont tell you which are the bad datapoints, nor will it give you the best Learner model at the end of the loops. So there is still the problem of trying to get the best possible model from the data.

I would like some kind of feature like the Backward Feature Elimination Filter node, but where this suggests which columns to keep and which to remove, this new node would suggest which rows to keep to get the most optimal model. So you need the X-Aggregator to have a model outport, and then a Row Feature Elimination Filter node developed. This could be quite powerful I feel.



well if you want to keep just the "good rows" the model will just fit to such data. But isn't the questions if these "outliers" are maybe representativ? That means there will be some of these outliers in the future-prediction-data as well and the model has to react to these outliers? THerefore it would be good if your model generalizes to a lot of variants.

You said an "optimal model". Optimal for what? Just the good data? The quality of your prediction model is not only how it predicts the training data. You can run in an overfitting. The Cross-Validation helps you to test your prediction on different inputs.

If you really want just to learn the "good data" (which I think is a mistake, because in the real world there is probably also other data) you could use an outlier detection, to detect the values which are outliers as there aren't a lot of other instances nearby. There are several algorithms but non of these are handled in KNIME(as far as I know). There is a local outlier detection in R (google for it) which you can call by the R- snippet node. An other option would be to define some "norms" where the data is compared to and the instances which don't fit the norm are filtered out.

hope I could help a bit.



Thanks for the detailed description on building models its much appreciated. The model I am building is a predictive model for biological activity of molecules. Unfortunately biological assays can sometimes give inaccurate data and therefore what I wanted was a way of somehow detecting which datapoints donot fit with the other data in the building of the model (i.e. outliers). I could then look at these and decide to keep them in the model or not, I appreciate there is a risk of overfitting the data if I leave all the outliers out, but also I dont want to keep them all in to, as then the model may not predict the majority of the compounds correctly.

Thanks, Simon.

Hi Simon,

I have just seen this old thread and I have been dealing with a similar problem myself. My every attempt to optimize the model has not led to a model with a satisfactory accuracy level in predictions. I am curious about your current approach to such issues: have you find out a better way to deal with optimizing your models?