I often work with data sets that can give me quite different R2/RMSE values depending on how I choose the training set. From what I understand, a good way to deal with this is do a 5-fold cross-validation then use the averaged model for the folds as the final prediction model. Would this be the right way to handle this?
In general a model building workflow should look more or less do the following:
Clean / prepare data
Split data into train/test
create model building workflow only with train data
Last point includes feature selection, model type selection, parameter optimization etc. Each of which needs to be done with cross-validation due to the reasons you stated (eg. you want to optimize for a general use case and not optimize to a specific train/test split).
Feature selection itself should also be done on each specific slit separately. So each split can have different features. Feature selection is part of your model but you can optimize the selection process like threshold for correlation filter etc. I digress.
Once the feature selection process, model type and parameters have been determined I usually do a final cross-validation and look at the metrics. What matters is the actual performance but also the variability. The more variable the individual splits are, the less guarantees you can make about you models actual performance. eg. high variability between splits is bad.
As a next step the averages can be compared to your test set (also called hold.out set). Is performance comparable? Usually the final CV can have a bit better performance but if it is still in same ballpark, it tells you something about the generalizability of the model.
Which model do you select? Simple. You simply apply the workflow you determined with the training set on the full data: same feature selection process, same model type, same model parameters. And that is the model you will use for new predictions.
Cross-validation is to optimize for a more general training set (over one specific split) and to check model variability but not to get out a final model.
Do you know of any workflows on the hub or elsewhere that show this process? I haven’t see any with KNIME that goes in detail over creating CVs (with train data only), comparing to hold-out set and selecting the final model.
A few parts were not clear to me like why the last CV (the last fold in the 5-fold CV for example?) would have better performance. Maybe I don’t understand how you’re creating the CVs with train data. Would you also need to create an averaged model from the models of the different folds? KNIME has no option to do that. If there’s an example, that would be great. Thanks again for the detailed response.
You misunderstood me. It’s not the last fold that has better performance but CV on training set vs the hold out set.
Also I’m not sure why you keep coming back with your averaged model. What would that even be? If we talk averages here it always means on the performance metrics of the model(s) like accuracy or R2 or whatever other metrics relevant for your use-case.