Keep track of training set performance within grid search with cross validation

Hello,

I am familiar with machine learning using python and the scikit-learn package, but I am new to KNIME. I’m investigating the use of KNIME 5.2 as a tool to teach machine learning to students who have no coding background and who fear coding.

I’ve managed to put together a workflow that implements a grid search to find the value of the minNumberRecordsPerNode parameter to the Decision Tree Learner that maximizes the accuracy on the test data.

In the Parameter Optimization Loop End node I can find the accuracy metric for each value of the minNumberRecordsPerNode parameter.

I would like to somehow also calculate mean accuracy on the training data for each value of the minNumberRecordsPerNode parameter, but cannot figure out how to make it work. The reason I am trying to do this is that I would like to make a line plot with the minNumberRecordsPerNode parameter on the x-axis and accuracy on the y-axis. Two lines would be plotted, one for training accuracy and one for test accuracy. This plot can serve as a reference for a discussion of overfitting and underfitting.

I’ve tried connecting a second Decision Tree Predictor node to the training data output port of the X-Partitioner and then connecting both Decision Tree Predictor nodes to a Loop End node with two input ports (replacing the X-Aggregator node). This seems to collect only the data from the last iteration of the parameter optimization loop.

Has anyone found a way to do this or something similar? In essence, I have a nested loop (X-Partitioner loop within Parameter Optimization loop) and I want to collect and concatenate accuracy on the training data for each iteration of the Parameter Optimization loop. I will also need to somehow add the parameter setting as a column in the data, so that I can calculate the training accuracy for each value of the minNumberRecordsPerNode setting.

Thanks
MH

1 Like

Hi @mhaney

Welcome to KNIME Forum

I think you are almost there, if you expand this part of your wf with some additional nodes like below.

After the Scorer node their is a Transpose node, to make the results of the iteraration one line. With the Table Row to Variable node, you combine the results with parameters of the iteration at hand.
afbeelding

The Parameter Optimization Loop End has 2 output ports. The lower port gives you an overview of the parameters and the objective function your are optimizing.

Happy KNIMEing,

gr. Hans

4 Likes

Thanks, @HansS!

The Table Transposer and Table Row to Variable nodes look very useful.

I want to keep track of accuracy on both the training and testing data for each value of the minNumberRowsPerNode parameter, so I used a Table Row to Variable loop instead of the Parameter Optimization loop because the generic Loop End node can take and stack two inputs. I also replaced the X-Aggregator node with a generic Loop End node to end the X-Partitioner loop for the same reason.

The rough workflow is now working, but needs annotations.

I would also like to ensure that all the parameter candidates are trained and tested on the exact same splits. I think that setting the random seed on the X-Partitioner node will probably do this. I tried to put the parameter optimization loop inside of the cross validation loop but couldn’t get that to work.

Link to draft workflow:

Best,
MH

2 Likes