Crossvalidation in Random Forest

Good evening,

I want to train a RF learner for a two-class problem (see attached workflow). However, the process is stopped in the X-aggregator node after (10-fold) crossvalidation. An error message appears in the KNIME console stating "Execute failed: Encountered duplicate row ID “Row0"”, but the previous Partitioning node separated training and test sets. What is wrong?

Thank you,

Example.knwf (49.2 KB)

Hi @MarcB -

The problem arises because you have an extra Partitioning node that isn’t needed. In a CV context, the X-Partitioner is intended to replace that node entirely (much like the X-Aggregator replaces a Scorer node).

Remove the partitioning node and connect the bottom port of the X-Partitioner to your predictor. Then your indices should line up OK.

1 Like

Thank you Scott, it worked. So, with this worlkflow, the performance (in this case, measured through AUC) of my test set is derived from the mean AUC in the holdout set from CV? I expected to obtain a CV AUC in the holdout TRAIN set, and then comparing this with the AUC derived in the definite test set (the sample splitted with the Partition node).

Another question: I expected a single AUC with the contribution from all relevant predictors, and instead I obtain one AUC for each predictor. How do I obtain a single, overall AUC?

Thank you,

Take a look at this example to see how you might use parameter optimization together with cross validation to identify the “best” model. You can then apply those parameters to a model learner via a flow variable, and evaluate the overall performance on the holdout set:

For looking at overall AUC, make sure that you are plotting the probability of the prediction of the target, and not the confidence (or any of your other numeric features). This workflow will show you how can configure a ROC curve:

Thank you Scott, I will read these carefully.

Best regards,

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.