Full Data Set vs Sample Set Random Forest Workflow

Below are the same two H2O Random Forest Regression Workflows. The only difference is the top workflow is partitioning the data 80/20 Learner/Predictor, and the workflow below it feeds 100% of data to the Learner and the Predictor.

Why? it’s a small data set of 630 home sales and constitutes the full data set. There’s no sample–this is all there is.

Two questions: 1) is the second WF a correct set up (100% of data to Learner and Predictor, no partitioning) when using a complete data set, and 2) are the RMSE, MAPE, and R2 scores legitimate when using 100% of data at the Learner and Predictor nodes, or am I creating a self-fulfilling prophesy series of scores?

I realize this may be more data science philosophy than a question of structural KNIME workflow design, but I ask for your patience and kindness.

Thanks

If you use 100% for training you have a high risk that the nodel will not work as good on unknown data. This could somewhat be mitigated if you were to use cross validation.

Using 80/20 split is the better way to go.

1 Like

@mlauber71 Thanks for the advice and link :grinning:

1 Like