I need some help for the Churn Prediction example from KNIME TV.
Why do we need a Partitioning Node? Which function do these nodes have, besides splitting the data?
And why do we split the data in 80% and 20%?
Thank you for your help.
You basically split the data in a training and a test set. The training set is used to train your model, in this case the Decision Tree. The test-set is then used to test the quality of the trained model on an independent part of your data which was not used for training of the model (=test set). The reason for that is, that if you would test the quality (or the generalization) of your model on the training data itself (for example if you used all data for training the decision tree without splitting it into test/training) then potentially your model quality in your test is very high, but your model simply memorized all the training data and therefore doesn't generalize, i.e. performs very bad on any other data than the training data itself.
The 80% / 20% split is a common choice for a test/training set split. However, if you split the data randomly, of course the quality of the trained model differs based on the split. Therefore a technique called cross-validation (X-Validation in KNIME) can be used to create many of these splits and many of these models and then you can guess the average model quality based on all trained models.
Hope this helps,