Applying trained model - New data missing column from one-to-many

gonhaddock · February 21, 2024, 1:07pm

Hello @Ana_Proskurin
Just a few observations about your current methodology:

Random Forest Learner doesn’t need a previous data encoding. As this node takes cares of all the data preprocessing. Can you do it? Yes. And it shouldn’t affect your Scorer Results as a QC, if the preprocessing is the right one.
Challenge 23 - Modeling Churn Predictions - Solution – KNIME Community Hub
Then your model can be much more simple, as it is in this provided example.
One to many encoding to create dummy variables it isn’t jus correct. Because you create a multicollinearity effect; and as a result your prediction model underperforms. So you are applying a not needed preprocessing that is downgrading your predictive model.
Multicollinear models work but trend to capture the noise of the training dataset, represented as overfitting. This reduces the variance in the prediction, and performance is penalized by down ranking in the scorer.

I would try to feed the model with raw data.

I have this example on how you can deal with dummy variables. You can play around with this data and feed a model with and without (raw) preprocessing, even including the Normalizer. The result predicting $HeartDisease$ will rate very similar; because the learner doesn’t need it.

Then try to compare model performance with one to many encoding…

You can connect this component to your own data and compare the performance, versus your current preprocessing.

I hope this helps.

BR