Prediction is not consistent for obvious class of NTF (No Trouble Found)

Hi,

An issue I encountered in Knime.

I am using a Random Forest predictor to predict classes A,B,C & NTF(No trouble found)

No trouble found means the values of the features are 0’s

I used 2 models
Model A = 10 features with 89% accuracy. The confusion matrix can detect all NTF correctly
Model B = 28 features with 92% accuracy. The confusion matrix can detect all NTF correctly

When I read these models (using model reader) and applied it to new data

Model A can detect all NTF
Model B cannot detect all NTF

Kindly share some possible reasons for this based on your expertise.

Thank You,
Claudette

Hi @Claudette_Paneza,
You have probably encountered overfitting:


Your Model B has much more parameters than Model A increasing it’s ability to fit to the data, but also the risk of overfitting.
best,
Gabriel

Hi,

I used a Logistic Regression Learner (Gauss setting) to address over fitting,
I have an accuracy of 94.3%. But still when I deployed the model to an unlabeled data set, it still cannot detect the NTF failures.

Another question I have is : Do I need to apply the normalization I did in the modeling when processing unlabeled data set? When I did, it became worse, it resulted to only one classification.

Please advise

Regards,
Claudette

Usually you have to apply the same preprocessing steps to your training and your test data. The exception being methods that change the composition of your training set, like oversampling / stratified sampling etc. Speaking of such methods, are you using Stratified sampling in your Partitioning node? This can help if your classes are not evenly distributed (e.g. one is much more rare than the others).
best,
Gabriel

Hi Gabriel,

Yes I am using stratified sampling

Hello Claudette,

I believe this could be a normalization issue.
You should calculate the normalization model on your training data after you split it with the Partitioning node.
If you use your model to predict unseen data, you need to normalize this data with the model produced by the normalizer and the Normalizer (Apply) node.

It’s extremely important to use the normalization model to normalize unseen data because the new data could have a different distribution (e.g. unbalanced classes) which will result in a different normalization.

Cheers,

nemad

Thank you nemad.
I learned something from your response.
I followed your normalization flow but when deployed, it still resulted to one classification


Result of model:

Deployment (read the model & apply to unlabeled data
image

Result of deployment only have one class:

I tried to use the normalization node instead of the applied node and it gave a better result with more classifications. Is this correct? I shouldn’t use the normalization from the model?
image

This is indeed very strange.
In your first post you mentioned that NTF means that all features are 0.
Do you mean the features your model is trained on?

It would be helpful if you could provide me with an example workflow that allows me to reproduce the issue if that’s possible.