Identify mis-classified records (under supervised mode using PMML)

Hello,

How to identify mis-classified records when you are running new dataset w/ a trained model?

Training Model:

CSV Reader -> Partition (80/20) -> Naïve Bayes Leader -> Naïve Bayes Predictor -> PMML Writer

Testing Model:

PMML Reader + CSV Reader (new raw data) -> PMML Predictor -> CSV Writer

Sample training Data: (Category, Description, Classification)  - actual data is much more complex for demo purpose, I am trying following examples.

Colour, Red, A

Colour, Blue, A

Colour, Green, B

Colour, Black, B

Situation:

My Testing Model is able to classify data correctly but when I add a new test data ex: (Colour, Yellow or XXX, YYY) then the PMML Predictor is assign a values from the classification list (A,B)

Question:

  1. How to restrict the PMML Predictor not to assign any value if the value is not find especially if the Category is not defined or the description is brand new
  2. How to create a CSV output file for all the mis-classified records so we can re-classify them manually and rebuild the model.

The above scenario is very common text classification issue - I believe, something is missed during my design and will really appreicate if you can point out what is missing or if there is a different approach to this situation.

Thanks

Hi rahiml,

1. Since your model is trained on a limited set of data, it can't predict categories it has not seen in training. Therefore, you would have to exclude rows with categories that were not present in the training data from the new data. Alternatively, you could also of course re-train your model on the new data.

2. You can use a Rule Engine node after the Predictor node to identify rows where the prediction is not equal to the actual category. Next, you filter all rows where the prediction was correct. Then you can use a CSV Writer node to write the mis-classified rows.

Cheers,

Roland