Hello,
How to identify mis-classified records when you are running new dataset w/ a trained model?
Training Model:
CSV Reader -> Partition (80/20) -> Naïve Bayes Leader -> Naïve Bayes Predictor -> PMML Writer
Testing Model:
PMML Reader + CSV Reader (new raw data) -> PMML Predictor -> CSV Writer
Sample training Data: (Category, Description, Classification) - actual data is much more complex for demo purpose, I am trying following examples.
Colour, Red, A
Colour, Blue, A
Colour, Green, B
Colour, Black, B
Situation:
My Testing Model is able to classify data correctly but when I add a new test data ex: (Colour, Yellow or XXX, YYY) then the PMML Predictor is assign a values from the classification list (A,B)
Question:
- How to restrict the PMML Predictor not to assign any value if the value is not find especially if the Category is not defined or the description is brand new
- How to create a CSV output file for all the mis-classified records so we can re-classify them manually and rebuild the model.
The above scenario is very common text classification issue - I believe, something is missed during my design and will really appreicate if you can point out what is missing or if there is a different approach to this situation.
Thanks