What are the best metrics to evaluate a multiclass classfier to be able to compare its performance with other classifier. I found that some people mention the balanced accracy but I am not sure how this can be correct.
Hi @zizoo -
Sorry for the delayed response here. One method you’ll find mentioned a lot for evaluating the results of multi-class classifiers is the F-measure (or F1), both micro- and macro- averaged. There a brief explanation and associated calculation in this thread on CrossValidated. Scikit-learn has a function that will calculate it for you. Incidentally, macro-averaged F1 is what Amazon ML uses.
There are also some methods to calculate multi-class ROC curves using pairwise comparison, where you look at the results of one class versus all the other classes combined. Here’s a different thread on CrossValidated that has links to some R packages for that.
Finally, here’s a Coursera video that highlights general concepts of multi-class classification using Python.
Hope all of this helps!
I used this example
to turn it into a workflow to predict multiple classes. And I also added the accuracy statistics. One with F1 in Python but also the KNIME Scorer which also has a “micro” F1 as “Accuracy” - so the Python is more there to test the logic.
I am though a little bit reluctant to just use such multiple classifications as targets. Maybe someone could weight in who has more experience in that field. And of course you should be careful and test your results with real life data (and new data for that matter).
kn_example_multiple_classification.knwf (460.2 KB)