Kelleher, Mac Namee & D'Arcy discuss two types of average class accuracy in their book 'Fundamentals of Machine Learning for predictive data analytics' (2015, MIT). The formulas of these statistics are included in the attached file. They call the first one the 'Arithmetic mean' and the second one the 'Harmonic mean'. They strongly recommend to use the Harmonic mean (p.419 of the book).
I used the scorer node to investigate the performance of my model (Random Forest), but I'm not sure where to find the harmonic mean. Is the Accuracy-column in the accuracy statistics table what Kelleher and colleagues call the harmonic mean of the average class accuracy? Or do I need to search elsewhere?
The Scorer node calculates what Kelleher et al. refer to as arithmetic mean of the average class accuracy. At present, there is no option to have the scorer node compute the harmonic mean.
However, you can use the Table Column to Variable and Math Formula (Variable) nodes to do this, see the attached example workflow.
Thank you for the answer! I tried your workflow, but experienced some problems. More precisely, I receive the error:
Formula (Variable) 0:77:78 Node can't be executed - Node "Math Formula (Variable)" not available from extension "KNIME Math Expression (JEP)" (provided by "KNIME GmbH, Konstanz, Germany"; plugin "org.knime.ext.jep" is installed)
Strange, but I installed KNIME with all extensions and have never experienced this error before. Am I doing something wrong or is this something that can be fixed?
failed to apply settings: Unknown flow variable "${DCluster"
So I conclude the expression needs to be adapted to the name of the flow variables in my analysis. I tried to change the expression, but am uncertain about how to get it right. Do I change it into:
The two examples above lead to drastically different harmonic means (i.e. 0.77432 and 775.87), which makes me think the first one could be correct (since it is somewhere in between the two values for accuracy in my model), but is this interpretation correct?
In the Math Formula node, you need to use the flow variables named after the RowIDs in your accuracy table (from the scorer node). These names correspond to the class labels in your target column.
In the workflow I posted, those were Cluster_0 and Cluster_1. You will have to replace those in the Math Formula node.
If you are unsure how to format the code, simply delete "$${DCluster_0}$$" (resp. "$${DCluster_1}$$") and double click the correct entry on the Flow Variable List on the left. This will insert the names in the correct format.
I've (finally) reached a value for my average class accuracy harmonic mean (which is, as could be expected, quite close to the arithmetic mean for all my models) and can start comparing!