Accuracy statistics

Wim · December 30, 2016, 5:06pm

Hi,

Kelleher, Mac Namee & D'Arcy discuss two types of average class accuracy in their book 'Fundamentals of Machine Learning for predictive data analytics' (2015, MIT). The formulas of these statistics are included in the attached file. They call the first one the 'Arithmetic mean' and the second one the 'Harmonic mean'. They strongly recommend to use the Harmonic mean (p.419 of the book).

I used the scorer node to investigate the performance of my model (Random Forest), but I'm not sure where to find the harmonic mean. Is the Accuracy-column in the accuracy statistics table what Kelleher and colleagues call the harmonic mean of the average class accuracy? Or do I need to search elsewhere?

Cheers,

Wim

overall_accuracy_formulas.png

RolandBurger · January 4, 2017, 10:51am

Hi Wim,

The Scorer node calculates what Kelleher et al. refer to as arithmetic mean of the average class accuracy. At present, there is no option to have the scorer node compute the harmonic mean.

However, you can use the Table Column to Variable and Math Formula (Variable) nodes to do this, see the attached example workflow.

Cheers,

Roland

harmonic_mean.knwf

Wim · January 5, 2017, 1:06pm

Hi Roland,

Thank you for the answer! I tried your workflow, but experienced some problems. More precisely, I receive the error:

Formula (Variable) 0:77:78 Node can't be executed - Node "Math Formula (Variable)" not available from extension "KNIME Math Expression (JEP)" (provided by "KNIME GmbH, Konstanz, Germany"; plugin "org.knime.ext.jep" is installed)

Strange, but I installed KNIME with all extensions and have never experienced this error before. Am I doing something wrong or is this something that can be fixed?

Cheers,

Wim

Iris · January 5, 2017, 1:29pm

Hi Wim,

the node is only availabe since KNIME 3.3. I guess you need to make an upate first.

Best, Iris

Wim · January 5, 2017, 3:45pm

Hi Iris/Roland,

Thanks for your help! I updated my version of KNIME (thanks for the reminder!) and this brings me one step closer to the end.

I'm not there yet, though.

First, I left the workflow as I received it from Roland [this means, with the expression being

"1/((1/2)*((1/$${DCluster_0}$$)+(1/$${DCluster_1}$$)))" ]

and received the following warning:

failed to apply settings: Unknown flow variable "${DCluster"

So I conclude the expression needs to be adapted to the name of the flow variables in my analysis. I tried to change the expression, but am uncertain about how to get it right. Do I change it into:

1/((1/2)*((1/$${DAccuracy}$$)+(1/$${DAccuracy}$$)))

or into:

1/((1/2)*((1/$${I#False}$$)+(1/$${I#Correct}$$)))

Or is there another configuration?

The two examples above lead to drastically different harmonic means (i.e. 0.77432 and 775.87), which makes me think the first one could be correct (since it is somewhere in between the two values for accuracy in my model), but is this interpretation correct?

Already thank you for your help!

Cheers,

Wim

RolandBurger · January 5, 2017, 5:33pm

Hi Wim,

You are almost there! :-)

In the Math Formula node, you need to use the flow variables named after the RowIDs in your accuracy table (from the scorer node). These names correspond to the class labels in your target column.

In the workflow I posted, those were Cluster_0 and Cluster_1. You will have to replace those in the Math Formula node.

If you are unsure how to format the code, simply delete "$${DCluster_0}$$" (resp. "$${DCluster_1}$$") and double click the correct entry on the Flow Variable List on the left. This will insert the names in the correct format.

Cheers,

Roland

Wim · January 9, 2017, 1:16pm

Hi Roland,

I've (finally) reached a value for my average class accuracy harmonic mean (which is, as could be expected, quite close to the arithmetic mean for all my models) and can start comparing!

Thank you for helping me out!

Wim