Problem with Numeric Scorer node

Hi guys,

I found a problem using the Numeric Scorer node. This node gives inter alia the coefficient of determination R^2.

In the attached example workflow I have 2 vectors of 129 elements: one being the reference number and the other the predicted one. In this case the coefficient of determination R^2 should simply be the squared of the correlation coefficient.

If use the Linear Correlation node and I square the result I obtain an R^2 = 0.184. If I use the 2D/3D Scatterplot node to plot the point I also obtain an R^2 = 0.184. Anyway the Numeric Scorer node gives me an R^2 = 0.062.

Anybody knows what is the problem and why this happens? Do I misunderstand something?

Thanks in advance,

Gio

Hi,

well,

in your case

R-squared is 0.1837

R-squared adjusted is 0.1773

and...the standard deviation is 0.0616

So it sounds like a mistake.

To obtain the results use a Linear regression learner node.

Best regards.

Thank you Fabienc,

So you would say that is a BUG of the Numeric Scorer node?

Please, can any KNIME developer confirm this?

Thanks,

Gio

What formula do you expect to be the definition of R2? The one in Numeric Scorer seems pretty much same as the one on wikipedia (the most general definition; main article linked from the node) and for example on Khan academy.

I guess your definition is this: As squared correlation coefficient.

MathWorks Matlab seems to use the definition implemented by Numeric Scorer. R has packages for both (though the quasi-standard lm computes the corr2).

It is confusing to have different measures with the same name, but probably the most popular is the as squared correlation coefficient. Should KNIME provide other measures too?

Cheers, gabor (the guy who implemented the Numeric Scorer)

Hi Gabor and gcincilla,

I did the wikipedia calculation on the exemple given by gcincilla and the result is 0.1837. I explain it in the xlsx file in attachment. There is perhaps a misunderstanding in how to use this node by gcincilla. The fact is this node is intended to do a caculation between the y observed and the y predicted. I thing it has been used by gcincilla between x and y observed. If you use it after a regression predictor between y observed and predicted you obtain the 0.184 we need that is the non R2 ajusted wich is convenient in the case of two variables.

 

Best regards

Fabienc and Aborg,

Thank you so much for your answers and help! I'll try to explain my misunderstanding.

I'm effectively using the node in order to calculate the coefficient of determination (R^2) between an observed and a predicted value, that in my case are the first column (called “reference”) and the second one (called “prediction”), respectively. Actually I'm using it after a regression predictor node.

If I use Fabienc spreadsheet putting my “reference” on column B of the sheet and my “prediction” on column C of the sheet I effectively obtain an R^2 = 0.062 as the node gives. So the node correctly applies the formula appearing in this Wikipedia definition.

Differently from the Numeric Scoring node, the 2D/3D Scatterplot node is implementing a R^2 calculation based on this other Wikipedia definition that gives an R^2 = 0.184, that is about 3 times larger than the other! A part from the 2D/3D Scatterplot node this value can be obtained also squaring the Pearson Correlation Coefficient between the observed and the predicted values obtained through the Linear Correlation node.

I'm not a statistician and I'm missing here the deeper meaning of this difference. In the same way I cannot tell which implementation is the best to provide a measure of how well observed outcomes are replicated by the model. Anyway as a KNIME user I found it confusing to find different R^2 values given by different nodes (also if the 2D/3D Scatterplot node is in a community package).

I'm missing something else or is just a matter of how do you define the R^2?

Best regards,

Gio

Hi,

The scorer node use the same standard définition as the 3D/2D scatterplot. As I explained, in the scorer the first column must be the y observed and the second one the y calculated and not x and y observed as you did. In order to use the Scorer node you have previously to calculate a y calculated with a model. In your case that will be the linear regression.

There are two main R² the simple one as in te Scorer and 3D/2D viewer and the adjusted one (that can be negative) to prevent the fact that the simple R² increases its values with the number of variables included in the model (you can find the second one in the knime's linear regression learner node). The differences you observed are only due to the fact you entered wrong columns in your scorer node.

Best regards

Fabien

Hello again Fabienc,

I thank so much for your reply but I have to say that I'm still lost.

I want to remark that I didn't use 2 observed variables as you supposed. I used one observed (the column I called “reference”) and one predicted (the column I called “prediction”). The predicted column is already a result from a regression node! (In this case it is not a linear regression but a non-linear regression tool of which I want to measure the performance through the Numeric Scorer node).

Anyway my problem still persists: using these 2 columns attached in my first message the Numeric Scorer node gives a R^2 = 0.062 and the 3D/2D scatterplot a R^2 = 0.184.

Gio

Hi,

the fact is when I try to predict your prediction with your reference I obtain a R²=0.184 with the scorer after a linear regression.... I will test it tomorrow again with some datas of mine.

OK, thank you fabienc.

Hi gcincilla,

I'm sorry but I tested it with my data sets and the numeric scorer and the linear regression learner gave me exactly the same results on the R². Can you check your process ?

Best regards

Fabien

 

Hi Fabien,

Thank you to follow this forum thread. I just tried with your dataset using y column as reference and y predicted column as predicted. Actually both nodes (Numeric scorer and 2D/3D scatterplot) gives exactly the same R^2, as you stated. This is good but still I don't understand why they don't give the same R^2 with my data: reference column as reference and prediction column as predicted. They should give the same R^2 also in this case, right?

Best regards,

Gio

 

Hi gcincilla , in your post so we have the y observed and the y calculated. It would be easier to test if we had all the variables.

Best regards.
 

Hi Fabien,

Probably there was a misunderstanding: my y observed and y calculated were those I uploaded in my first post. Anyway I re-attach them here with the columns named as y observed and y predicted.

Thanks,

Gio

Hi not at all, what I meant is where are the x variables ?

Hi Fabien,

In my opinion the problem here is independent from the x variables. Meaning that whatever are the x variables the 2 nodes should give the same R^2 based on the 2 given arrays: y observed and y predicted independently from how the y predicted are obtained. Isn't it?

Anyway I attach here also the x variables.

Regards,

Gio

 

Listening to Gabor....

I made two predictions and calculated the R² with both definition to compare them with the numeric scorer's node.

If you read my last post I did a mistake in it take the last one I have deleted my last workflow.

OK, let's close the discussion with what Gabor said above: different R^2 definitions exist and could seems "confusing to have different measures with the same name". But this is like it is. So the most important thing is just that the user are aware of it.

Thank you both for your help!

Gio