logistic regression probabilities - what do they mean?

deanabb · October 23, 2013, 3:21am

I was checking the probabilities coming out of the Logistic Regression Predictor and am not matching the probabilities there with my own manual calculation of the probabilities.

The logistic regression model was created in KNIME and saved to PMML. That PMML code was loaded into KNIME and connected to the Logistic Regression Predictor. Data (just the inputs) flowed into that same predictor.

The mean model score (prob) output from the Predictor is 0.1634935742703903.

My mean calculated prob is 0.20210305482423618.

the mean absolute difference is 0.038609480553845395

This is really puzzling. Has anyone encountered a problem like this? The target variable has 3 levels but I'm only computing the probability for one of them. As you can see, the numbers are close but not close enough to be mere roundoff error (unless the 16 digits of precision in the PMML code isn't enough)

Details of my calculation (note: for the numbers above, I actually only used records where VAR2 through VAR10 are 0, so only the constant and VAR1 are contributing to the sum in the computations above)

sum =

-1.74578494870595* +

3.09082475949971E-05*VAR1 +

0.548481855888782*VAR2 +

-0.759391199922522*VAR3 +

-1.9802495148082*VAR4+

-1.31116296357029*VAR5 +

-0.621782784131766*VAR6 +

1.2800395070313*VAR7 +

-0.499764624175758*VAR8 +

0.107928096379516*VAR9 +

0.0969355450835457*VAR10

		<p>My calculation of the probability is 1 / (1 + exp(-sum))</p>

		<p>&nbsp;</p>
		</td>
	</tr>
</tbody>

workflow_logistic_regression_problem.png

hofer · December 4, 2013, 2:06pm

I am trying to investigate the difference of your calculation to what KNIME does. Could you provide the PMML model and the table spec of the logistic regression input?

hofer · December 5, 2013, 11:05am

Hi,

I found the reason for the difference. As you stated the target has three categories, but the formula 1 / (1 + exp(-sum)) is only applicable to targets with two categories. For your case, you can find formulas at wikipedia.

Thank your for sending me the workflow. I double checked the output of KNIME and compared it to R and got the same results in KNIME and in R.

I am going to edit the title of the post, so that other people won't get a falsy impression.