Incorrect means using GroupBy or Pivot nodes

Hi,

I have just noticed a weird behavior of the GroupBy and Pivot nodes when calculating a mean. I attach the table from which I was summarizing all compounds and to keep their concentration doing a mean. To my surprise, some averages are incorrect (use Full Precision) and I cannot explain this behavior. For some compounds the result is correct for others not. Any help on this would be appreciated.

Thanks,

Marc

Hi,

this might be due to representation issues of numeric values in computers. We compute the mean using an incremental formula to prevent buffer overflows when summing up large numbers. The formula is

mean = mean * (count / (count + 1))+ d * (1 / (count + 1));

With mean initialized with 0, count the number of elements seen so far and d beeing the new value to add.

This might result in some minor computational errors since not all numbers especially smaller numbers can be represented with Java Double. You can also compute the average manually by computing the sum and count per group. In doing so I get a difference for some groups at the 14th digit position (see attached workflow that uses your csv file). Is this the difference you are mention?

Thanks,

Tobias

Hi Tobias,

Thank you for your answer, but I doubt your explanation is correct. If you count the numbers of occurences you'll find the imprecision with lower and higher counts than the correct mean. Once an imprecision has crept in, it should propagate, this is not the case, so I think the reason must lie somewhere else.

I am also surprised by the worry about buffer overload. For calcuating a mean, you just keep a running sum and a count and at the end you divide one by the other. Where is the overflow? Actually your method is computationally more burdersome since more calculations are performed and you still keep track of two numbers (running mean and count).

Bye,

Marc

Hi Marc,

I think the buffer overload is a wrong term here. It looses precision during summation unless a similar algorithm is used like Tobias mentioned. (This wikipedia page has a bit artifical example (and a different solution too).)

Unfortunately not even summation is as simple as one might think when we use floating precision numbers. With BigDecimals, that could provide better precision, but would be much slower.

Cheers, gabor

Hi Marc,

we use the incremental computation method to avoid buffer overflows which can occur when you sum up a lot of big double values since double can only represent a maximum value befor it overflows.

The differences between the two computation methods of the mean are the result of the inability of Double to represent an infinite number of floating point numbers with its finite 64bits.

Bye,

Tobias

Hi Tobias,

what about this implementation:
 

double M=0;

for (int i = 0; i < N; ++i)
{
    M += (x[i] - M)/(i+1);
}

(http://stackoverflow.com/questions/1346824/is-there-any-way-to-find-arithmetic-mean-better-than-sum-n)

At least in R it does not result in precision errors. And it would be save in terms of big doubles and it is not slower than other methods.

Antje

Hi Tobias,

Just to confirm, Knime does not intend to fix the issue with the GroupBy node’s aggregation methods, but rather will advise users on how to work around it?

Best,

Cullen