GroupBy "Missing Value" behavior

Dear Knimers,

The “Sum” aggregation method in the GroupBy node does not allow me to include missing values:

Suppose a hotel owner has a problem with a tarantula invasion. He sends his employees to check all hotel rooms and count the number of tarantulas there:
image

After discovering 8 tarantulas in room 104, the employees refused to check room 105.

The hotel owner now wants to know the total number of tarantulas in his hotel by using the GroupBy node:
image

Result:
image

…except that this is probably not the total number of tarantulas, because of the missing information on room 105. It would have been better if the result would have been a missing value… This would have alerted the hotel owner of the missing value in the starting data. But this option is unfortunately not available.

Best
Aswin

3 Likes

@Aswin , I would recommend to use Row Filter to check on missing values before Group by.
The logic of processing in case missing values re identified could be different for different scenarios. That is why it is hardly reasonable to include this processing to Group by with possible lost of speed.

2 Likes

Dear @izaychik63

to imagine how serious this could be, replace “Tarantulas” by “VotesABCPartyPerCommunity”, and you want to know the total number of votes cast for this party for the whole country. In this case the sum is very misleading if missing values are ignored when calculating the sum. In this case the best solution would be if the missing value is “carried over” to the final result. Also, the loss of speed would only occur if the user consciously selects “include missing values”.

I do not see how a simple row filter would help me here. I came up with this workaround:
image

Math formula node config:
image

The 0/0 results in the desired missing value.

By the way, I use Pivoting instead of Transpose for the final node to avoid a problem that I mentioned on this forum 2 years ago:
https://forum.knime.com/t/bug-report-transpose-forgets-column-type-for-missings/19749

Best
Aswin

1 Like

Easier:
image

with
image

This yields:
image

and
image

gives the desired result
image

…but I always feel a bit disappointed when I have to resort to lists and java snippets in order to achieve a simple task without creating a node jungle…

hello @Aswin try to use the option “missing value count” in group by node > Manual Agregation

I hope that’s what you’re looking for

Thank you @natanaeldgsantos but I was already using that aggregation option in the workaround that I described above:
image

Hello @Aswin,

how about this workaround?

Workaround
2021_10_12_GroupByMissingValue.knwf (26.0 KB)

Know I have already said it but just loving your creative descriptions :smile:

Br,
Ivan

2 Likes

Dear @ipazin,

thank you, clever solution! :smiley:

But I still think it would be a useful feature to have the “Missing Value” checkbox available and active for the Sum (and Mean etc) aggregation option, as it is shown in the UI already.

General opinion: correct missing value handling is often extremely important, but I have the feeling that in many parts of Knime they are treated as an afterthought or nuisance. The Rule Engine is capable of checking for missing values, but it has no option to insert missing values. The Constant Value Column has no option to insert missing values. To create missings, I often return 0/0 with a Math Formula node or “return null” in a Java Snippet nodes… these feel like hacks.

Also the handling of missing values is inconsistent. As mentioned, the “Sum” aggregation option in the GroupBy node simply ignores the missing values. As does the COL_SUM function in the Math Formula node, even though the “help” text of the Math Formula node clearly says “When any of the used columns contains a missing value, the result is missing”. For example, when we start with

image

and add all tarantulas together using Math Formula’s COL_SUM:

image

Result:

image

However, when I add 2 columns together using the same Math Formula node…

image

I get

image

As described in the help text. It turns out that “COL_SUM” ignores missing values, but “+” does not…

Another way of taking the sum of the two columns is by using the Column Aggregator…

What we learn here is that missing values are ignored, they are not assumed to be zero. Having only missing values results in a missing value.

How about the Column Expression node?

image

Interesting! When doing simple arithmetic in the Column Expressions node missing values are assumed to be zero. But this is not true everywhere! For Column Expressions functions like “average()”, where the missing values are again ignored or yield NaN in the absence of regular numbers. For trigonometric functions missing values yield missing values. Within the same node, there are 3 different ways of handling missing values depending on which function is used.

Funnily enough, the Column expression node has the “isMissing” and “Missing” functions. That means I could do something like this:

if(or(isMissing(column("Tarantulas")),isMissing(column("Scorpions")))) {missingValue();} else {column("Tarantulas")+column("Scorpions");}

How about the Java Snippet (simple) node?

image

Of course, this node does not ignore missings, because Java does not ignore nulls.

Writing this was as much fun as I had with my base-0/base-1 indexing post. It is also similarly inconsistent and it is often hard to predict what will happen with missing values in advance. In case of the Math Formula node and Column Expressions node, the node itself seems to be inconsistent.

Best
Aswin
KNIME_project20.knwf (13.5 KB)

4 Likes

A post was split to a new topic: Missing values in Moving Aggregation vs Moving Average

Hello @Aswin,

tnx for summarizing it. Have created ticket for GroupBy node so let’s see. (AP-17637). Upon news someone will update topic.

Your last reply/question moved to another topic to get proper attention.

Br,
Ivan

2 Likes

@natanaeldgsantos necesito de su ayuda como puedo hacer para contactarlo.

Hello @Honduras22,

please open up a new topic, describe your issue/question as best as possible and someone will assist you. Don’t reply in random and unrelated topics. Here is the category you can post in:

Welcome to KNIME Community!

Thanks,
Ivan

2 Likes