GroupBy "Missing Value" behavior

Dear @ipazin,

thank you, clever solution! :smiley:

But I still think it would be a useful feature to have the “Missing Value” checkbox available and active for the Sum (and Mean etc) aggregation option, as it is shown in the UI already.

General opinion: correct missing value handling is often extremely important, but I have the feeling that in many parts of Knime they are treated as an afterthought or nuisance. The Rule Engine is capable of checking for missing values, but it has no option to insert missing values. The Constant Value Column has no option to insert missing values. To create missings, I often return 0/0 with a Math Formula node or “return null” in a Java Snippet nodes… these feel like hacks.

Also the handling of missing values is inconsistent. As mentioned, the “Sum” aggregation option in the GroupBy node simply ignores the missing values. As does the COL_SUM function in the Math Formula node, even though the “help” text of the Math Formula node clearly says “When any of the used columns contains a missing value, the result is missing”. For example, when we start with

image

and add all tarantulas together using Math Formula’s COL_SUM:

image

Result:

image

However, when I add 2 columns together using the same Math Formula node…

image

I get

image

As described in the help text. It turns out that “COL_SUM” ignores missing values, but “+” does not…

Another way of taking the sum of the two columns is by using the Column Aggregator…

What we learn here is that missing values are ignored, they are not assumed to be zero. Having only missing values results in a missing value.

How about the Column Expression node?

image

Interesting! When doing simple arithmetic in the Column Expressions node missing values are assumed to be zero. But this is not true everywhere! For Column Expressions functions like “average()”, where the missing values are again ignored or yield NaN in the absence of regular numbers. For trigonometric functions missing values yield missing values. Within the same node, there are 3 different ways of handling missing values depending on which function is used.

Funnily enough, the Column expression node has the “isMissing” and “Missing” functions. That means I could do something like this:

if(or(isMissing(column("Tarantulas")),isMissing(column("Scorpions")))) {missingValue();} else {column("Tarantulas")+column("Scorpions");}

How about the Java Snippet (simple) node?

image

Of course, this node does not ignore missings, because Java does not ignore nulls.

Writing this was as much fun as I had with my base-0/base-1 indexing post. It is also similarly inconsistent and it is often hard to predict what will happen with missing values in advance. In case of the Math Formula node and Column Expressions node, the node itself seems to be inconsistent.

Best
Aswin
KNIME_project20.knwf (13.5 KB)

4 Likes