Bug: GroupBy Node: Sum function broken / weird error

The GroupBy node can't sum numbers whenever the result exceeds what INT allows for.

Replicate scenario:

a) put the following values in a CSV

CATEGORY;ANUMBER
catA;1000000000
catB;500000000
catA;1000000000
catB;500000000
catA;500000000

(Notice that the numbers in the second col are all in the range of an INT or INT4)

 

b) Make a Workflow:

CSV Reader -> Column Rename -> GroupBy

 

c) in Column Rename, convert the ANUMBER column either to LONG or DOUBLE, check that the output corresponds to the selected type (it does).

 

d) In GroupBy, group by column CATEGORY in tab Groups. Change to tab Options and add column ANUMBER, then select function SUM from the dropdown in the function field.

 

Run the workflow or node. Node returns the following error:

WARN      GroupBy     Skipped group(s): ANUMBER groups: "catA" cause: "Sum > maximum int value. Convert column to long."

 

The problem is that SUM converts the summands back to INTEGER. The other aggregation functions seem to work.

Weirdness: This happens ONLY when the numbers contained in the CSV are within the range of an INT and the column is interpreted by the CSV Reader node as INT. If you add the following row to the file

catB;3000000000

 

then the second column will be interpreted as Double and the GroupBy suddenly has no problem summing what is there, and correctly choses Double as output format.

Hi,

the problem is that the Column Rename node does not return a proper LongCell but a non native type that implements the Long as well as the Int interface. That is why the GroupBy node still thinks that the column is an IntCell and tries to return an IntCell.

A possible workaround is to convert the Int column to a Long column using the Java Snippet (simple) node. Select in the dialog "Replace column : ANUMBER" and Long as return Type. In the Method body write:

return (long)$ANUMBER$;

The java snippet node will than replace the Int column with name ANUMBER to a Long column.

Bye,

Tobias

Hi,

the problem is that the Column Rename node does not return a proper LongCell but a non-native type that implements the Long as well as the Int interface. That is why the GroupBy node still thinks that the column is an IntCell and tries to return an IntCell.

A possible workaround is to convert the Int column to a Long column using the Java Snippet (simple) node. Select in the dialog "Replace column: ANUMBER" and Long as return Type. In the Method body write:

return (long)$ANUMBER$;

The java snippet node will than replace the Int column with name ANUMBER to a Long column.

Bye,

Tobias

 

This is a common problem with the column replace node. I have had (and others) had the same problem downstream in a workflow. Why can this node not be altered to convert columns properly into the type the user selects. If its not possible, can we have a typecast node instead of needing to rely on java snippets.

thanks,

simon.

The string manipulation node has a long function in it.

Will the String manipulation work the same as a java snippet node?

Yes, this should also work. The String Manipulation node returns a proper LongCell when using the toLong() method.

Thanks for that hint.

This is a great blog, as an IT student I have many information get from you . Thank you for sharing your article I hope you can post more about this.

 

Purchase Twitter Services