Large numbers performance issue

Ergonomist · November 4, 2011, 5:11pm

Dear KNIMErs,

Just a quick heads-up which I don't know how to place: I am processing a large file list with paths, sizes, dates, owners etc. generated by FileList. 5 of the 650k files are so large that their byte size won't fit in an integer, so I went ahead and tossed them at a Java snippet node which converted them to LONG.

So far, so good, but when I went ahead and "grouped by", summing byte sizes, I realised that GroupBy would convert the results to Double. As Double is easier to read thanks to 1000s separators, and as it's generally better supported, I went back and converted with the standard "String to number" node to "Double". But guess what? "GroupBy" would now take ages to complete, if it completed at all, because memory usage was much higher and closer to my heap space limit.

How come? Size-wise they take both 8 bytes per value, but I guess it's an "int vs. float" performance issue? Standard x86 hardware under Windows.

Having discovered that, wouldn't it be beneficial to improve "long" support in KNIME, if only by providing "long" as a target format in "String to Number"?

Thanks for your comments,
E

P.S.: Subsequent aggregations now always need a double to long conversion to have acceptable speed - so "long" support in the "GroupBy" node would be appreaciated, too, I guess. :-)

Ellert_van_Koperen · November 15, 2011, 3:54pm

Note that grouping and comparing on doubles is not a very good idea in general, due to rounding problems.

Ergonomist · November 17, 2011, 1:47pm

Thanks Ellert, this is certainly true for "real" doubles. In my case the GroupBy node forces me to move from INT to DOUBLE, entailing additional disadvantages.

Re also the proposed "to long" suggestion: I've just discovered the great cpabilities of the "Rename" node to re-type columns. If its interface were a bit easier to use for multiple changes (just like in GroupBy), it'd do perfectly to replace "String to number". Similar concerns apply to "string to date", by the way, which does not offer any multi-column selector right now.

Cheers
E

tobias.koetter · February 17, 2012, 12:07pm

Hi Ergonomist,

with the latest release of KNIME the sum operation now supports int and long columns as output formats and no longer converts them to double.

http://tech.knime.org/changelog-v252

Bye,

Tobias