Out of memory exceptions with the Pivot node

Jay · June 16, 2008, 5:27pm

Hi,

I am getting an out of memory exception when using the Pivot node. Is the pivot node designed to work with out-of-memory data? Does it require an in-memory representation of the "pivoted" data?

Thanks,

Jay

unknown_user · June 17, 2008, 10:41am

Hi Jay,

Yes, the node should work with large data sets, highly depending on the pivoting and group column's number of unique values. For example, if you try to pivot an numeric column with mostly distinct values, you will very likely end-up with an out-of-memory error when processing huge data sets with less memory settings.

Btw, are you using KNIMEs default settings specified in the knime.ini (Windows) or .knime.ini (Linux)? In order to give KNIME the chance to grab more memory, you could change the option -Xmx256M to something like -Xmx1024M or more depending on your system properties.

Regards,
Thomas

unknown_user · June 17, 2008, 5:43pm

Hi Thomas,

Thanks for the reply. The pivot field has ~360 unique values. My java settings are 256 & 1536. I don't think I can do any better on Windows with the jvm?

Best regards,

Jay

unknown_user · June 17, 2008, 7:22pm

Hi,

The time period I'm working with actually has only 287 unique values in that string field and 16.8 million rows in the input data. The group by field has 1.5 million values.

I tried running it on only a fraction of the data 100,000 rows which had 156 unqiue values in the string field. The pivot node ran in seconds.

Any ideas?

Jay

unknown_user · June 20, 2008, 4:13pm

Hi Jay,
Right, the Pivot node should be able to process any amount of data. The current implementation of this node keeps a map of *all* group values together with *all* values found in the pivot column to one aggregation value (count, mean,...). The advantage is that the columns must not be sorted in advance and the map can be written after the entire data has been processed. That's the reason why the amount of date is limit by the main memory. But if the data would be sorted as an additional step just before pivoting, we can immediately write each pivoted row to the output table as soon as we see a new group value. I hope, I can fix this till our KNIME 2.0 release.
Cheers, Thomas

PS: To check your example, I have generated an artificial dataset with 16.8mio rows (1.5mio unique (group) and 287 unique (pivot) values), it works but needs ~4GB memory which presumably to much for a standard desktop PC. :shock: