Geospatial extensions - controlling Python's use of RAM?

Hello

My workflow is using Geospatial Analytics nodes (Spatial Join, Dissolve, Overlay) to analyse overlap between two layers (polygons) and I find it is failing, due to memory problems I think.

As such I have been running the operation on a freshly booted laptop with no other programmes running.

As a workaround I have already chunked the operation into 30 batches (using Group Loop - the batches are not equal-sized) - the small batches run fine, it is encountering this problem only on a few of the largest batches. I guess if there is no other solution I will have to further sub-divide the operation.

There is 16 GB RAM on the laptop and I have allocated 8GB to Knime in knime.ini, but what I find when watching the Resource screen in Task Manager (Windows) is that behind the scenes Python is using all available RAM (up to 13+ GB). On the toughest batches it treads this line of up to 97% RAM use for several minutes but eventually fails. Sometimes it fails gracefully, with a message in the console e.g. ERROR Dissolve 3:501 Execute failed: Error while sending a command. Once, this was accompanied by a Windows message: “The process failed because it could not allocate additional memory”. Other times the Windows GUI crashes and I am left with a black screen and a mouse cursor. Other times I get a blue screen of death with messages about the video driver (possibly the laptop is using part of normal RAM as video RAM).

I understand from this post (Python script node error when executing in the workflow only with 38M rows, but runs fine at a few million. - #4 by carstenhaubold) that it might not be possible to control Python’s use of memory?

Just wondering if this is still the case? If there was a setting in Knime to stop Python using all of the memory, that would be great. I can see that it is trying to contain itself, as it dances around 96-97% for several minutes, on the largest batches (unless the node manages to succesfully reach 100% during this time). But after a few minutes of this (stuck on 90%), it always seems to trip off the cliff!

Mick

Hey Mick,

Sorry to hear that you’re having trouble with memory. But thanks for analyzing the memory usage to such detail while running the Geospatial Analytics nodes!

They are indeed based on Python, and Python processes do not respect the memory limit that is set for KNIME, because there is no way to limit Python’s memory (as for basically any other process that is not running in the Java Virtual Machine – they rely on the operating system to handle resources).

Batching your data beforehand is a great idea!

As that is possible it points to the nodes were not implemented in the most optimal way – the nodes could do that batching themselves (the data is batched into ~64MB blocks anyways that could be processed individually if the algorithm allows for that).

If there was a setting in Knime to stop Python using all of the memory, that would be great.

What we could do is let KNIME kill Python processes if they use too much memory, so that you at least get a better error message than a crash. We already do that when KNIME AP runs as executor in a Hub setting (the machinery is currently linux-only). But there’s unfortunately no means to tell Python that there is less RAM available than there is so that it tries to cope with that amount of memory. Instead, the node implementation needs to be improved.

Hope that sheds some light on the issue…
Carsten