Python script node error when executing in the workflow only with 38M rows, but runs fine at a few million.

carstenhaubold · April 22, 2024, 11:46am

Hi @RVC2023,

Sorry to hear you’re running into trouble with large data and the Python Script node. No, the amount of data should not be a problem. And – unfortunately – the knime.ini configuration Xmx and the columnar off-heap do not have an impact on the amount of memory Python uses (Python runs as a separate process and Python doesn’t have a means to limit the memory usage directly).

The error complains about varying batch sizes. (This is btw. something we’re working on to allow for the 5.3 release). Could you share your Python script, or at least an anonymized version? Are you using pyarrow to create your tables and the BatchOutputTable, or does that happen when you use pandas?

Side note @mlauber71: splitting the data into Apache Parquet files should not have any benefits over our Apache Arrow storage that we use when getting data from KNIME to Python (and in the Columnar Backend). Unless – for some reason – you really need to split the data into separate files. But the 4GB filesize limitations should be long gone I hope

Best,
Carsten