Python Source: Very Slow Execution

Hi there,

I have to read a 80 MB binary data file. Because of the easy implementation in Python I want to use the Python Source Node. Everything works fine, but the execution time is horrible. In Python (Ananconda) it takes about one second to load the data, in Knime 4.2.1 more than a minute!

I googled about this issue and I found some older threads in this forum, but no solution were presented. It seemed that it was a bug at that time (2016-2018). Has anyone a hint or solution for me?

Andreas

Hi Andreas,

KNIME has to transfer the data from Python into its internal data storage, which takes some time (though the increase in execution time you have observed really feels extreme). We are actively working on reducing this overhead at the moment.

For the time being, you could try to improve the way the data is transferred by selecting a different serialization library under File > Preferences > KNIME > Python > Serialization library. I would suggest giving the Apache Arrow serializer a try. This will however require installing pyarrow in version 0.11 in the Python environment used by KNIME.
Depending on how big your output table is in terms of the number of rows, increasing the Rows per chunk option on the Options tab of the node’s configuration dialog could also help.

Marcel

8 Likes

Depending on the nature of your data you could try and store it to an appropriate format from within python (parquet has already been suggested) and read it back into KNIME with a node.

Here is an example using R packages to store data in several file formats and one local database KNIME would be able to read.

And here is an example with python

Please note. With the parquet implementation in KNIME there have been some reports of strange behaviour. They might have been fixed by now.

1 Like

Hi,
the imported table has a size of 4e6 x 30 (some strings, some integer and most double precision).

Increasing the Rows per Chunk size doesn’t have any effect on the performance. I’ll give pyarrow a try, but yeah maybe I’ll do the whole task in Python.

Andreas

1 Like

That’ s the perfect solution! Thanks @MarcelW !!

Apache Arrow make a huge difference in terms of performance!

Best

4 Likes