Hi there,
since I upgrade from Knime 4.5 to 4.51 I have to deal with very slow data transfer from Python Source to Knime. The python environment was done by Knime and serialization is set to Apache Arrow (default setting).
So it takes more than 1 Minute to get a table with 100.000 rows and 21 Columns. And I skipped execution after 10 Minutes for a table of 1e6 rows. Last week, before I upgraded to 4.51 everything went okay.
Actually there isn’t much inside of my Python Source:
@ActionAndi I think in order to take full advantage of the new columnar storage you might have to switch your Python code to the new nodes, which are still in labs status:
What I sometimes do in the meantime is save the data from within the Python node as Parquet and then read that file back into KNIME (yes not the most sophisticated solution, but it should work) - cf. old and new Python nodes.
Could you say where and when that error occurs. I think also to narrow down the problem it might be an option to use Parquet to transfer the data without the input_table and output_table.
I still KNIME will be able to ‘stabilize’ the python data transfer.
Hi Daniel,
thanks for your suggestion. Yes I could connect with standard SQL nodes. But as my company uses some kerberos related security features the route via Python Source is the most convinient for me. (In the past I had some problems connecting to kerberos with Knime).
The problem of slow data transfer occurs also on Python Script nodes and is a general thing I guess…
Andreas
Thanks for giving the Python (Labs) node a try. The LZ4 library should be present actually. Could you please check whether you have the KNIME Python Scripting (Labs) extension as well as the KNIME Columnar Table Backend installed? We have seen issues like that before in 4.5, but thought we fixed them in 4.5.1. Are you working on a Mac by any chance? Did you download and install a fresh KNIME 4.5.1 build or did you upgrade your 4.5.0 installation?
Your performance issues are puzzling me, as we did not really change anything regarding the “non-Labs” Python nodes in 4.5.1. Let’s try to narrow down what might be the cause:
Are you sure that the time is not spent in the database call? Did you put a timer around the piece of code retrieving the data?
Did you create a new conda environment with KNIME 4.5.1? If so, could you try with the environment you used with KNIME 4.5.0 as well? If there was a change in some numerical package (e.g. changing the blas library from Intels MKL to OpenBlas which is used in numpy and scipy) or you used some package with GPU support that it is now lacking that could also affect performance.
Is your workflow configured to use the columnar backend? (Right click on the workflow in the workflow explorer → Configure → Table Backend)
I work on Win10 machines only and upgraded from Knime 4.5.0 (so no fresh installation)
LZ4-Error: The KNIME Columnar Backend was missing. Thank you!
I checked the data transfer within the Python Script Editor by timing the corresponding line. It took about 3s to download the data. I checked also the dimensions and did some math on it so I think that the datatransfer from the DB to Python was good and correct.
When I run the Python Source Node the Progress Bar jumps within 3 to 5 seconds to 70% and stays there the last 50 seconds (Table Size 10k rows).
The Table Backend was set to “default”. I changed it now to “columnar” but no big change.
BUT:
I tried then the Python Script (Labs) Node… And received an error regarding a column with “timetamp” Datatype.
ValueError: Data type 'timestamp[ns]' in column 'time' is not supported in KNIME Python. Please use a different data type
When I remove this time column both the Knime “Python Script (Labs)” Node and “Python Source” Node work good! It seems that the latter one struggles with this datatype and crashs.
About the error in the Python Script Labs node:
As the error mentions, the data type timestamp[ns] is not (yet) supported in Python (Labs). That is because Pandas’ Timestamp is a datatype that is different from Python’s own datetime / timestamp. To fix that, you can e.g. convert the data to a Python datetime object using pandas.Timestamp.to_pydatetime — pandas 1.4.0 documentation
We have reproduced the extremely slow data transfer with the Python Source node and your script and have opened a ticket to investigate and fix the problem. We’ll get back to you once we know more
That is true, we still have better Timestamp support for the Python (Labs) node on our agenda but did not get to add that for KNIME 4.5.2.
As for the slow data transfer with the Python Source node (non-Labs): I have tried it with KNIME versions back to 4.3 and it was also slow there, so I am curious what might have changed. Maybe the Pandas version is different? Can you tell us which Python/Pandas/NumPy versions you are using with the older KNIME installation?