Python Source: Extreme Slow Data Transfer to Knime

Hi there,
since I upgrade from Knime 4.5 to 4.51 I have to deal with very slow data transfer from Python Source to Knime. The python environment was done by Knime and serialization is set to Apache Arrow (default setting).

So it takes more than 1 Minute to get a table with 100.000 rows and 21 Columns. And I skipped execution after 10 Minutes for a table of 1e6 rows. Last week, before I upgraded to 4.51 everything went okay.

Actually there isn’t much inside of my Python Source:

import pandas as pd
import pyodbc

engine = pyodbc.connect(DSN=“MyServer”, autocommit=True)

cmd = “SELECT * FROM myScheme.myTable LIMIT 100000”

df = pd.read_sql(cmd,engine)
‘’’
Some Math
‘’’
output_table = df

Andreas

@ActionAndi I think in order to take full advantage of the new columnar storage you might have to switch your Python code to the new nodes, which are still in labs status:

What I sometimes do in the meantime is save the data from within the Python node as Parquet and then read that file back into KNIME (yes not the most sophisticated solution, but it should work) - cf. old and new Python nodes.

2 Likes

Thanks, but the “lab” nodes doesn’t work for me as I receive a cryptic lz4 error. I don’t know what’s wrong there.

Interestingly with version 4.5 the speed was fairly okay…

Have you already tried connecting to SQL from KNIME directly and then use a python snippet for your transformations?

Could you say where and when that error occurs. I think also to narrow down the problem it might be an option to use Parquet to transfer the data without the input_table and output_table.

I still KNIME will be able to ‘stabilize’ the python data transfer.

Hi Daniel,
thanks for your suggestion. Yes I could connect with standard SQL nodes. But as my company uses some kerberos related security features the route via Python Source is the most convinient for me. (In the past I had some problems connecting to kerberos with Knime).
The problem of slow data transfer occurs also on Python Script nodes and is a general thing I guess…
Andreas

Hi,

I receive this error message when I try to configure the “Python Script (labs)” node
“Could not initialize class org.bytedeco.lz4.global.lz4”

Andreas

Hi Andreas,

Thanks for giving the Python (Labs) node a try. The LZ4 library should be present actually. Could you please check whether you have the KNIME Python Scripting (Labs) extension as well as the KNIME Columnar Table Backend installed? We have seen issues like that before in 4.5, but thought we fixed them in 4.5.1. Are you working on a Mac by any chance? Did you download and install a fresh KNIME 4.5.1 build or did you upgrade your 4.5.0 installation?

Your performance issues are puzzling me, as we did not really change anything regarding the “non-Labs” Python nodes in 4.5.1. Let’s try to narrow down what might be the cause:

  • Are you sure that the time is not spent in the database call? Did you put a timer around the piece of code retrieving the data?
  • Did you create a new conda environment with KNIME 4.5.1? If so, could you try with the environment you used with KNIME 4.5.0 as well? If there was a change in some numerical package (e.g. changing the blas library from Intels MKL to OpenBlas which is used in numpy and scipy) or you used some package with GPU support that it is now lacking that could also affect performance.
  • Is your workflow configured to use the columnar backend? (Right click on the workflow in the workflow explorer → Configure → Table Backend)

Best,
Carsten

4 Likes

Hi,

I work on Win10 machines only and upgraded from Knime 4.5.0 (so no fresh installation)

  • LZ4-Error: The KNIME Columnar Backend was missing. Thank you!
  • I checked the data transfer within the Python Script Editor by timing the corresponding line. It took about 3s to download the data. I checked also the dimensions and did some math on it so I think that the datatransfer from the DB to Python was good and correct.
  • When I run the Python Source Node the Progress Bar jumps within 3 to 5 seconds to 70% and stays there the last 50 seconds (Table Size 10k rows).
  • The Table Backend was set to “default”. I changed it now to “columnar” but no big change.

BUT:
I tried then the Python Script (Labs) Node… And received an error regarding a column with “timetamp” Datatype.

ValueError: Data type 'timestamp[ns]' in column 'time' is not supported in KNIME Python. Please use a different data type

When I remove this time column both the Knime “Python Script (Labs)” Node and “Python Source” Node work good! It seems that the latter one struggles with this datatype and crashs.

1 Like

I think the “datetime64[ns]” datatype is the root cause of the problem:
With this Code-Snipped you can reproduce the error.

import pandas as pd
import numpy as np 
# Create table
rng = pd.date_range('2015-02-24', periods=5e5, freq='s')
df = pd.DataFrame({ 'Date': rng, 'Val': np.random.randn(len(rng)) }) 

print(df.shape)

output_table = df

In KNIME 4.3.4 the Python Source node ran ~40s, in KNIME 4.5.1 it crashes…

1 Like

Thanks a lot for your investigations!

About the error in the Python Script Labs node:
As the error mentions, the data type timestamp[ns] is not (yet) supported in Python (Labs). That is because Pandas’ Timestamp is a datatype that is different from Python’s own datetime / timestamp. To fix that, you can e.g. convert the data to a Python datetime object using pandas.Timestamp.to_pydatetime — pandas 1.4.0 documentation

We have reproduced the extremely slow data transfer with the Python Source node and your script and have opened a ticket to investigate and fix the problem. We’ll get back to you once we know more

5 Likes

Unfortunately in latest release 4.5.2 (2022-03-23) this behaviour wasn’t fixed.

That is true, we still have better Timestamp support for the Python (Labs) node on our agenda but did not get to add that for KNIME 4.5.2.

As for the slow data transfer with the Python Source node (non-Labs): I have tried it with KNIME versions back to 4.3 and it was also slow there, so I am curious what might have changed. Maybe the Pandas version is different? Can you tell us which Python/Pandas/NumPy versions you are using with the older KNIME installation?

Hi,
as far as I remember the issue was at the Python (Labs) Node also.
I use
Python = 3.9.7
Numpy = 1.22.1
Pandas = 1.3.5

Python (Labs) Node delivers following error:

ValueError: Data type ‘timestamp[ns]’ in column ‘Date’ is not supported in KNIME Python. Please use a different data type.

Hi @ActionAndi,

the KNIME 4.6 release coming in summer will add support for pandas.Timestamp (= timestamp[ns]) in the Python Script (Labs) node :slight_smile:

Cheers,
Carsten

5 Likes

How GREAT is THAT!!!

Thank you!

3 Likes

Sorry, what is the explicit work-around for setting a date? This throws the same error.

from datetime import datetime
import pandas as pd
import knime_io as knio

tran = knio.input_tables[0].to_pandas()

ts = pd.Timestamp(‘2020-03-14T15:32:52.192548’)
ts.to_pydatetime()

tran[“load_date”] = ts

knio.output_tables[0] = knio.write_table(tran)

One worka-around is to convert the datetime columns to string-type before the “write_table” statement.

3 Likes

Works - thanks

tran[“load_date”] = datetime.now().strftime("%m/%d/%Y %H:%M:%S.%f")[:-3]

1 Like