I am trying to use a Python script in a flow, but it seems very slow to load the data into the Python node. The flow 2 nodes long - CSV Reader node followed by the Python Script node.
This takes over a minute for about 1800 columns X 1700 rows. When I run the same Python code in a shell it loads and executes almost instantaneously. My impression is that the problem is during the loading of the data from the KNIME CSV reader into the Python script (i.e. before the actual Python code executes).
The script looks like this:
# select a subset of the variables
dat1 = input_table.iloc[:, :1780]
# select another subset of the variables
dat2 = input_table.iloc[:, 1855:]
#output where varX = 1
output_table = dat1[dat2.varX == 1]
Any thought how I can speed this up? Is it something to do with the loading of the data into Python from Knime? Or something else?
I am using KNIME 3.1 in windows with Python 2.7 installed with Anaconda 2. Same issue also on a Mac.
from your code, I assume you are using the KNIME Python scripting nodes but not the community Python scripting nodes. Is that correct?
I just can tell you, that in case of the community python nodes, the data has to be written to a temporary file before python can load the data and run the script. I don't know the mechanisms of the KNIME Python nodes but it case it works similar, this would explain the problem...
If you have the source as CSV anyway, it might be a good idea to load the path of that file into knime (List Files node will help) and just pass the file-path to python. Then python can read in in the data directly and you would save one read/write process of the whole data.
Thanks for the reply.
You are correct I am using the KNIME Python scripting node - I have just realised that maybe this should have gone in the general forum rather than the Community one. I was using a CSV for testing - usually thedata will come from a database. I was just surprised how long it seemed to take for the data to load into a Python structure before the Python code actually started.
I will resubmit the question in the KNIME general forum as I don't think my problem is specifc to the Community Pythn node.
It doesn't **seem** slow, it actually is!
Apparently something strange on version 3 as it was not an issue on v2 (I just upgraded).
It indeed takes forever to load data into Python and the execution is as fast as usual.
If anyone has an idea...
I have a python script that generates a table with 400,000 elements, the script when run stand-alone takes 126s (including dataframe.to_csv), but as a Knime node, it takes over 2 hours.
BTW, seems really slow from [R] as well.