I have a simple test case for passing 10,000 x 8 table into and out of a python scripting node. This takes about 7 seconds.
Is there any tips that can help to speed up the io between the python node and knime?
The reason for asking is that I need to process about 1mm rows several hundred times daily. The timing seems to scale with the number of rows, which would imply 700 seconds per go, and so would take longer than time available in a day just for i/o.
I am runing with 24G machine, with 18G allocated to KNIME.
I have a attached the test I used (without the data), to illustrate.
The timer node recorded 3 secons to read, 7.9 seconds for python, 0.04 seconds for a java snippet to go through the same table.
The python node itself spent almost no time inside (clicking Execute script in the configuration is instantaneous.
the transfer of the data to and from python is unfortunately slow with the current solution. Do you perhaps see a way of working around this issue by loading / saving the data directly in Python?
Thanks for the suggestion. We'll give disk access a go.
If there are any changes please do let me know.
Is there a difference in a way how data is loaded during execution Python script inside configuration and how it is node when 'Execute' is performed? As David wrote "The python node itself spent almost no time inside (clicking Execute script in the configuration is instantaneous.". Loading data into Python and script execution time (done via 'Configure') is a small fraction of Node 'Execute' time.
What I notice is the time taken seems to be on the output.
As Wieslaw suspects, there is a difference when it is run as part of the workflow. Using the configuration screen to execute and look through the data is fine. The reason I can therefore think of is inefficiencies on the java side in converting the information from a python dataframe to java storage.
Anywya, I was using python to access tic by tic price data from Bloomberg. This is very high volume, hence the issue. Fortunately Bloomberg has a java API also, and I've managed to achieve what I need by converting to the java API.
We still use python for connection to a lot of analytics packages, so if there is a chance this can be resolved it would be very helpful.
There is work-around for processing tasks where each row can be processed independently. `In such case it is possible to use Parallel Chunk Start/End. However if processing of any row havedepends on others it cannot be applied as we don't have a control what is assigned to each chuck (it is only possible to select number of processing streams/chunks).
Still it can significantly help in some situations.