is there a way to speed up passing data to/from python nodes

David_Ko · July 8, 2016, 6:10pm

I have a simple test case for passing 10,000 x 8 table into and out of a python scripting node. This takes about 7 seconds.

Is there any tips that can help to speed up the io between the python node and knime?

The reason for asking is that I need to process about 1mm rows several hundred times daily. The timing seems to scale with the number of rows, which would imply 700 seconds per go, and so would take longer than time available in a day just for i/o.

I am runing with 24G machine, with 18G allocated to KNIME.

I have a attached the test I used (without the data), to illustrate.

The timer node recorded 3 secons to read, 7.9 seconds for python, 0.04 seconds for a java snippet to go through the same table.

The python node itself spent almost no time inside (clicking Execute script in the configuration is instantaneous.

Help?

David

python_io_timing.zip

winter · July 19, 2016, 11:33am

Hi David,

the transfer of the data to and from python is unfortunately slow with the current solution. Do you perhaps see a way of working around this issue by loading / saving the data directly in Python?

Cheers,

Patrick

David_Ko · July 19, 2016, 2:57pm

Thanks for the suggestion. We'll give disk access a go.

If there are any changes please do let me know.

Best

David

Wieslaw_Pietruszkiewicz · August 17, 2016, 9:57pm

Is there a difference in a way how data is loaded during execution Python script inside configuration and how it is node when 'Execute' is performed? As David wrote "The python node itself spent almost no time inside (clicking Execute script in the configuration is instantaneous.". Loading data into Python and script execution time (done via 'Configure') is a small fraction of Node 'Execute' time.

Cheers,

Wieslaw

David_Ko · August 18, 2016, 8:43am

What I notice is the time taken seems to be on the output.

As Wieslaw suspects, there is a difference when it is run as part of the workflow. Using the configuration screen to execute and look through the data is fine. The reason I can therefore think of is inefficiencies on the java side in converting the information from a python dataframe to java storage.

Anywya, I was using python to access tic by tic price data from Bloomberg. This is very high volume, hence the issue. Fortunately Bloomberg has a java API also, and I've managed to achieve what I need by converting to the java API.

We still use python for connection to a lot of analytics packages, so if there is a chance this can be resolved it would be very helpful.

Best

David

Wieslaw_Pietruszkiewicz · August 24, 2016, 10:42am

There is work-around for processing tasks where each row can be processed independently. `In such case it is possible to use Parallel Chunk Start/End. However if processing of any row havedepends on others it cannot be applied as we don't have a control what is assigned to each chuck (it is only possible to select number of processing streams/chunks).

Still it can significantly help in some situations.

Cheers,

Wieslaw

system · April 21, 2023, 9:29pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.