some time ago, when I tried working in Python nodes in KNIME, I noticed that passing data to Python is very slow. Hoped that it is fixed already, but no - seems still on (v4.4.1). My table is some 700k rows x 90 columns, and 50 minutes after passing it to Python script node the Python still have not received the whole dataset.
I am workarounding this by first writing data to CSV and then reading in using pandas, however it would save a lot of hassle if data went into Python node smoothly.
Did you try the new Python Scripting (Labs) node, released with 4.5.0 just two weeks ago? Ideally in conjunction with the Columnar Table Backend (see KNIME Python Integration Guide)?
@Experimenter two things about that. With the version 4.5 KNIME launched a new Python file handling support that should greatly increase the speed.
You could try a sample here:
You will have to active the Apache/Parquet based columnar storage to take full advantage:
Since the new integration is yet in Labs status you could also try and use Parquet to bring data in and out of KNIME and Python. Maybe not as fast but might be worth a shot:
Which still is a possibility with the new integration also. As well as using SQLite to have a local database ready that both Python and KNIME can use:
Not yet, in progress on getting the latest version. Will verify and get back on this.
Just for clarification: It’s Apache Arrow based (not Parquet). The confusing part here is that (two?) years ago we introduce a “Table Backend” based on Parquet and/or ORC. However, the new “Columnar Backend” is really a complete rewrite of basically everything, based on Apache Arrow (other implementations possible, though). The Parquet, ORC backend will be deprecated soon.
Two notes on this.
- The default Knime’s FlatBuffers serialization [for passing data to Python node] does not work if Python installation has Flatbuffers module updated to v2.0. (Python script node execution fails with error: Execute failed: Builder.EndVector() takes 1 positional argument but 2 were given). Downgrading Flatbuffers to 1.12.0 solves the problem.
- Tried Apache Arrow as serialization option. Had to install pyarrow on Python side. Solves the problem of slow data passing to Python node, yay!
Did you use the new Python (Labs) node? KNIME Python Integration Guide
I don’t think so. Imo I used regular Python integration Script node.
If you have time, try the labs one for actual performance improvements (in combination with the columnar backend). We’re planning to move the node out of labs with the summer release and are happy to hear your feedback.
It does not want to cooperate. First it wanted me to add py4j module (which I did), now it says:
ERROR Python Script (Labs) 0:30 Execute failed: ‘java.lang.String org.knime.python2.kernel.PythonKernel.executeAndCheckOutputs(java.lang.String, org.knime.python2.kernel.PythonCancelable)’
I am running the node with default code inside, input being a 4M rows table with 4 columns, one integer, 3 string.
@Experimenter you might want to consider to do a clean Python installation with a KNIME recommended YAML file for Python 3.7 or 3.8:
Most of the environments have:
You could use the Conda Environment Propagation to make sure your KNIME Python node does use the correct environment.
If you still encounter problems you might want to share the specific environment you use:
Currently integration KNIME and Python involves matching compatible packages (which is always a challenge with Python in general). You might look forward to a deeper integration of Python/Anaconda and KKNIME in the future.
I have Python 3.10. No Conda (or Anaconda).
Not sure why you are mentioning Flatbuffers, but I specifically downgraded it from 2.0 for Knime Python script node to work.
As far as I can see KNIME does not have a recommendation for Python 3.10 - yet. From my experience and depending on what you do there can be issues with compatibility, so maybe to give Python 3.9 a try might be an idea.
The other idea further up the thread was to use Parquet in a generic way.
in a generic way, that being?
In general, it is not my intention to spend a whole week on this, all I wish for is Knime data being passed to Python script reasonably quickly.
And, if Knime has some issues with Python version, it would be nice to get message like “3.10 not supported, use on own risk” or alike. Now with Labs Py node I get some mysterious [aforementioned] Java error which I don’t know how to attend at all.
In the guide it says that Python is being supported up to version 3.9. Therefor my recommendation to try with that version.
The other idea was to just use Parquet to transfer the data.
Ok. Thanks. I am being careless.
Wondering how I would notice benefits of columnar storage. Turned it on, however with some simple application (yet multi-M rows dataset) did not notice difference. Does not help the subject of the thread anyways. But we’ll check if it improves stability of larger workflows.
Answering my own question. Did a simple experiment, reading a 4M rows table, filtering out of it 200k rows, then joining back to original table using one join condition. Starting KNIME afresh, running it 3 times. Checking memory consumption after each run.
Default backend: 4.5G, 6.5G, 9G.
Columnar backend: 3.5G, 3.5G, 4.2G.
Update: Tested on another machine with smaller dataset, there columnar backend was consuming more memory, but stopping at ~4.2G and never passing it (which suggests some threshold on RAM consumption?).
Anyway, this is offtopic already
This looks like your KNIME installation got corrupted in some way (in particular, the versions of the different KNIME Python extensions, i.e. non-labs vs labs, do not seem to match). Did you upgrade to v4.5 from a previous version or is this a new installation of KNIME? Perhaps a fresh installation or uninstalling and reinstalling all Python extensions in one go could fix this problem.
Actually this is fresh PC, Knime is 4.5.0 and was installed on a fresh system. Extensions got installed from default web update site(s).
Reinstalling all Python extensions in one go, uh, I have many of them installed…
What should be Labs and non-Labs Python extensions version, and how non-Labs (which work) affect Labs (raises error)?