thanks for your reply!
I am using KNIME 4.2. One of the crashing nodes is “Python Script 1=>1” with pyrhon 3.
I use it as a column extractor.
In that node i take the input_table and extract some columns. Concatenate the columns and assign it to output_table. The input_table size is ~ 650000 * 4000 . So not too large ;).
i executed the node with a small portion of the dataset, that works.
I removed all code from the node and it worked with the small dataset.
But it crashes with the large dataset.
The only thing left it does is loading the input table i guess.
The “python source” loads all the data and the “Partitioning” node works with the large dataset.
So might this be a bug in the “Python Script 1=>1” node?
Could you try to decrease the Rows per chunk option on the Options tab of the node’s configuration dialog? Its default value is not ideal for wide tables.
The error you experience (“java.lang.AssertionError: FlatBuffers: cannot grow buffer beyond 2 gigabytes”) suggests that too much data is tried to be transferred between KNIME and Python at once. This is a limitation of the transport format we use (Flatbuffers), independent of the total available memory.
thanks for your reply!!
That might fix the problem…i am currently checking.
So next problem comes up ;)… loading in the node takes forever.
The “Splitter” takes some minutes, but the “Python 1=1” gets 20 percent of the dataset and takes…veeeeery long loading it (it seems like the “DL python network learner” has the same issue).
My current estimation tells me that the “Python 1=1” getting the 80% split of the dataset might finish on monday with loading the data.
Why are those nodes so slow loading the container and what can i do about it?
P.S.: and yes we could change the topic of the thread to “handling data of size worthy for AI with KNIME” …or something like that …what would be in favor of the admin?
I have seen this on another case too with 89 signals, each with a length 200000 …so the “DL pyhon Network learner” spend 6 hours on loading (while “splitting” node took just a few minutes) and then crashed on the wrong batch size overloading the GPU memory O_x …
So first of all: let me assure you that we are aware of the performance issues of our Python nodes and are actively working on improving them . At the moment, the main problem is that all of the input data to a Python node needs to be copied and transferred from KNIME to the external Python process. Likewise, all of the output data needs to be transferred from Python back to KNIME. So roughly speaking, the input data is read twice (first in KNIME, then in Python – and copied once in between) and the output data is written twice (first in Python, then in KNIME – and, again, copied once in between). This overhead will be reduced in the future, but unfortunately, I cannot make any predictions as to when these improvements will be available.
In the meantime, you could try to change the library that is used for copying and transferring the data, to Apache Arrow via File > Preferences > KNIME > Python > Serialization library. Note that you may need to install an additional package – pyarrow, version 0.11 – in your Python environment for this to work. If your Python environment is a Conda environment that has been created via KNIME, then this package will already be installed.
But as mentioned above, the underlying problem of having to copy and transfer the data still persists. So changing the serializer will only improve things up to some point.
This is a similar issue. Here, the entire input data is first transferred from KNIME to Python and only after this is done, the script begins to execute and transfer the data to the GPU. What KNIME would need to do here, instead, is to stream the individual training batches from KNIME via Python to the GPU. Note that this is exactly what the Keras Network Learner node does. So if your training routine does not require any custom scripting, you could also try to use that node.
@niko314159 you could try and use either ORC pr Parquet to store the data from KNIME and then read it back from within Python (node) without using the transfer. Maybe not the most elegant of solutions but it might still work.
thank you very much for your reply’s and sorry for the late answer!
I have finally had some time to test the Apache Arrow approach.
For a tiny case i can report that it speeds up the node by a factor of approx. 20. Which is great!
BUT testing on the large case: it blows up the machines memory of 280 GB.
I could not benchmark the change in memory consumption yet.
Is that behavior expected?
Saving on disc and reading it again seems to be quite a brutal workaround.
I will keep that in mind for further research, THX!!.
Since most of our “big” data is on network drives it might lead to some other complications. Maybe I could can read it once initially and then save it on local SSD for that purpose.
Obviously it requires changing all of my workflows O_x…
@MarcelW is there a way to get in touch with you directly ?
Just to make sure: this did not occur when using the default serialization library (Flatbuffers), right? I am not aware of any differences between the two serialization libraries in terms of memory consumption. I will need to look that up.
Yes, definitely. As @mlauber71 mentioned, the upcoming 4.3 release of the KNIME Analytics Platform will introduce a new (optional) format for how KNIME stores its data tables, which is based upon Apache Arrow. And while we have not yet adapted the Python nodes to make direct use of this new format (that is, they will still need to copy data), their performance could already benefit from that. We are planning to get rid of the copy step entirely and make the nodes interface with KNIME’s table format directly in future updates of the Python integration.
The Python nodes copy the data to the temp. directory configured in the KNIME Preferences for the transfer to/from Python (so by default to the local disk), so this should not be a problem specific to the Python nodes, right? Or did you redirect the directory to a remote drive? (Maybe I do not fully understand your infrastructure setup.)
Sure, I will send you a direct message via the forum.