Error passing pickled Python object to another Python script (Capacity) (integer overflow?)

botichello · April 29, 2022, 11:40am

Hello Knime community!

My workflow component reads in ECG data from files using a spefiic python library into a list of arrays. 1 file = 1 array.

Every node works perfectly as intended, as long as I don’t try to read more files than a certain breakpoint.

The valid rows Python script is the main reader, the consistency check is basicly irrelevant, identical outcomes with any python script with these inputs.

My problem appears after I read in too much data, make a pickled object from it and give it to another python script as input.

I already increased the Java Heap Space to 8g after I got errors about that.

The pickled object gets created without a problem and I can even even check the port for data;

but any python script, even something like output_object1 = input_object1 throws capacity error.

My guess would be that the python scripts pass pickled objects to each other as strings and that string is longer than 1 int.

Has anyone ever encountered this error?
What are my options?

Thank you in advance!

mlauber71 · April 29, 2022, 12:41pm

@botichello have you thought about saving the pickled object as a ZIP file and reading it back like in this example. Could you say how large this object then is?

If it is ‘just’ date (Pandas dataframe) you could try and save it as a Parquet file or use the new Columnar backend integration that should enhance the collaboration between KNIME and Python.

botichello · April 29, 2022, 1:17pm

@mlauber71 Thanks for your answer!
The main problem with this approach is the way I want to process the data. I would like to keep the data in memory, and work with it in multiple different components afterwards.

I can only guess how much slower would it be to write it to a ZIP and read in every time there is a python script to python script connection.

In every file there are ~7 million float values (there are some way smallers aswell), and the breakpoint seems to be at the 18. file

For the context 7 million floats equals a 2 hours long record of ECG data.

Daniel_Weikert · April 29, 2022, 3:56pm

can you share the workflow?

mlauber71 · April 30, 2022, 2:27pm

@botichello maybe you give it a try. How large would the ZIP file be?

Can you save the data as a ‘standard’ Pandas Data frame? Then you might be able to benefit from the new Columnar storage integration mentioned that would speed up the transfer of data between KNIME and Python.

Then you could try and optimise your settings within the Python code. You could delete unnecessary items or you could try and run Garbage Collection from within the Python environment.

https://docs.python.org/3/library/gc.html

One last thing you could try is if the large individual job would run to put a conda environment propagation into the loop and reset the Python environment in every step in order to freiem memory. You would then have to have a loop in KNIME and restart the Python environment in every step.

botichello · May 2, 2022, 11:34am

I am sorry but I can’t publicly share this workflow. I checked the zipping method and sizes.

The 2. pickled object port is irrelevant in this context.

This error is not about usable memory as I had different errors before I set Java Heap Space, and this error is reproducable on multiple computers.

The compressed size of the zip file is 162 MB, the uncompressed file inside is 1094 MB.

I still think this error is about the python script’s string indexing out of int.

Is there a way to check the length of the output as a string to check my theory?

mlauber71 · May 2, 2022, 11:44am

@botichello could you give us more information about your Python version and the version of packages used.

Also can you construct a simple example of the type of data you want to transfer (without spelling any secrest).

Is it possible that it might be because of missings or other values in the data itself?

Have you tried the Python node marked Labs and their handling of Pickled objects.

https://docs.knime.com/latest/python_installation_guide/#usage-examples

botichello · May 2, 2022, 12:07pm

I tried to explain the data structure (list of numpy arrays) in my first post it looks like this: [array((1,2,1,1,11)),array((3,2,45,1))]

botichello · May 2, 2022, 12:08pm

The error occured in knime 4.4.4

I updated to 4.5.2 and the error is no longer present.

It probably got fixed already.

system · May 9, 2022, 12:09pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.