Python Learner error: buffersize during serialization exceeds the maximum buffer size

qianyi · March 17, 2020, 1:53am

Hi there,

Python Learner failed with the following error when doing a RandomForest regression training (sklearn).

ERROR Python Learner
Execute failed: The requested buffersize during serialization exceeds the maximum buffer size. >Please consider decreasing the ‘Rows per chunk’ parameter in the ‘Options’ tab of the configuration dialog.

PythonLearner_Err.txt (5.0 KB)

So I tried to decrease the ‘Rows per chunk’ parameter from 500000 to 1000, but sill failed with the same error.

One considerable thing is the data contain about 1.5 million rows. Should I do row sampling for solving this error?
But I don’t have this error when I using Jupyter notebook.

Thanks in advance.

AlexanderFillbrunn · March 17, 2020, 9:56am

Hi @qianyi,
how many columns does your data have? And what types of data do you have in there? It’s pretty unlikely that 1000 rows of anything in KNIME exceed the limit of 2GB, though. Can you share the workflow with us or is it confidential?
Kind regards
Alexander

qianyi · March 18, 2020, 1:54am

Hi, @AlexanderFillbrunn

Thanks for your reply.
The data for training has 7 numerical columns, and it’s an open data. So of course I can share my workflow. But I have to upload it without data (the exported workflow with data is about 300MByte, too big…).

Data is easy to get from Kaggle. Only the “train.tsv” is needed for this test workflow

PythonLearner_Test.knwf (15.4 KB)

Thanks!

Regards,
Qianyi

AlexanderFillbrunn · March 18, 2020, 11:30am

Hi,
I think the issue is the row with ID 1328010. Here the item_description has a length of 35079696, which seems a bit excessive. Maybe you can fix the file manually and then try loading it again.
Kind regards
Alexander

MarcelW · March 18, 2020, 1:38pm

Hi @qianyi,

The error message you get is incorrect. The problem seems to be caused by the output model being too large (around 2.2 GB when pickled), not the input data.
Currently, models are handled as single, indivisible entities when transferring them from Python to KNIME, therefore decreasing the number of rows per chunk does not help here, unfortunately.

As a workaround, you could manually pickle the model to disk by adding the following line at the top of your script:

import pickle

and these lines at the bottom:

with open("path-to-model/model.pkl", "wb") as f:
	pickle.dump(output_model, f)
output_model = None

path-to-model should be replaced by a sensible directory path. This could simply be your user directory C:\Users\… - I would, however, recommend to write the file into the workflow directory to keep the workflow portable across different users and machines. You can get the path to the workflow directory via the Extract Context Properties node (property context.workflow.absolute-path) and feed it into the Python node as a flow variable.
Note that the last line in the snippet above makes sure that the Python Learner node outputs an “empty” model instead of trying (and failing) to transfer the large one again.

Using the model in other nodes can then be done by unpickling it via:

import pickle
model = pickle.load("path-to-model/model.pkl")

I will add a ticket to our issue tracker to make sure that large models will be supported by future versions of the KNIME Analytics Platform (and also to correct the error message you got). I hope this workaround works for you for the time being.

Marcel

qianyi · March 19, 2020, 7:51am

Hi @MarcelW,

The workaround worked for Python Learner! I’ll continue to run the Predictor then.
Also many thanks for the information of Extract Context Properties node. It’s new to me, but seems very convenient to use.

BTW, does this error only occur when model transferring from Python to KNIME?
Anyway, I think it would be appreciated for Python users if large models could be supported. Thanks for your help.

Regards,
Qianyi

MarcelW · March 19, 2020, 6:07pm

Hi Qianyi,

Great, glad to hear that it worked!

The error should occur in both directions, Python to KNIME as well as KNIME to Python, but really only for models. In the case of ordinary data tables, the amount of data that can be transferred should only be limited by the amount of RAM that is available to the Python process and the rows per chunk option.
I totally agree that large models should be supported. We hope to add this capability to KNIME as soon as possible.

Marcel

system · March 26, 2020, 6:07pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

MarcelW · December 8, 2021, 12:54pm

It’s been a while but here’s a quick follow-up on this one: our Python nodes now support models that are larger than 2 GB, beginning with KNIME Analytics Platform 4.5.0 which was released two days ago.