Error when trying to convert Pyarrow table to Knime table with Python Script (Labs)

Hello,

I’m trying to manage memory usage when converting data from the Python Script (Labs) node to a Knime table. The conversion from Pandas DataFrame to Pyarrow was causing memory issues, so I’m skipping the DataFrame step altogether.

I’ve tried many ways of outputting the Pyarrow table, but I’m always getting the following error:

Executing the Python script failed: Traceback (most recent call last):
File “”, line 190, in
File “/home/ubuntu/knime/configuration/org.eclipse.osgi/673/0/.cp/src/main/python/knime_arrow_table.py”, line 344, in append
batch = ArrowBatch(data, sentinel)
File “/home/ubuntu/knime/configuration/org.eclipse.osgi/673/0/.cp/src/main/python/knime_arrow_table.py”, line 109, in init
raise ValueError(“Can only create a Batch with data”)
ValueError: Can only create a Batch with data

The technique which appears most efficient is to create a list of arrays of column data, and apply a schema. I’ve also tried appending the data as batched, chunked, and dict data but I get the same error.

I have included a minimal version of the Python Script (Labs) node. I have made it a lot more minimal than the original but have retained parts of the consume() function which may seem unnecessary, but I figure it may be useful in case there’s a solution within returning the HTTP response data (maybe it can remain formatted as Pyarrow rather than converting from Pyarrow, to Python dict, back to Pyarrow?)

You will see in the script (line 190) that outputting the data by converting to Pandas Dataframe, (then to Pyarrow behind the scenes), and then to Knime table works just fine, so I’m going wrong with Pyarrow somewhere.

Thanks in advance.

Pyarrow_Batch_Error_Min_Workflow.knar.knwf (9.8 KB)

@Nancyjay the code look complicated. This is what my impression is. At the top you could convert the incoming data to a pandas Dataframe and then extract PyArrow ‘batches’ from them

df = knio.input_tables[0].to_pandas()
data = pa.record_batch(df)

That would allow your loop further down to maybe look like this:

# Then process the batches one by one.
for batch in data:
    # input_batch = batch.to_pyarrow()
    input_batch = batch

this at least starts to run until there is some error with an URL and a integer and string in line 165. Maybe you try and figure that one out.

In the end the syntax to export the results back might look like this. It is possible that if you already have a PyArrow frame that you could just output the output table.

knio.output_tables[0] = knio.write_table(output_table)

These are my first impressions though I am far from an expert with these codes. In case someone wants to try. I have included a Conda Environment Propagation that would create a simple environment with the necessary packages.

1 Like