When creating table in knime python extension there is an unexpected behaviour to how knime handles batching internally that pretty much forces you to never use pyarrows batching because as far as I can tell there’s no way to predict the size of batches the knime will consume. While this may seem like an user issue but the way that API is set up I think this is actually an oversight in the knime python extension.
Let me explain with a simplified snippet, here’s the problematic execute
snippet.
import knime.api.table as kt
import knime.extension as knext
import pyarrow as pa
....
def execute(self, exec_context: knext.ExecutionContext, table: kt._TabularView):
def page_steps(length, page_size):
start, end = 0, 0
for _ in range(0, length + 1, page_size):
start, end = end, min(end + page_size, length)
yield start, end
if end == length:
return
data = ["A" * 76 for _ in range(1_500_000)]
names = ["A"]
pa_chunks = [
[data[start:stop]]
for start, stop in page_steps(len(data), 1_000_000)
]
pa_batches = [
pa.record_batch([
pa.array(c) for c in chunk
], names=names)
for chunk in pa_chunks
]
pa_table = pa.Table.from_batches(pa_batches)
kn_table = knext.Table.from_pyarrow(pa_table)
return kn_table
When knime executes this, from_pyarrow
will throw
org.knime.python3.nodes.PythonNodeRuntimeException:
Tried writing a batch after a batch with a different size than the first batch. Only the last batch of a table can have a different size than the first batch.
However, the error is wrong. In the specific case there are exactly two batches given to knime [1M, 500k]
therefore the data passed to knime does fit the criteria that knime tells you it expects (only last batch being different size).
After further investigating, I’ve concluded that this happens because internally knime tries to re-batch the input and the 1M rows batch probably gets some leftovers because it’s not partitioned equally. So let’s say if I changed the page_steps
in the previous snippet to 100_000
instead of 1_000_000
it would work. Which wouldn’t be an issue onto itself but the thing is, I found that the batch size also depends on the input size. Therefore if I made the string size 3
instead of 73
it would work with original 1_000_00
batch size. And not only that, having more than one column also seems to influence the batch size. Therefore there’s no way for me to know how I could even partition my data for knime to accept it even if this is not considered a bug (which IMHO it is).
Therefore, the only way would to fix this would either be not using batching which is memory demanding or using very small batching which exponentially increases the execution time. Neither of the options are acceptable.