Python batching issue

ratchet · June 4, 2024, 11:00am

When creating table in knime python extension there is an unexpected behaviour to how knime handles batching internally that pretty much forces you to never use pyarrows batching because as far as I can tell there’s no way to predict the size of batches the knime will consume. While this may seem like an user issue but the way that API is set up I think this is actually an oversight in the knime python extension.

Let me explain with a simplified snippet, here’s the problematic execute snippet.

import knime.api.table as kt
import knime.extension as knext
import pyarrow as pa
....
def execute(self, exec_context: knext.ExecutionContext, table: kt._TabularView):
    def page_steps(length, page_size):
        start, end = 0, 0
        for _ in range(0, length + 1, page_size):
            start, end = end, min(end + page_size, length)
            yield start, end
            if end == length:
                return

    data = ["A" * 76 for _ in range(1_500_000)]
    names = ["A"]

    pa_chunks = [
        [data[start:stop]]
        for start, stop in page_steps(len(data), 1_000_000)
    ]

    pa_batches = [
         pa.record_batch([
            pa.array(c) for c in chunk
        ], names=names)
        for chunk in pa_chunks
    ]

    pa_table = pa.Table.from_batches(pa_batches)
    kn_table = knext.Table.from_pyarrow(pa_table)

    return kn_table

When knime executes this, from_pyarrow will throw

org.knime.python3.nodes.PythonNodeRuntimeException:
    Tried writing a batch after a batch with a different size than the first batch. Only the last batch of a table can have a different size than the first batch.

However, the error is wrong. In the specific case there are exactly two batches given to knime [1M, 500k] therefore the data passed to knime does fit the criteria that knime tells you it expects (only last batch being different size).

After further investigating, I’ve concluded that this happens because internally knime tries to re-batch the input and the 1M rows batch probably gets some leftovers because it’s not partitioned equally. So let’s say if I changed the page_steps in the previous snippet to 100_000 instead of 1_000_000 it would work. Which wouldn’t be an issue onto itself but the thing is, I found that the batch size also depends on the input size. Therefore if I made the string size 3 instead of 73 it would work with original 1_000_00 batch size. And not only that, having more than one column also seems to influence the batch size. Therefore there’s no way for me to know how I could even partition my data for knime to accept it even if this is not considered a bug (which IMHO it is).

Therefore, the only way would to fix this would either be not using batching which is memory demanding or using very small batching which exponentially increases the execution time. Neither of the options are acceptable.

bwilhelm · June 10, 2024, 6:55am

Hi @ratchet,

Thank you for your report. This is indeed a bug. The internal re-batching should not create a configuration that does not work. I created a ticket in our internal ticket system for that.

However, we also have a specific API for creating tables batch by batch which does not re-batch the inputs and works with your example:

# https://docs.knime.com/latest/python_installation_guide/index.html#_working_with_batches
batched_table = knio.BatchOutputTable.create(row_ids="generate")
for batch in pa_batches:
    # Note that you can do the processing inside the loop
    batched_table.append(batch)

knio.output_tables[0] = batched_table

Additionally, we are working on removing the limitation of same-sized batches (which is an implementation detail of our Java backend).

system · September 8, 2024, 6:55am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.