Misleading Label - Technically wrong

Most nodes that mention

Concurrent

are not actually processing in concurrent (processing n rows at any given time) mode but in batch (processing sets of n rows before starting the next set) mode

Hi @fe145f9fb2a1f6b,

thank you for the effort to submit feedback!

In this instance (REST Request nodes), the requests are done concurrently (if configured and possible).

The node builds (GitHub) what we call a ColumnRearranger. This rearranger leads to the creation of a RearrangeColumnsTable (GitHub), which is implemented via the MultiThreadWorker (GitHub), which in turn offers “processing n rows at any given time” (with a queue of “done” rows). Depending on the concurrency setting of the node and the KNIME worker pool load, requests are sent in parallel or not.

Maybe it was an unfortunate pick for an example… so do you have other nodes in mind where you encountered this “concurrency” vs. “batch” problem?

Best,
Manuel

3 Likes

I can confidently say that this note (up to at least Knime 5.4) does batching.

if you have a list of GET requests and some of them are long running (1min +) to return a status code, while others respond in seconds, it will idle and wait for the batch to complete before starting another batch.

The code from the 5.4 releases is more or less the same (modulo the changes to enhanced concurrency while blocking on I/O).

The node guarantees the input row order is the output row order, so any intermediate long running requests block all following rows once the queue of “done” tasks is full. The “first requests” are a bit special, since they are used to determine the output spec (more specifically, the first non-failure response does), but in general I would say my point still holds.

There is no definition of “batch” in the code (“set of n nodes” as you wrote), so imho “Batch size” is not an accurate label.

so tldr: if the number of rows is smaller the “concurrency” n setting, its concurrent.
but if your number of rows is larger, it will take the first n rows, finish processing all of them, and then start the next n rows. this is literally the definition of batching.

a proper concurrent processing wouldn’t have 1 long running task block all subsequent tasks if there is a limit in parallel processes.

currently, if you are past the first batch, it will always just trigger n requests, wait for all of them to complete, and then trigger the next n.

When you use the REST nodes, there is no instruction to wait for “finish processing all of them” unconditionally, i.e. the set of n rows to be processed is not defined a-priori (so a single “batch” is well defined by the set of rows it contains), so in my understanding of batch processing this does not fall under the term “batch processing”. Maybe we have different definitions of “batch processing”?

The MultiThreadWorker uses a bounded, ordered finished task queue (GitHub), so if n is lower than the number of input rows, there can be scenarios where rows are done being processed as if they were processed as a batch of rows. But in general the latency of row k being finished does not depend on row j, where k < j, even if they are less than n rows apart.

My point is that we don’t define a batch a-priori, so it does not constitute batch processing.

I agree that the label as it is now can be improved, though. For example as “Maximum concurrency – Defines the maximum number of concurrent requests.”.

1 Like

still, this is not concurrent processing if a single job can block/clog the whole queue and prevent any job further down the queue to be picked up.

if you have 100 rows and throw them in, currently restricted to 3 concurrent executions, you will always be able to observe (starting at 4 to ignore the first “set”):

  • start processing of rows 4, 5 and 6 at the same time or with the delay configured
  • wait for completion of rows 4, 5 and 6
  • only now start of rows 7, 8 and 9, again at the same time or with the delay configured

you may not define batching in the code, but the result acts like batching or chunking if n is smaller than the total number of rows

I disagree that backpressure by the bounded queue disqualifies a processing from being “concurrent”.

The MultiThreadWorker does not wait for the completion of row 6 to output row 4. The next row to be output gets picked up from the done queue, in your example row 4, as soon as this row has a response. It makes room in the queue so row 7 can start. Rows 5 and 6 may still wait for their response, but neither putting row 4 into the output table nor starting the request of row 7 got blocked by them.
If you define the workings of the node like you outlined, then it is batching, but this is not what the code says.

You can induce batched responses by slowing down every n-th response until the n-1 responses after it are done. But since this scenario is not actively created by the MultiThreadWorker and rather by the specifically skewed task latency. So in my opinion labeling this “batch processing” is not accurate, since, as I outlined above, it can never be the case that an earlier row waits for the completion of a later row, which happens if you define a batch of rows to be processed together.

Or, you happen to have real-life processes with a varying run or response time.

hence, if a long runner “pins” the queue to block anything in the queue position further down than x+n, then calling it concurrency seems misleading.

I think the point of the setting is to give users the possibility to not bombard a single endpoint with too many parallel requests, but allow some level of parallelization if possible (or in the extreme case to not exhaust local ports).

“Maximum concurrency – Defines the maximum number of concurrent requests.”

Do you agree that this would be less misleading?

no.

what you are describing is “limiting parallelism” but not concurrency. this is a nuance but very relevant from a technical point (and likely not obvious if you arent a native english speaker or deep into comp science topics)