GroupBy: Serious Performance Regression

mwiegand · June 3, 2024, 6:26pm

Hi,

while Knime was crunching the data set from the current Just Knime it Data challenge (21k rows, 21 columns)

I constantly face a serious performance regression with this simple config simply counting values (total and unique) per id:

More than one minute later all cores are still quite busy:

The benchmark ran three times … all the same.

Since this has never been an issue as I crushed hundreds of thousand top million rows and many more columns in less time, I believe this is a bug.

I run the most recent version of Knime and all extensions on Windows 11.

Best
Mike

wiswedel · June 4, 2024, 9:36am

Hi Mike,

(I assume you are referring to this data set, not the one linked above)

I tried KNIME 5.2.4. Execution time of the GroupBy as per “Timer Info” node is <0.2s. Workflow is here:

Please clarify.

mwiegand · June 5, 2024, 2:38pm

Hi @wiswedel,

apologize, yes I was referring to the data set from challenge 3. Odd that it is working as quickly as expected for you. Here is my solution, maybe you spot something:

The issue is still persisting for me, though.

Best
Mike

wiswedel · June 6, 2024, 7:51pm

Hi Mike,

it’s as quick as we would expect it to be:
flow

Do you mind extracting a “thread dump” for me while the GroupBy node is running? (FAQ)

– Bernd

mwiegand · June 6, 2024, 8:02pm

Hi @wiswedel,

here you go. I closed and restarted Knime. Reset the workflow to until that GrouBy node and started execution experiencing the same regression. I then immediately took aa thread dump, one more around half way through and after processing finished.

Here is the zip in disguise of a txt.

240606 Knime topic 80009.zip.txt (29.0 KB)

Best
Mike

wiswedel · June 6, 2024, 8:50pm

Thank you.

… I have an explanation.

Your node is configured to allow group sizes of up to 10M elements:

The default is 10k, which is why my workflow above doesn’t show the problem.

Why did I not notice this earlier when trying your workflow? … because I was using a nightly build (5.3.0 Nightly), which is using a different, more efficient, grouping implementation.

I don’t think this is a performance regression. For these settings, with many groups (21k groups) allowing for 10M elements in a group, it has always been “inefficient”.

In order to fix your problem, either set a different group size parameter or use 5.3.0 nightly (release is planned for mid July).

Thanks for reporting it.

– Bernd

mwiegand · June 7, 2024, 5:49am

Thanks Bernd! It seems my understanding of the threshold is not accurate. I thought I ran into compute constraints because the threshold of unique values was exceeded causing missing to be entered. Hence I increased the number, which I did many times before in other workflows too, to a higher number.

Maybe it’s an option to make the skip voluntary as this one setting, please correct my explanation if it’s wrong, does both, divide the ingress data into groups of 10k unique values each, and inserts missing in case of “overflow”.

I still believe I am not getting the concept right, am I?

PS: The same applied to the Pivot node too.

Cheers
Mike