Ways to speed this up? 10M + rows

Gavin_Attard · July 17, 2023, 7:20am

Thanks so much for replying.

Agree with you, i’m not expecting Knime to match Alteryx AMP, but this performance does seem out of kilter for me.

I definitely remember this begin faster around 2 - 2.5 years ago when i last did a digital analytics project on knime with this data volume

What i am preparing at the moment, is a set of tools to allow Digital Analytics users to easily query, download and process GA4 Bigquery data (which is a significant problem at the moment) without the need for complex SQL writing in Google BQ console.

I think @mwiegand has a good example as he reports the same issue.
I will prepare a sanitized version of the data i am working with and send you excerpts of the processes i am running for you guys to have a look at.

Thanks.

Gavin

carstenhaubold · July 17, 2023, 8:42am

Hi @Gavin_Attard

Thanks for getting back to us.

I definitely remember this begin faster around 2 - 2.5 years ago when i last did a digital analytics project on knime with this data volume

That is very interesting! Do you remember whether that was already using the Columnar Backend (which we released with KNIME 4.3 ~2.5 years ago) or the default = row-based backend? We didn’t change much under the hood for the row-based backend, however what we did is updating to Java 17 which could have had an impact on performance.

I think @mwiegand has a good example as he reports the same issue.
I will prepare a sanitized version of the data i am working with and send you excerpts of the processes i am running for you guys to have a look at.

Awesome, thank you! We’ll start to investigate @mwiegand’s example then until we have your use case.

Cheers,
Carsten

mwiegand · July 17, 2023, 9:15am

Chiming in about the approximate time when the performance was (presumably) much better. I checked the file meta data in my workflows and can confirm, which also correlates with what I remember, that around April 2021 performance was much better. This relates to workflows processing fast amount (>10 million) of firewall requrests with >100 columns containing not just strings but also semi complex data like lists or JSON.

Back then I used the default backend which I assume is not the new columnar one. Some other workflows which crunched huge amount of data, like thousands of XML with massive paralllism, date back to 3rd quarter of 2019. I used to run these XML crunching workflows until around March 2022 and did not experience a significant performance regression.

Though, the WAF logs contained substentially more rows but used significantly less parallelism. So somewhere after April 2021 to 2022 something might caused a potential performance regression.

carstenhaubold · July 17, 2023, 9:34am

Thanks so much for the detailed investigation @mwiegand! I assume you were always using the “most recent” KNIME version by updating as soon as the update was offered inside KNIME AP?

Back then I used the default backend which I assume is not the new columnar one.

Exactly, the default backend is not the columnar backend.

I used to run these XML crunching workflows until around March 2022 and did not experience a significant performance regression.

I just checked. We did use the Java 17 JVM since KNIME AP 4.6 which was released in June 2022. That might coincide with the drop in performance… Very suspicious.

mwiegand · July 17, 2023, 9:47am

You are most welcome and yes, I am aleways at the forefront when it comes to update. Maybe one minor detail. Up until recently I had an Apple MacBook Pro (2016) in use. Java, I roughly recall, was kind of different on OSX as Apple does things “differently” … pun intended not not necessarily for the better

Gavin_Attard · July 18, 2023, 7:36am

Hi @carstenhaubold

how am i best sending you the data?

It is still being prepared, so far it has been running for 12 hours… (still going)

I am hashing values. The hash per column is quick. The bottleneck is the loop end

I also tried capturing a table write process for you

Let me know how i can best send you the file.

kr Gavin

Gavin_Attard · July 19, 2023, 7:20am

Hi @carstenhaubold

I’ve had to give up sanitizing the data as the loop end brought everything to a halt, 18 hours in and not even half way.

mlauber71 · July 19, 2023, 8:15am

@Gavin_Attard I modified the example of @mwiegand and run it with the standard backend on my Mac M1 with KNIME 4.7.4. - with fixed RowIDs and Cache node written to disk. Maybe you can give this a try. RAM was set to 12 GB.

Parallelization would not really work. The rest I will have to continue to check.

carstenhaubold · July 20, 2023, 7:50am

We have received the data from @Gavin_Attard and are looking into that and into the workflows you have kindly provided. Thank you all for your cooperation!

mwiegand · July 26, 2023, 4:23pm

Just wanted to share a contrary performance experience. While working on some image deduplication tasks the disk read throughput peaked beyond 800 MB

@mlauber71 I did not back tested your adjusted workflow of mine (yet). Can you help me understand where exactly you managed to squeeze out more performance?

Cheers
Mike

mwiegand · August 5, 2023, 1:50pm

I am not quite certain if that’s realted but for me it falls into the “performance” category. In two other posts I noticed that Knime becomes quite unstable in the presence of a binary object column. I then noticed it is not exclusive to binary objects as well.

Node configuration opens with about 30 sec. delay or sometimes doesn’t open et all. Output preview upon sort throws an error, scrolling to the binary column freezes the preview not even enabling to close it.

I belive it might be realted to this as the performance regression becomes quite descerable under the described circumstances. Maybe this help tracking down the percevied regression here too?

Best
Mike

mwiegand · August 25, 2023, 1:24pm

Hi all,

I’d like to circle back to this topic as I recently read about a regression caused by Windows which was presumably fixed lately. However, Microsoft wasn’t quite specific so the exact circumstances are illusive.

Nevertheless, it might be a possible explanation or contribution to the experienced degression. Maybe someone else has more background information or got some “good news” as Professor Hubert J. Farnsworth would say …

Best
Mike

system · November 23, 2023, 1:24pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.