Agree with you, i’m not expecting Knime to match Alteryx AMP, but this performance does seem out of kilter for me.
I definitely remember this begin faster around 2 - 2.5 years ago when i last did a digital analytics project on knime with this data volume
What i am preparing at the moment, is a set of tools to allow Digital Analytics users to easily query, download and process GA4 Bigquery data (which is a significant problem at the moment) without the need for complex SQL writing in Google BQ console.
I think @mwiegand has a good example as he reports the same issue.
I will prepare a sanitized version of the data i am working with and send you excerpts of the processes i am running for you guys to have a look at.
I definitely remember this begin faster around 2 - 2.5 years ago when i last did a digital analytics project on knime with this data volume
That is very interesting! Do you remember whether that was already using the Columnar Backend (which we released with KNIME 4.3 ~2.5 years ago) or the default = row-based backend? We didn’t change much under the hood for the row-based backend, however what we did is updating to Java 17 which could have had an impact on performance.
I think @mwiegand has a good example as he reports the same issue.
I will prepare a sanitized version of the data i am working with and send you excerpts of the processes i am running for you guys to have a look at.
Awesome, thank you! We’ll start to investigate @mwiegand’s example then until we have your use case.
Chiming in about the approximate time when the performance was (presumably) much better. I checked the file meta data in my workflows and can confirm, which also correlates with what I remember, that around April 2021 performance was much better. This relates to workflows processing fast amount (>10 million) of firewall requrests with >100 columns containing not just strings but also semi complex data like lists or JSON.
Back then I used the default backend which I assume is not the new columnar one. Some other workflows which crunched huge amount of data, like thousands of XML with massive paralllism, date back to 3rd quarter of 2019. I used to run these XML crunching workflows until around March 2022 and did not experience a significant performance regression.
Though, the WAF logs contained substentially more rows but used significantly less parallelism. So somewhere after April 2021 to 2022 something might caused a potential performance regression.
Thanks so much for the detailed investigation @mwiegand! I assume you were always using the “most recent” KNIME version by updating as soon as the update was offered inside KNIME AP?
Back then I used the default backend which I assume is not the new columnar one.
Exactly, the default backend is not the columnar backend.
I used to run these XML crunching workflows until around March 2022 and did not experience a significant performance regression.
I just checked. We did use the Java 17 JVM since KNIME AP 4.6 which was released in June 2022. That might coincide with the drop in performance… Very suspicious.
You are most welcome and yes, I am aleways at the forefront when it comes to update. Maybe one minor detail. Up until recently I had an Apple MacBook Pro (2016) in use. Java, I roughly recall, was kind of different on OSX as Apple does things “differently” … pun intended not not necessarily for the better
@Gavin_Attard I modified the example of @mwiegand and run it with the standard backend on my Mac M1 with KNIME 4.7.4. - with fixed RowIDs and Cache node written to disk. Maybe you can give this a try. RAM was set to 12 GB.
Parallelization would not really work. The rest I will have to continue to check.
We have received the data from @Gavin_Attard and are looking into that and into the workflows you have kindly provided. Thank you all for your cooperation!
@mlauber71 I did not back tested your adjusted workflow of mine (yet). Can you help me understand where exactly you managed to squeeze out more performance?
I am not quite certain if that’s realted but for me it falls into the “performance” category. In two other posts I noticed that Knime becomes quite unstable in the presence of a binary object column. I then noticed it is not exclusive to binary objects as well.
Node configuration opens with about 30 sec. delay or sometimes doesn’t open et all. Output preview upon sort throws an error, scrolling to the binary column freezes the preview not even enabling to close it.
I belive it might be realted to this as the performance regression becomes quite descerable under the described circumstances. Maybe this help tracking down the percevied regression here too?
I’d like to circle back to this topic as I recently read about a regression caused by Windows which was presumably fixed lately. However, Microsoft wasn’t quite specific so the exact circumstances are illusive.
Nevertheless, it might be a possible explanation or contribution to the experienced degression. Maybe someone else has more background information or got some “good news” as Professor Hubert J. Farnsworth would say …