Should it be taking this long?

knightyboy · June 22, 2020, 3:12pm

Apologies for the semi-repeat post however I have read through many posts (many on old versions) and applied what I have found to little effect.

Scenario:

4 main data sources - 2 excel files, 2 tables on sql server
Large workflow - Just over 1,000 nodes
End dataset - 350k rows, 56 columns
Total Time to process - >4 hours
Longest node to run - 87 minutes (Joiner: Left outer join)

Hardware:

Windows 10 VM on Azure
Intel® Xeon® Platinum 8272CL
32GB RAM

Have experimented with the Knime.ini configuration and is currently set to use 18GB and 1m cellsinmemory

KNIME has been updated to the latest release

The data is all transactional sales data from 4 different systems which need to be transformed before being combined with additional joins from other smaller ‘dictionaries’ such as FX rates etc.

The longest node to run is a left outer join with the left table being the sales data (350k rows), and the right table being all the UK postcode prefixes (3k rows, 3 columns). This is done in order to validate which prefixes in the sales data are valid (most of the data comes from 2 of the systems without built in validation, also need to add cities to all of the data).

Additional long running nodes are all joins, again with the 350k left and anywhere from 300 to 18k on the right. These include joins from a pre-downloaded dataset from Companies House (only records that we are matching to) and SIC lookups from the codes from Companies House.

My thoughts:

Do I really not have enough computing power to handle these joins quickly? I never seem to fully max out on RAM usage.
I could re-work my workflow to apply the joins to the subsets of data before they are concatenated, but is this that much more efficient if they are potentially running at the same time anyway?
Is there a method more efficient than using Joiner for this?

Truly appreciate any help.

Many thanks,
Charlie

AnotherFraudUser · June 22, 2020, 8:15pm

Hi @knightyboy,

from your data I cannot really tell you if it is normal or not.
Can you give an example workflow for a step you think is particular slow (with dummy data https://nodepit.com/node/org.knime.base.node.util.sampledata.SampleDataNodeFactory )?
Maybe then it is easier to tell you what could be changed

mlauber71 · June 22, 2020, 8:54pm

@knightyboy welcome to the KNIME forum

These things come to my mind:

joins are costly so it makes sense to plan ahead
KNIME seems to be reworking the join nodes: i cannot find the post right now maybe @ScottF can help
then it makes sense to try and do one join/step at a time
i tried to use the cache node before a join on both sides in order to bring the data together and avoid additional internal copying (Execute failed: Cannot read file "knime_container_20200301_1905559478659870280.bin.snappy" - #19 by mlauber71)
then you could try and force the join node to do everything in memory and see if this is getting any faster
then if a workflow becomes very large it tends to use more RAM. It might help to split it into several workflows and maybe call them from a parent workflow (no guarantee there - and of course the effort to rework the whole thing) - you could try with you two largest join files to see
use the garbage collector to free memory
you mentioned Azure: is this some virtual server? I sometimes had the impression that virtual or shared drives over networks make for poor KNIME performance. Local SSD fare much better. If this is the case caching again might be an option
you would have to see if caching is better to be done on disk or in memory. Memory might be faster - but there is always a price to pay
then you could try and change the internal storage from zip to parquet (cf. Execute failed: Cannot read file "knime_container_20200301_1905559478659870280.bin.snappy" - #2 by mlauber71). Although you might be careful with that

Then concerning your data

make sure you do not join large numbers of missing (or dummy IDs or something) against each other. That would also take more time
complex strings could also be a problem
I have thought about but never tried to build a reference file with numbers to my string IDs. Then join the numbers to each file and then join by numbers (that might be more costly than just living with the longer time) but combined with other approaches it might be a measure of last resort

And well if everything else fails. You could try and employ a larger system or a database. Or you could try and see if the SQL server could handle some joins.

You asked for speed but one more experiment could be to use Hive and the local big data environment. In theory Hive should be able to at least finish the job. But that would be more of an issue if your job would not finish at all.

OK since we are at it here a historical collection of entries about performance and KNIME

knightyboy · June 23, 2020, 12:47pm

That’s incredibly helpful, thank you.

I’ve started to split my data before joins to remove the missing values from going through the join, and then concatenating back afterwards, this seems to help a lot (would be nice if the join had it’s own built in logic to do this, I guess I assumed it would skip those rows, although I guess it works much like an Excel lookup)

Once I’ve completed that process, I will look at seeing if I can simplify some of the steps and break the workflow up into subworkflows and adding in some of your other recommendations.

Regarding the garbage collector, how am I meant to utilise that?

I assume this is the link - https://hub.knime.com/Vernalis/extensions/com.vernalis.knime.feature/latest/com.vernalis.knime.misc.gc.node.RunGarbageCollectorNodeFactory

However I don’t understand how to insert that into my workflow, as it contains input/output ports that won’t connect to any of my existing nodes.

Many thanks

mlauber71 · June 23, 2020, 12:56pm

Glad you like the information. If you consider using a call workflow scenario that comes with some quirks. I have tried to set up such a scenario. You might have look:

In general you will have to do some trial and error to get a feeling what is best for your task and plan ahead with the reduction of workload for the system in mind. From my experience the combination of Cache and join in memory (wahtever combination fits best) could help a lot.

You could just use the flow variable. There alse seems to be a ‘heavy garbage collector’ - also some configuration is possible. Maybe you read about that in the description.

mlauber71 · June 25, 2020, 5:29am

Here is a way to have garbage collection without additional installation

system · July 2, 2020, 5:29am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.