optimize a join

Hi,

I have a join that take a very long time and i have some questions.

1 - What is the criteria to choose the "Maximum number of open files"? the default is 200, if i choose 1000 is better? 10000000000 is better or worst?

2 - If i split the rows between 4 join node can i have a better execution time? (4 java snippet+4 join+1 concatenate node)

3 - If i sorting the table on the join columns before the join node can i have a better execution time?

4 - The length of the join columns conditions the execution time? 00000000100 is the same of 100?

5 - The presence of some missing values into the join columns can conditions the execution time?

Thanks in advance

Hi darbon,

this question will almost certainly require an "official" answer from one of the KNIME core developers since they know all the gory details of the Joiner node.

Based on my (limited) knowledge, I can say that the Joiner itself splits large input tables into smaller chunks and also performs an internal sorting of these input chunks. AFAIK there is no detection of pre-sorted tables and the code from the Sorter node is used for this purpose, so I would not expect a benefit from your point (3) and also no big benefit from approach (2) - unless you have a very special distribution of keys that you can use somehow.

The runtime of string comparisons certainly depends on the length of the strings, so (4) might help - but I would not expect a dramatic effect.

I have seen significant speedups when I added a special handling for rows with missing values - however, those were tables that contained a significant number of missing values and it also was a couple of KNIME versions ago.

Just to mention it: If there is any way for you to make more memory available to KNIME - this would almost certainly help.

Nils