If i make a filter before the join like (key_first_table=1 and key_second_table=1) the output table has only the first row (right). Anyone could explain me?
I'm testing the join node with full outer join and i think there is a problem.
I can make a full outer join with 3 inner join (screenshot), this method works perfectly but take a long time.
Now i try to do the same thing with the join node in full outer mode version (two tables with 5mls of rows each). The number of rows in output is higher than the previous method. If i groupby the key under join i can see a lot of duplicated keys in output.
my previous tests were done with KNIME 2.9.4 (because thats still my default), I just repeated them with 2.10.1 - but got the same results. Also, I increased the number of rows to 10M and 20M - the final join still returned the correct number of 20M rows.
I am working under Linux and have 24GB of memory - but I guess that that should not make the difference.
Unfortunately, this project is too large to attach it here.
May be worth attaching your test workflow by exporting your workflow with the data in it, in the run state, may help the KNIME team to diagnose the problem.
Ok weskamp, first of all thank you very much for your attention.
Have you tried with a 'normal' pc? =) 24GB is really a huge ram...
I think it is quite a serious problem because, in my case, I get a wrong result without having warnings or alarms; it is dangerous for me using the node, this way I cannot rely on it.
I can confirm the problem. It has something to do with the amount of available memory. Your example workflow works fine with 1GB of heap space, but produces the wrong result with only 768MB. We will look into it. So far you may try to increase the memory available to KNIME.
The Joiner was constructed to handle low-memory conditions by itself. However, there was a bug in the implementation that caused duplicate rows for right and full outer joins. This is fixed in 2.10.2 which will be released today.
I have seen that with the new release 2.10.2 the node works perfectly well.
The result I get is now correct and also the low-memory condition handling works perfectly fine, the speed may slow down but when the node ends the result is correct.
I have 2 tables one with about 3000 rows and the other 8000 rows.
when I use full outer join for these tables (on a specific column), it matters to connect which table to which input port of the joiner. If I connect the smaller table to the left (top) port, the output is the same as smaller table (number of the rows) and if I use the larger table as left table, again the output number of the rows is the same as the larger table.
Is it normal? I expect to have at least the same row number as the larger table in both cases not only when use it as the left table. Why do the results differ when I change the input ports in full outer join?