I have problem with understanding JOINER NODE and its 2 options: Left and Right Outer Join.
I have 2 datasets - A and B. A is 1000 rows and B 1000000. When I am trying to join A with B using Left Outer Join and 2 columns (in this case A is on the top - left side of the joiner and B is connected to the lower input - right side in the joiner).
As an output I get 1000 rows from A extended (joined) by columns from B. That is ok.
col1 col2 valA
X B V
Y C 7
T D 8
col1 col2 valB
X B G
Y C 8
I H 9
Output: A Left Outer Join B using col1 and col2 and match all of the following:
X B V G
Y C 7 8
T D 8 ?
Case 2 is weirder: I am doing the same, but A this time is connected to lower input and B to upper input - changing option to Right Outer Join - here the output should be the same - 1000 rows. Instead I am getting 1002 - 2 additional rows for every matched row between A and B.
G V B X
8 7 C Y
? V B X -> additional row with nulls (notice that X B V are from table A and above they have match in table B
? 7 C Y -> additional row with nulls (the same Y C 7 had match but also have null values which is weird)
Using logic these two outputs should be the same (and they are for small datasets, but for my generated 2 - 1000 and 1000000 i am getting weird outputs).
If someone is interested here is the example pictures that could be more clear then description above (result of a Right Outer Join):
h t t p : //postimage . org/ image /fkxiruei1/
Please notice how it is possible for 2 rows from table A to have match in table B and in the same moment to have null values. - Only when using Right Outer Join and big dataset.
Any help appreciated. Thanks!
Hi odin, thanks a lot for your detailed report. I agree that this behaviour is not intended, I need to check that. However, I am trying to reproduce it but I am failing with the second test case where my result has a different column order, valB, col1, col2, valA and the ouput of row 3 looks different, too: ?, T, D, 8. Can you please help to reproduce your behaviour, I have attached my test flow created with KNIME 2.6.3 (Windows 7, 64 bit).
This only works for a big datasets. Here is a workflow for You ( i have used automatic generated data). There is a description in the workflow. Hope it is helpful.
Briefly speaking: Dataset A has 20000 rows, Dataset B - 1 000 000.
Using left outer join I am getting as an output - 20000 rows (all from dataset A) + related rows from B, but when using Right Outer Join (after changing inputs) i am getting 20002 rows - 2 extra that should not be there.
After changing inputs and left to right outer join it should be the same (logically).
BTW - You will have to run it , as I can not export full workflow with data (it is to large).
Thanks a lot for providing this workflow. I had a look into it and it seems that the type StringListCollection screws up the Joiner behavior. This is hard to explain, but the DataCell-collection des not implement the interface compareTo and equals which may lead to strange problems when using those nodes on collections. I hope in your original example you also use collections? However, I tried to overcome the problem and used the Java Snippet to compute a hashCode() as int and used this to run the joining. This works and I hope this would be an solution for you?
Thanks, good point about hashCode() - btw I am using only left outer join now.
Glad that I could help.
BTW - is it going to be fixed in the next release, and what collection should I use (what do You mean by "use collections")?