JOINER NODE - Weird behavior -possible severe error

odin · November 10, 2012, 1:26pm

Hello,

I have problem with understanding JOINER NODE and its 2 options: Left and Right Outer Join.

Description:
I have 2 datasets - A and B. A is 1000 rows and B 1000000. When I am trying to join A with B using Left Outer Join and 2 columns (in this case A is on the top - left side of the joiner and B is connected to the lower input - right side in the joiner).

As an output I get 1000 rows from A extended (joined) by columns from B. That is ok.

A

col1 col2 valA

X     B      V

Y     C      7

T     D      8

B

col1 col2 valB

X     B      G

Y     C      8

I      H     9

Output: A Left Outer Join B using col1 and col2 and match all of the following:

X     B      V     G

Y     C      7     8

T     D      8     ?

Case 2 is weirder: I am doing the same, but A this time is connected to lower input and B to upper input - changing option to Right Outer Join - here the output should be the same - 1000 rows. Instead I am getting 1002 - 2 additional rows for every matched row between A and B.

G    V    B    X

8    7    C    Y

?    V    B    X     -> additional row with nulls (notice that X B V are from table A and    above they have match in table B

?    7    C    Y     -> additional row with nulls (the same Y C 7 had match but also have null values which is weird)

Using logic these two outputs should be the same (and they are for small datasets, but for my generated 2 - 1000 and 1000000 i am getting weird outputs).

If someone is interested here is the example pictures that could be more clear then description above (result of a Right Outer Join):

h t t p : //postimage . org/ image /fkxiruei1/

Please notice how it is possible for 2 rows from table A to have match in table B and in the same moment to have null values. - Only when using Right Outer Join and big dataset.

Any help appreciated. Thanks!

gabriel · November 11, 2012, 8:09pm

Hi odin, thanks a lot for your detailed report. I agree that this behaviour is not intended, I need to check that. However, I am trying to reproduce it but I am failing with the second test case where my result has a different column order, valB, col1, col2, valA and the ouput of row 3 looks different, too: ?, T, D, 8. Can you please help to reproduce your behaviour, I have attached my test flow created with KNIME 2.6.3 (Windows 7, 64 bit).

joinerbug.zip

odin · November 12, 2012, 12:13pm

Hello Gabriel,

This only works for a big datasets. Here is a workflow for You ( i have used automatic generated data). There is a description in the workflow. Hope it is helpful.

Briefly speaking: Dataset A has 20000 rows, Dataset B - 1 000 000.
Using left outer join I am getting as an output - 20000 rows (all from dataset A) + related rows from B, but when using Right Outer Join (after changing inputs) i am getting 20002 rows - 2 extra that should not be there.

After changing inputs and left to right outer join it should be the same (logically).

BTW - You will have to run it , as I can not export full workflow with data (it is to large).

Thanks

joinerbugexample.zip

gabriel · November 12, 2012, 6:17pm

Thanks a lot for providing this workflow. I had a look into it and it seems that the type StringListCollection screws up the Joiner behavior. This is hard to explain, but the DataCell-collection des not implement the interface compareTo and equals which may lead to strange problems when using those nodes on collections. I hope in your original example you also use collections? However, I tried to overcome the problem and used the Java Snippet to compute a hashCode() as int and used this to run the joining. This works and I hope this would be an solution for you?

odin · November 13, 2012, 9:25am

Thanks, good point about hashCode() - btw I am using only left outer join now.

Glad that I could help.

odin · November 21, 2012, 2:43pm

BTW - is it going to be fixed in the next release, and what collection should I use (what do You mean by "use collections")?