full outer join error

darbon · September 5, 2014, 10:16am

Hi,

I have two tables with many records (>5ml) and a join node with full outer join mode.

The output table is like:

row_key| key_first_table | key_second_table | value_first_table|value_second_table

row1_1 | 1 | 1 | 1 | 5

row2_? | 2 | ? | 3 | ?

row?_1 | ? | 1 | ? | 5

Why the node extract the third row?

If i make a filter before the join like (key_first_table=1 and key_second_table=1) the output table has only the first row (right). Anyone could explain me?

(The keywords have not blank characters)

Thanks

Aaron_Hart · September 5, 2014, 3:33pm

Hi Darbon,

Any chance we can see an example workflow? Not sure I am following the problem, sorry.

darbon · September 9, 2014, 3:56pm

I'm testing the join node with full outer join and i think there is a problem.

I can make a full outer join with 3 inner join (screenshot), this method works perfectly but take a long time.

Now i try to do the same thing with the join node in full outer mode version (two tables with 5mls of rows each). The number of rows in output is higher than the previous method. If i groupby the key under join i can see a lot of duplicated keys in output.

The same node works fine with few data in input.

1.jpg

nik · September 12, 2014, 4:53pm

I have had the same problem with the JOINER node set to 'FULL OUTER JOIN'.

The error is easily reproducible:

one 'empty table creator' node to generate 1,000,000 rows from 0 skip 1

plus one rowID node to convert the row id into a column

another 'empty table creator' node to generate 2,000,000 rows from 0 skip 1

a JOINER node joining them set to 'FULL OUTER JOIN'

plus one rowID node to convert the row id into a column

and then I join the data using the JOINER node set to 'FULL OUTER JOIN'.

I execute the flow and i find 3,000,000 rows which are not the 2,000,000 expected.

When I set the 'empty table creator' nodes to generate less rows the node works perfectly fine.

(Attached image)

Thanks

1.jpg

nik · September 16, 2014, 4:37pm

Hi,

has anybody had problems with the 'FULL OUTER JOIN' node? In my experience it works properly only when using few data and i can't trust on it !!!

weskamp · September 17, 2014, 5:39pm

Hi Nik,

I just tried to reproduce your issue - but in my case, I got 2,000,000 rows. Please see attached workflow.

Nils

knime_project.zip

nik · September 18, 2014, 10:51am

Hi Weskamp,

thanks a lot.

I imported your project and run it in my environment (see below). The knime version is 2.10 and the rows returned by the join node are still 3M

Weskamp, have you tried to run your project with much more rows?

This is my PC:

windows7 professional 32 bits

intel i5 3.10 GHz (4CPUs)

RAM 4 GB

This is my KNIME.INI:

-startup
plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_1.1.100.v20110502
-vmargs
-XX:MaxPermSize=256m
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-XX:+UnsyncloadClass
-XX:-UseSplitVerifier
-Dknime.enable.fastload=true
-Xmx512m

weskamp · September 18, 2014, 4:03pm

Hi Nik,

my previous tests were done with KNIME 2.9.4 (because thats still my default), I just repeated them with 2.10.1 - but got the same results. Also, I increased the number of rows to 10M and 20M - the final join still returned the correct number of 20M rows.

I am working under Linux and have 24GB of memory - but I guess that that should not make the difference.

Unfortunately, this project is too large to attach it here.

Nils

richards99 · September 18, 2014, 4:13pm

Hi Nik,

May be worth attaching your test workflow by exporting your workflow with the data in it, in the run state, may help the KNIME team to diagnose the problem.

Simon.

nik · September 18, 2014, 4:49pm

Ok weskamp, first of all thank you very much for your attention.

Have you tried with a 'normal' pc? =) 24GB is really a huge ram...

I think it is quite a serious problem because, in my case, I get a wrong result without having warnings or alarms; it is dangerous for me using the node, this way I cannot rely on it.

nik · September 18, 2014, 5:10pm

Sure, the workflow has been attached

Thanks all

knime_full_outer_join.zip

thor · September 19, 2014, 11:21am

I can confirm the problem. It has something to do with the amount of available memory. Your example workflow works fine with 1GB of heap space, but produces the wrong result with only 768MB. We will look into it. So far you may try to increase the memory available to KNIME.

nik · September 22, 2014, 9:41am

Ok Thor, a heap memory lack is always possible but important for us would be being informed by a warning/alarm in order to handle the error.

Please let us know,

thanks a lot

thor · September 22, 2014, 9:46am

The Joiner was constructed to handle low-memory conditions by itself. However, there was a bug in the implementation that caused duplicate rows for right and full outer joins. This is fixed in 2.10.2 which will be released today.

nik · September 23, 2014, 11:04am

Hi thor,

I have seen that with the new release 2.10.2 the node works perfectly well.

The result I get is now correct and also the low-memory condition handling works perfectly fine, the speed may slow down but when the node ends the result is correct.

Thanks you all

weskamp · September 23, 2014, 6:58pm

Hi All,

good to hear that the issue is fixed - thanks a lot. I guess that the Joiner-node is still among the most heavily used nodes.

Nils

armingrudd · December 16, 2017, 2:59pm

I have 2 tables one with about 3000 rows and the other 8000 rows.

when I use full outer join for these tables (on a specific column), it matters to connect which table to which input port of the joiner. If I connect the smaller table to the left (top) port, the output is the same as smaller table (number of the rows) and if I use the larger table as left table, again the output number of the rows is the same as the larger table.

Is it normal? I expect to have at least the same row number as the larger table in both cases not only when use it as the left table. Why do the results differ when I change the input ports in full outer join?