Join of 2 big tables - one column - No reliable merge-join algorithm in Knime ??

Have 2 big tables - 10 mio rows - I would like to do inner join by one column. If I run it in KNIME it works but after 8 hours of runtime I gave up .. it goes qucikly to 30% and then slower and slower. After 7 hours - stuck at 69% after 8 hours at 70% .... no patience to wait anymore ... 

I remeber when we were implementing such big joins in SAP - in Analysis Process Designer in SAP BW (very similiar to what KNIME does) 13 years back - we used "merge join" algorithm - based on merge-sort logic - assuming you pre-sort both inputs - you can do reliable inner/outer joins which do not fit into memory. Well, truth to be told - yes it used a lot of DB engine power to accomodate this (the pre-sort part), but it was possible and it was only question of time to get the result. Besides the pre-sorting of the input tables the join itself was moreless linear problem (you have 2 sorted inputs (long M and N) and you need to generate records which correspond to the join definition - M+N complexity).

Was hoping to find something similiar in KNIME. The standard join doesn't have it obviously. Then I found in NGS->Tools->JoinSorted which looked promissing - but not sure what it really does.

-I added 2 pre-sort nodes to each join branch - those were executed in KNIME in very reasonable time (cca. 20 mins both for my 2x10 mio rows, in parallel running some other ML models :)) - that's not bad ...

- but then I realized that the NGS->Tools->JoinSorted (http://tech.knime.org/community/next-generationsequencing) doesn't see all my columns of my pre-sorted input tables (actually sees only one column from many - maybe it is a BUG??)  .. I did execute it with the other column -> and it seemed to work though couldn't validate the result. The overallruntime of the whiole thing (read 2x 10 mi rows csv, pre-sort each and do the JoinSorted was under 1 hour ! BUT couldn't use it as it joined on other colmun than I needed ...

Anyone have heard about reliable implementation of merge-join logic in KNIME ? Or an idea that standard KNIME join node could use the knowledge that the source tables of the join are already pre-sorted - so the join logic can be optimized (using the merge-join logic) ???

Thanx, Richard.

Ok, not sure how this does for what you need.

you have the Column Appender node which will join two presorted tables, but they must contain the same number of rows. 

 

If your use case doesn't have the same number of rows, you could use the reference row filter node first to ensure each table has the same row values for joining with the column appender node.

i don't have big data like this, so I cannot answer whether this will be quicker for you.

simon.

A second option, since your table is presorted is taking one of the tables as it is first connecting to the top port of the joiner node.

then for the second table, passing it through a chunk loop start node setting this at say 50,000 rows, then connecting this to the second port of the joiner node, and then finishing with a loop end node.

hopefully that would reduce memory use.

simon

 

Oh, and in your initial workflow, in the joiner node, go to memory policy and set it to write tables to disk.

This may prevent it grinding to a crawl.

also in the node you could tinker with the number of open files setting. Reducing it may also save on memory.

Finally, there is the option to increase the amount of heap space, I.e. Memory knime is allowed to use. I think the default is 1Gb, so increasing that could help. Instructions are in the FAQ on how to do it.

simon.

Hi Richard,

I'm pretty sure that the Joiner-node actually uses a MergeJoin-type of algorithm. Nevertheless, memory management within Java has its limitations and KNIME suffers from this - for which you just observed an example. Maybe one the the people from KNIME.COM can have a look at your dataset and check what happens.

I agree with Simon: You should check whether KNIME is actually using all available memory on your machine (which it does not by default), play around with the memory policies of the respective nodes and also check whether you actually need all the columns in your input tables.

If you have access to a database, it might also be a good idea to move the two tables there and do the join in the DB.

I don't think that the NGS nodes will help you since they are intended for specific Bioinformatics applications and not necessarily for general purposes.

Regards,

Nils

Pretty sure that it is actually the writing data to disc that is letting the whole thing grind to a halt.

I had a similar problem with a pivot operation on 70M rows, until I found that there was some data about half way that generated extra columns. And these extra columns are then added on all previous rows. On disc. An operation that is not in any way efficient.

For me, then, the solution was to make sure that all data was filtered, sanitized, nothing was null and in the first row all (dummy !) strings were of a length that could accomodate all that was to come.

@Richards99:

1)  Column Appender  node - this is odd solutin for very special case ... + the extra work  in case the tables are not of same length. But in general that could be an usefull workaround and I tested the 10mio rows on ColumnAppender - worked fine, linear complexity so no issues.

2) I did try the Chunk Loop - Endloop on 2nd table of the join - it run and then failed due some memory issues ... might retry with Xmx4GB in knime.ini ... will see ...

3) Did try setting writing everything to disc -> run sooo long (10+ hours) so I didn't have patience to wait 

@weskamp: If it uses merge-join than then very ineffcient implementation .... how to explain that I can sort both 10mio rows tables in matter of minutes (using regular Sort Nodes) and then when doing the Join (inner btw, on one key) it cannot process it under 8 hours ...

It looks like there is some "memory leak" of some kind which get's worse when doing many iterations ... causing GC after many iterations to start "trashing" ...as observed as well  by @Ellert van Koperen 

Now running Join Node in default mode with Xmx=4GB - will see. Next step is give it all my memory 16GB, and 3rd step is to sit down and rewrite the Join Node :)

I do understand that in general I can further optimize dataflow by removing necessary data beeing processed - but this is not the scenario, I need all columns and the point is not to optimize particular use-case but to see if big join is possible.

 @All - anyway appreciate all your comments ! good discussion.

One additional point just came to my mind: Do you have (many) missing values in your tables - particularly in the joined columns? I have the feeling that missing values get processed via exception handling, which can also slow things down dramatically.

No missing values in the join column. No missing values in genral.

Update:

2) I did try the Chunk Loop - Endloop on 2nd table of the join - with Xmx12GB in knime.ini ... 

It finished successfully - with runtime approx. 20 hours (not fair assesment - was susspending my NB during that and also running other intensive calculations in KNIME and outside KNIME in prallel). But my guess -> you won't get it under 10 hours on my HW/SW configuration using this approach.

Input: 2x tables - 10mio each, CVS size 2x 2,3GB

Output: 280 mio rows (yes, I have some duplicate keys which produce additional join records)

Well, if your output is significantly larger than your input, this would at least explain the differences between Sort and Join. Would be interesting to see how (say) MySQL would perform the same task on the same hardware.

There is one huge difference between a database join and KNIME's Joiner node: KNIME keeps the order of the rows which databases usually don't do (depending on which join operator is used internally by the query optimizer). This allows for much faster join operations, especially on large datasets.

Having said that, I agree that >8 hours for 10 million rows is not acceptable. We will check if this can be optimized somehow.

Update 2:

  - the regular Join Node with Xmx12GB settings in knime.ini finished overnight (using 1-2 cores of my CPU), cannot get exact runtime estimate but I started noon - found it it is finished 7:30am - so worts estimate would be 19-20 hours - which is similiar to the Chunk Loop Join solution (chunk size = 100 000 rows, both with default memory policies setting) - which would be slight disappointment for me - I would expect Chunk Loop to by significantly slower but beeing able to handle larger datasets. But as I cannot get close estimate of the runtime -> besides the fact that it finished (both with same number of rows :) ) I can't conclude anything else.

Is there any way to check exact runtime in the logs ? I tried View->Open KNIME log but didn't find any info (though I have KNIME settings for file log set to ERROR - that's might be why).

@Thor: Agree that DB join and KNIME join are different - but I am not trying to compare KNIME to DB. I wonder if having 2x SORT of 10 mio rows (runtime up to 20 mins) and then having merge-join logic to emit 280mio join-definition compatible records needs to consume so much memory and (maybe because of this memory inefficency) take so long. I do understand that the issue might not be in the KNIME-join algorithm itself but maybe in some inefficiency in the file access logic (as mentioned by Ellert van Koperen) - that could explain few things.