Huge Joiner node problem!

ImNotGoodSry · June 8, 2012, 12:43pm

I've got a question regarding the Joiner node.

I created a workflow opening two CSV files. One CSV file has got 3 columns and 1270 rows, the other has got 6 columns and 0.5 million rows. I want to join them by one column (inner join). No matter what I do, the Joiner node takes like half a minute for the first 60% and maybe a year for the rest.

The systems processor has 8 cores (8 x 2.0 GHz) and 16 GBytes of memory. I tried all the tricks from the KNIME FAQ (changing the KNIME ini file), nothing changes.

Can you help me?

wiswedel · June 8, 2012, 10:00pm

Hi YoureNotGoodSry,

For such small data sets a join should be no problem (even if you give it less than 1G RAM). Can you give more information on the data (or the data itself?) What version of KNIME are you using (there was a memory leak in a previous release)? How does the data overlap?

I can create you an ftp account in case the files are too large for a forum/mail attachment. Just let me know.

Thanks,
Bernd

PS: Also have a look at the "Cell Replacer" node, which can be used instead of the Joiner if one of the tables is only a dicitionary table. (But please help us reproduce the Joiner problem even if the Cell Replacer works for you.)

richards99 · June 8, 2012, 11:03pm

The other temporary measure is changing the memory control settings to write to disc in the joiner node settings under the memory tab. See if this helps,
Simon.

andrewma · June 11, 2012, 2:03pm

I have also had trouble with workflows hanging up, esp. the joiner node after I have done a bunch of processing. Quiting Knime and restarting appears to free up memory so it works again.

ImNotGoodSry · June 13, 2012, 9:24am

Hey,

after updating to KNIME Version 2.5.4 the node is running towards the 60% mark and then slowly moves foreward. It takes maybe 30 minutes until the 100% mark is reached. This is a huge step foreward!

I don't know exactly which version I was using when starting this thread. I think it was 2.5.0!

wiswedel · June 13, 2012, 1:13pm

Great. I just looked it up. We fixed a memory leak in v2.5.2 (first item in the changelog, see http://tech.knime.org/changelog-v252).

wiswedel · June 13, 2012, 3:29pm

One more question? Have you changed the memory settings in the knime.ini file? It seems half an hour for such small data is long. The Joiner (, GroupBy, Pivot, ... everything that requires partial sorting of the data) really benefit from more memory assigned to the java process. The default is 512M, which is little given that you have 16G in your machine.

ImNotGoodSry · June 14, 2012, 9:04am

I changed the settings as follows:

-Xmx2048m
-XX:MaxPermSize=1024m

Half an hour might be an exaggeration, I just wanted to point out, that it is in no way comparable with the first 60%.

Thank you for your help!

andyg · June 15, 2012, 12:21am

I have found that I would sometimes get a similar problem ImNotGoodSry. The first 50% would go really fast then the last really slow. I found that often this was caused by having a lot of "null" fields in the join. Usually blanking the string would get around this problem.

Cheers

wiswedel · June 15, 2012, 1:10pm

Yes, that could be because null values (=missing values) don't match anything and there is addtitional bookkeeping required (so a missing value also does not match a missing value).