I’m trying to join some stuff. One of the Joins is 2 ~40k rows tables with multiple columns. I have an Intel i5-2400 and 24Gb of RAM at 1333mhz, but it still stops at 55% and stays there (I’ve never seen it complete, just tried yesterday, no errors in console so far). There has been like 1 hour and it’s still processing.
I need it way more speedy and I can’t filter more stuff.
Knime seems to use about 9Gb of the 16Gb I allowed in heap space. CPU is running between 50% and 100%, mostly towards 50%. So, where is the bottleneck really?
I have another Workstation with two Xeon X5687 and 32Gb RAM at 1333mhz. It has currently no OS, so I wonder if It’s worth it to try to install Knime on it or if I wouldn’t notice much difference.
what is the selectivity of your join? In the absolute worst case you are doing a cross join and end up with around 1.6 billion rows. As far as I know KNIME does a simple nested loop join, and in that case it will try out all of the 1.6 billion options. That takes a while and obviously taxes the CPU. Depending on the selectivity of the join, RAM might also become an issue. Could you partition your tables in a way that you can get the same result with two or more joins in parallel? If you could split each table into 2 parts, each joiner would only have to do 20.000x20.000=400.000.000 loops, so 0.8 billion in total, half of what you had before. But of course you have to be careful with the partitioning. If you split table A into A1 and A2 and table B in B1 and B2, then join A1 with B1 and A2 with B2, the partitioning must guarantee that there are no matches between A1 and B2 and A2 and B1.
Mmm, I don’t think I can partition my tables more.
In any case, to make it faster, more cores or higher clock speed?
More power should always help. You could try to write everything to the disc.
Then you could try a sorted join where you would have to pre-sort you data and then just align it (would have to be careful if matches are not exact, might have to check for errors and completion)
Another possibility would be to use the local big data environment and Hive. Hive is known to be slow at joins because it would do everything on hard drive. Might be slow but could work in the end; since this is only a local installation I am not sure if it is suited for such task but it might be worth a try.
In my experience, I’ve found additional RAM, not CPU core count or speed makes the biggest difference in joining very large tables together. I do wish KNIME was faster in this regard.
Well what are you actually joining? Have you read and understood AlexanderFillbrunns comment especially regarding the possible cross join worst case scenario? if that is the case, you can’t do anything but fix our data or workflow or wait. And given how long it takes, it probably is doing a cross join because else it’s never that slow even with much more rows.
This what it looks like:
Yesterday it didn’t finish for several hours. It instead filled my HD with temp files.
please check your right table if Name is unique and not empty. Normally a inner join on a table with 40k rows does not make trouble. Therefore i assume inconsistency in your table.
Name table has no uniques, the same string is repeated over several rows. Name is just an username column where the Joiner has to match and aggreate a column with User IDs (Integers).
I’d say it’s pretty staightforward. Yet something happens and cant finish.
As assumed you do a n:m joining which increases your result table and slowing down your joiner node.
Please give it a try then you first aggregate your right table with groupby node to get unique records an then join the aggregated result table to your base table and see what happens.
You can try Cross Joiner and Rule-based Row Filter node in streaming mode
It’s not only slowing down the node, it’s just halting without error and filling the hard drive with temp files.
I’l just drop Knime for now, I spend more time debugging Knime that the time I’ll spend writing it in python and sql. I just want a simple Join operation.
It’s understandable. If you could do join in DB, do it in DB KNIME still could be useful for statistics, fuzzy joins or such.
When you want to perform a job between two tables, make sure at least one of them has unique keys on which you join. Otherwise you multiply each non-unique record in the first table by the number of matches in the second. Or if t1.name1 is repeated 3 times and t2.name2 is repeated 4 times, you will receive 12 records.
Make a sample, trace 5-10 records and see what happens with them in the process.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.