i have a problem with a joiner.
Table1 consists of 847k rows and 47 columns
Table2 consists of 1200 rows and 9 columns
Task is a simple left outer join adding a column from Table2 to Table1. Table2 has been cleaned up for duplicates.
The join takes too long, after 3 days only 52% of the node was executed. So far I never had problems even with larger joins. As an attempt I removed all columns from Table1 except the Joining Column. Unfortunately this did not lead to the desired success. Also changing the “Maximum number of open files” did not lead to a noticeable performance improvement.
Does anyone have an idea how I can speed up the whole thing?
Welcome to the KNIME forum !
Strange. Did you try the new Joiner (Labs) ? It should be faster.
Besides this, could you please tell us what you are using as joining column ? For instance, the column type (number, string, or other special type such as molecule, protein, image, document, etc. ? Perhaps share some data in a minimalist workflow ? Thanks.
Hope this helps.
The Joiner(Labs) works as expected I did not realize that there were such differences. Thank you very much, I can continue to work with it.
But apart from that, I would still be interested to know why the previous join does not work.
The Joining Column of Table1 looks like this:
String Values (10745, 10P45, 0001A)
String Values (10745, 1045P, 0001A)
Column to join
String Values (Tire, Engine, Rest)
Welcome to KNIME Community and glad new Joiner works better.
It’s hard to tell without seeing data and workflow itself. Do you maybe have long RowIDs? Remember seeing this as issue when joining. And if you can share your workflow someone can check it
the longest RowID ist Row847135. If i find time i try to anonymize the data to share the WF.
But thanks for your help. Great Forum
then it’s not the RowID issue then. ok.
Glad it helped and thanks for validating the answer !
@ipazin’s hint is an example of why sometimes the joiner node could be slow. I would mention too for instance:
- Join by types that are not standard, such as for instance chemo- or bio- informatics data.
- Join by data that is by nature too big to be used as a key for joining, images for instance.
- Join by numerical data that is not integer (i.e. with decimals) because not sure that will correctly match.
- Join without checking first that the result would be memory intractable, for instance producing a M x N number of rows, where M and N are the number of rows of “huge” tables.
- Join when “missing values” are present in the joining columns.
Just to cite a few. Having said this, there are bypasses to solve some of these problems or at least to detect them before achieving a join.
Looking forward to having the anonymized WF to further help if possible.
All the best,
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.