Joiner with many rows

Cairo · June 15, 2023, 3:37am

Hello darlings!

I wanted to share a recurring problem in Knime, and how you think about solving it. Example: When I do a joiner between two tables with two columns in common, the number of rows increases significantly. A simple duplicate row filter is not ideal for this type of treatment because it eliminates many sensitive rows. I know that a groupby can often be viable, what else do you guys suggest or use in this case?

Thanks!

mlauber71 · June 15, 2023, 3:51am

@Cairo first you might want to investigate why you have so many (unwanted?) duplicate rows. More often than not the question is not one of technology but of concept:

https://hub.knime.com/-/spaces/-/latest/~kyA_KJ2QUUgI7g61/

takbb · June 15, 2023, 8:35am

Hi @Cairo , I would agree with @mlauber71 's comments about duplication in general.

Additionally, there is no “generic” solution to the problem, unless you can tell us what outcome you are expecting.

If I have duplication of “code” in two tables and I want to join on “code” then obviously the duplication is going to multiply up the rows, as the resultant row count is the product of the occurrences of “code” in the two tables. ie If the same “code” appears twice in one table and three times in the other, then the resultant join will be 6 rows ( 2 x 3).

If you don’t want this “multiplication through duplication”, then you have to decide what you need to do to avoid it.

Now it could be that the duplicates are erroneous, in which case you need to correct the data at source, or you need to de-duplicate.

If the duplicates are not erroneous, and there is other data associated with the different rows that you wish to keep, then this suggests that you are missing a crucial additional “key” that would make each row of your data unique (the primary key). Only when joining values that are unique in at least one of the two joining tables can you expect your join not to increase the resultant row count.

So, your choices are generally the following:

de-duplicate or somehow filter your input data to ensure that a “unique” subset of at least one of your tables is presented to the join.
find additional key columns on one of your tables that when combined together will uniquely identify a row ( the “primary key”), and use all of those columns in the join. If one or more of those columns is not present in the other table, then the data in that table is deficient, and you need to take steps to either add the additional “key” columns to the other table, which may mean you have to go back to the source of the data to get it corrected, or else you have to return to option 1.

Ultimately though, as said at the beginning, you need to decide what result you want. KNIME cannot decide that for you. You cannot just say “I don’t want to remove duplicates but I don’t want to see the duplicates either”, which is currently the situation you appear to find yourself in.

Daniel_Weikert · June 15, 2023, 4:49pm

@mlauber71 and @takbb already pointed out the most important things. The only thing I like to add here is besides the Group by node in certain situations a join might not even be required and sorter + column appender can give the result (Please be aware that this highly depends on your use case which I don’t know)
br

Cairo · June 16, 2023, 3:49am

Absolutelly, thanks for all answers, it’s a currently problem but with some possible solutions.

Cairo · June 16, 2023, 3:49am

Great text, Lauber! Thanks for help the community

Cairo · June 16, 2023, 3:54am

I would add that the columns brought in the group are also fundamental for this. Depending on, if you passed something like list and then had to pass the ungroup to pass it to the column, the groupby might have been irrelevant. But, bringing in the right columns can definitely solve this problem.

denisfi · June 16, 2023, 7:05pm

Hi guys,

I saw that this topic was solved, but if I can add a tip or 2… lol

At the join node, you can tell one or more fields… yeap, distinct clauses is great and you can groupby too before… BUT, you can set exactly or any cases, pass 1 or more match fields, use expression… well some interesting solutions that can add some insights too. I thought about a flag to to sign for both tables to control it… well. try, try and try harder to solve some problems… lol again…

Thats is folks,

BR, Denis

system · June 23, 2023, 7:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.