Hello darlings! I wanted to share a recurring problem in Knime, and how you think about solving it. Example: When I do a joiner between two tables with two columns in common, the number of rows increases significantly. A simple duplicate row filter is not ideal for this type of treatment because it eliminates many sensitive rows. I know that a groupby can often be viable, what else do you guys suggest or use in this case? Thanks!

Hi @Cairo , I would agree with @mlauber71 's comments about duplication in general. Additionally, there is no “generic” solution to the problem, unless you can tell us what outcome you are expecting. If I have duplication of “code” in two tables and I want to join on “code” then obviously the duplication is going to multiply up the rows, as the resultant row count is the product of the occurrences of “code” in the two tables. ie If the same “code” appears twice in one table and three times in the other, then the resultant join will be 6 rows ( 2 x 3). If you don’t want this “multiplication through duplication”, then you have to decide what you need to do to avoid it. Now it could be that the duplicates are erroneous, in which case you need to correct the data at source, or you need to de-duplicate. If the duplicates are not erroneous, and there is other data associated with the different rows that you wish to keep, then this suggests that you are missing a crucial additional “key” that would make each row of your data unique (the primary key). Only when joining values that are unique in at least one of the two joining tables can you expect your join not to increase the resultant row count. So, your choices are generally the following: de-duplicate or somehow filter your input data to ensure that a “unique” subset of at least one of your tables is presented to the join. find additional key columns on one of your tables that when combined together will uniquely identify a row ( the “primary key”), and use all of those columns in the join. If one or more of those columns is not present in the other table, then the data in that table is deficient, and you need to take steps to either add the additional “key” columns to the other table, which may mean you have to go back to the source of the data to get it corrected, or else you have to return to option 1. Ultimately though, as said at the beginning, you need to decide what result you want. KNIME cannot decide that for you. You cannot just say “I don’t want to remove duplicates but I don’t want to see the duplicates either”, which is currently the situation you appear to find yourself in.

Joiner with many rows

KNIME Analytics Platform

mlauber71 June 15, 2023, 3:51am 2

@Cairo first you might want to investigate why you have so many (unwanted?) duplicate rows. More often than not the question is not one of technology but of concept:

https://hub.knime.com/-/spaces/-/latest/~kyA_KJ2QUUgI7g61/

3 Likes