i have some problem with 2 tables and a JOIN.
in table A (Left), I have different lines like this: (all are uniques)
In a table B (right) i have lines like this: (lines are not uniques, sometimes are duplicated)
When i join together the lines of table A increase for how much rows find the record in a table B.
I fix this with “Duplicate row filter” before to join the table B, but i this i should have a different solutino, right?
How can i join only with only 1 row of table B?
let me try to help.
What if you use a group by node on table B before the join, so that you won’t have repetitions in B?
With group by you can aggregate column 2B using sum, average, etc. or just selecting the first value.
The join should result as you asked.
Thanks for you suggest!
“Group By” is an alternative solution of “remove duplicate”.
I didn’t know this node (thanks!), but i think is not the best solution.
In my mind JOIN node should have the option i think is missing: match only 1 time
If you think it can be an useful feature, you can suggest it in this part of the forum:
I don’t think that it is currently available.
The joiner node performs the standard “relational” joins and it’s not common to have a
join-on-first option but I can understand there may be an occasional need to do this.
Whilst there is no direct option in the joiner to do it, you can replicate this behaviour without using the duplication removal by “ranking” the rows within the keys that you wish to join on.
For the table that is the “lookup table” (Table A in your example), you just assign every row a “rank” of 1.
For the other table, you can assign each row a rank grouped within “key”
I added an extra column to your Table B containing to demonstrate this.
After the ranking node, each row is ranked as follows:
In the joiner, you include the “rank” column in the join. Since your lookup table only has 1 as its ranks, only the first row will match for each key. In the joiner we elect to remove the rank nodes as part of the join. Here I have elected to include the (right) unmatched rows too:
and so you then get match-on-first
or if you only include matched rows you get the same output as if you’d removed duplicates:
Join on first match only.knwf (13.7 KB)
Personally I would generally do duplicate removal as you already suggested. I know it’s still extra steps, but I hope that helps if duplication removal is for some reason otherwise causing problems.
Thanks a lot for your reply and approach!
I could use your solution waiting for a join-on-first (or last) option in Join Node.
Have a nice day!
@gcas with a lot of databases you can use Window functions like RANK to have more control over which record gets selected at the end. You might have to have a criterion to determine which row would get choosen. I have discussed options here
Dealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easiest ways to deal with them is SQL’s GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Data scenarios there might still be lurking some muddy duplicates and you should still be able to deal with them.
And you should be able t…
An example whith H2 database is here
@chanoufi_marwa one way to do it is to use a Hive environment and a RANK function. Other SQL databases should also support RANK like H2 ( Window Functions).
Hint: the KNIME implementation of H2 currently has the version 1.4.196. Unfortunately with H2 the RANK etc functions are only supported with version 1.4.198 or later … @ScottF might be an interesting case to update that version
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.