@gcas with a lot of databases you can use Window functions like RANK to have more control over which record gets selected at the end. You might have to have a criterion to determine which row would get choosen. I have discussed options here
Dealing with duplicates is a constant theme with data scientist. And a lot of things can go wrong. The easiest ways to deal with them is SQL’s GROUP BY or DISTINCT. Just get rid of them and be done. But as this examples might demonstrate this might not always be the best option. Even if your data provider swears your combined IDs are unique especially in Big Data scenarios there might still be lurking some muddy duplicates and you should still be able to deal with them.
And you should be able t…
An example whith H2 database is here
@chanoufi_marwa one way to do it is to use a Hive environment and a RANK function. Other SQL databases should also support RANK like H2 (Window Functions ).
Hint: the KNIME implementation of H2 currently has the version 1.4.196. Unfortunately with H2 the RANK etc functions are only supported with version 1.4.198 or later … @ScottF might be an interesting case to update that version
3 Likes