Replace/Simplify row values to eliminate duplicates

mlauber71 · May 27, 2020, 9:56am

I see these points:

in deduplication you can aggregate towards a key. That means you will loose any other key that is not been aggregated to - if you want to preserve the information you will have to do something like
** aggregation functions like min, max, avg
** llists (list all the values in one cell) the group by function has such things
** separate variables that contain the information like
*** preserved_id_1, preserved_id_2
then you can choose to keep one line based on several criteria (that would contain the original structure and row integrity - like discussed in the links below). You would loose the information from the other lines
** if you want to preserve them you will have to aggregate them separately and then re-join via a preserved ID (preferably your key value)

The reason why I (somtimes) like the solution with the row_number forcing the reduction to a single unique line is that especially in very large datasets with a cascaded group by (over several fields) some duplicates might be lurking even if the providing database sweared that would not be possible (very common in Big Data environments). So if you want to be sure just eliminate a lone duplicate.

But that very much depends on your business case. If you cannot loose any information at all you will have to resort to one of the methods mentioned above.