School of duplicates - and how to deal with them

mlauber71 · May 28, 2020, 9:14am

As an addition you could use the FIRST_VALUE and LAST_VALUE function to get additional information in one step. In this demonstration the SQL server is Hive (KNIME Local Big Data) since neither H2 nor SQLite seemingly would provide such a feature.

@webstar initiated this question - thanks.

SELECT 			`t2`.* 

-- mark the Top row per ID ad the chosen one ...
, CASE WHEN 	`t2`.`rank_id` =1 THEN 'chosen'
		ELSE 'duplicate' 
		END AS `duplicate-type-classifier`
FROM 

-- start t2
( SELECT * 

-- also you could derive a FIRST_VALUE or LAST_VALUE within your group and add that to the table
, FIRST_VALUE(`t1`.`last_contact`) OVER (PARTITION BY `t1`.`ID`  ORDER BY `t1`.`last_contact` DESC) as very_last_contact

, ROW_NUMBER() OVER (
-- set which unique ID should be kept - all others will be deleted 
-- this makes sure even if there is a *complete* duplicate only one row will remain (and be intact)
	PARTITION BY `t1`.`ID` 
-- set rules which ID should remain on Top in this case the one with the latest enry and then the last contact
	ORDER BY 	`t1`.`entry_date` 	DESC
		, 		`t1`.`last_contact` DESC
	) AS `rank_id`

FROM #table# AS `t1`
) `t2`
-- end t2

ORDER BY 	`t2`.`ID` 
		, 	`t2`.`rank_id`