I know it has been asked before and the reply is GroupBy node but I have not idea how to do it. Could anyone please share a workflow or at least tell me the settings in the configuration?
I set up a workflow to demonstrate how this could be done
- use group by to calculate how many duplicates there are (note: KNIME should introduce a generic COUNT(*) function - I had to use a variable)
- if the count is larger then 1 it is a duplicate
- left join it back to the original data
- sort the data by ID and other variables if you want to keep one of the duplicates
- use the LAG column to identify which line is a 2nd, 3rd occurrence of a duplicate
- make a rule to keep just a single line of each ID
- alternative: just remove all duplicates
kn_example_duplicates.knwf (44.9 KB)
This is a quite impressive solution but I think in most cases the GroupBy node should be able to all do this in one step. If you want to keep only one of the duplicate rows, just add all columns to the Group columns option. Then you will only get one row per “Group” which is defined as all rows that have the exact same values in the selected columns.
Depending on your needs you can then adjust the Aggregation settings to e.g. get the number of duplicate rows that have been removed, this requires an additional column (created with the Constant Value
Column node which is used in the aggregation:
GroupBy Demo.knwf (9.2 KB)
GroupBy is one of the most powerful nodes in KNIME, but it takes a while to learn how to use it efficiently.
I would like to go into my reasoning for deliberately focussing on an ID. Yes you can just remove duplicates by Group by or SELECT DISTINCT.
In a lot of cases you have something like customer IDs or phone numbers. With DISTINCT what would happen in such a case
phone_no | adress | joined
555-1234 | main street 1 | 2008
555-1234 | main street 1C | 2009
555-1234 | Main Street 1 | 2008
you would still end up with 3 distinct cases and you would call/mail the customer three times, and what if your data is 99% unique after the ID but there are a few lines you are missing and would not see in a sample or by just looking at them.
Of course it depends on your use case in the end
You can always just use the ID as grouping value and use aggregation on the other values, so in your example, where you might need some special handling of the address column you could choose the Set aggregation method, so you can handle those rows in a dedicated Address deduplication branch of your workflow.
Fantastic, thank you all!
Yes set, list, sorted, first are all great functions and I do like the Group By node very much (if only I could find a generic count without a constant value ).
Question in my example would be: I want to preserve the whole set of features (phone, address) from the one ID I choose so not to mix them. First and Last would do that and Set or List would give me all there is in one line (great functions, thank you for reminding me) but I would have less control about which line to choose. In my scenario it could be necessary to choose the latest address with the longest street name or something. But again: it depends on the use case. I just would encourage people to take into account everything there is to their data and then make a deliberate decision what should happen to the data.
Amen to that
Very good! A complete solution for what I was searching for! Many thanks!
I usually use RowID (with ensure uniqueness and at the same time extracting the rowid as a column), followed by a string manipulation node (searching for the # and appending a boolean column) to identify duplicates.
I used 3 step flow First Lag column, Sorter, then a Rule based filter with the condición $ROW1$ = $ROW1(-1)$ => TRUE and exclude true
with our KNIME 4.0 Release we published a Duplicate Filter node.
You can find more information about it here:
Thanks for a great thread!
In my case, I’m looking to remove duplicates by a date column. I have an ID Number, Identifier (string), and a date edited. I need to keep the duplicate with the latest edited date, and if the dates are identical, then take the one with the higher ID number.
So, I’ve been contemplating a workflow that gets kinda complicated until I saw the Duplicate Row Filter node, which seems to be doing just fine. I don’t see in the documentation how this node handles things when the row filter chosen has identical values though, but based on a quick test, it appears that it takes the first occurrence in the original data set, so depending on how that dataset is sorted going into the node, which I can work with.
But I LOVED reading all the solutions and seeing the different thought processes here.
Actually you have already discovered the solution. Yes, by default it takes the first instance of the duplicate values, but you can change this behavior in the advanced tab of the node’s configuration window. You have more options to choose from in the “Row selection” section of the tab.
@serendipitytech I can add one more idea to remove duplicates and ensure unique IDs using SQL (Hive in this case) and row_number() if you have to deal with a data base - or you just like using SQL script.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.