Deleting all instances of duplicate row values

The_Intern · January 10, 2017, 10:11am

Hi,

Here is the deal: I have a large table (15k rows) with a summary of "production parameters", they are identified by a Production_ID. These production parameters are coupled to some measurement statistics, these measurement statistics are identified by a Measurement_ID. Most of these Production_ID's and Measurement_ID's have a correct 1:1 mapping, but for a variety of reasons, a number of Production_ID's have 2 Measurement_ID's assigned to them. Sometimes the first measurement is the correctly assigned one, sometimes the second one is, but I have no way of knowing this unless checking them manually. Since this is not really an option timewise, considering the size of the table, my solution is to drop both the Measurement_ID's and their corresponding Production_ID.

Translated to KNIME, I want to delete all rows with duplicate Production_ID's. So not only the duplicates, but also the 'original'. As far as I know, the Groupby node does not offer this functionality. What I am trying to implement now is 2 Lag Columns, one with the value of the previous Production_ID and one with the value of the next Production_ID (table is sorted on Production_ID) and if any of those two is the same as the current Production_ID, set that Production_ID to null and run it through a Missing Values node. I think this will eventually work, but I can't help but wonder if there would be a simpler and more elegant solution.

Thanks in advance, J

ImNotGoodSry · January 10, 2017, 2:27pm

Hey,

you can use the GroupBy node to count the number of rows/entries for each Production_ID. Then use the Row Filter node to keep only these rows with Count > 1. Now use this node's output for the Reference Row Filter node to exclude these Production_IDs from your initial table.

Best,
Marc

The_Intern · January 10, 2017, 3:15pm

Hi Marc, thanks for your quick reply.

I understand what you're suggesting, but I can't get the GroupBy node to provide the functionality as you state in your answer, what settings should I apply within the node?

PS: I did fix it the way I suggested in my first post, but since that takes 7 nodes and your solution uses 3 nodes, I'm trying to implement it your way.

s.roughley · January 10, 2017, 6:14pm

In the Groups tab, select the 'Production_ID' column

In the Manual Aggregation tab, select the 'Measurement_ID' column, and then under 'Aggregation (click to change)' select 'Count'

Steve

The_Intern · January 11, 2017, 7:56am

Ah thanks I've got it! I've only used the GroupBy node for filtering duplicates so far, was not aware of its more advanced functionalities, but this is very nice to know!

Also thanks to Marc for the solution, it works now and looks way more elegant than my own solution :)

ImNotGoodSry · January 11, 2017, 9:48am

You're welcome!