detect and delete duplicate files (rows) based on two variable identifiers (date and lot number) and not based on row ID

ncernic · January 23, 2012, 8:57pm

This is what we would like to do in Knime:
1) Append two databases based on two variables "Lot Number" and "In Date" and detect the duplicated rows of data to then delete them. We have a master dataset and we want to append a second dataset that may have the same exact data the master dataset contains plus, new data that was added some time after (or this second dataset may only have new data, which it means, no duplicate rows would be found). However, because of how the data is imported, both datasets will have the same row IDs, more or less.
For this step we tried "Row concatenate" but this node seems to delete the row iDs that are duplicated. By doing this, because our both datasets have the same row IDs, it deletes automatically all the rows from the second database. In addition, we could not find an option that will allow us to detect duplicated rows by their content ( because they have the same information) and not by their row ID.
We also tried Joiner (data manipulation/column/split& combine/Joiner), but it keeps giving us a warning for the duplicate column handling.

We do not necessarily need to detect rows based on those two variables, but we do need to be able to detect duplicated rows in the appended table so the duplicates can be deleted.

We will appreciate any insights on this matter.

Thank you very much,

Natalia.

richards99 · January 23, 2012, 9:48pm

Hi Natalia,

Unfortunately I am not 100% clear on the problem, but as far as I can tell, after you concatenate rows together, you have some rows with duplicate entries. In the concatenate node options be sure to choose "Union" and "Append Suffix" for duplicate RowID handling, this way no rows should be filtered out.

Now in your combined table, as you mention, there will be duplicates, and these can be removed by using the GroupBy node.

In Groupby node, select the column or column(s) you want to check for duplicates. So I am assuming here you would choose "Lot Number" and "In Date", then in the Options tab add all your data columns and choose to Aggregate by "Last". Doing this will mean any duplicate entries seen will only collect the data from the last most entry (i.e.the data in the bottom concatenated table which is the one you mention may have newer data present in it).

Hope this helps, if I've completely missed the point, then feel free to ellaborate some more of the problem.

Simon.

frederico.nardi · January 25, 2012, 7:14am

Dear Natalia

It looks like we encountered similar issues. If I understand well you would like to keep the so called first occurrence of a series of concatenated rows based on unique IDs that may be repeated several times in the concatenation process. In deed, I will use the GroupBy like suggested by Simon, but then I will use also a Joiner to keep the original row avoiding the aggregation process that need to be reconfigured each time if the number of columns is changing. Please find hereby an example that may hopefully help you. In your case you may want to use an appropriate row sorting to put in first the rows you want to keep.

Frederico

keepfirstoccurrence.zip

ncernic · January 25, 2012, 6:47pm

Thank you very much Simon for such a quick reply. I will try what you suggested.

Natalia

ncernic · January 25, 2012, 6:49pm

Dear Frederico:

I appreciate your explanation and the attached file. I will look into it and try. It looks like you are right on.

Thank you very much,

Natalia

simon67 · February 17, 2014, 3:18pm

DuplicateFilesDeleter helped me in this case.

bhhataba · January 27, 2016, 2:32pm

Hi all, Not just removing duplicates but I also want to see those which are removed and add a frequency or count for repetition, how many times etc. Any suggestions. Thanks

Geo · January 27, 2016, 11:39pm

Here is a solution based on the split-apply-combine strategy:

after concatenation of all tables as specified by Simon, open a loop with Group Loop Start on the column(s) that allow you to detect duplicate rows;
apply a Math node to create a new variable (e.g. nbrows) which counts the number of cases per group by specifying the simple formula ROWCOUNT ;
apply a Math node to create a variable which identifies the first row of each group by specifying the formula ROWINDEX == 0 (mind the double equal sign);
apply a Math node to create a variable which tells you whether or not there are any duplicates by specifying the formula ROWCOUNT > 1 ;
close the loop with Loop End.

Split (step 1), Apply (steps 2-4), Combine (step 5). Feel free to tailor the indicators in the Apply step to your needs. After the loop you can do anything with the duplicates you like (split, filter, analyze, identify, etc.).

boterovic · March 2, 2016, 7:42pm

Hi, Could you share an example of what has been done . Thank you

arixooo · March 5, 2016, 10:57am

Dear Simon,

Would you please take a loot at my question at below link.

https://tech.knime.org/forum/knime-reporting/passing-a-column-values-to-a-database-connector-and-running-a-query-on-the-lis

Many thanks