Remove Duplicate Values from Multiple Files

soon_98 · July 28, 2021, 7:38am

Hi all, I am working with multiple files, and I need to first remove the duplicate values of each file and next merge them into a CSV file. How should I do? Thanks.

gonhaddock · July 28, 2021, 7:57am

Hi @soon_98 and welcome to the KNIME community

You are not giving details on your problem and the casuistic can be very broad like:

i have repeated files in the same folder (mounting path) with different names
i have repeated file names in different folders (mounting paths)
i have repeated files with different names in different folders
…
then the work approach may be completely different.

For the most simple case same files in a folder, ‘String Manipulation’ node to remove the final brakets of repetition (1), (2)… and then a ‘Group By’ node will do the job for you.

For more details please provide a dummy file list with paths that represents your problem, and we will be able to provide a more accurate help.

Regards

soon_98 · July 28, 2021, 8:09am

Hi,
Actually there are duplicate values in each fie and I have total of 40000++ files need to work with, it will be time wasting if I need to remove the duplicate value of each data one by one before merging them. Besides, the size of file is too big, it will be trouble if I need to combine all the data first before removing the duplicate values. Is there any other solution?
Thanks and Regards.

gonhaddock · July 28, 2021, 8:41am

Well, this is my view: the point of using KNIME is to bring data to be analysed in KNIME, so basically the workflow would be read all the files in a loop and aggregate the transformed data by collecting all at the end of the loop…

If the problem you are describing doesn’t fit or scale in this type of approach, I would rather code it in R (Py… optional); where you can access or bring the information without opening the files. Then if you still want to run some processing in KNIME, you can do both: import your processed data in a unique file or run the R Script from KNIME.

andrejz · July 28, 2021, 9:14am

Hi,

You can read all the csv files with the “CSV Reader” node and then use “Duplicate Row Filter” node

regards
Andrej

SamirAbida · July 28, 2021, 10:17am

Hello Soon_98,

Do your files have the same columns ? For all of them ?

If so, you just could :

load your CSV files with “CSV Reader” node from folder, it will compile them as one
remove duplicate with the “Duplicate Row Filter”
write your compiled file with “CSV Writer” node (check ss below)

Best Regards,

Samir

soon_98 · July 28, 2021, 12:59pm

Hi, it works well. Thank you so much for your help.

system · August 4, 2021, 1:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.