Duplicate Image finder

Hello there,

my parents in law were saving their files on several external hard drives. For safety they are saving each picture on different external hard drives. Over the time it happens a lot that several picture are more than once saved on one hard drive.

Now I want to try the Image processing feature and want to find duplicate picture on one harddrive. Do you have any suggestions how to do so?

I hope I was able to explain the use case properly.

Greetings
Torben

Hi @Feltor , the only way I can think of doing this is to do an MD5Checksum on each picture, and then do a duplicate row filter (or you can do a count via a groupby and filter where count > 1).

I put something together for you.

Input:

Results:
Unique ones:

Duplicate ones:

And here’s the workflow: Compare image files.knwf (2.8 MB)

2 Likes

Thank you very much :slight_smile:

Hi @Feltor,

Just a thought but if you have a very large number of image files, then you might be able to optimise this a little with some additional processing. If you use one of the File Meta Info nodes (varies according to which KNIME version you are using), then you could first collect the sizes (in bytes) of all the files to be processed.

Once you have that, discard all of those that have unique sizes and then pass only those files with duplicated sizes to the process that @bruno29a has provided. Even if the MD5 calculation itself is fast, this would potentially greatly reduce the number of image files to actually be read and processed.

5 Likes

Hi @takbb , that was my original plan, but I didn’t know how to get the file size, let alone getting the file size without reading the file, as it’s more the reading of the image file than the MD5 that is the performance hit. Thanks for pointing out which nodes can do that :slight_smile:

@Feltor , I integrated the filtering on size before reading the files, and the workflow looks like this now:

And here’s the new workflow: Compare image files.knwf (2.8 MB)

4 Likes

Nice one @bruno29a! Yes I agree with you on where the performance hit is. Anything not to have to physically read all the images :wink:

Thanks for the ideas on using MD5. I’ve been doing something similar to this recently to tidy up my own photo library, and I will incorporate your ideas into it when I next get back to looking at this :slight_smile:

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.