Similarity Check within a column

Juliane · July 31, 2024, 12:08pm

Hi, I would like to find similar URLs within a column and mark them as duplicates (e.g. by creating a new column with “Duplicate”). I have tried string similarity, but here two columns are compared. I would like to have the check within one column.

A classic example would be two URLs once with / and without / at the end. Or the URL path was minimally adjusted with an additional pronoun and a duplicate was created.

Does anyone have any idea how I can solve this?

qqilihq · July 31, 2024, 1:00pm

Hi Juliane,

If you have them within a single column, you’ll need to do a cross join first to build pairs which you can then compare:

-Philipp

mlauber71 · July 31, 2024, 1:32pm

@Juliane welcome to the KNIME forum. You could try this example without a ground truth. You will have to see if it works for URLs.

Juliane · August 1, 2024, 7:45am

@qqilihq thanks for the hint. I already tried something like this, but what would you do next (I tried Similarity Search, but then I can’t filter them out of the list…).

Juliane · August 1, 2024, 8:10am

Hi, I have downloaded the workflow, but unfortunately I still don’t understand how to apply it to my case… I have a list with over 1000 URLs and sometimes up to 3 or 4 similar URLs that I would like to filter out or possibly merge… I don’t quite understand which part of your workflow can help me.

qqilihq · August 1, 2024, 9:01am

I would do the following:

Build pairs as described above
Use e.g. String Similarity or Column Distance node
Determine suitable distance / similarity metric
Apply a threshold above which pairs are considered “duplicates”

tomljh · August 1, 2024, 9:11am

You can try clustering algorithms, such as DBScan. The distance used in the clustering algorithm can try the Edit Distance（ Levenshtein Distance）.

mlauber71 · August 1, 2024, 4:15pm

@Juliane it allows for a deduplication without a ground truth so it will group similar items within one list. I thought this might be similar to your case.

system · October 30, 2024, 4:16pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.