Similarity Check within a column

Hi, I would like to find similar URLs within a column and mark them as duplicates (e.g. by creating a new column with “Duplicate”). I have tried string similarity, but here two columns are compared. I would like to have the check within one column.

A classic example would be two URLs once with / and without / at the end. Or the URL path was minimally adjusted with an additional pronoun and a duplicate was created.

Does anyone have any idea how I can solve this?

Hi Juliane,

If you have them within a single column, you’ll need to do a cross join first to build pairs which you can then compare:

-Philipp

2 Likes

@Juliane welcome to the KNIME forum. You could try this example without a ground truth. You will have to see if it works for URLs.

1 Like

@qqilihq thanks for the hint. I already tried something like this, but what would you do next (I tried Similarity Search, but then I can’t filter them out of the list…).

Hi, I have downloaded the workflow, but unfortunately I still don’t understand how to apply it to my case… I have a list with over 1000 URLs and sometimes up to 3 or 4 similar URLs that I would like to filter out or possibly merge… I don’t quite understand which part of your workflow can help me.

I would do the following:

  1. Build pairs as described above
  2. Use e.g. String Similarity or Column Distance node
  3. Determine suitable distance / similarity metric
  4. Apply a threshold above which pairs are considered “duplicates”

You can try clustering algorithms, such as DBScan. The distance used in the clustering algorithm can try the Edit Distance( Levenshtein Distance).

@Juliane it allows for a deduplication without a ground truth so it will group similar items within one list. I thought this might be similar to your case.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.