String similarity of URLs

Hi there,

I’m new to Knime, so this exercise is not quite easy to me. I have a dataset of URLs (as Strings). And I wish to compare new URLs (in another dataset) with my collected URLs in terms of similarity. The String Similarity Node (with Jaro-Winkler set in Config) seems to be promising for this task.

My problem is that with String Similarity node you compare one specific entry of a column with one entry of another column. However, I want to check similarity of one URL with every URLs of my dataset.
One inefficient approach that works so far: through Rule Engine Node I created a second column in my big url dataset, with one URL (which is the same in every row). Afterwards every Row of the two columns will be compared through String Similarity Node (Jaro-Winkler).

Does anyone know, how I could scale this String Similarity comparing approach in an effecient way? Looping over the rows of the URL column and comparing each entry would be nice, however I have no idea how to do that in KNIME.

Greetings,
Jon

Hi Jon,

Does anyone know, how I could scale this String Similarity comparing approach in an effecient way? Looping over the rows of the URL column and comparing each entry would be nice, however I have no idea how to do that in KNIME.

This depends a bit on your goal.

The most “brute force” solution (which you’re suggesting) would be to simply cross-join the table with itself, so that you can determine the similarity for any pair. You can do this with the Cross Joiner node:

Simply connect the input table to both input ports and you’ll end up with a table of all “combinations”.

Alternatively, you could run a loop where you pick one reference row within each iteration and then determine the similarity to any other row, and only keep e.g. those rows where the similarity is greater than a given threshold (this would avoid creating a table of size n x n)

KNIME also has a “Distance Matrix Calculate” node (and companions which you find in the “Distance Matrix” category) for building up a “distance vector”:

I have to admit that it never really “clicked” for me (probably I’m too stupid and the idea of the “distance vector” data type is too hard to grasp for me).

Anyways, these some potential options. Bonne chance!

– Philipp

4 Likes

Welcome to KNIME. You can get some ideas from example
https://www.knime.com/blog/address-deduplication

4 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.