Yet another newby : similar strings within a column

lug · January 28, 2016, 4:03pm

Hello,

I read many posts expecting to find an answer to my simple question : I would like to create a column with distances calculations but within a single column.

I did it successfully based on adress example and using Similarity Search. But Similarity Search needs my table and a reference table as inputs. I would like to do the same, but without a reference.

Example: if I have in my column:

java
java(JRE)
java(JDK)

I would like to identify similarities of these 3 rows and set a distance estimation. But I'm not able to create a reference of strings. The Index and Search node may help, but I haven't understood how to use it.

Regards

Ema · January 29, 2016, 10:26am

Hi lug,

if you need a distance estimation in a single string column, withouth any reference table, you could first use the node "String Distances" (you can select a particular type of distance for strings to obtain a similarity cross measure, e.g. Jaro Winkler. The first row is the base, so the first string has a distance=0 with itself), then the node "Distance Matrix Calculate" that will append a new column with the distance estimation, and then a"Distance Matrix Pair Extractor".
With this last node you can easily find similar couples of strings based on the previous distance calculation and a threshold you want (the lower is the threshold the greater is the similarity=smaller distance).

I attach an example workflow (knime 3.1)...
hope it can help (i don't know, may be there are other better solutions and suggestions by other Knimers...)

regards
Ema

string_distance_similarity_test.zip

lug · January 30, 2016, 2:46pm

Thank you Ema. This is excactly what I expected.

Geo · February 2, 2016, 11:04pm

Probably much easier: the reference table for Similarity search needn't be another table, simply use the very same table for both input ports. You can also connect the same table to the string distances node and feed it to Similarity Search, which will provide you with other string distance functions than those already available in the Similarity Search node.