Levenshtein distance in a row

Dear community,

I am trying to compare two strings with the Levenshtein distance. The two strings are in the same row and in seperated columns.

Nevertheless, I do not find a solution - the String Matcher node and the Similarity Search node are bit too sophisticated for that job as they are comparing the string to all values of a column (I just need one row or I did not understand how to set up the node properly).

Do you have any advice to me?

Thank you in advance.

Hi,

this should help: https://www.knime.com/forum/knime-users/distance-between-two-columns

-- Philipp

Thank you for your fast response - I will have a look on the Palladium nodes package.

In the meanwhile I played around with the Java Snippet and Guava and Simmetrics Java extension. I am providing the code, in case, anybody is looking for a solution of that problem in the future:


// system imports
import org.knime.base.node.jsnippet.expression.AbstractJSnippet;
import org.knime.base.node.jsnippet.expression.Abort;
import org.knime.base.node.jsnippet.expression.Cell;
import org.knime.base.node.jsnippet.expression.ColumnException;
import org.knime.base.node.jsnippet.expression.TypeException;
import static org.knime.base.node.jsnippet.expression.Type.*;
import java.util.Date;
import java.util.Calendar;
import org.w3c.dom.Document;

// Your custom imports:
import static com.google.common.base.Predicates.in;
import static org.simmetrics.builders.StringMetricBuilder.with;

import java.util.Set;

import org.simmetrics.StringMetric;
import org.simmetrics.metrics.CosineSimilarity;
import org.simmetrics.metrics.Levenshtein;
import org.simmetrics.simplifiers.Simplifiers;
import org.simmetrics.tokenizers.Tokenizers;

import com.google.common.base.Function;
import com.google.common.base.Predicate;
import com.google.common.base.Predicates;
import com.google.common.collect.Multiset;
import com.google.common.collect.Sets;

// system variables
public class JSnippet extends AbstractJSnippet {
// Fields for input columns
/** Input column: “VALUE_OLD” /
public String c_VALUE_OLD;
/
* Input column: “VALUE_NEW” */
public String c_VALUE_NEW;

// Fields for output columns
/** Output column: “cVALUE_OLD_NEW_LevDistance” */
public Double out_cVALUE_OLD_NEW_LevDistance;

// Your custom variables:

// expression start
public void snippet() throws TypeException, ColumnException, Abort {
// Enter your code here:

StringMetric metric = with(new Levenshtein())
.simplify(Simplifiers.removeDiacritics())
.simplify(Simplifiers.toLowerCase())
.build();

double distance = Math.round( metric.compare(c_VALUE_OLD, c_VALUE_NEW ) * 100.0) / 100.0;
// 100/100 for decimal places

out_cVALUE_OLD_NEW_LevDistance = distance;

// expression end
}
}

Similarity Search Node seems to do the job. It has levenshtein and jaro-winkler distance options.

1 Like