Hi everybody I am trying to calculate the Jaccard string similarity between two pairs of column names using the Java Snippet node using this library, which was added in the Additional Libraries tab.
By calculate the Jaccard index I mean to add a column.
I am aware of the string similarity node that comes with the Palladian package, but I still want to use the Jaccard index.
This is the code that I am using in the node which is slightly modified form this GitHub code
I beg if someone can help me running this code.
// Your custom imports: import info.debatty.java.stringsimilarity.interfaces.MetricStringDistance; import info.debatty.java.stringsimilarity.interfaces.NormalizedStringSimilarity; import info.debatty.java.stringsimilarity.interfaces.NormalizedStringDistance; import java.util.HashSet; import java.util.Map; import java.util.Set; import net.jcip.annotations.Immutable; // system variables public class JSnippet extends AbstractJSnippet { // Fields for input columns /** Input column: "NAMES TWO" */ public String c_NAMESTWO; /** Input column: "NAMES ONE" */ public String c_NAMESONE; // Fields for output columns /** Output column: "Jaccard" */ public Double out_Jaccard; // Your custom variables: // expression start public void snippet() throws TypeException, ColumnException, Abort { // Enter your code here: public class Jaccard extends ShingleBased implements MetricStringDistance, NormalizedStringDistance, NormalizedStringSimilarity { /** * The strings are first transformed into sets of k-shingles (sequences of k * characters), then Jaccard index is computed as |A inter B| / |A union B|. * The default value of k is 3. * * @param k */ public Jaccard(final int k) { super(k); } /** * The strings are first transformed into sets of k-shingles (sequences of k * characters), then Jaccard index is computed as |A inter B| / |A union B|. * The default value of k is 3. */ public Jaccard() { super(); } /** * Compute jaccard index: |A inter B| / |A union B|. * @param s1 * @param s2 * @return */ public final double similarity(final String c_NAMESTWO, final String c_NAMESONE) { Map<String, Integer> profile1 = getProfile(c_NAMESTWO); Map<String, Integer> profile2 = getProfile(c_NAMESONE); Set<String> union = new HashSet<String>(); union.addAll(profile1.keySet()); union.addAll(profile2.keySet()); int inter = 0; for (String key : union) { if (profile1.containsKey(key) && profile2.containsKey(key)) { inter++; } } return 1.0 * inter / union.size(); } /** * Distance is computed as 1 - similarity. * @param s1 * @param s2 * @return */ public final double distance(final String c_NAMESTWO, final String c_NAMESONE) { out_Jaccard = return 1.0 - similarity(c_NAMESTWO, c_NAMESONE); } }
I am also attaching the workflow.
Best Regards