Jaccard string similarity Java Snippet

mauuuuu5 · October 24, 2016, 4:58am

Hi everybody I am trying to calculate the Jaccard string similarity between two pairs of column names using the Java Snippet node using this library, which was added in the Additional Libraries tab.

By calculate the Jaccard index I mean to add a column.

I am aware of the string similarity node that comes with the Palladian package, but I still want to use the Jaccard index.

This is the code that I am using in the node which is slightly modified form this GitHub code

I beg if someone can help me running this code.

// Your custom imports:

import info.debatty.java.stringsimilarity.interfaces.MetricStringDistance;
import info.debatty.java.stringsimilarity.interfaces.NormalizedStringSimilarity;
import info.debatty.java.stringsimilarity.interfaces.NormalizedStringDistance;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import net.jcip.annotations.Immutable;

// system variables
public class JSnippet extends AbstractJSnippet {
  // Fields for input columns
/** Input column: "NAMES TWO" */
  public String c_NAMESTWO;
/** Input column: "NAMES ONE" */
  public String c_NAMESONE;

  // Fields for output columns
/** Output column: "Jaccard" */
  public Double out_Jaccard;

// Your custom variables:

// expression start
    public void snippet() throws TypeException, ColumnException, Abort {
// Enter your code here:

public class Jaccard extends ShingleBased implements
        MetricStringDistance, NormalizedStringDistance,
        NormalizedStringSimilarity {

    /**
     * The strings are first transformed into sets of k-shingles (sequences of k
     * characters), then Jaccard index is computed as |A inter B| / |A union B|.
     * The default value of k is 3.
     *
     * @param k
     */
    public Jaccard(final int k) {
        super(k);
    }

    /**
     * The strings are first transformed into sets of k-shingles (sequences of k
     * characters), then Jaccard index is computed as |A inter B| / |A union B|.
     * The default value of k is 3.
     */
    public Jaccard() {
        super();
    }

    /**
     * Compute jaccard index: |A inter B| / |A union B|.
     * @param s1
     * @param s2
     * @return
     */
    public final double similarity(final String c_NAMESTWO, final String c_NAMESONE) {
        Map<String, Integer> profile1 = getProfile(c_NAMESTWO);
        Map<String, Integer> profile2 = getProfile(c_NAMESONE);

        Set<String> union = new HashSet<String>();
        union.addAll(profile1.keySet());
        union.addAll(profile2.keySet());

        int inter = 0;

        for (String key : union) {
            if (profile1.containsKey(key) && profile2.containsKey(key)) {
                inter++;
            }
        }

        return 1.0 * inter / union.size();
    }


    /**
     * Distance is computed as 1 - similarity.
     * @param s1
     * @param s2
     * @return
     */
    public final double distance(final String c_NAMESTWO, final String c_NAMESONE) {
        out_Jaccard = return 1.0 - similarity(c_NAMESTWO, c_NAMESONE);
    }
}

I am also attaching the workflow.

Best Regards

jaccard_forum.knwf

marco_ghislanzoni · October 25, 2016, 11:46am

Hi,

you had a number of issues in your workflow, starting from having imported the wrong library (latest version is 0.19), to not using it properly within your Java Snippet.

Anyway, the attached version does what you were asking for and works as expected. You need to get the proper jar library from here: https://github.com/tdebatty/java-string-similarity/releases

Cheers,
Marco.

jaccard_forum_fixed.knwf

mauuuuu5 · October 25, 2016, 11:02pm

Thank you Marco, I managed to calculate the Jaccard Index, with your Fix

Best Regards