Java Snippet iterate row calculation with String Similarity

mauuuuu5 · July 16, 2024, 3:33pm

Hi everyone,

Some of you have already helped me in a previous post regarding the usage of the Java library called java-string-similarity, which can be downloaded here.

The previous post helped me select the maximum between two matching text similarity scores. However, the columns “c_NAMESONE” and “c_NAMESTWO” must be organized so that each name in “c_NAMESONE” is compared with each row in “c_NAMESTWO”.

To achieve this, these columns must be organized to include all possible combinations for cross-referencing the data. I currently achieve this using the “Cross Joiner” node, but this method significantly increases the number of records.

Therefore, I am looking to code a more resource-efficient approach to achieve the following:

Calculate matching scores for the first row of the “c_NAMESONE” table against the entire “c_NAMESTWO” table.
Store the results in a temporary table, keeping only scores greater than 80% (“out_Max” > 80%).
Repeat this process for each subsequent row in the “c_NAMESONE” table, appending results to the same temporary table.

I have attempted to implement the above solution, but I am encountering some errors. I would greatly appreciate any assistance.

Interation among rows.knwf (23.2 KB)

Thank you

mauuuuu5 · July 21, 2024, 5:07pm

Hi everyone when reviewing the code I see a few errors:

If anyone can help me; I will appreciate

Cheers

> // system imports
> import org.knime.base.node.jsnippet.expression.AbstractJSnippet;
> import org.knime.base.node.jsnippet.expression.Abort;
> import org.knime.base.node.jsnippet.expression.Cell;
> import org.knime.base.node.jsnippet.expression.ColumnException;
> import org.knime.base.node.jsnippet.expression.TypeException;
> import static org.knime.base.node.jsnippet.expression.Type.*;
> import java.util.Date;
> import java.util.Calendar;
> import org.w3c.dom.Document;
> 
> 
> // Your custom imports:
> 
> import info.debatty.java.stringsimilarity.*;
> //import org.knime.core.data.*;
> //import org.knime.core.data.def.*;
> //import java.util.*;
> 
> 
> // system variables
> public class JSnippet extends AbstractJSnippet {
>   // Fields for input columns
>   /** Input column: "NAMES TWO" */
>   public String c_NAMESTWO;
>   /** Input column: "NAMES ONE" */
>   public String c_NAMESONE;
>   // Fields for input flow variables
>   /** Input flow variable: "Number Rows" */
>   public Integer v_NumberRows;
> 
>   // Fields for output columns
>   /** Output column: "Jaccard" */
>   public Double out_Jaccard;
>   /** Output column: "RatcliffObershelp" */
>   public Double out_RatcliffObershelp;
>   /** Output column: "Max" */
>   public Double out_Max;
> 
> // Your custom variables:
> 
> Jaccard jInstance = new Jaccard(2);
> RatcliffObershelp roInstance = new RatcliffObershelp();
> 
> 
> // expression start
>     public void snippet() throws TypeException, ColumnException, Abort {
> // Enter your code here:
> 
> 
> // Temporary table to store results
> List<Row> tempResults = new ArrayList<>(); // List|Row|ArrayList cannot resolved to a type
> 
> 
> // Iterate over each row of c_NAMESONE
> for (int i = 0; i < v_NumberRows; i++) {
>     String nameOne = c_NAMESONE.get(i); // The method get(int) is undefined for the type java.lang.String
>     
>     // Iterate over each row of c_NAMESTWO
>     for (int j = 0; j < v_NumberRows; j++) {
>         String nameTwo = c_NAMESTWO.get(j); // The method get(int) is undefined for the type java.lang.String
>         
>         // Calculate similarities
>         out_Jaccard = jInstance.similarity(c_NAMESONE, c_NAMESTWO);
>         out_RatcliffObershelp = roInstance.similarity(c_NAMESONE, c_NAMESTWO);
>         out_Max = Math.max(out_Jaccard, out_RatcliffObershelp);
>         
>         // Save to temporary table if scoreMax is greater than 80%
>         if (out_Max > 0.80) {
>             Row resultRow = new Row(); // Row cannot resolved to a type
>             resultRow.setValue("NAMES_ONE", nameOne);
>             resultRow.setValue("NAMES_TWO", nameTwo);
>             resultRow.setValue("Jaccard", out_Jaccard);
>             resultRow.setValue("RatcliffObershelp", out_RatcliffObershelp);
>             resultRow.setValue("Max", out_Max);
>             tempResults.add(resultRow);
>         }
>     }
> }
> 
> // Assign the filtered results to the output
> for (int k = 0; k < tempResults.size(); k++) {
>     Row tempRow = tempResults.get(k); // Row cannot resolved to a type
>     c_NAMESONE.set(k, tempRow.getValue("NAMES_ONE"));
>     c_NAMESTWO.set(k, tempRow.getValue("NAMES_TWO"));
>     out_Jaccard.set(k, tempRow.getValue("Jaccard"));
>     out_RatcliffObershelp.set(k, tempRow.getValue("RatcliffObershelp"));
>     out_Max.set(k, tempRow.getValue("Max"));
> }

takbb · July 22, 2024, 12:44pm

Hi @mauuuuu5 ,

I took a look at your java code, and what I realised is you are misunderstanding how the java snippet works.

You are assuming that the java snippet is called once and that you can then iterate over all the rows in the table. Unfortunately it doesn’t behave that way,

The java snippet code in the section entitled “// Your custom variables:” is invoked once when the node is executed but has no direct access to the table data.

The snippet code in the section entitled “// Enter your code here:” is invoked once for each row in the data table. But on invocation it has access ONLY to the one row for which it is being invoked, plus any java variables that have been defined in the “// You custom variables:” section.

It is therefore not possible to write a piece of code to “loop through the row values” like you are doing.

In terms of specific compilation errors:

The first error displayed re List and ArrayList classes is that in java these are contained in the java.util package, so to have the java snippet recognise them you’ll need to include the following import statement.

import java.util.*;

You have commented this out, so you’ll need to uncomment it.

but you have also got a line of code referring to a “Row” class, which is undefined.

If you were able to do the processing that you need in the java snippet, then you’d probably want to change this to one of the java Map classes such as HashMap, so the line would instead be:

List<HashMap<String,Object>> tempResults = new ArrayList<HashMap<String,Object>>();

but that’s kind of irrelevant since the java snippet cannot work how you want it to here.

Now, in terms of your algorithm, yes a Cross Joiner will significantly increase the number of records (it is after all the product of all the rows in the two tables being cross-joined), but your algorithm is going to be doing (almost) the same thing since it is ultimately comparing every row with every other row, although the memory footprint may be smaller.

If you are trying to keep the memory footprint down, one approach would be to keep your original java code from your referenced post, and adapt the workflow around it.

Let’s assume you then have a workflow something like this.

I have used a Column Splitter to divide you demo table into two tables. One table has NAMES ONE and the other has NAMES TWO.

The Cross Joiner matches all rows to all rows, then the java snippet calculates a score for each row, and finally a duplicate row filter can be used to find the best scoring row for each NAME ONE, for example.

For a small data set it is fine, but if you had 1000 rows in each table, then you have a dataset after the cross joiner of 1000000 rows. Which is where you are finding a problem.

An alternative to this then, is to have a cross joiner within a chunk loop. The chunk loop processes one row at a time from NAME ONE and then cross joins that row to NAME TWO. So at any one time, you run the java snippet against 1000 rows, rather than 1000000.

Of course the loop then iterates 1000 times, so for smaller data sets there is a performance penalty for looping, but for larger datasets it may be that looping is more efficient than the additional resource-swapping that may be required for a 1000000 row table.

I have uploaded a demo workflow for the above.

Interation among rows - takbb.knwf (158.1 KB)

You’ll see that I’ve also included my component “Download JAR file” to make your workflow more portable. This downloads the jar file from the web address you mentioned and places it in the workflows ‘data’ folder. Within the java snippet you can then reference the additional library as:
knime://knime.workflow/data/java-string-similarity-2.0.0.jar

Incidentally in your flow, the Row Filter and Table Row to Variable nodes are not required:

The Extract Table Dimension node already returns flow variables as well as rows. You are not the first to do this!

mauuuuu5 · July 23, 2024, 1:03pm

Hi,

I appreciate your help and the time and effort taken to solve this issue. Certainly, the approach you proposed is better than coding anything else in the Java snippet node.

Thank you again.

system · July 30, 2024, 1:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.