Fuzzy matching String and get the comparison and filter data which comparison is 97& or more show only those data

mshahn02 · May 16, 2025, 7:29pm

Hi Team,
i have a requirement to calculate Fuzzy logic as per below .

Have list of vendors name in one column .
Generating record ID against each vendor name.
Then it compare Vendor Name one to another and create another column MatchScore and MatchScore_vendor_name
if Vendor Name is matching 100 % one to another then MatchScore and MatchScore_vendor_name is 100% .
After generating Matching score then filter the data to show which has only 97% or more MatchScore.

rfeigel · May 16, 2025, 9:26pm

It would be a lot easier to help if you provide some sample data.

fe145f9fb2a1f6b · May 17, 2025, 1:32pm

use similarity search node

mshahn02 · May 17, 2025, 1:57pm

Fuzzy_logic_data.xlsx (9.9 KB)

mshahn02 · May 17, 2025, 2:02pm

tried smilarity search but didnt worked out… PFA sample Data
Fuzzy_logic_data.xlsx (9.9 KB)

Any reference to implement would be appreciated… i have given input data and also what is the expected output we are suppose to get all are mentioned

rfeigel · May 17, 2025, 3:11pm

Try this. Uses String Similarity node with Levenshtein similarity measure. To use any of the similarity nodes you have to compare two strings. I accomplished this by cross joining the vendor columns. You should study how Levenshtein similarity works to make sure it meets your needs.

mshahn02 · May 17, 2025, 3:32pm

Thank you for the quick reply but here we have input as a Vendor Name only which you can use in your excel reader, remaining column has been derived from this input and i am having KNIME version 5.4.2 which doesn’t have string similarity node option to install

rfeigel · May 17, 2025, 4:12pm

You need to install the Palladian extension. I don’t understand the rest of your message.

mshahn02 · May 17, 2025, 4:41pm

Hi rfeidel,

PFA input file which we need to use as a input and also i have attached expected output.
Fuzzy-output.xlsx (9.1 KB)
Fuzzy-Input.xlsx (9.2 KB)

rfeigel · May 18, 2025, 12:30am

Try this. It appears that you’re trying to compare the first and last eleven rows although you didn’t explain that. If not, I’m totally lost. I also can’t understand how you expect to produce the output file other than the similarities. There’s not enough information to produce the other columns. Finally, you’ll need to install the Palladian extension to use the String Similarity node.

mshahn02 · May 18, 2025, 12:54am

I am sorry in case of confusion , let me try this one and yes i am comparing vendor names one to another to see which vendor names are matching very closely.

rfeigel · May 18, 2025, 12:58am

One more time - are you comparing the first eleven rows to the last eleven rows or do you to want to compare every vendor name to every other vendor name? If the former, my second workflow should work. If the latter my first workflow does that.

mlauber71 · May 18, 2025, 6:01am

@mshahn02 if you want to group similar names from a single column you could try to adapt this example where you would not compare to a ground truth.

You also might want to formulate your request in a more detailed way as @rfeigel has suggested. If you do not want to do this in English you could try in a language you are familiar with and either then translate it with a current LLM like ChatGPT or use Deepl

mlauber71 · May 19, 2025, 6:27pm

@mshahn02 I built this workflow which tries to group the Vendor Names into groups without a ground truth

Maybe you can check if this suits your needs. You can manipulate the value for a match between 0 and 100:

Ahmad_Vh · August 21, 2025, 10:05am

You can do this with the Approximate String Matcher node from the Exorbyte MatchMaker extension. It computes edit distance (or similarity) between names and lets you score & filter pairs.

How the example works

Clean names (optional: lower/trim/punct).
Self-join the list into left/right and run Approximate String Matcher

Keys: Vendor Name ↔ Vendor Name
Algorithm: Levenshtein (or Positional/LCS)
Output: distance (or similarity) + carry IDs/names from both sides

Convert distance → percentage

Similarity% = (1 - distance / max(len(name1), len(name2))) * 100

Filter Similarity% ≥ 97.
Remove self-matches and deduplicate symmetric pairs (A–B vs B–A) by grouping on (min(ID1, ID2), max(ID1, ID2)).

I’ve shared a ready-to-run workflow here:
Vendor Name Similarity Scoring Workflow [forum-post-88014] – KNIME Community Hub