Approximate String Matcher – Powerful Fuzzy Matching in KNIME

Hi everyone,

We’ve recently published a new community node on the KNIME Community Hub:

Approximate String Matcher.
This node allows you to calculate string similarity scores between a reference table and a comparison table, with flexible matching algorithms and filtering options.

Key features:

  • Choose from algorithms such as Levenshtein Edit Distance, Longest Common Subsequence, or Positional Matching.
  • Apply a user-defined threshold to decide which rows are considered a match.
  • Match logic: all columns in the comparison table and all rows in the reference table are combined in a logical OR fashion, and the best match is decisive.
  • Flexible output: return matching rows, non-matching rows, or all rows.
  • Optional extra columns:
    • Numeric match value
    • Best reference match
    • Alignment string showing modifications to align comparison with reference

Example workflow
We’ve created an overview example workflow that demonstrates how the node works and how to configure it:
:link: Approximate String Matcher Overview Examples

Contact us
:globe_with_meridians: Website: https://www.exorbyte.com
:e-mail: Email: consulting@exorbyte.com

:yellow_heart: We’d love to hear your feedback, use cases, and ideas for improvement.
Feel free to try it out and let us know how it works in your workflows!


exorbyte Team

11 Likes

Very nice, thank you!

2 Likes

Thanks for your comment Nick.
Let us know what you’re building with it.

Very nice piece of work. Congratulations! I am struggling with the Frequency-Aware Anomaly Detection example workflow. It works perfectly with the example data set. If you add several duplicate misspelled city names it analyzes them as correct. The logic for this is pretty obvious. I don’t think the Rule Engine filter you have can accommodate this since there are both correct and incorrect spellings with the same count. I haven’t studied the Approximate String Matcher node carefully. There may be a way around this. I realize the algorithms you’re employing have no AI (not meant as a criticism.)
input

tbh I’m not building anything with it for the near future. I don’t do string matching that often. But it’s good to know you’ve made this and it’s there if I need it.

I work with public transport data so if we do string matching usually it’s two sets of station IDs and descriptions.

1 Like

Hi @rfiegel,

Thanks for trying out our node and for your great feedback!
We are glad to hear you found the example useful. :hugs:

The idea of the Frequency-Aware Anomaly Detection use case is to identify potential data entry errors by comparing the least frequent values in a column to the most frequent ones in the same dataset, using fuzzy string matching.

In your example, when you add "Stutgart" several times, it becomes just as frequent as "Stuttgart".
Since the workflow splits the dataset into “high-frequency” and “low-frequency” sets, values with equal frequency end up in the high-frequency group and are assumed to be correct.
This is why "Stutgart" is being classified as “Correct” because it no longer meets the “rare” condition, so it is not compared against the most frequent set.

In real-world data, typos usually appear as outliers with a much lower frequency than the correct value.
That is why this frequency-based split works well in most scenarios as it flags rare, high-similarity values for review without overwhelming you with common, valid entries.

To catch cases like your example, you could:

  • Adjust the frequency threshold logic so that ties are handled differently (for example, require a minimum count difference before a value is moved to the high-frequency group).
  • Or run the Approximate String Matcher across the entire dataset without frequency filtering if you want to find close matches regardless of frequency.

Hope this clarifies what is happening behind the scenes.

We really appreciate you experimenting with the workflow and sharing your thoughts.

__
Ahmad Varasteh

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.