Introducing the exorbyte M|BOX Partner Extension for KNIME

Approximate Matching & Index-based Retrieval in KNIME (exorbyte Nodes)

Exact joins are a bottleneck in many KNIME workflows.

As soon as data becomes even slightly inconsistent (typos, different encodings, missing fields), classical joins and rule-based pipelines start to break down.

We’ve released a set of KNIME nodes that address exactly this layer:

:backhand_index_pointing_right: exorbyte matchmaker toolbox (M|BOX)

What it actually does (technical view):

| Builds an in-memory index over structured or semi-structured data
| Executes approximate queries against the entire index
| Supports multi-attribute matching with configurable weighting

Returns:

| best match
| similarity score
| optional alignment information

𝗡𝗼 𝗯𝗹𝗼𝗰𝗸𝗶𝗻𝗴, 𝗻𝗼 𝗰𝗮𝗻𝗱𝗶𝗱𝗮𝘁𝗲 𝗽𝗿𝗲𝘀𝗲𝗹𝗲𝗰𝘁𝗶𝗼𝗻, 𝗻𝗼 𝗽𝗿𝘂𝗻𝗶𝗻𝗴.

Core nodes:

Table Indexer
→ builds a multi-field index (e.g. name, address, id fragments)
Table Index Matcher
→ queries the index with fuzzy logic across all fields
Approximate String Matcher
→ pairwise similarity (Levenshtein, LCS, positional methods)
Character Mapper
→ normalization layer (diacritics, variants, encoding issues)

What’s different compared to typical KNIME approaches:

Not a join → index-based retrieval problem
Not ML → deterministic, explainable scoring
Not preprocessing-heavy → works on dirty data directly

Where this becomes relevant:

Identity resolution across heterogeneous sources
KYC / sanctions screening pipelines
OCR / ICR post-processing (error-tolerant lookup)
Product or entity matching without stable identifiers

Typical workflow pattern:

Normalize input (optional)
Build index once
Query repeatedly
Post-process matches (thresholding, routing, enrichment)

If you’re working on anything where:

joins are failing
rules are exploding
or preprocessing becomes the main workload

this might be a useful addition to your KNIME setup.

:backhand_index_pointing_right: Extension + example workflows:
https://lnkd.in/eqk8giDa

3 Likes

you may want to

  • fix your font
  • fix your list icons (|)
  • explain the advantage of your paid notes compared to a using String Similarity, String Cleaner and Value Lookup directly. the only thing I see straight away is that you enable processing of more than 1 column(-pair) which should be not much more than a loop and assigned weights per column(-pair).

Can’t drag and crop the extension. Also how is this different from the extension you previously posted (which did install)?

@fe145f9fb2a1f6b
Hi,

thanks a lot for the feedback. We really appreciate you taking the time.

Regarding the Hub page: we’re currently in the process of updating the extension. The landing page on the KNIME Community Hub is unfortunately not fully under our control. We’ve already reached out to the KNIME team about the font and formatting issues, and they’re working on improving it.

On your main point (comparison to String Similarity + Cleaner + Value Lookup):

You’re right that you can replicate parts of this with loops and multiple nodes; but that approach fundamentally doesn’t scale well in terms of performance, architecture, and maintainability.

What we’re doing differently is shifting from pairwise comparison logic to index-based retrieval:

  • Instead of comparing every row against every candidate (looping), we build an in-memory Index Object once and query it efficiently
  • This removes the need for candidate generation, blocking, and manual pruning entirely
  • Matching is executed against the full dataset with sublinear lookup behavior, not quadratic joins

Also preprocessing (cleaning, normalization) is optional; the matching is designed to be fault-tolerant on raw data.

I would like to invite you to have a look at this workflow where we are comparing the performance of KNIME String Matcher vs exorbyte Term Indexer and Matcher

String Indexing and Matching in KNIME – KNIME Community Hub

@rfeigel

Thanks for the feedback!

We’ve just released a new version (1.2.4) of the extension. This update significantly expands the previous version.

We’ve added new capabilities including table-level matching and alias handling, and now cover three levels of matching:

Term Matching – edit-distance-based matching on single tokens
Phrase Matching – subword-aware matching for multi-word text
Table Matching – multi-field matching for real-world entity resolution

You can already access the latest update directly via the KNIME Partner Update Site inside KNIME Analytics Platform.

We’re currently waiting for the KNIME team to finalize updates on the Hub page. Once that’s done, we’ll follow up with a proper announcement and fully aligned documentation.

In the meantime, you can explore the example workflows here:

Table Matching

Phrase Matching

Term Matching

Alias Handling

Thank you again,
Ahmad from exorbyte

I tried to install from the update site and get this. 1.2.4 doesn’t seem to be available.


comparing the String Matcher against Term Index Matcher excluding the Term Indexer is obviously not a proper comparison.
but indeed, many Knime nodes use rather old or inefficient means. Easy to understand but not well scaling.

Thank you for pointing this out! @rfeigel

Regarding version 1.2.4, we’re currently coordinating with the KNIME team to have it published on the update site as soon as possible.

At the moment, the latest available version should be 1.2.3. So if you’re seeing 1.2.2, something is wrong.

Could you please try uninstalling and reinstalling the extension, or refreshing your update sites? That should resolve the issue.

You’re right, in terms of node count, the comparison isn’t entirely fair. But that’s exactly the point: this isn’t a node-to-node replacement, it’s a different execution model where the index is the central component.

What we introduce is:

  • Configurable Indexing as a first-class step
  • Clear separation between index build and query execution
  • Deterministic retrieval instead of repeated pairwise comparisons

Once the index is built, queries run efficiently against it, which is where the real scaling advantage comes in, especially for repeated lookups and larger, dirty datasets.

You shouldn’t announce a new extension before its available. Its very frustrating.

We apologize for your frustration. Now the problem is solved and you can now access the latest version (v1.2.4) via the KNIME update site for both macOS and Windows.

Thanks for your patience, and let us know if you run into any issues.

any specific reason you - according to your statement - do not provide this for linux?

I installed 1.2.4 on Windows 11 KAP 5.11.0. When I open one of the example workflows all of the nodes are locked. It also locks all of the nodes in any other workflow I have open.