Fuzzy lookup/joiner (Regex?)

Sjoerd · February 3, 2022, 9:49pm

Hi everyone,

So I have a challenge in KNIME where I have two datasets that I want to compare/join based on a keyfield. So far so good, you can use the joiner. However…this is not a regular join.

I have wildcards in my data. Here are some random examples:

Data column A (for example):
123456
ABC123
000000-A

Column B (to join):
1234X6
ABCXX3
XXXXXX-A

The X represent a wildcard number, meaning they can be numbers from 0 to 9. Now what I want to achieve is to cmpare column A and B. However like this they will not match of course. I am looking for a method to tell KNIME that the X are wildcards so that eventually KNIME will match these.

eg. the last entry from column B should match with column A, because it meets the criteria of the wildcard.

I hope this makes sense? Has anyone already encountered such a problem and how did you solve it? Looking forward to your ideas!

gonhaddock · February 3, 2022, 10:29pm

Hello @Sjoerd
You can concatenate all your key-fields with the OR separator:
Column A:
123456|ABC123|000000-A

Yu can also negate them to include ALL but NOT
Column A:
(?!.*123456|ABC123|000000-A).*

BR

Sjoerd · February 3, 2022, 11:38pm

Hi thanks for you comment. Maybe I did not explain my problem clear enough, as I think we are talking about different things here.

So I have two streams of data, one which is my masterdata and another one with that has data I want to lookup (through a joiner (usually):

My master data has a column we’ll call Column A containing thousands of lines with partnumbers.
My other stream of data has a column we’ll call Column B, containg thousands of lines with partnumbers but also some that have wildcards. There is also an additional column which is the data that I want to retrieve from it to join to my masterfile.

However if I do the Joiner node (keys: Column A & Column B) it will of course not match the partnumbers with wildcards in them. So I am looking for a method to account for wildcards, represented with an “X”. X being any possibility from 0 to 9.

I hope this makes sense, otherwise I’ll make a screenshot in KNIME tomorrow

aworker · February 4, 2022, 8:45am

Hi @Sjoerd

Maybe the following solution does the job:

20220204 Pikairos Fuzzy lookup joiner.knwf (32.3 KB)

Be aware that a -Cross Joiner- is quite memory greedy so for a very big joining, I would add extra tricks. Please let us know if this is the case and we will suggest further improvement.

Hope it helps.

Best

Ael

Sjoerd · February 4, 2022, 9:09am

Good idea! I will try this out

aworker · February 4, 2022, 9:35am

Great

Please be aware that my -Rule-based Row Splitter- comment in previous image was inverted. It should be

$Reference$ LIKE $Target$ => TRUE
TRUE => FALSE

Best

Ael

system · February 11, 2022, 9:35am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.