Approach - fuzzy match or supervised learning?

Hi,

I have a question on how I can solve this using KNIME.

Document name Campaign Name
APAC_SP-BusTran_EMM_16Q3_SD-WAN_Wave 1_AW_TY_ASEAN_Product APAC_SP-BusTran_EMM_16Q3_SD-WAN_Wave 1_AW_EN_ASEAN
EMEA_ECT-SmrtCldInt_EVT_17Q4_SummitParis-Invite07.11 EMEA_ECT-SmrtCldInt_EVT_17Q4_Self-Driving-Network-Summits
EMEA_DC-Cross_EVT_16Q4_Hilversum-Attendee-Reminder EMEA_DC_EVT_2016_Summit-NLv2
Q316 EMEA Tech Summit 12-14.07 - Lady’s cocktail party 2016-EMEA-Events-Partner-E10v2
2017Q3 Open Lab Sept12 TY edm 2017Q3 Open Lab Sept12 TY edm - Batch
EMEA_DC-Auto_EMM_16Q2_Gartner Newsletter Issue 2 eDM 2016-EMEA-Newsletters-Customer-E10-Multi
2016Q3 APAC Free Your Future 26 August 2016Q3 APAC Free Your Future 26 August

There are two column here – document name and campaign name. The above table provides a sample of the mapping. Now I have many documents unmapped to a campaign. I have a list of unmapped documents on one side and a possible list of campaigns on the other side.

We can do the matching manually (fuzzy) by using the clues – similarity in name. The typical things I look out for are the year with quarter information, the region (APAC, AMER, EMEA), some of the company narratives, type of campaign (EVT for example is event).

How can I do this fuzzy matching using KNIME in a scalable way and a method that can tell me an accuracy score? I have used the manipulation nodes in KNIME but haven’t had a chance to play with the text processing or fuzzy match nodes yet

Not sure if this is a problem more like the supervised learning (I can have a training set for this problem with about 5000 entries) or a fuzzy match approach? Any guidance will be appreciated.

I have a KNIME version 3.3.2 (not the most updated). So, i may need nodes compatible with this version. I can’t upgrade because we use KNIME server with this version.

Hi shalinirs,

You can use the String Matcher or Similarity Search nodes to compute the distance between each document to each campaign. Both nodes will output the category value with the minimum distance.

Please find attached a sample workflow for your data. You might also want to check the knime://EXAMPLES/08_Other_Analytics_Types/01_Text_Processing/09_Fuzzy_String_Matching workflow on the EXAMPLES server.

Best,
Anna

FuzzyStringMatching.knwf (15.4 KB)

1 Like

Hi Anna,

Thank you for the response. I tried our the workflow and the string matcher node. On a quick look, it seemed to identify the correct matches.
How do I indeed verify whether the matches are correct? How to chose a threshhold value?
I had a correctly matched dataset documents to campaign names and looked at the distances between them. They gave me a range of 0 to 50. The 80th percentile score was at a levenshtein distance of 20 and below.
Is there a better way to do the checks using other KNIME nodes?

There is String Similarity node. I usually check names with 3-gram overlap and similarity >0.8 consider as matched.

Hi shalinirs,

You can use the Optimisation Loop nodes to vary the threshold and choose the one which yields the best accuracy (or other classification performance measures).

Best,
Anna

Hello guys,
I know it is an old thread but I am stuck with the same problem: I need to tell within each row, how close is the match between two columns.
I looked at the attached example and outside of the original table with two columns to compare, I am lost. Why do we need to create random rows, why do we need to look for matches in the whole table? I am not sure I understand. All I want is to tell if the value A is 100% or x% close to value B on the same row.
Suggestions?

See hear

1 Like

I am sorry, I am really not understanding how this works. I understand what nodes you are pointing me to, I am not understanding how to make them work and make sense. If I have one table with two columns and I need to answer if on Row 1 Column 1 is 25% close to Column 2. What do I need to do?

And it worked! I was missing palladian nodes. Thank you very much!

1 Like