There are two column here – document name and campaign name. The above table provides a sample of the mapping. Now I have many documents unmapped to a campaign. I have a list of unmapped documents on one side and a possible list of campaigns on the other side.
We can do the matching manually (fuzzy) by using the clues – similarity in name. The typical things I look out for are the year with quarter information, the region (APAC, AMER, EMEA), some of the company narratives, type of campaign (EVT for example is event).
How can I do this fuzzy matching using KNIME in a scalable way and a method that can tell me an accuracy score? I have used the manipulation nodes in KNIME but haven’t had a chance to play with the text processing or fuzzy match nodes yet
Not sure if this is a problem more like the supervised learning (I can have a training set for this problem with about 5000 entries) or a fuzzy match approach? Any guidance will be appreciated.
I have a KNIME version 3.3.2 (not the most updated). So, i may need nodes compatible with this version. I can’t upgrade because we use KNIME server with this version.
You can use the String Matcher or Similarity Search nodes to compute the distance between each document to each campaign. Both nodes will output the category value with the minimum distance.
Please find attached a sample workflow for your data. You might also want to check the knime://EXAMPLES/08_Other_Analytics_Types/01_Text_Processing/09_Fuzzy_String_Matching workflow on the EXAMPLES server.
Thank you for the response. I tried our the workflow and the string matcher node. On a quick look, it seemed to identify the correct matches.
How do I indeed verify whether the matches are correct? How to chose a threshhold value?
I had a correctly matched dataset documents to campaign names and looked at the distances between them. They gave me a range of 0 to 50. The 80th percentile score was at a levenshtein distance of 20 and below.
Is there a better way to do the checks using other KNIME nodes?
I know it is an old thread but I am stuck with the same problem: I need to tell within each row, how close is the match between two columns.
I looked at the attached example and outside of the original table with two columns to compare, I am lost. Why do we need to create random rows, why do we need to look for matches in the whole table? I am not sure I understand. All I want is to tell if the value A is 100% or x% close to value B on the same row.
I am sorry, I am really not understanding how this works. I understand what nodes you are pointing me to, I am not understanding how to make them work and make sense. If I have one table with two columns and I need to answer if on Row 1 Column 1 is 25% close to Column 2. What do I need to do?