Customer Name Matching (internal name to external name)

Hello,
I work with a list of thousands of customer names. The data is from our many distributors that details who (what customer) bought what item. It is point of sale data (POS).
The customer’s name entered at the distributor for a purchase is not standardized. Yet, the customer’s name we use in our internal data (SalesForce as an example) is standardized.
We need to match the two data items (POS data to SalesForce). Thus, this is a matter related to customer name normalization but goes deeper than that (machine learning???). This is not a simple matter of removing punctuation or standardizing upper/lower case etc.
Please see example below. Any thoughts on a KNIME process are appreciated. I am somewhat new to KNIME, but I can get around. I am looking for direction on which components or which nodes to use or input from someone who has worked with a project like this one. The data is in Excel.
I am open to RapidMiner suggestions as well.
I also use Tableau Desktop and Tableau Prep Builder.
Thank you.
image

Hi @wethernt , you probably need to have a mapper/dictionary for this kind of matching, in addition to removing punctuation and standardizing case (upper/lower case).

Looking at your sample list coming from the POS, I think a mapping for the following should be enough:
image

I used the same sample input data as you have:
image

I prepare the data with some string manipulation before doing the mapping:
replaceChars(regexReplace(upperCase($column1$), " LP$", ""), ".,", "")

This is basically removing any " LP" at the end, and also removing punctuations such as dots (.) and commas(,), and changing everything to uppercase to match the case of the mapper.

After the manipulation, the prepared data looks like this:
image

And after mapping:
image

Alternatively, even the Mfg ones could be removed from the mapper, and convert the MFG to MANUFACTURING during the data preparation process before being mapped. But without seeing the rest of the data, I can’t tell if all Mfg of the should be converted or not.

And that is applicable for any of the real data that you have. That is, you will have to build the mapper according to the data that you have and also according to how much or what you want to manipulate before mapping.

The workflow looks like this:
image

Here’s the workflow: Customer Name Matching internal to external name.knwf (11.6 KB)

1 Like

Thank you @bruno29a this makes sense to me. I appreciate your detail and quick reply.

1 Like

No problem @wethernt , happy to help

Quick followup… @bruno29a
see image below. Thanks much.

Hi @wethernt , yes I manually created the mapper. There is no much option for this unfortunately, but the idea is to look at patterns that can be regrouped and also be regrouped after the data is “prepared” so that you can reduce that list.

You might not get the correct list at first attempt, but you just run the workflow with your first list, and then based on the results, see what you can tweak or what can be regrouped further.

EDIT: An additional note is that I used a LEFT JOIN also, so that those that did not match do not get discarded. So, after running your workflow, you will be able to see what did not match, and see how they can be regrouped, or “prepared” to be regrouped.

1 Like

Sorry @bruno29a, Need syntax help. In the string manipulation step, if I want to add other matters to clean, what is the syntax. For example, I want to remove anything that says “INC”. Do I need more parenthesis? Do I just add before final parenthesis?


I tried this and get error:
image

Hi @wethernt , so if you want to remove INC at the end, then you can use “INC$” as a regex.

You can either add it as another regexReplace like this:
replaceChars(regexReplace(regexReplace(upperCase($column1$), " LP$", ""), " INC$", ""), ".,", "")

Or within the same regexReplace like this:
replaceChars(regexReplace(upperCase($column1$), " LP$| INC$", ""), ".,", "")

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.