Find common Words in Strings

I have two tables containing names and I want to find which entries are the same (or nearly the same). Unfortunately, the data in the second table is quite messy. Therefore came up with the idea to match names if they have the same words.

Here is an example:
Table 1 might contain:
Doe, John

Table 2 might contain something like:

Doe John
John Doe
Doe, John James
Mr. John Doe, Knime Specialist

It might also contain similar entries which should not be matched, because they do not contain “John” and “Doe”

James Doe
Doo, John

I would like to obtain some kind of matching score. I tried the string similarity nodes, but all of then rank Doo, John higher than Mr. John Doe, Knime Specialist. This is bad for my application because I know that spelling mistakes are not very common, but permutations of words and additions (like “Knime Specialist”) are.

Hi,
This is going to be rather difficult, as the computer has no concept of names per se, so to it there might just as well be a person with first name Knime, last name Specialist (wish that was my name, haha). This means that it cannot easily know if “Mr. John Doe, Knime Specialist” or “Mr. Arnold Schwarzenegger, Knime Specialist” are the same person or not by just looking at the text value. You could maybe improve your results by coming up with some rules, like that the name always comes first and has at most 3 words, or you could remove commas, but it is going to be a difficult task.
You can try to use our new LLM integration to solve it. I tried with GPT4all, which is nice because it is offline and does not send your data anywhere, but results are a bit inconsistent. When I use a prompt like this:

From the following text, extract only first and last name and remove the rest. The output format should be \<firstname> \<lastname>.
Text:
Doe John

I sometimes get

FirstName: Doe
LastName: John

but sometimes also

Please find below the extracted names from the given text.

First Name: John
Last Name: Doe"

Maybe you can find a better prompt that works well. Otherwise, you could also use OpenAI instead of GPT4All. That should yield much better results but may cost you and sends data into the cloud.
Kind regards,
Alexander

GPT4All.knwf (79.8 KB)

3 Likes

KNIME support NLP with NER right? spacy or other ways. So maybe we could extract the person names first and then do the similarity search afterwards
br

Hi, and thank you for your reply.
I agree that the computer has a hard time extracting the first and last names from the second table.
In my case, i know that the first table has clean data which is either
last name(s)
or
last name(s), first name(s)
(names if the person has multiple first or last names).
Therefore I want to know only the entries from the second table where each word from the first table appears. But how can I do this?

A score could be obtained by removing the matched words and looking at how long the remaining string is.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.