FUZZY MATCH ALTERNATIVES

jmanuelml21 · January 18, 2023, 6:30pm

GD, Forum Users

I got the following problem; hopefully as you’re the forum-analytic gods you will be merciful with this poor guy’s problem.
I got a big big list of string inputs (1000k+). These inputs should have been standarized and only should have been input only when chosen from a 1k options list but things did not go as planned and data is now a mess; writing each input with different characters, random digits, missing letters, words in different order, etc…

I need to narrow down the original 1k option list.
At first, thought fuzzy match would be my solution but i need the “dictionary”, or the 1k list so it can match the different inputs.
Now i think i need to string manipulate and get the 1k option list manually, checking the inputs 1 by 1. So far i’ve tried string manipulation to remove digits and specific words, groupby node but still, my selection is way too big to do it manually.
screenshot

COULD YOU PLEASE HELP ME!?!?
How would you approach this problem?

Many thanks in advance, guys!

goodvirus · January 18, 2023, 6:57pm

Hi, have you tried the String similarty node?

mlauber71 · January 18, 2023, 7:01pm

@jmanuelml21 here is an example of how you could identify duplicates and bring them together without having a ‘ground truth’. The workflow will try to determine the groups for itself.

Then you can do more address matching and deduplication with these examples:

izaychik63 · January 18, 2023, 7:22pm

@jmanuelml21 , look at this trail:

The approach somewhat simpler. If you prefer different metrics just connect String Distance to Similarity Search node.

You can connect WF from the trail above.
The idea to compare same lists.

jmanuelml21 · January 18, 2023, 9:04pm

You mean “similarity search”?
Please let me know.

Thanks!!

izaychik63 · January 18, 2023, 9:16pm

See here

jmanuelml21 · January 19, 2023, 5:27am

many thanks for your input. This is greatly appreciated. I’m giving your workflow a try but looks like it’s not as simple as i thought. I will have to go over calmly and understand each step as i go over it. There are many resources which i’m not familiar with. I’lL let you know how it goes.
again, thanks!

system · April 19, 2023, 5:27am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.