String manipulation and similarity comparing

ahortonmilsig · May 20, 2022, 7:40pm

Hi,

I have to different tables of data.
Table 1 is human input
Table 1

Table 2 is REAL name

I am wanting to try and pull the closest matches from Table 2 to Table 1. There will likely be multiple for each which is absolutely fine!
I was thinking I need to have each word in the cell enclosed in “” so that I can do string similarity but I am not sure if that is the best route or not either.

Any guidance would be much appreciated.

duristef · May 21, 2022, 9:57am

hi @ahortonmilsig
My suggested workflow

A - remove groups of chars from both columns in order to maximize (potential) similarity. Give the resulting columns the same name. For example, in Table 1 you could remove chars between parentheses and “inc.”, “co”, “llc”,… using the String Manipulation node

strip(
	regexReplace(
		regexReplace(
			$Epic_Name_10004$, "^(.*?)\\(.*\\).*$", "$1"
		),
		"^(.+?) (co\\.*|llc\\.*|inc\\.*)(\\b| )*$", "$1"
	)
)

In Table 2 you should remove everything after “-”

strip(regexReplace($soldto_txtsh$, "^(.*?)\\-.*$", "$1"))

B - Assuming you have now 2 columns both named “new_text”, you can use them in a Similarity Search node configured for example like this

and this is the resulting table

system · August 19, 2022, 9:57am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.