Data normalization for Huge csv file

ini_scott · January 8, 2019, 5:15pm

Hi Guys,

I have a list of search terms that users search on our site. I am trying to find a way to “normalise” the data and group them together ee.g. if one searches for “segretary” and another “seggretary” or something similar, i would like to be able to say they are part of the same family.

The difficulty is that the search terms are vast and could be varied so its difficult to manually aggregate. are there any pros out there that can help me to find a way to do this?

Thank you in advance.

deicide_bg · January 8, 2019, 6:36pm

You need to classify those terms and map them to one term. No workaround there. If the difference is just one letter, you can easily put a mask for searching purposes. But for normalization, you need to map each unique term to a number.
Of course, you can try word2vec hashing, but I can’t tell the precision of that. GL!