How to fix noisy data

Hi guys, apparently im trying to replace digits into values using cell replacer node. I succeeded but somehow there are many missing values…

Like this :
image

Reason for missing values:
EXAMPLE -
I declared “Internet” = 10.
Supposedly System will replace “Internet” with the number “10”.

Then there are some data spelt : Intranet , ethernet, internett etc
So the system cannot detect and replace 10 to those values. Thus, missing data.
*Not only “Internet”, but also like “stove and stoves” , “TV and television” and many more. “Internet” is just an example.

There are like 10k columns and rows in total and I would like to know, is there any possible way to clean the data?

I will be waiting here, appreciate any suggestion!! Thankyou so much!!

Look at example below.

1 Like

Hi, if understand well the original data had some misspellings like:

Internett which should be: Internet
stoves which should be: stove
TV which should be: television

Then you replaced, Internet by 10, stove by 5 and television by 4, and of course, the misspellings will be empty values.

My recommendation is to clean the values before the doing the replacement you can use the Groupby node to find the misspellings and then use the Rule Engine node to replace them.

Cheers

1 Like

Hi, so means i need to do it one by one manually? What are the configurations needed? Thankyou!

Hi, I did this basic workflow that helps to identify the unique values, correct them manually and assign the values

Let me know if you need more help

Cheers

Info Quality.knwf (23.9 KB)

For problems like this, a little manual work may be required, but you can combine the work with automation by using nodes like Similarity Search:

Similarity Search.knar.knwf (15.0 KB)

Notice television won’t be captured correctly, so you may need to adjust your data or strategy to figure out more optimal ways.

You could create a dictionary using a groupby strategy like @mauuuuu5 proposed.

1 Like

I recommend to get rid of Television the way below

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.