clean_Salary.knwf (94.1 KB)
Hello All,
I would like to clean a Salary Survay dataset. I’ve managed to clean some of the columns (Country, Gender, Education, Salary) and I also dealed with the missing values. I will use this dataset for practicing classification.
I would like to hear your ideas about these three columns: Industry, Job title and Race!
Some of the job titles have a very long description and I don’t know how to approach that. It is the same with Race.
I couldn’t upload the dataset, because it is too big, but here are some pictures:
Thank you for your help in advance!
Could you explain in more detail how you’d like to “clean” these three columns?
1 Like
Hello,
I would like to deal with typo mistakes, make them more uniform. The pictures were the wrong example, sorry! Here is another picture for Industry:
There are several similar, but different entries. For Industry there is 995 different value. I would like to have fewer different values, while losing the fewest line possible.
For the Country column I made several String replacer where I could use regex for the typos and also a very long script for a string manipulator (replacing mistakes and different names for the same country). I would like to know if there is a different approch for this!