@Karim_Amarouche these things come to mind:
- actually there is not that much data to get a precise amout of salary. You might be better off with rounding the numbers and maybe using them in groups. Depending on what is your goal.
- also you rely heavily on one-to-many transformations for the categorical data. There might be other options *1). Also some data can be interpreted as numeric like age or years of experience. A model might benefit from having a real number (ordinal rank maybe) instead of a fixed point in time. 10 years of experience is significantly more than 2 and carries a meaning. Maybe use the mean of a ‘categorical’ column (2-10 years can be 6 or so)
- there might be more information in the descriptions of the roles. Maybe try to extract topics or industries from there or extract a set of key words that you might be able to standardize and assigne to each case
- also from what I saw you left out the additional benefits. Especially for managers they might form a relevant part of their compensation, so leaving them out might mislead the model in thinking someone with 20+ years of experience in a senior position would only earn less money when the rest is in the extras in this dataset (edit: just saw you used the whole number)
More examples how to deal with regression models here:
*1) some more advanced data preparation can be done for example with vtreat. I have code and an article about that:
If you want to learn more about machine learning there are some great KNIME ressources out there: