Regression Model for Salary Prediction

Hey everyone!

Explanation
I’ve just started to get into Knime and found an interesting dataset on salary data.
(Source: A real-world, messy dataset to practice on | R-bloggers)
Now I’m trying to train a regression model to predict salary based on different nominal and ordinal parameters (e.g. Industry, Age-Class, US-State, etc.).

I’ve tried a bunch of different regression types such as Linear, Polynomial, Tree Ensemble, Gradient Boosted and Random forrest as well as feature selection loops on all of them.
But I can’t get an R^2 value over 0,3 and a mean absolute percentage error of under 35%.

Question
So I wanted to ask if maybe someone more experienced can take a look at my workflow and see if there’s anything I could improve? The dataset is probably less than Ideal, considering it has no numeric data expect for salary and doesn’t include (usable) data on job-titles, but maybe there is something I could do to improve my results?

Here’s the link to my workflow :slight_smile:

I’m looking forward to your feedback!

Best regards
Karim

PS: I’m still super new so please excuse any rookie mistakes :')

@Karim_Amarouche these things come to mind:

  • actually there is not that much data to get a precise amout of salary. You might be better off with rounding the numbers and maybe using them in groups. Depending on what is your goal.
  • also you rely heavily on one-to-many transformations for the categorical data. There might be other options *1). Also some data can be interpreted as numeric like age or years of experience. A model might benefit from having a real number (ordinal rank maybe) instead of a fixed point in time. 10 years of experience is significantly more than 2 and carries a meaning. Maybe use the mean of a ‘categorical’ column (2-10 years can be 6 or so)
  • there might be more information in the descriptions of the roles. Maybe try to extract topics or industries from there or extract a set of key words that you might be able to standardize and assigne to each case
  • also from what I saw you left out the additional benefits. Especially for managers they might form a relevant part of their compensation, so leaving them out might mislead the model in thinking someone with 20+ years of experience in a senior position would only earn less money when the rest is in the extras in this dataset (edit: just saw you used the whole number)

More examples how to deal with regression models here:

*1) some more advanced data preparation can be done for example with vtreat. I have code and an article about that:

If you want to learn more about machine learning there are some great KNIME ressources out there:

3 Likes

Thanks a lot!!

I’ll go through your suggestions and see if I can improve anything :slight_smile:
Especially your point regarding the categorical-variables referring to years sound really promising!

Best regards
Karim

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.