Explanation
I’ve just started to get into Knime and found an interesting dataset on salary data.
(Source: A real-world, messy dataset to practice on | R-bloggers)
Now I’m trying to train a regression model to predict salary based on different nominal and ordinal parameters (e.g. Industry, Age-Class, US-State, etc.).
I’ve tried a bunch of different regression types such as Linear, Polynomial, Tree Ensemble, Gradient Boosted and Random forrest as well as feature selection loops on all of them.
But I can’t get an R^2 value over 0,3 and a mean absolute percentage error of under 35%.
Question
So I wanted to ask if maybe someone more experienced can take a look at my workflow and see if there’s anything I could improve? The dataset is probably less than Ideal, considering it has no numeric data expect for salary and doesn’t include (usable) data on job-titles, but maybe there is something I could do to improve my results?
Here’s the link to my workflow
I’m looking forward to your feedback!
Best regards
Karim
PS: I’m still super new so please excuse any rookie mistakes :')
actually there is not that much data to get a precise amout of salary. You might be better off with rounding the numbers and maybe using them in groups. Depending on what is your goal.
also you rely heavily on one-to-many transformations for the categorical data. There might be other options *1). Also some data can be interpreted as numeric like age or years of experience. A model might benefit from having a real number (ordinal rank maybe) instead of a fixed point in time. 10 years of experience is significantly more than 2 and carries a meaning. Maybe use the mean of a ‘categorical’ column (2-10 years can be 6 or so)
there might be more information in the descriptions of the roles. Maybe try to extract topics or industries from there or extract a set of key words that you might be able to standardize and assigne to each case
also from what I saw you left out the additional benefits. Especially for managers they might form a relevant part of their compensation, so leaving them out might mislead the model in thinking someone with 20+ years of experience in a senior position would only earn less money when the rest is in the extras in this dataset (edit: just saw you used the whole number)
More examples how to deal with regression models here:
*1) some more advanced data preparation can be done for example with vtreat. I have code and an article about that:
If you want to learn more about machine learning there are some great KNIME ressources out there:
I’ll go through your suggestions and see if I can improve anything
Especially your point regarding the categorical-variables referring to years sound really promising!