Did you check that workflow on the previous data sets?
I’m not sure whether such a prediction could only be based on historical data. IMHO the main question will be whether that algorithm takes into account all relevant parameters.
Such a parameter could be a twitter from or action by Donald T. (like yesterday).
Or some (to me unknown) effects of the Brexit to the UK economics and thus the employment rate.
I think you have to take a step back and think about what you want to do and maybe read a thing or two about building predictive models. Your Decision Tree tries to predict the “Area” which is a region in your data, that is if a line is from wales or some other region. Not sure what to make of this.
The Target is taken from 2017 “Emp UK percent”. If you want Unemployment you would have to change that. The years are transformed into columns, while the first year of the data becomes _0 and the year before that _1 and so on.
Then I construct a Test dataset using the data of past 9 years from 2009 to 2017 while 2018 provides the Target. So we have the same structure in Train and Test.
Then we use several regression models to predict the Target and see how the model is doing on our test data. Focussing on the RMSE (lowest is best) we see that H2O GLM is best.
Next step would be to use the data from 2010-2018 and do the prediction like it was done on the Test data and you have your prediction for the 4 Areas for 2019.
Of course the question is if such a figure can be derived from just these numbers. But it is there to demonstrate how to do that. Here we use quite complicated algorithms. You might also try just to use a linear model on a time series alone.
And since a figure like unemployment could depend on very many factors I would not bet the farm on having found a magic system to predict unemployment
The scorer node for numerical targets is already there and provided us with the RMSE statistic to evaluate the model.
You could try to read more about the prediction of numerical targets in this collection of entries:
Also if your assignment is from a university or other educational institution you might want to read the provided literature or ask your professor or tutor which statistic is the preferred/expected one. RMSE is widely used and also often the decisive statistic in Kaggle competitions.
Another note: the models compared in this example are used in their most basic configuration. All the good stuff like normalisation, feature engineering and hyper parameter tuning are not used (yet). But the question is if with so few data points you would get better results.
Also you could legitimately try to predict each area with a time series of its own. But since all parts of the UK are economically connected the model might benefit from having them together in one dataset. And we left the area information there for the model to be used. Another approach could be to leave the area out.
But all this is an addition. You should first concentrate on getting your setting and ‘business question’ right.
thank you so much
I tried to understand, maybe the approach I’m in is just not right, maybe I have to understand this better.
I thought I could just predict the labor rate with my data set, it is not just rendering.
Just out of curiosity I took a look at the current ‘real’ numbers. You have to be careful to interpret them since there are seasonal adjustments and I am not sure what status your original data had (maybe load them again from the original source).
For the 2018 data your numbers of 2018 for Wales and Scotland match, for England and Northern Ireland they are different (cf. Figure 1)
Regional labour market statistics in the UK: October 2019
Yes in the data preparation just select the unemployment rate as “Target”.
I added a few algorithms to the build data node.
Please be aware that with such few data points the effect and stability of such models is limited. If you check the feature importance features from 6 years before are prominent. That could mean that there is a cycle of economic development, or it is radom and prune to external influences that are not covered by the data.
Thank you so much
When I am changing the prediction (target) from employment to unemployment it is still giving me the same results near to the employment statistics, when compared to both shouldn’t the unemployment rate be much lower?
That is good. The next thing you could try is to create a list of time series to get more data, if you assume there are ‘seasonal’ cycles of the economy.
2018 -> 2017, 2016, 2015
2017 -> 2016, 2015, 2014
And so on. But again. To model such complex developments like unemployment is not so easy. I would not be surprised if a simple average would also not be that far off. And also we are working with absolute numbers of employed people and so on, for a more robust model it might be necessary not normalize or index them.