Hey all - newbie here so please forgive what may seem like an odd question.
I have a single-class stream of nominal data - 5 potential values in the data, each of which is distributed with a different probability through out the dataset.
basically values are 0,1,2,3,4.
zero is highest probability (59%) and the other values are 27%, 10%, 2%, 1% respectively.
I never see anything in the predictor output other than 0 or 1, although there are instances of 2,3,4 in the source training data.
Any idea why the predictor is only showing 0 or 1?
I’m trying to build a model that will give me a 30d forecast based on these probabilities.
Naive Bayes operates on an assumption of independence for the input variables. Are your inputs truly independent? Are you handling correlation beforehand?
Is it possible that you have features in your test dataset that are not present in your training dataset? This is the so-called Zero Frequency problem.
Apart from this, is there a particular reason you want to use NB as a classifier? Other algorithms often perform better, with less restrictive assumptions, for a negligible calculation tradeoff.
OK so firstly, the training dataset is the test dataset. They are one and the same. I am looking for a way to accurately (as possible) forecast the next 30 periods. I was intending to use a back-testing method to validate this as a method.
The NB nodes seem to give me a useful probability distribution, but it’s just that the predictor does not seem to match the result.
What other algorithms would work here? Pls bear in mind I’m new to this kind of data science and also new to Knime.
Be careful of using probabilities calculated from NB - they are notoriously inaccurate. NB can be an effective classifier, but that’s often in spite of the probability estimates. Also, if you have numeric features in your model, NB assumes they are distributed Gaussian. So you may need to do transformations on those features ahead of time.
As for other algorithms, you might consider Decision Trees or Random Forests here instead. Both can handle multi-class problems.
But before you think about that - this sounds like a time series problem. Time Series has its own peculiarities, since each observation in your data is correlated with the ones that came before. So you need to deal with that too. There should be some forum posts about that floating around - I’m not a TS expert so I will defer to others that know better than I do there (e.g. @Corey).
I agree that a Random Forest would be an interesting model to try here.
You can twist it into a forecaster by lagging your column to use it as an input and by recursively deploying model to predict the next day’s class until you get to the 30 day mark.
That being said, it can be difficult to build forecasting models, sometimes impossible without supplementary data.
Is it possible the 2, 3, 4 values are occurring at random or due to the presence of something your model couldn’t know from only seeing the series itself? That combined with their rarity would cause the model to never actually predict them.
An example of that would be trying to build a model to forecast ice cream sales. It probably wouldn’t accurate unless it could “see” things like the weather.
just to add considering you are new to data science and KNIME itself to check this Learning page KNIME provides. Among other useful things you will find (free) Data Science E-Learning Course which seems to me would be a good start for you