So I have a dataset which contains absolutely no missing value (not even a space character). However after training a decision tree, its predictor predicts some missing values (blanks actually). I have no clue what I've done wrong. Thanks in advance for your kindness!!
I have the exact same problem. No missing values in the dataset but the decision tree node predicts ? for 92% of the testdata! I have used the partitioning node to divide the data into a trainng (10% = 150K rows) and testset (90% = 1350K rows) . The interesting thing is that for the rows where it actually predicts a class (its a binary problem) I get 75% accuracy, but when I set the "No True Child Strategy" to "returnLastPrediction" and get predictions for all rows the accuracy decreases to 61%! Naturally I would like to know why the missing predictions occur since it by itself seem to be powerful predictor. When inspecting the tree and the data manually I don't see any reasons for null predictions!
From your description I believe the problem may actually lay in the data. How do you make the split between the training and the test set? Is it possible that your data are your data sorted in some ways so that the split generates a training set that is not representative of the rest of your data set?
I would simply plot the training and testset on a scatter plot (you can chose the most relevant dimensions or user pairs plot) marking them with distinct colors to see how they look graphically. Do they fully overlap or do you see distinct clusters?
It is anyway normal that, in such conditions, if you force the predictor to return the last prediction (based on the score of the parent node) when no true child exists the accuracy decreases.
Hi, thanks for your replay. I have experimented with several setups of the training and test sets without any success. All partitions seems to be representative in all attributes. However, it turns out that the problem is connected to a nominal attribute with seven string categories. When this attribute is recoded to int (0-6) the number of missing values is reduced from more than 1000K rows with missing values to 30. The categories contain space, / and :. and I guess that there is some kind of bug in how the decision tree learner handles string values..? However, I still get 30 missing values which of course is insignificant in the context but annoying nonetheless. I also haven’t understood exactly why the missing value occur, i.e. which category / charecter that causes the missing values, since the result is rather random and happens for all categories.