Desicion Tree Predictor predicts missing values.

nima_m · June 3, 2016, 12:42pm

Hi Community,

So I have a dataset which contains absolutely no missing value (not even a space character). However after training a decision tree, its predictor predicts some missing values (blanks actually). I have no clue what I've done wrong. Thanks in advance for your kindness!!

marco_ghislanzoni · June 3, 2016, 1:17pm

In your Decision Tree Learner, what is the setting for the No True Child Strategy? It's under PMML Settings.

nima_m · June 3, 2016, 2:49pm

Thanks for your rapid reply! :)

I checked there and it was "predict null", I changed it to "Last prediction" and it worked out of course!

Thanks again.

marco_ghislanzoni · June 3, 2016, 3:32pm

That is exactly what I was hinting to. Well done!

Cheers,
Marco.

RIK · November 29, 2016, 3:57pm

I have the exact same problem. No missing values in the dataset but the decision tree node predicts ? for 92% of the testdata! I have used the partitioning node to divide the data into a trainng (10% = 150K rows) and testset (90% = 1350K rows) . The interesting thing is that for the rows where it actually predicts a class (its a binary problem) I get 75% accuracy, but when I set the "No True Child Strategy" to "returnLastPrediction" and get predictions for all rows the accuracy decreases to 61%! Naturally I would like to know why the missing predictions occur since it by itself seem to be powerful predictor. When inspecting the tree and the data manually I don't see any reasons for null predictions!

marco_ghislanzoni · November 29, 2016, 5:04pm

Hi

From your description I believe the problem may actually lay in the data. How do you make the split between the training and the test set? Is it possible that your data are your data sorted in some ways so that the split generates a training set that is not representative of the rest of your data set?

I would simply plot the training and testset on a scatter plot (you can chose the most relevant dimensions or user pairs plot) marking them with distinct colors to see how they look graphically. Do they fully overlap or do you see distinct clusters?

It is anyway normal that, in such conditions, if you force the predictor to return the last prediction (based on the score of the parent node) when no true child exists the accuracy decreases.

Cheers,
Marco.

RIK · December 5, 2016, 9:48am

Hi, thanks for your replay. I have experimented with several setups of the training and test sets without any success. All partitions seems to be representative in all attributes. However, it turns out that the problem is connected to a nominal attribute with seven string categories. When this attribute is recoded to int (0-6) the number of missing values is reduced from more than 1000K rows with missing values to 30. The categories contain space, / and :. and I guess that there is some kind of bug in how the decision tree learner handles string values..? However, I still get 30 missing values which of course is insignificant in the context but annoying nonetheless. I also haven’t understood exactly why the missing value occur, i.e. which category / charecter that causes the missing values, since the result is rather random and happens for all categories.

Cheers!

marco_ghislanzoni · December 5, 2016, 11:32pm

If you could isolate a subset of your data which causes the random issue that would be a great starting point to investigate why that happens.

Cheers,
Marco.