[SOLVED] Bug: incorrect sampling proportion in boosting learner metanode (Prediction error too big. Finishing.)

Hi guys,

I'm using the boosting learner metanode in combination with decision tree learner and decision tree predictor nodes for prediction. Independently from the parameters chosen in the decision tree learner node the boosting learning metanode performs only 1 iteration because at the second iteration the following warning is triggered:

WARN Boosting Learner Loop End Prediction error too big. Finishing.

I don't know if this can be connected to the problem but I noticed that the second iteration training set is 3 times as large as the first iteration set and especially it is composed for the 90.8% by a single entry (curiously the last of the training set).

Please, do  you have some suggestion?

Gio

The Boosting Loop Start always oversamples the input data by a factor of three, so the dataset size is as expected. The oversampled dataset will contain wrongly classified examples more often. The method applied is AdaBoost.SAMME. With boosting it happens quite often that after a few iterations the prediction error is too big. But that depends on the dataset and the algorithm parameters.

OK thor, thank you for your feedback. Much appreciated.

I understand that the boosted dataset is oversampled and enriched with wrongly classified items. Anyway do you see it reasonable that 90% of the boosted dataset are represented by a single entry? I have no experience with the AdaBoost.SAMME method, so I would need an advise over this issue.

Thanks again.

If you have very few wrongly classified examples in the first iteration, then those get a much higher probability of being in the next iteration. If you have only one wrong example, then it will occurs very frequently in the next set.

The node description contains the reference to the AdaBoost.SAMME publication if you are interested in more details.

Perfect.

Thank you so much Thor!

In the iteration#0 I have 20 wrongly classified items and quite all of them are contained in the iteration#1 training set, as expected. The strange behavior is that the entry representing 90% of the iteration#1 boosted dataset is correctly predicted in iteration#0. So in my opinion this has not much sense as it should have low probability to be there.

Anyway I'll read the original publication to see if I'm missing something.

Thanks

Hm, this sounds strange. Any chance you can share the workflow? Then I can have a look at it.

OK Thor, thanks to have a look at it. I appreciate.

I prepared a workflow with a small subset on which we can focus. The dataset has 40 items and 2 response classes. In the boosting learner iteration #0 only 2 items are wrongly predicted: row6 and row8.

As you said, the boosting learner iteration #1 contains 3 times the items of iteration#0 (meaning 120 items) and the 2 wrongly predicted items are among them, as expected. Anyway the last item of iteration#0 (row39), which is correctly predicted on iteration#0, appears 104 times in iteration#1 training set. This correspond roughly to the 87% of the whole training set. This is strange to me.

Please, could you give me your opinion on this?

Thank you in advance,

Gio

Oh sh... This was a very nasty (and stupid) error in the boosting implementation. It will be fixed in 2.11. Thanks for pointing it out!

Very good Thor. As I always say: "any bug found and solved is a potential harmful bug less".

Thank you.

Gio

Thor,

I just verified that the bug found in the boosting metanode is solved in KNIME 2.11.

Thank you very much,

Gio