XGBOOST versus RANDOM FOREST

aworker · April 23, 2020, 2:39pm

Dear Knimers,

I’m currently comparing the performance of XGBoost to Randorm Forest in their different KNIME versions (Classification / Regression based on KNIME “RF” & “Tree Ensemble” (TE) nodes). Something that I like a lot in the KNIME RF & TE is the information given by the column called “Solubility(logS0) (Out-of-bag) (Prediction Variance)” coming out of the training & predictor nodes.

Here are my questions :

Is there any way of getting this information from the XGBoost nodes ? If yes, how?
If not, is this because of the implementation that KNIME does of the XGBoost library or because of the library itself?
Is KNIME using the XGBoost4J library (https://xgboost.readthedocs.io) in the background ? If so, could you please point me out where this information is adressed and handled in the XGBoost4J documentation ?

Many thanks in advance for your time and answers.

Best regards

Ael

AlexanderFillbrunn · April 27, 2020, 7:45am

Hi @aworker,
KNIME currently uses xgboost4j version 0.72 under the hood. This library does not seem to support returning the variance of the out-of-bag predictions, but that may also be because for gradient boosting it does not make so much sense. In stochastic gradient boosting, just like in Random Forests, a datapoint may be out-of-bag in one tree, but not the other. But since in gradient boosting every tree depends on all others that came before it (since it is learned on their residuals), you cannot really create a prediction with only the subset of trees that did not “see” the datapoint in the training data before. If you did sum up the individual predictions of the trees in your ensemble that have not seen the datapoint before, you’d totally ignore that the trees early in the chain have a much bigger influence on the outcome than the latter ones.

Let’s assume that you have trained a stochastic gradient boosting model with 3 trees. Now you have a datapoint that was used to train tree 2 and 3. For your out-of-bag prediction you would only use tree 1 and that might give you an okay estimation. Another datapoint that was used to train trees 1 and 2, though, would be less helpful: for your out-of-bag prediction you could only use tree 3, but that only predicts the residuals of the previous trees and therefore the output is quite useless.

Kind regards,
Alexander

aworker · April 27, 2020, 9:07am

Hi @AlexanderFillbrunn

Many thanks for your clear explanation and your time. It makes fully sens to me now. I was expecting though a kind of prediction quality estimator such in the case of Out-Of-Bag prediction, which is really useful in Tree Ensembles in general. It may be more complicated to figure out a good prediction quality estimator in the case of gradient ensembles. If you know of any way to achieve it or reference explaining how to implement it, please let me know

Many thanks again & all the best,

Ael

AlexanderFillbrunn · April 27, 2020, 9:11am

Hi,
why not just go with cross validation? That can easily be done with KNIME (Cross validation example).
Kind regards,
Alexander

aworker · April 27, 2020, 9:19am

Hi,

Yes sure, Out-Of-Bag is already a kind of smarter cross validation, that is integrated into the building of the Tree Ensemble and which facilitates a lot the efficiency/complexity of the workflow. Statistically speaking, it is better too but this would be more elaborate to develop. Thanks Alexander.

Kind regards,

Ael

system · May 4, 2020, 9:19am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.