XGBOOST versus RANDOM FOREST

AlexanderFillbrunn · April 27, 2020, 7:45am

Hi @aworker,
KNIME currently uses xgboost4j version 0.72 under the hood. This library does not seem to support returning the variance of the out-of-bag predictions, but that may also be because for gradient boosting it does not make so much sense. In stochastic gradient boosting, just like in Random Forests, a datapoint may be out-of-bag in one tree, but not the other. But since in gradient boosting every tree depends on all others that came before it (since it is learned on their residuals), you cannot really create a prediction with only the subset of trees that did not “see” the datapoint in the training data before. If you did sum up the individual predictions of the trees in your ensemble that have not seen the datapoint before, you’d totally ignore that the trees early in the chain have a much bigger influence on the outcome than the latter ones.

Let’s assume that you have trained a stochastic gradient boosting model with 3 trees. Now you have a datapoint that was used to train tree 2 and 3. For your out-of-bag prediction you would only use tree 1 and that might give you an okay estimation. Another datapoint that was used to train trees 1 and 2, though, would be less helpful: for your out-of-bag prediction you could only use tree 3, but that only predicts the residuals of the previous trees and therefore the output is quite useless.

Kind regards,
Alexander