differences between WEKA Random Forest and KNIME 3's Random Forest Learner

kajocina · October 24, 2016, 8:15pm

Hi there,

I would like to ask you about new functionalities in KNIME 3.2.1.

We have moved our Random Forest workflow for data analysis from Knime2 to Knime3. We were using Weka's Random Forest node and were getting good results. Now, after moving to Knime3 we decided to give it a try with the new Random Forest Learner/Predictor nodes for our classification task. Not only was your node way faster in our workflow (we only exchanged the RF nodes), but it seems that the classification works better than with Weka's RF node. We are now trying to pinpoint the reason for this.

We are using information gain for splitting (although we get the same results with Gini and information gain ratio). We also use the same number of trees (500) and default numbers of features in both nodes.
Theoretically, using information gain should be equivalent to what Weka node was doing, right?. However, our data has a lot of missing values, ~30% in some rows. Weka was using surrogate splitting to handle missing values, while you point to XGBoost's github in your node description. Does it mean that you are using exactly XGBoost's approach to missing data instead of surrogate splitting?

Best regards,
Piotr

nemad · October 25, 2016, 12:07pm

Hi Piotr,

first of all I am very happy to hear that the effort we spent on the random forests for the past year is paying of in some way.

Now regarding Your question:

Unfortunately I do not know the Weka implementation of random forests so anything I say (or write) is more an informed guess than a definitive answer.

Nonetheless, if you have many missing values, the missing value handling will probably have a big impact on the differences You are experiencing. Theoretically it would also be possible to use surrogate missing value handling in our random forests (for gradient boosting and simple regression tree this is possible) but we decided against it because it is quite slow and conflicts with the feature sampling that is typically used in random forests.

The approach proposed by the XGBoost authors is much faster and in our tests we also found that the results are on average not worse than those of other strategies like surrogates.
The idea is to send missing values to each child in turn and finally assign them to the child for which the resulting Impurity reduction is maximal. If there are no missing values in the training data, then we send missing values to the child that receives the most rows in the training data. That's a very crued measure but without any missing values in the training data it is the best we can do.

I hope this gives You some insights, and if not please feel free to ask.

Cheers,

nemad

kajocina · October 26, 2016, 10:21am

Thanks a lot for your response.

Correct me please if I didn't quite get it. For the XGB approach in training set situation, let's assume we have rows 1,2,3,4,5 and features A,B,C in a given node. Row 5 however doesn't have a value for feature C. The algorithm will iterate over possible splits meaning check the impurity reduction for a split based on A for all 5 rows, same for B. However, when it checks C, it will only consider rows 1-4 for impurity reduction and when it's done, it will send row 5 into a node that had the best impurity reduction? Do I understand this approach correctly?

I am trying to understand the mechanics a bit more as we observe a difference in our data scoring between the same workflow with a Weka node and your RF node. It seems that Weka tends to assign RF scores close to 0.5 for entries with the highest numbers of missing values which could be interpreted simply as a lack of power to make an "extreme" decision like scores 0.0 or 1.0. Your node however displays a strong correlation between the RF score and the number of features available (see the plots below). Now, we're not very convinced if this is a behavior that fits our assumptions about the system (we are studying proteins), but we naturally want to know if missing value treatment is behind this behavior.

Weka output:

Note that "protein intensity" is a feature that was NOT used for training in any case and that it's strongly correlated with the actual number of non-missing features (the less intensity, the more missing values).

Your RF node:

As you can see in the second plot, there is a bias in the RF score with the number of available features. In both cases the maximal number of features is almost 300, the minimum is 20 meaning that there exist rows that have only 7% of possible values present. Unfortunately, this is a typical scenario in mass spectrometry data and we have to live with this ;)

Do you have a comment on that? And just as a side note, we never work with the 0.5 threshold, meaning we don't divide the data into "yes/no" groups, but work with the actual scores throughout our analyses.

nemad · October 27, 2016, 1:47pm

Regarding the first part of Your post:

Not exactly. Let's suppose C is a numerical feature and we consider to split it at the value 5.
For rows with missing values, we cannot say to which child they should go, therefore we calculate for each child what gain we would achieve if we send the missing values to it. Then we pick the child for which the gain was the greatest.

This approach somewhat supposes that there is a possible meaning behind a missing value which works better for some datasets and worse for others. However, missing value handling is always tricky and doing it efficiently usually requires some kind of heuristic.

In order to get a better undestanding for the kind of impact the XGBoost missing value handling has on the RF, You could try to use the Missing Value node before the RF Learner node and fill in the missing values with the median/mean (or mode for nominal features). This is what most other implementations of RFs usually do if they encounter missing values (if they support missing values at all).

Regarding Your other questions, I will have to think some more and look at the Weka implementation.
I will come back to You once I can provide an answer that I can back with some valid arguments.

Kind regards,

nemad