Random forest learner provides different result as vs Random Forest Classifier from sklearn in Python -

Jyotendra · August 18, 2019, 4:39pm

Hi Hans,

Thanks for looking into it. For the sake of discussion, let me provide my perspective on your points.

I forgot to add this line into the start of the code that I shared “np.random.seed(123)” . Please add this line and run the python code again - you will see that results will not change and sensitivity will be 0.90~ . However, this opens up a new question - What is so special about the value of Random seed as ‘123’ - If you will replace this value with any other value - Sensitivity of the model will further go down and hover between 0.8 to 0.9. What could be the reason and how do we know that it is the ‘best’ random seed to be used ?
I have fixed my Partitioning in the KNIME with Random Seed value as 10 and hence my results are not changing. If you look at the ‘Python’ code, I have also used the same Random seed value while splitting the data in function “train_test_split”. So, In an ideal world, Both platform should provide the same partiioning and same set of test and train datasets.
I am in complete agreement with you that Parameters of the RF In SKLearn ‘might’ not match with those in KNIME. However, if you look closely at the ‘default’ Parameters value of the RF classifer - those are as follow.

RandomForestClassifier ( n_estimators=’warn’ , criterion=’gini’ , max_depth=None , min_samples_split=2 , min_samples_leaf=1 , min_weight_fraction_leaf=0.0 , max_features=’auto’ , max_leaf_nodes=None , min_impurity_decrease=0.0 , min_impurity_split=None , bootstrap=True , oob_score=False , n_jobs=None , random_state=None , verbose=0 , warm_start=False , class_weight=None )

Few of these parameters exist in the KNIME for tweaking and few of them don’t - so we don’t know what values KNIME would be using in its RF algo. However, I have tried to match wherever in both of the places in a hope that I get similar results…But couldn’t

Don’t you think if data is same; Random seeds are same; models are same; results should be same irrespective of the technology used?
It leads to a question that how can I trust my KNIME results when I can see that I am getting a better prediction model in Python? Or, what else I could do in my KNIME classifier that makes my model better or equal to my Python result?

Let me know what do you think?