KNIME Parameters in Python

Hello,

When we use the same

  • Features
  • Data
  • Training / Test set ratio
  • KNIME settings

to recreate our KNIME models in Python, there is a significant performance gap between KNIME and the corresponding Python regression model performance. With regards to R2, KNIME performed 5% to 9 % better than Python, across the board. We expected that the models would have nearly identical performance.

For the following KNIME nodes, are there any KNIME default parameters, that the user does not have visibility or access to?

· XGBoost Regression Learner
· Random Forest Regression Learner
· Gradient Boosted Trees Learner
· Tree Ensemble Learner

Thanks,

Alex

Do you also use the same random seed?
There certainly could be the case that sklearn uses different parameters then KNIME (if so I would assume the information is given in the description/documentation)
br

1 Like

Hello @TE499OP,

I would second what @Daniel_Weikert stated, as well as mention that a lot of the learner nodes, will point or link to documentation/articles that support the way that the node was developed. I believe a good example is this (A Scalable Parallel Classifier for Data Mining) which I found a part of the description of the decision tree learner node.

Regards,
Ryan

1 Like

Hello,

Random Seed is used in the KNIME regression learner workflows.

I have been through the documentation for all the regression leaner nodes. In many cases, the parameter description, in the KNIME node documentation, does not completely line up with the variable names in Python. Some of the KNIME parameters can be intuited in to their Python doppelgangers.

Circling back to my question, are there any KNIME default regression learner parameters, that the user does not have visibility or access to?

Thanks,

Alex

@TE499OP

Everything that is available for configuration or manipulation on how the node or model will be run is available in the configuration window. If you are curious as to how the model was built or if there are any default parameters, the analytics platform is completely open-source so you could head over to our GitHub and look at the source code of these nodes.

As far as performance differences, this could rely on a number of factors such as how your Python environment is configured and the KNIME environment is configured as well as that Analytics Platform is built and run using Java.

Hope this is helpful,
Ryan

2 Likes