Gradient Boosted Trees Predictor Regression)

Hi,

I am using Gradient Boosted Trees Learner, (XGBoost) followed by Gradient Boosted Trees Predictor to make a feature selection based on R^2 score. The block are preceded by a Domain Calculator inside a Feature Selection loop. I am often getting some errors:

ERROR Gradient Boosted Trees Predictor (Regression) 0:20:23 Unable to clone input data at port 1 (Model): Zero length BigInteger

ERROR Gradient Boosted Trees Predictor (Regression) 0:20:23 Unable to clone input data at port 1 (Model): null

The errors depends on the parameters of the Regression Tree Learner. Already tried the variations on parameters on the mentioned nodes with no success. What is the root cause of the error?

Kind regards,

Leo

Hello Leo,

without further specifications it is hard to tell what caused the error.
Does your data contain categorical columns with many possible values?

In order to find the problem, it would be very beneficial if you could provide us with a small reproducible example workflow where you encounter the issue.

Kind regards,

nemad

Hello nemad.

Thanks for your quick response.

I am using a model that worked with a particular set of data and when changed it to a new set, the regression stopped working for several errors.

I tried to clean the data, which has categorical columns mixed with numeric ones, and used the Domain Calculator do exclude the target column, and setting up to 3000 possible categorical values.

The model runs for a long time and it crashes with the mentioned errors. Tried to reduce the iteration step to no avail.

KNIME is a black box for me, and just using it to validade an idea. For the real implementation I would rather write my own code in python. Still wondering if KNIME could be of some use in this case.

Rgds,

Leo

Ps. I can send you an exert of the model with data if needed… :slight_smile:

Hello Leo,

there is a known bug that can occur for categorical columns with very many possible values (3000 certainly qualifies).
Your data seems to contain a lot of ID columns (judging from their names) which shouldn’t be used for learning a model since they don’t generalize to new data.
Please correct me if the assumptions I made are wrong, in which case a workflow with data would greatly benefit the further investigation.

Cheers,
nemad

1 Like

Hi nemad.

Thanks again for your help. Really appreciated!

The following link has the knwf workflow and the dataset in xlsx format.

https://drive.google.com/file/d/1uAcKjkztSkVf9vgBD5QpcvErwwS577Ue/view?usp=sharing

Rgds,

Leo

I get the same error messages. Applying missing values, removing special characters from variable names and other things did not help. This looks like a bug maybe because of too many categorical options or something else.

So I set up the whole thing with my current favourite model package from H2O.ai with Gradient Boosting Machine (for Regression). This does work. Question is if this would be a sufficient replacement for your task.

I did not run all the 4.000+ Loops but a few to see if this would be stable. The H2O nodes are all set to store all the data in memory. If your final data is too big or you run into problems you might have to revert that setting.

And of course a lot of models and preparations also can be done in Python but KNIME has all the nice nodes and flows :slight_smile:

kn_elpa_test.knar (52.9 KB)

1 Like

Slightly off-topic but I wanted to point out that the workflow you show in my opinion is flawed anyway. A single partitioning is simply not going to provide a result that will generalize. Or said otherwise the difference in model performance will most likely be more affected by the choice of observations (eg the partitioning) than by a single feature.
With this setup your optimizing for this 1 specific partition (assuming the seed is fixed, if seed isn’t fixed,then you are completely comparing apples to oranges with the score).

So to actually make this meaningful IMHO you would need to cross-validate each feature combination. Meaning it will only become more computational expensive. Hence I’m not really a fan of feature selection this way especially for trees. Trees should be relatively immune to unimportant features. For feature selection I would simply go with the linear correlation filter and low variance filter.

2 Likes