ArrayIndexOutOfBoundsException using Tree Ensemble Learner

Hi,

It seems to have nothing to do with the length of strings. I'm still struggeling with this issue... I converted all other string features except the target column to int features. Still i get the indexoutofbounce exception. My target column has around 250 possible values... still experimenting further with this strange... next is to limit the amount of values in my target column. The previous long strings i converted to numbers which i converted to strings again so the lenght shoulnd't be a problem anymore.

Regards,

Ramon

Hi Ramon,

your last tip was really helpful as I am now able to reproduce the problem.
250 classes are quite a lot and this also explains why your learner is so slow, as it has to build 25000 trees.

With fewer classes, things should work out and also training will be much faster.

Cheers,

nemad

Received
ERROR Gradient Boosted Trees Predictor 0:21:10 Execute failed: (“ArrayIndexOutOfBoundsException”): null on Gradient Boosted Tree Predictor while I was running it in Feature Selection Loop (as a part of forward feature elimination metanode). The weird part is that the feature selection loop completed one round before it decided to throw this error. Cannot share the dataset here but my data has 500k rows and 380 columns from which I was trying to select the top 10 features. Have reset and run the loop a couple of times and this error is thrown at different iterations which is again mind-boggling!

Hello anujloomba,

yes, mind-boggling is a pretty accurate description for this bug.
Just to be sure your issue is similar to the other ones in this thread, how many classes does your dataset have?

I hope, with your help we will be able to finally track down and fix this nasty bug for good.

Cheers,
nemad

1 Like

I have the same problem, in my case I split a table in (around 57k rows) in 10 (aprox. 6k each)
some work and some don’t
the configuration of each predictor is the same
using the gradient boosted tree predictor

Hello @manugor,

can you quickly check how many different classes your dataset contains?
While this is not an excuse for this issue, you should consider whether Gradient Boosted Trees are the right model if you have a lot of classes because the number of trees (and therefore the training time and final model size) grows linearly with the number of classes.
A random forest might be a better fit for this kind of problem because the model size (and to some degree the training time) is independent of the number of classes.

Kind regards,

Adrian

the predicted variable has 37 classes. I use 2 variables (numeric) to build de the model
Have tried increasing the number of rows in the partition used by the prediction model and it works. (moving from an 80% training data to a smaller fixed number of rows)
Nevertheless, works differently for different sets.

have the same error - using 4 features - 1 numercial (0-1) and 3 strings (couple of classes, ca. 3-15). target is 5 class string.

in default configuration it crashes with error as above - limit tree depth to 4, learnign rate 0,1 and 100 tree models
switching to 50 tree models - works fine

have tried with a different model and the error reproduces when the sample size used by the learner is bigger than 5000,
the learner works fine…but the predictor crashes
don’t know if it has to do with particular value or what, using ramdom sampling

Quick update: This issue should be resolved with the next bugfix release (4.0.1).

Cheers,

Adrian

2 Likes

I am having same issue.

Hello @asenkron,

are you running KNIME Analytics Platform 4.0.1?
If so, could you provide me with a workflow that reproduces the problem or at least answer some questions about your setting?
How many, rows and columns does your table have?
How many classes does it contain?

@nemad,

I can consistently reproduce this on v4.1.0. I’m not willing to post the work flow publicly because of intellectual property concerns, but I’ll answer any questions I can.

Rows: 4
Columns: 2029, 1 string (the Target), 1 Local Date, the rest are all Number (double).

Note, the model is trained on only 93 rows of data with 57 columns (1 string, 56 doubles). The prediction itself, where the error occurs, is from the 4/2029 data set.

Hello @cybrkup,

I don’t understand your setup completely.
Your training set has only 57 columns, while the test set has 2029?
Are the 57 columns you trained on included in the 2029?
Another important detail for debugging this is the number of classes you have.

Best,

Adrian

After dimensionality reduction, the 2029 columns are reduced down to 57. I just haven’t gotten around to rationalizing the columns pre- and post- dimensionality reduction, hence the discrepancy; since the extra columns are just ignored, it wasn’t a pressing concern.

Three classes.

Could you provide me with the stacktrace of the exception?
You can find it under View->Open KNIME Log (or in the console depending on your configuration).
I can’t reproduce the issue without the data. I have tried dummy data but couldn’t get the node to crash.

Best,

Adrian