I'm using KNIME 2.7.0. The Learner gets an input table of ~6,1k rows and 100 cols. As settings I use the ones described in the link above for RandomForest with number of models set to 1k.
I get the following warning and error:
WARN RearrangeColumnsTable$ConcurrentNewColCalculator Unhandled exception in processFinished.
ERROR Tree Ensemble Learner Execute failed: java.util.concurrent.ExecutionException: java-lang.ArrayIndexOutOfBoundsException: -1
I tried to change the settings a bit and to filter some columns from the input table, but that didn't help.
Thanks for reporting this problem which is very likely a bug in node configuration. To locate the problem we need to have the full stack trace available in KNIME Console view. Please switch on DEBUG logging under File > Preferences > KNIME GUI. Thanks.
I had the same problem with my Tree Ensemble Learner....Oddly enough all what I did is removing some #N/A strings that I had when I exported my data from Excel.
Just came across this post. If ymiladi or anyone else runs into the problem it would be extremely useful if you can attach an example flow or data. I'm still not able to reproduce :-( and N/A should be a big deal as this node doesn't accept missing values anyway (will abort with a reasonable error message).
Sorry, the Gradient Boosted Trees Predictor error is inconsistent, and most likely has to do with the size of input and/or parameters. When input to predictor included is minimally necessary vars (i.e., those used for training), and after I uncheck Append individual class probabilities, the problem disappeared. Adding large input vars back or checking probs seem to cause the problem. Sorry, I cannot share my data. The workflow is simple:
Create GBT and save using Model Writer. Read again using Model Reader, and provide an input data set. I trained using only numerical (double) variables and removed all categorical independent variables. Target column of course was a String categorial variable. The input data file read is actually the same for both training and predicting. I random sample based on Target (kMeans Clusters), and predict again for another random sample. Essentially testing to see if GBT can predict the Clusters based on other variables.
Hope that helps a bit. If I find another situation where I could share more, I shall post again. Thanks!
I'm consistently able to produce the GBT predictor error. The steps are Read data, GBT Learner, Model Writer, and in a seperate workflow Read Data, Model Reader, GBT Predictor with Append individual class probabilities checked, and the error occurs. Uncheck works. Thanks.
DEBUG Gradient Boosted Trees Predictor 0:116 reset
DEBUG String Manipulation 0:131 String Manipulation 0:131 doBeforePostExecution
ERROR Gradient Boosted Trees Predictor 0:116 Execute failed: ("ArrayIndexOutOfBoundsException"): -1
DEBUG String Manipulation 0:131 String Manipulation 0:131 has new state: POSTEXECUTE
DEBUG Gradient Boosted Trees Predictor 0:116 Execute failed: ("ArrayIndexOutOfBoundsException"): -1
java.lang.ArrayIndexOutOfBoundsException: -1
at org.knime.base.node.mine.treeensemble2.model.MultiClassGradientBoostedTreesModel.getClassLabel(MultiClassGradientBoostedTreesModel.java:139)
at org.knime.base.node.mine.treeensemble2.node.gradientboosting.predictor.classification.LKGradientBoostingPredictorCellFactory.getCells(LKGradientBoostingPredictorCellFactory.java:165)
at org.knime.core.data.container.RearrangeColumnsTable.calcNewCellsForRow(RearrangeColumnsTable.java:503)
at org.knime.core.data.container.RearrangeColumnsTable$ConcurrentNewColCalculator.compute(RearrangeColumnsTable.java:732)
at org.knime.core.data.container.RearrangeColumnsTable$ConcurrentNewColCalculator.compute(RearrangeColumnsTable.java:1)
at org.knime.core.util.MultiThreadWorker$ComputationTask$1.call(MultiThreadWorker.java:442)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:328)
at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:204)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246)
DEBUG String Manipulation 0:131 String Manipulation 0:131 doAfterExecute - success
DEBUG String Manipulation 0:131 String Manipulation 0:131 has new state: EXECUTED
DEBUG String Manipulation 0:131 Column Filter 0:133 has new state: CONFIGURED_QUEUED
DEBUG Gradient Boosted Trees Predictor 0:116 Gradient Boosted Trees Predictor 0:116 doBeforePostExecution
DEBUG Gradient Boosted Trees Predictor 0:116 Gradient Boosted Trees Predictor 0:116 has new state: POSTEXECUTE
DEBUG Gradient Boosted Trees Predictor 0:116 reset DEBUG String Manipulation 0:131 String Manipulation 0:131 doBeforePostExecution ERROR Gradient Boosted Trees Predictor 0:116 Execute failed: ("ArrayIndexOutOfBoundsException"): -1 DEBUG String Manipulation 0:131 String Manipulation 0:131 has new state: POSTEXECUTE DEBUG Gradient Boosted Trees Predictor 0:116 Execute failed: ("ArrayIndexOutOfBoundsException"): -1 java.lang.ArrayIndexOutOfBoundsException: -1 at org.knime.base.node.mine.treeensemble2.model.MultiClassGradientBoostedTreesModel.
getClassLabel(MultiClassGradientBoostedTreesModel.java:139) at org.knime.base.node.mine.treeensemble2.node.gradientboosting.predictor.classification.
LKGradientBoostingPredictorCellFactory.getCells(LKGradientBoostingPredictorCellFactory.java:165) at org.knime.core.data.container.RearrangeColumnsTable.calcNewCellsForRow(RearrangeColumnsTable.java:503) at org.knime.core.data.container.RearrangeColumnsTable$ConcurrentNewColCalculator.compute(
RearrangeColumnsTable.java:732) at org.knime.core.data.container.RearrangeColumnsTable$ConcurrentNewColCalculator.compute(
RearrangeColumnsTable.java:1) at org.knime.core.util.MultiThreadWorker$ComputationTask$1.call(MultiThreadWorker.java:442) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:328) at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:204) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123) at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246) DEBUG String Manipulation 0:131 String Manipulation 0:131 doAfterExecute - success DEBUG String Manipulation 0:131 String Manipulation 0:131 has new state: EXECUTED DEBUG String Manipulation 0:131 Column Filter 0:133 has new state: CONFIGURED_QUEUED DEBUG Gradient Boosted Trees Predictor 0:116 Gradient Boosted Trees Predictor 0:116 doBeforePostExecution DEBUG Gradient Boosted Trees Predictor 0:116 Gradient Boosted Trees Predictor 0:116 has new state: POSTEXECUTE
Do you also write the model out and read it in a different workflow?
This is important to know because your log indicates that the problem is a different one (based on the line where the exception is thrown).
I have an intuition what might be the problem but reproducing it is kind of tricky (you would have to provoke a numerical overflow). Can you provide an example workflow where this problem exists? That would be immensly helpful to confirm my suspicion.
But from what you wrote I get the feeling that there is no minimal example. But can you give some more information on your dataset? Number of features, what kind of features (numerical/nominal) and the number of classes/categories. With this information I might be able to reproduce the problem in my setup.
The data i use isn't that complicated; i made some minimal data setup that contatins 5 columns:
1st column is a start date in unix time as Double example 1480055077889
2nd column is a webpage identifier as String example start.homePage
3-5 columns are Integer columns range 0 - 600
I use around 10k samples for training the GBT. Training the learner with the 2nd column - webpage - is ok, but the GBT predictor fails (immediate) with the indexOutOfBounce exception.
Some new info: seems to do something with 1k samples and fails with 2k samples of training data.
I maybe have found a clue with some good old trial and error. It seems that for me the string used for training (in my case the webpage) the GBT makes a difference. Using a maximum of 53 characters it seems all goes well... using 54 characters the predictor crashes. Maybe this helps you reproducing the problem?
to be honest your last posts make me really scratch my head.
I literally can't imagine what I messed up that the prediction depends on the length of the class names.
However, maybe there is a possible explanation. Do you limit the length of the class names by converting the names, i.e. keeping all rows but limiting the class name to 53 characters? Or do you filter out all rows which have a class that exceeds the 53 characters?
In the first case I would suggest to enumerate all classes and use the number of the class as new target for learning and predicting the model. Afterwards you can just replace the number with the actual class name again.
In the second case it would be really interesting how many different classes you have in total. A really high number of classes would also explain, why the GBT scales so badly in your use case because I never experienced such problems with datasets that are considerably larger and contain more features.
To be honest I hope that the second option is the case because this issue literally deprives me of my sleep =D
Thanks for your help in figuring out this problem,