H2O Random Forest Learner error

Hi,

I created an H2O Sparkling Water Context over a Spark Livy Context using only KNIME nodes on GCE Dataproc and I have this error at the H2O Random Forest Learner:

“ERROR : KNIME-Worker-29-H2O Random Forest Learner 3:2688 : : Node : H2O Random Forest Learner : 3:2688 : Execute failed: Job crashed unexpected. Cause: org.knime.bigdata.spark.core.exception.KNIMESparkException: error cannot be computed: too many classes (UnsupportedOperationException) See log for details.
org.knime.ext.h2o.exception.H2OJobCrashedException: Job crashed unexpected. Cause: org.knime.bigdata.spark.core.exception.KNIMESparkException: error cannot be computed: too many classes (UnsupportedOperationException) See log for details.
at org.knime.ext.h2o.jobs.DefaultH2OJobFuture.get(DefaultH2OJobFuture.java:98)
at org.knime.ext.h2o.jobs.AbstractH2OExecutionContext.submit(AbstractH2OExecutionContext.java:76)
at org.knime.ext.h2o.context.DefaultH2OSession.futureOf(DefaultH2OSession.java:213)
at org.knime.ext.h2o.context.DefaultH2OSession.run(DefaultH2OSession.java:369)
at org.knime.ext.h2o.nodes.learner.drf.H2ODRFNodeModel3.run(H2ODRFNodeModel3.java:86)
at org.knime.ext.h2o.nodes.learner.drf.H2ODRFNodeModel3.run(H2ODRFNodeModel3.java:1)
at org.knime.ext.h2o.nodes.AbstractH2OSupervisedNodeModel.execute(AbstractH2OSupervisedNodeModel.java:157)
at org.knime.core.node.NodeModel.executeModel(NodeModel.java:576)
at org.knime.core.node.Node.invokeFullyNodeModelExecute(Node.java:1236)
at org.knime.core.node.Node.execute(Node.java:1016)
at org.knime.core.node.workflow.NativeNodeContainer.performExecuteNode(NativeNodeContainer.java:558)
at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:95)
at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:201)
at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:117)
at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:334)
at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:210)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246)
Caused by: java.util.concurrent.ExecutionException: org.knime.bigdata.spark.core.exception.KNIMESparkException: error cannot be computed: too many classes (UnsupportedOperationException)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.knime.ext.h2o.spark.H2OSparkJobFactory$H2OSimpleSparkJob$1.checkException(H2OSparkJobFactory.java:221)
at org.knime.ext.h2o.spark.H2OSparkJobFactory$H2OSimpleSparkJob$1.(H2OSparkJobFactory.java:175)
at org.knime.ext.h2o.spark.H2OSparkJobFactory$H2OSimpleSparkJob.getStatus(H2OSparkJobFactory.java:173)
at org.knime.ext.h2o.jobs.DefaultH2OJobFuture.get(DefaultH2OJobFuture.java:87)
… 19 more
Caused by: org.knime.bigdata.spark.core.exception.KNIMESparkException: error cannot be computed: too many classes (UnsupportedOperationException)
at org.knime.bigdata.spark2_4.base.LivySparkJob.call(LivySparkJob.java:106)
at org.knime.bigdata.spark2_4.base.LivySparkJob.call(LivySparkJob.java:1)
at org.apache.livy.rsc.driver.BypassJob.call(BypassJob.java:40)
at org.apache.livy.rsc.driver.BypassJob.call(BypassJob.java:27)
at org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:64)
at org.apache.livy.rsc.driver.BypassJobWrapper.call(BypassJobWrapper.java:45)
at org.apache.livy.rsc.driver.BypassJobWrapper.call(BypassJobWrapper.java:27)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)”

I am having the input data with 8360 columns and 3841 categories. Do you have any suggestions?

Thank you,
Mihai

Unfortunately the H2O nodes with kinme seem to have problems with implementing and reproducing the string handlings they offer with Spark.

Try label encoding your categorical data if you expect the relationship to be a stable one. Or try another method to convert your strings into numbers (one hot encoding could be a challenge).

https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_h2o_automl_spark/s_401_spark_label_encoder

The workflow addresses other issues that I have encountered with Spark and H2O. There is also a presentation about it but it is in German though the charts are in English.

https://hub.knime.com/mlauber71/spaces/Public/latest/kn_example_bigdata_h2o_automl_spark/s_400_spark_h2o_automl_about_this_collection

If you use R you might want to try great, problem is it would be challenging to bring that to a big data environment

https://hub.knime.com/mlauber71/spaces/Public/latest/automl/kn_automl_h2o_classification_r_vtreat

1 Like

Hi @mlauber71,

Sorry for the delayed answer. I tried to use Spark Category To Number node in order to label the input data and the H2O Random Forest Learner node doesn’t accept numerical column at ‘Target Column’. My initial data has only one single column of type String, the other columns are numerical columns.

Thank you,
Mihai

You should try the one for regression targets

2 Likes