Spark Predictor (Classification) error: Execute failed: empty collection (UnsupportedOperationException)

Hi,

I am trying to run the Spark Predictor (Classification) node with a model created on GCP Dataproc and I have this error:
ā€œ2019-12-31 10:41:54,991 : ERROR : KNIME-Worker-29-Spark Predictor (Classification) 2:2678 : : Node : Spark Predictor (Classification) : 2:2678 : Execute failed: empty collection (UnsupportedOperationException)
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1380)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.first(RDD.scala:1377)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:615)
at org.apache.spark.ml.util.DefaultParamsReader$.loadParamsInstance(ReadWrite.scala:650)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:274)
at org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$4.apply(Pipeline.scala:272)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:272)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:348)
at org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:342)
at org.apache.spark.ml.util.MLReadable$class.load(ReadWrite.scala:380)
at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:332)
at org.apache.spark.ml.PipelineModel.load(Pipeline.scala)
at org.knime.bigdata.spark2_4.jobs.namedmodels.NamedModelUploaderJob.runJob(NamedModelUploaderJob.java:55)
at org.knime.bigdata.spark2_4.jobs.namedmodels.NamedModelUploaderJob.runJob(NamedModelUploaderJob.java:1)
at org.knime.bigdata.spark.local.wrapper.LocalSparkWrapperImpl.runJob(LocalSparkWrapperImpl.java:123)
at org.knime.bigdata.spark.local.context.LocalSparkJobController.lambda$1(LocalSparkJobController.java:92)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)ā€

I used Spark Random Forest Learner node on GCP to create the model and I saved it on my local and than used a Model Reader node to read the model and plugged that node to the Spark Predictor(Classification) node. I tried to run the Predictor node on a local context and on a GCP context and the same result.

Thank you,
Mihai

Hi @mihais1,
It looks like first() is run on an empty RDD. Maybe there are some filters applied that result in an empty RDD at this point, does the input data fit the trainings data you used?
How do you load the input data?

best Mareike

Hi @mareike.hoeger,

I found the issue: the RDD is not empty but I have different type of data in it than the data used for the Predictor.

Thank you,
Mihai

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.