AWS EMR - Spark Scorer - java.lang.OutOfMemoryError

Hello there,

I’m using KNIME 4.0.2 and AWS EMR 5.23.0, I’ve a problem passing a 4.5TB input spark on Spark Scorer node:
Spark Random Forest Learner -> Spark Predictor(Clasificator) -> Spark Scorer
Tried to use 300G on executors and had 700G on master node, none of the configuration architecture works for Spark Scorer.

2019-10-29 22:10:26,847 : ERROR : KNIME-Worker-30 : : Node : Spark Scorer : 0:2670 : Execute failed: An error occured. For details see View > Open KNIME log.
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:400)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2056)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
at org.apache.spark.rdd.RDD$$anonfun$aggregate$1.apply(RDD.scala:1124)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1117)
at org.apache.spark.api.java.JavaRDDLike$class.aggregate(JavaRDDLike.scala:426)
at org.apache.spark.api.java.AbstractJavaRDDLike.aggregate(JavaRDDLike.scala:45)
at org.knime.bigdata.spark2_4.api.RDDUtilsInJava.aggregatePairs(RDDUtilsInJava.java:429)
at org.knime.bigdata.spark2_4.jobs.scorer.AccuracyScorerJob.doScoring(AccuracyScorerJob.java:56)
at org.knime.bigdata.spark2_4.jobs.scorer.AbstractScorerJob.runJob(AbstractScorerJob.java:56)
at org.knime.bigdata.spark2_4.jobs.scorer.AbstractScorerJob.runJob(AbstractScorerJob.java:1)
at org.knime.bigdata.spark2_4.base.LivySparkJob.call(LivySparkJob.java:90)
at org.knime.bigdata.spark2_4.base.LivySparkJob.call(LivySparkJob.java:1)
at org.apache.livy.rsc.driver.BypassJob.call(BypassJob.java:40)
at org.apache.livy.rsc.driver.BypassJob.call(BypassJob.java:27)
at org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:57)
at org.apache.livy.rsc.driver.BypassJobWrapper.call(BypassJobWrapper.java:42)
at org.apache.livy.rsc.driver.BypassJobWrapper.call(BypassJobWrapper.java:27)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Thanks,

Hi @sirev,

this sounds like this question:

Did you tune the driver settings to use more memory?

1 Like

Yes, it didn’t work, I used yarn settings to allocate as much memory as I had, and also I’ve used both memory increased on executor and master’s side too.

master - 16g/700g
executor - 16g/300g/700g

The Spark Scorer starts to execute, then right after 2-3 seconds stops with that error. On cluster side everything seems fine though, but the Spark Scorer can’t start because of limitation. I also use in knime.ini
-Xmx848g

From Spark UI I see that Spark Scorer uses aggregateAtRDDs, which collects all the data in one place, whether it’s driver or core, in YARN mode they’re on the CORE side which has the most resources allocated. What increase do I need to proceed with?

Hi @sirev,

the Spark Scorer might have issues if your data has many labels. You can try the Spark MulticlassMetrics using a Spark Dataframe Snippet as a workaround:

ScorerDataset20191121.knwf (27.3 KB)

The Snippets in the Workflow expect a dataframe with the columns label and prediction as input. Do the snippets work? What data produce accuracy snippet?

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.