Big data: why is it required to reset Spark Context?

peleitor · January 16, 2020, 10:38am

I have a workflow which uses the local big data environment. The workflow was executed and saved.

But when I open the saved workflow, in order to re execute it, it complains about the (lost?) spark context:

ERROR Spark to Table 0:15 Execute failed: Spark context ‘sparkLocal://knimeSparkContext’ does not exist in the cluster. Please create a context first.

You can fix this just by resetting and re executing the “Create Big Data Local environment” node, but of course it demands a lot of time because of the downstream nodes reexecution.

Why is this? Can it be avoided?

Thanks!

sascha.wolke · January 16, 2020, 5:00pm

Hi @peleitor,

the Create Local Big Data Environment has an option called Action to perform on dispose. This option controls if the Spark context will be destroyed after closing the workflow, but this does not help if you restart KNIME. The spark session can’t be persisted to disk, only the data transferred down to KNIME can be persisted. That’s why you need to launch a new Spark session after restarting KNIME and all nodes depending on the Spark session must be restarted too.

mlauber71 · January 16, 2020, 9:07pm

This is due to the way Spark works which is mainly in the memory of the system which makes it fast but also temporary.

It could make sense to familiarise yourself with some key concepts like lazy evaluation since they heavily influence how spark works and what you will encounter once you start using it with KNIME not least persist and unpersist.

mlauber71 · January 17, 2020, 5:15am

…I have no idea why the preview from medium gives us such a large picture like the one above. I hope it does not detract from the content.

ipazin · January 17, 2020, 10:16am

Never mind. It’s a nice picture
Ivan

peleitor · January 20, 2020, 10:07am

IMHO if Knime does offer you the chance of a local spark context, then either:

if the Spark context is destroyed upong closing Knime, then automatically reset the node (instead of marking subsequent Spark nodes as executed)
give you the chance of persisting spark memory into storage -just like caching knime tables

This matter is apart from Spark paradigm characteristics like lazyness or parallelization.

Cheers,
Fernando

system · July 20, 2020, 10:07pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.