Spark context: How to cache the intermediate data and not write it back while doing a series of transformations on the data

Hi,

Once you use Spark context to create a connection to Spark cluster then how to cache the intermediate data and not write it back while doing a series of transformations on the data (eg: if we series of KNIME Spark/PySpark nodes)

Hi @vipul,
I am not sure what you mean with caching and writing back in that context.
All data between Spark nodes, stays in your spark cluster. Only if you view intermediate results or use the Spark2Table node, data is transferred between the cluster and you local machine. Actually due to the lazy evaluation in Spark, the intermediate data might not even been produced once a node is executed, but only if the data is needed for an operation (e.g. model learning).

best Mareike

2 Likes

Hi @mareike.hoeger

Thank you for your reply.

Sorry if I have confused you. What I actually meant over here is for example if you queried the data you want to work with and then as next step you want do some pre-processing and then as a next step you want to do some calculations. So if you use a series of KNIME Spark/PySpark nodes to do is, does the result at each step gets written back to Spark or we can maintain/cache intermediate results and if yes, how can we do that in KNIME. So this intermediate data would be maintained/cached solely for the purpose of saving us from multiple disk read and writes.

One possibility is to persist and unpersist intermediate results. This can be done in memory or on disk. This is if you want to use some results in a loop or fork them spark would not have to do everything all over again.

But there is no free lunch. Everything needs RAM Time and space.

You could unpersist the spark workflow after you are done.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.