KNIME Databricks - PySpark integration

vipul · October 13, 2020, 2:44pm

Hi all,

I had a question regarding KNIME PySpark integration - So I’m following the below steps:

I’m using the “Create Databricks environment node” to create Spark context.
Then “PySpark script source” node to run some python code and output is a resultant dataframe.

Questions:

Where is the output dataframe stored in Spark?
Will it persist as long as the cluster persists ?
How can we delete this result dataframe file once the results are stored in dbfs/blob storage?

Also, it would be great if you could link any reference documentation to the answers.

Thank you

NDekay · November 17, 2020, 11:35pm

Hello Vipul,

Thank you for contacting KNIME regarding this issue.

Can you please confirm what version of KNIME Analytics Platform (AP) you are using?

Regards,
Nickolaus

sascha.wolke · November 20, 2020, 6:21pm

Hi @vipul,

the KNIME Spark nodes creates usual Spark data frames. Data frames are not persisted by default, they are computed on the fly and only available in memory. This means they are lost if you stop or restart the cluster. You can use nodes like Spark to Parquet to persist your data frame to e.g. DBFS or S3.

Cheers