Pyspark Script speed performance questions

Hello. KNIME Support Team

I’m currently using the Create Local Big Data Environment node for Spark related tests as a temporary solution as Livy is not deployed in the client’s environment for my project.

We are utilizing the Pyspark Script node in version 4.7.7, and there is a big difference in the time to complete the same code when executed in Jupyter Notebook and when executed in Pyspark Script in KNIME. (KNIME takes longer, the code used is ML analytics related).

I’ve asked several times on the forums about the Create Local Big Data Environment node before, and was told that it was outdated and that I should just use it to get a feel for Spark.

Is the difference in speed also due to the fact that the node is outdated?

Any answers would be appreciated.

Hi @JaeHwanChoi,

We are utilizing the Pyspark Script node in version 4.7.7, and there is a big difference in the time to complete the same code when executed in Jupyter Notebook and when executed in Pyspark Script in KNIME. (KNIME takes longer, the code used is ML analytics related).

Can you post some example code, and does big difference mean seconds, minutes or hours?

There is always some overhead running scripts in Spark, this is acceptable if you run a complex computation on a large cluster. You can find the Spark documentation here: https://spark.apache.org/

If you like to run Python scripts without a cluster, the python nodes are a better choice: Getting started with KNIME's Python integration – knime – KNIME Community Hub

I’ve asked several times on the forums about the Create Local Big Data Environment node before, and was told that it was outdated and that I should just use it to get a feel for Spark.

Is the difference in speed also due to the fact that the node is outdated?

No, the Local Big Data Environment gets updated occasionally and this should not be related. Please note that the Local Big Data Environment is not a compute cluster and should only be used to test things.

Cheers,
Sascha

1 Like

Is this ‘pure’ Python code? Or is this also accessing a Spark environment? The point @sascha.wolke also is making is that Spark by its very design will always need so initiation and setup since it is meant to work on distributed systems. This is the price to be paid to be able to scal up to very large amounts of data while ‘normal’ Python operations /like pandas or numpy) just ‘live’ on one machine (typically).

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.