Questions about Output Port in Pyspark

Hello, I have a question about Pyspark node in KNIME 4.7.1 version.

Python Scrip node can export algorithms generated from ML/DL to Output Port’s “Output object (picked).”

However, Pyspark node does not appear to provide the above functionality to Output Port.

So how can we export the pkl files of ML/DL algorithms made in Pyspark node, except for using code?

The exported space is Local or KNIME Server.

Hi @JaeHwanChoi,

The PySpark nodes don’t have the picked output port.

What ML/DL models do like to export? The Spark models have a save method to export them the spark way, and should not be exported as pickle.



Hi @sascha.wolke,

Thank you for your answer

I want to model Tabnet and propet, several of the DL algorithms developed by Google and Facebook, using Pyspark node, and then save the learned models to Object Storage as a pkl file.

In python script, the above process is successfully completed by proceeding with model writer.

  1. If so, how should I proceed in Pyspark node?

  2. Is “save method” a code-based storage method within Pyspark? If so, could you share the manual about the use?

  3. Is there any way to export to Output other than saving as code?

Hi @JaeHwanChoi,

Suggesting using the Python nodes in KNIME to get started with Prophet.

Do you run a Spark cluster with multiple nodes? Otherwise, there might be no benefit running the code using PySpark.

To exchange the data, use Spark data frames instead of pickle. This is what the PySpark node in KNIME expose as an output port, and Spark uses to distribute the data in the cluster. The Spark data frames can be converted to pandas data frames, and be used e.g. with Prophet, but doing this on scale might become a way more difficult and require some custom python code, as Prophet does not have Spark support out of the box.

Some nice blog post about this: Scalable Time-Series Forecasting with Spark and Prophet | by Young Yoon | Medium

Keep in mind that running this with PySpark only helps if you run a Spark cluster with multiple executors.


1 Like

Thank you for your answer, @sascha.wolke .

Well, applying various analysis models to Pyspark seems to be limited.

If so, the guide to using Python Script in KNIME seems to be available through “KNIME Python API — KNIME Python API documentation #knime-python-script-labs-api”.

Then is there any guide to using Pyspark on KNIME? Examples of codes used inside Pyspark are required as in the above document.

Hi @JaeHwanChoi,

Just to confirm, are you using an Apache Spark or Databricks cluster?

Well, applying various analysis models to Pyspark seems to be limited.

Yes, that’s true, PySpark can’t be used as 1:1 replacement of existing Python code as it has to run distributed in a cluster, and this requires a lot more effort to organize the execution.

A great starting point with PySpark is the Apache Spark documentation: Getting Started — PySpark 3.2.0 documentation

Then is there any guide to using Pyspark on KNIME

Not right now. Compared to the other Python nodes in KNIME, the PySpark nodes are limited and do not support the KNIME Python API.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.