How to use Workflow Executor for Apache Spark

Hi guys,

I would use the KNIME Workflow Executor for Apache Spark. Is there a example workflow or a description for this usage?

I’ve installed the KNIME Workflow Executor for Apache Spark (Preview). For our hadoop cluster I used the Create Spark Context (Livy) Node, Spark Version 2.3, and HDFS Connection Node, HDFS 3.1.1.
I’ve build a small workflow with Non-Spark-Nodes inside a wrapped Metanode.
Spark Data Ports (in and out).

Inside the Metanode a Spark to Table Node and a Table to Spark Node and some Non_Spark-Nodes.

I’ve connected the Metanode with a Table to Spark Node outside the Metanode to ingest the table data.


I’ve got the Error: KNIME Workflow Executor for Apache Spark requires spark.knime.knosp.knimeDir to be set in the Spark settings.

But there is no key spark.knime.knosp.knimeDir to set custom Spark settings (Create Spark Context (Livy)).

Any suggestions?


Hi @Andrew_Steel

we are working on documentation for that extension, my apologies for the unclarities. The extension is currently still a preview.

It is rather intentional that spark.knime.knosp.knimeDir is not configurable on a per-node basis but should be done with cluster-wide spark settings. The background is that KNIME Workflow Executor for Apache Spark requires a KNIME installation on every single one of your cluster nodes. The Spark conf setting spark.knime.knosp.knimeDir needs to be pointed to the KNIME folder on your cluster nodes (and it needs to be the same on all nodes).

If you are on Cloudera CDH, we provide a parcel and CSD for Cloudera Manager in order to fully automate the installation (for Hortonworks HDP, we currently do not provide any automation.)

The CSD for Cloudera can be downloaded here.

If your cluster does not have internet access you also have to download the correct parcel file. The RHEL 7 (for example) parcel and corresponding sha file can be downloaded here:

By the way: If you just want to try out the extension, you can also use the local big data environment, it also works there. For rapid prototyping of workflows this might be the best approach anyway.