Unable to connect to Spark using Create Spark Context node

bjoern.lohrmann · March 23, 2018, 3:54pm

Did I have missed something? Cause the pdf with installation instruction only refers to how to setup the Spark Job Server, but not the previous steps.

Cloudera CDH, Hortonworks HDP are both Hadoop distributions that simplify the installation, configuration and administration of Hadoop clusters. Both CDH and HDP are free of cost, however both vendors charge money for some enterprise features and support. Our PDF guide explains how to install Spark Jobserver on CDH/HDP clusters.

I mean, it shouldn’t be the same thing to install the compatible hadoop version directly from the hadoop site?

In principle you can do that, it’s just a lot harder to operate then. I guess it is useful for the learning experience, but when you want to do a proper deployment you use one of the Hadoop distributions like HDP or CDH.

And I have also another questions… what’s the main difference between installing spark with these installation steps(for example with cloudera) and using just the new Create Local Big Data Environment node? I would like to understand if the final result is just the same or if there are huge differences as configuration and final practical use.

The Create Local Big Data Environment is completely local, i.e. there is no cluster behind it. Also you do not have to install Spark Jobserver then. However, you are limited by the power your machine where KNIME runs on.

The node is mostly useful for three use cases:

Learning how to use the KNIME big data nodes (on small/medium data).
Rapid prototyping of big data workflows in KNIME on a local subset of the real (large) data
If you have a single big machine (lots of CPU cores and RAM), you can use Spark on “medium” sized data there.

If you are thinking about solving real-world big data use cases, e.g. learning models on giga- or terabytes of data, then you need an acutal Hadoop cluster with Spark Jobserver.

Björn