Unable to connect to Spark using Create Spark Context node

gujodm · March 13, 2018, 4:45pm

Hi, I’m trying to setting-up the Spark Context Connection.
I have already modified the connection setting (both in preferences>Knime>BigData>Spark and within the node) like that:

Job server url: http://localhost:4040
Spark Version: 2.2.1

When I type in command line the “spark-shell” command everything went fine and I got UI at http://localhost:4040
You can see that in the images above:

But when I try to start Spark Context in Knime it returns an error:

ERROR Create Spark Context 2:204 HTTP Status code: 302 | Response Body:
ERROR Create Spark Context 2:204 Execute failed: Spark Jobserver gave unexpected response (for details see View > Open KNIME log). Possible reason: Incompatible Jobserver version, malconfigured Spark Jobserver

Here below the error_log trace:

org.knime.bigdata.spark.core.exception.KNIMESparkException: Spark Jobserver gave unexpected response (for details see View > Open KNIME log). Possible reason: Incompatible Jobserver version, malconfigured Spark Jobserver
	at org.knime.bigdata.spark.core.context.jobserver.request.AbstractJobserverRequest.createUnexpectedResponseException(AbstractJobserverRequest.java:154)
	at org.knime.bigdata.spark.core.context.jobserver.request.AbstractJobserverRequest.handleGeneralFailures(AbstractJobserverRequest.java:123)
	at org.knime.bigdata.spark.core.context.jobserver.request.GetContextsRequest.sendInternal(GetContextsRequest.java:62)
	at org.knime.bigdata.spark.core.context.jobserver.request.GetContextsRequest.sendInternal(GetContextsRequest.java:1)
	at org.knime.bigdata.spark.core.context.jobserver.request.AbstractJobserverRequest.send(AbstractJobserverRequest.java:72)
	at org.knime.bigdata.spark.core.context.jobserver.JobserverSparkContext.remoteSparkContextExists(JobserverSparkContext.java:410)
	at org.knime.bigdata.spark.core.context.jobserver.JobserverSparkContext.access$3(JobserverSparkContext.java:408)
	at org.knime.bigdata.spark.core.context.jobserver.JobserverSparkContext$1.run(JobserverSparkContext.java:240)
	at org.knime.bigdata.spark.core.context.jobserver.JobserverSparkContext.runWithResetOnFailure(JobserverSparkContext.java:341)
	at org.knime.bigdata.spark.core.context.jobserver.JobserverSparkContext.open(JobserverSparkContext.java:230)
	at org.knime.bigdata.spark.core.context.SparkContext.ensureOpened(SparkContext.java:64)
	at org.knime.bigdata.spark.node.util.context.create.SparkContextCreatorNodeModel.executeInternal(SparkContextCreatorNodeModel.java:155)
	at org.knime.bigdata.spark.core.node.SparkNodeModel.execute(SparkNodeModel.java:242)
	at org.knime.core.node.NodeModel.executeModel(NodeModel.java:567)
	at org.knime.core.node.Node.invokeFullyNodeModelExecute(Node.java:1172)
	at org.knime.core.node.Node.execute(Node.java:959)
	at org.knime.core.node.workflow.NativeNodeContainer.performExecuteNode(NativeNodeContainer.java:561)
	at org.knime.core.node.exec.LocalNodeExecutionJob.mainExecute(LocalNodeExecutionJob.java:95)
	at org.knime.core.node.workflow.NodeExecutionJob.internalRun(NodeExecutionJob.java:179)
	at org.knime.core.node.workflow.NodeExecutionJob.run(NodeExecutionJob.java:110)
	at org.knime.core.util.ThreadUtils$RunnableWithContextImpl.runWithContext(ThreadUtils.java:328)
	at org.knime.core.util.ThreadUtils$RunnableWithContext.run(ThreadUtils.java:204)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.knime.core.util.ThreadPool$MyFuture.run(ThreadPool.java:123)
	at org.knime.core.util.ThreadPool$Worker.run(ThreadPool.java:246)

Can some explain why I’m getting this situation?

Thanks in advance.
~G

sascha.wolke · March 13, 2018, 5:34pm

Hi,
you need to install the Spark Jobserver on your spark cluster: https://www.knime.com/knime-extension-for-apache-spark#install

gujodm · March 13, 2018, 5:52pm

The problem is that in the installation guide procedure are completely missing the steps for install/configure it on Mac machine. for example the “useradd” command is not recognized. So, this bunch of operations below:

root@host$ useradd -d /opt/${LINKNAME}/ -M -r -s /bin/false sparkjob-server
root@host$ su -l -c “hdfs dfs -mkdir -p /user/spark-job-server ; hdfs
dfs -chown -R spark-job-server /user/spark-job-server” hdfs
root@host$ cp /path/to/spark-job-server-xxx.tar.gz /opt
root@host$ cd /opt
root@host$ tar xzf spark-job-server-xxx.tar.gz
root@host$ ln -s spark-job-server-xxx ${LINKNAME}
root@host$ chown -R spark-job-server:spark-job-server ${LINKNAME}
spark-job-server-xxx/

Cannot be successfully applied in Mac machine.
Can someone provide a sort of step-by-step guide document for Mac machine or similar?

~G

tobias.koetter · March 13, 2018, 7:13pm

Hi,
I’m sorry but we do not support the installation of the Spark Job Server on Mac. Even though I guess it should work if spark-submit is working as well.
Unfortunately I’m no Mac expert to give you a translation of the different Linux commands but you might get help with that from somebody else in the forum or on the Spark Job Server page.

If you only want to try out the KNIME Big Data Extensions you could also have a look at the new local big data environment extension which is available in our nightly build. You simply need to install the extension (see screenshot) and you are done. The extension comes with a Spark 2.2 runtime, hive and hdfs access to your local hard drive all packaged within a new Create Local Big Data Environment node.

.

For more details have a look at the What’s new and cooking session from the last KNIME Summit once they are available online at https://www.knime.com/about/events/knime-spring-summit-2018-berlin#agenda.
Bye
Tobias

gujodm · March 18, 2018, 11:10pm

Hi @tobias.koetter ,
thank you for the interesting suggestion. I’ll give a try for sure.

Anyway, if for example I want to run the Spark context without the Create Local Big Data Environment node, what I need to do is:

first of all install spark 2.1from the apache spark site (…or this is an unnecessary step because spark is already present in the Cloudera CDH 5.13? I know that probably for you this question is obvious, but I want to be sure of the steps)
install that specific Cloudera CDH 5.13 (compatible with Spark 2.1)…honestly I didn’t understand what are the installation instructions for this step, cause the link points to a cloudera datasheet page.
install that specific Spark Job server (CDH 5.7 - 5.13, for Apache Spark 2.1)

Did I have missed something? Cause the pdf with installation instruction only refers to how to setup the Spark Job Server, but not the previous steps.

What I don’t understand is why I need to install those cloudera resources…(Important: are they free or not? ) I mean, it shouldn’t be the same thing to install the compatible hadoop version directly from the hadoop site?

And I have also another questions… what’s the main difference between installing spark with these installation steps(for example with cloudera) and using just the new Create Local Big Data Environment node? I would like to understand if the final result is just the same or if there are huge differences as configuration and final practical use.

~G

bjoern.lohrmann · March 23, 2018, 3:54pm

Hi @gujodm

Did I have missed something? Cause the pdf with installation instruction only refers to how to setup the Spark Job Server, but not the previous steps.

Cloudera CDH, Hortonworks HDP are both Hadoop distributions that simplify the installation, configuration and administration of Hadoop clusters. Both CDH and HDP are free of cost, however both vendors charge money for some enterprise features and support. Our PDF guide explains how to install Spark Jobserver on CDH/HDP clusters.

I mean, it shouldn’t be the same thing to install the compatible hadoop version directly from the hadoop site?

In principle you can do that, it’s just a lot harder to operate then. I guess it is useful for the learning experience, but when you want to do a proper deployment you use one of the Hadoop distributions like HDP or CDH.

And I have also another questions… what’s the main difference between installing spark with these installation steps(for example with cloudera) and using just the new Create Local Big Data Environment node? I would like to understand if the final result is just the same or if there are huge differences as configuration and final practical use.

The Create Local Big Data Environment is completely local, i.e. there is no cluster behind it. Also you do not have to install Spark Jobserver then. However, you are limited by the power your machine where KNIME runs on.

The node is mostly useful for three use cases:

Learning how to use the KNIME big data nodes (on small/medium data).
Rapid prototyping of big data workflows in KNIME on a local subset of the real (large) data
If you have a single big machine (lots of CPU cores and RAM), you can use Spark on “medium” sized data there.

If you are thinking about solving real-world big data use cases, e.g. learning models on giga- or terabytes of data, then you need an acutal Hadoop cluster with Spark Jobserver.

Björn

gujodm · March 23, 2018, 4:12pm

Hi @bjoern.lohrmann, thank you for the reply, much appreciated.
I think I have quite understand the different scenarios.

I wanna ask you another important question:

Is the big data/spark extension also compatible with cloudera CDH 5.14 (which is the latest version)? Cause in the list of compatible version is only present till the 5.13 version.

And another question is about the default spark version included in CDH that is the 1.6. Why they don’t provide the last version of Spark by default? It’s already available the 2.x version. In this circumstance is it fine if I manually install the Spark 2.x version after CDH installation? Will it work or not because there is the default Spark 1.6 version?

Then, I have another question about the step of the pdf installation guidelines called “HOW TO INSTALL ON A KERBEROS-SECURED CLUSTER”. Can some one clarify me how to proceed step-by-step? Are they mandatory for let work the spark knime extension or we can skip these steps? Cause actually my Ubuntu Machine configured with CDH 5.13 has no kadmin command. So I assume that I would need to type:

apt-get install krb5-user
or should I activate Kerberos from cloudera manager panel first? Cause actually is disable by default

Why I ask if these steps are mandatory? Because I have already configured the spark-job-server and when I open Knime platform and try to run again the ‘Create Spark Context’ node it still doesn’t work.

~G

bjoern.lohrmann · March 29, 2018, 9:18am

Hi @gujodm

my apologies for the late reply.

Is the big data/spark extension also compatible with cloudera CDH 5.14 (which is the latest version)? Cause in the list of compatible version is only present till the 5.13 version.

CDH 5.14 will be officially supported with the KNIME summer 2018 release. Until then it is not on the “officially” supported list, but it will work.

And another question is about the default spark version included in CDH that is the 1.6. Why they don’t provide the last version of Spark by default? It’s already available the 2.x version.

This is something you have to ask Cloudera. Probably because they have a lot of customers on 1.6 and migrating from 1.6 to 2.x is a big step. Cloudera provides additional parcels for Spark 2, so you can install Spark 1.x and 2.x in parallel on the same cluster. The documentation for that is on the Cloudera website:
https://www.cloudera.com/documentation/spark2/latest/topics/spark2.html

In this circumstance is it fine if I manually install the Spark 2.x version after CDH installation? Will it work or not because there is the default Spark 1.6 version?

I would advise against a manual installation of spark 2.x on a Cloudera cluster. Use the Cloudera Spark 2 parcels and follow their documentation:

https://www.cloudera.com/documentation/spark2/latest/topics/spark2.html

Then, I have another question about the step of the pdf installation guidelines called “HOW TO INSTALL ON A KERBEROS-SECURED CLUSTER”. Can some one clarify me how to proceed step-by-step? Are they mandatory for let work the spark knime extension or we can skip these steps?

The steps in this section are mandatory, if you already have a Kerberos-secured cluster. If you don’t, then you don’t have to do these steps. If you are unsure whether or not that is the case, then most probably your cluster is not Kerberos-secured.

Björn

system · June 2, 2023, 9:03pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.