Spark context on GCE

#1

HI,

How can I setup a Spark context on GCE on Google within a KNIME workflow?

Thanks,
Mihai

1 Like

#2

Hi @mihais1

you will have to either install Livy (recommended) or Spark Jobserver for your Dataproc cluster. For Livy, Google provides scripts that should make this easier:

IMPORTANT: You need to install Livy 0.5, not the current 0.6. Unfortunately KNIME currently has no Livy 0.6 support (planned to be added with December 2019 release).

Best,
Björn

0 Likes

#3

Hi Björn,

I installed Livy on Google Dataproc as you suggested but I am having this error when I’m creating the Livy Context from KNIME:
“ERROR DFSClient Failed to close inode 17423
ERROR Create Spark Context (Livy) 0:2681 Execute failed: Remote file system upload test failed: File /.knime-spark-staging-55156831-d835-4b19-884e-5151e22bc2af/a688990b-4b92-4354-b63d-ba24b6782f96 could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1819)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2569)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:846)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:503)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:871)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2606)”

I’ve set the environment variable PYSPARK_ALLOW_INSECURE_GATEWAY to 1, I also tried to set dfs.replication to 1 instead of 2. Any suggestions will be very helpfull.

Thank you,
Mihai

0 Likes

#4

Hello Mihai,
it seems that you are using a HDFS Connection to the cluster which requires access to all data nodes which are not accessible outside of the cluster. Try to use a httpFS Connector instead.
Bye
Tobias

0 Likes

#5

Hello,

I’ve tried using HttpFS and I couldn’t connect to the Google Dataproc master node. I’ve allowed all protocols from my local IP in the firewall rules.

Yes, in the past I used HDFS connection to access this cloud platform.

Thank you,
Mihai

0 Likes

#6

Hello, Mihai,
I’m sorry, it seems that Google Data Proc does not support httpFS. Since we do not have Google Cloud Storage connector as of now you will need to use the HDFS Connector node. However according to the error message “There are 2 datanode(s) running and 2 node(s) are excluded in this operation” your data nodes are not accessible from the KNIME Analytics Platform most likely due to network settings. So maybe you need to setup a VPN connection between the machine KNIME Analytics Platform is running on and the Google Data Proc cluster.
Bye
Tobias

0 Likes

#7

Hello,

The VPN approach didn’t worked for me.
But I’ve managed to create the Livy Context on GCE Dataproc by configuring the Apache KNOX to allow the Livy connection. Also, I’ve set the PYSPARK_ALLOW_INSECURE_GATEWAY environment variable to 1. I managed to use HttpFS Connection by setting the port from the node’s configuration to 8,443(gateway’s port).

Thank you,
Mihai

1 Like

closed #8

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

0 Likes