Cannot connect to Livy on Amazon EMR

I am trying to use the new Create Spark Context via Livy (preview) for connecting to am EMR cluster running version 5.15.0 and with Livy 0.4.0 installed on it.

I get the following error:

ERROR Create Spark Context via Livy (preview) 0:44       Execute failed: org.apache.hadoop.security.AccessControlException: Permission denied: user=livy, access=EXECUTE, inode="/user/hadoop/.knime-spark-staging-4cf52884-3321-4a35-8c33-65d4c7f208fb":hadoop:hadoop:drwx------

Any ideas?

Hi @cosmincatalin

when connecting to Livy on Amazon EMR you should use the “S3 Connection” node instead of HDFS/httpFS/webHDFS Connection, otherwise you will run into permission problems such as the one you are having.

To set up the “S3 Connection” node you will need to provide

  • AWS Credentials: These can be directly provided to the node in the form of Access Key ID and Secret or by using the default credentials provider chain (see [1])
  • AWS Region: The region must be the same that the EMR cluster was deployed into.

Also, in the “Create Spark Context via Livy” node in the Advanced tab you need to specify an S3 bucket that both the EMR cluster nodes and your client have read+write access to.

One additional hint about Livy on EMR: Livy on Amazon EMR defaults to “yarn-client” submit mode which does not work reliably with Livy. This seems to be a bug in the EMR default settings for Livy. When you create the EMR cluster you can provide the following software settings (JSON format) to set the submit mode to “yarn-cluster”:

[{"classification":"livy-conf", "properties":{"livy.spark.master":"yarn-cluster"}, "configurations":[]}]

Best,
Björn

[1] https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html#credentials-default

1 Like

Thank you for the detailed response and the wonderful recommendations.
One follow up question, is it not less performant to use the S3 connection in place of the server’s HDFS? And, shouldn’t it work with HDFS anyway?

Hi @cosmincatalin

Since the connection is just used to exchange small temporary files between the KNIME Analytics Platform client and the remote Spark context it does not make much of a performance difference whether you use S3 or HDFS.

This depends on the setup. When using HDFS, both KNIME Analytics Platform client and the remote Spark context need to access HDFS as the same user. In the default Amazon EMR settings, your access HDFS as user “livy”, but KNIME cannot access HDFS as user livy.

On a different setup (for example a Kerberos-secured cluster where you authenticate to Livy using Kerberos and Livy then impersonates your user), then you can use HDFS without running into problems with HDFS file ownership.

Best,
Björn

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.