KNIME 4 on AWS EMR Spark-Livy - Spark Context Livy (openOutputStream)

bug
#1

Hello,

I’m running spark nodes on KNIME 4.0.0 using EMR 5.23 (Hive 2.3.4, Spark 2.4.0, Livy 0.5.0). In order to connect I use S3 Connection node, when creating Spark Context it gives an error:

ERROR Create Spark Context (Livy) 0:2637     Execute failed: Remote file system upload test failed: java.lang.UnsupportedOperationException: openOutputStream must not be used for S3 connections.

Create Spark Context node gets to 70-80% and stops on openOutputStream error.
Perhaps on older KNIME version 3.7.2 it doesn’t give any error and works well.

  • Staging area is set on Spark Context for S3 ( tried to configure for local - same result)
  • Access keys are set up correctly as S3 Connection passes

How can I solve this problem?

Thanks,

0 Likes

#2

Hi @sirev

welcome to the KNIME community!

Thanks for bringing this to our attention, it seems something broke in KNIME 4.0 in the way we are using S3 in the " Create Spark Context (Livy) node. This will get fixed with the next bugfix release (4.0.1). I can also notify you once this is fixed in the KNIME nightly build.

In the meantime, as a workaround you can use the HttpFS Connection node instead of S3 Connection. The HttpFS Connection node needs to be configured as follows:

  • Host: Put in the hostname of the EMR master instance
  • Authentication: User
  • User: livy

Also, when creating your EMR cluster you must allow HttpFS to impersonate the livy user (which your Spark context runs as). Do this as follows: When you click “Create Cluster” in the AWS Web Console, click on “Go to advanced configuration”. Under “Edit software settings” paste the following into the text field:

[
  {
    "Classification": "core-site",
    "Properties": {
      "hadoop.proxyuser.httpfs.groups": "hudson,testuser,root,hadoop,jenkins,oozie,hive,httpfs,hue,users,livy"
    }
  }
]

This is basically the allows the httpfs service to also impersonate livy. Since your Spark context runs as the user “livy”, this setting is necessary for the Spark context (cluster-side) and KNIME (client side) to exchange files.

Hope this helps,
Björn

2 Likes

#3

Hi @bjoern.lohrmann,

Thank you for the answer, really appreciate it!

Will be waiting your next release with further fixes,
thank you again for a temporary workaround!

1 Like