I’m running spark nodes on KNIME 4.0.0 using EMR 5.23 (Hive 2.3.4, Spark 2.4.0, Livy 0.5.0). In order to connect I use S3 Connection node, when creating Spark Context it gives an error:
ERROR Create Spark Context (Livy) 0:2637 Execute failed: Remote file system upload test failed: java.lang.UnsupportedOperationException: openOutputStream must not be used for S3 connections.
Create Spark Context node gets to 70-80% and stops on openOutputStream error.
Perhaps on older KNIME version 3.7.2 it doesn’t give any error and works well.
Staging area is set on Spark Context for S3 ( tried to configure for local - same result)
Access keys are set up correctly as S3 Connection passes
Thanks for bringing this to our attention, it seems something broke in KNIME 4.0 in the way we are using S3 in the " Create Spark Context (Livy) node. This will get fixed with the next bugfix release (4.0.1). I can also notify you once this is fixed in the KNIME nightly build.
In the meantime, as a workaround you can use the HttpFS Connection node instead of S3 Connection. The HttpFS Connection node needs to be configured as follows:
Host: Put in the hostname of the EMR master instance
Authentication: User
User: livy
Also, when creating your EMR cluster you must allow HttpFS to impersonate the livy user (which your Spark context runs as). Do this as follows: When you click “Create Cluster” in the AWS Web Console, click on “Go to advanced configuration”. Under “Edit software settings” paste the following into the text field:
This is basically the allows the httpfs service to also impersonate livy. Since your Spark context runs as the user “livy”, this setting is necessary for the Spark context (cluster-side) and KNIME (client side) to exchange files.