Connect remote Spark cluster with S3 in same virtual private cloud?

brylie · October 31, 2019, 12:53pm

I am trying to connect to a remote Spark cluster (via Livy) and provide it access to an S3 bucket that is in the same VPC. The direct connection between our EMR cluster and the S3 does not require credentials, as it is behind a protected VPN.

When adding an Amazon S3 Connection it asks to use my local AWS credentials, which are protected by multi-factor authentication.

Regardless, I don’t want to transfer any of the S3 data to my laptop, but rather would have Spark access S3 directly in the remote cluster.

How can I create a Spark Context (Livy) that uses the remote S3 file system directly, without using my local laptop/credentials?

sascha.wolke · November 1, 2019, 10:46am

Hi,

the Amazon S3 Connection is only used to provide features like file browsing in you local KNIME instance. If you use this connection with e.g. a Spark to Parquet Node, only the Hadoop compatible Path will be transferred to the cluster, no credentials and no other data. The credentials from your cluster setup will be used to access the S3 bucket. As you already explained, this should work on EMR without any additional configuration using IAM service roles. All of the S3 data will be transferred on the cluster side and not via you Laptop. Does this help you?

system · June 2, 2023, 9:00pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.