Hi @Harun
this depends on your setup. The securest possible way is to configure Livy so that it starts each session/Spark context in yarn-cluster mode (i.e. master=yarn, deployMode=cluster) and performs impersonation (“proxyUser”, see [3]).
Livy-server does not actually execute the submitted code. The JVM that runs the Spark driver[1] does that. In yarn-cluster mode, the Spark driver runs inside the YARN application master (AM), which is a YARN container somewhere on a worker node in the cluster. Typically, the Linux process of the AM container also runs as the Linux user that has the same name as the Kerberos/Hadoop user that owns the YARN application (see [2]). This way the Spark driver cannot even access the local file system of the machine where livy-server runs (where the livy keytab resides).
And they could also potentially inject malware and run it.
Yes, technically it is possible for an unprivileged user to run malware on the cluster nodes where YARN containers are started. But that has nothing to do with Livy. This is an inherent property of all distributed compute frameworks (Spark, Hadoop MapReduce, …) that run user-code. It is their whole point: Parallel execution of user-defined code.
I would say there are two answers to this, none of which offer 100% security:
- Don’t let untrusted users onto your Hadoop cluster.
- If you are seriously worried about this, consider setting up Hadoop with Docker containers, which adds another isolation layer (linux containers) around the user-code. But even that is not 100% bullet-proof if the malware exploits a Linux kernel vulnerability that allows privilege escalation. Also this is very much a non-trivial setup.
And in the case of using HDFS with active directory to secure the datasystem, so that users need to specify a kerberos key to access their files, how could I manage multiple principals in one server, to get this working?
First, let’s distinguish two things: Authentication (to assert identity) and authorization (to assert what an identity is allowed to do).
On a secured cluster, authentication is handled by Kerberos across all services. Authorization is handled differently for each service. HDFS has file system permissions, Hive/Impala follow the general SQL Grant/revoke scheme, etc etc. Some Hadoop vendors (e.g. Coudera, Hortonworks) offer services to centralize permission management (Cloudera: Sentry / Hortonworks: Ranger), which makes it easier.
Björn
[1] Cluster Mode Overview - Spark 2.3.1 Documentation
[2] Apache Hadoop 2.9.1 – YARN Secure Containers
[3] Apache Hadoop 3.3.6 – Proxy user - Superusers Acting On Behalf Of Other Users