Apache Livy Security

Hi,

How can I prohibit the users to access a different users’ data, the filesystem, and the network, in the livy server?

My question is, how secure is livy? Users can inject custom code to run on livy, but this gives them the ability to access the filesystem on the host the livy server resides in. Even if we run livy with a different unix user, that has very little permissions on the filesystem, that could be potentially dangerous from my point of view, they could potentially access the keytab on the livy server also. And they could also potentially inject malware and run it.

I know that the session created creates also a JVM, so one session lives in a JVM, and it is impossible to see another session’s data etc. without having the kerberos ticket, but could I change the security settings of that JVM to only access specific paths and specific IP addresses only? Would that mean for me to change the source code of livy?

And in the case of using HDFS with active directory to secure the datasystem, so that users need to specify a kerberos key to access their files, how could I manage multiple principals in one server, to get this working?

Hi @Harun

this depends on your setup. The securest possible way is to configure Livy so that it starts each session/Spark context in yarn-cluster mode (i.e. master=yarn, deployMode=cluster) and performs impersonation (“proxyUser”, see [3]).

Livy-server does not actually execute the submitted code. The JVM that runs the Spark driver[1] does that. In yarn-cluster mode, the Spark driver runs inside the YARN application master (AM), which is a YARN container somewhere on a worker node in the cluster. Typically, the Linux process of the AM container also runs as the Linux user that has the same name as the Kerberos/Hadoop user that owns the YARN application (see [2]). This way the Spark driver cannot even access the local file system of the machine where livy-server runs (where the livy keytab resides).

And they could also potentially inject malware and run it.

Yes, technically it is possible for an unprivileged user to run malware on the cluster nodes where YARN containers are started. But that has nothing to do with Livy. This is an inherent property of all distributed compute frameworks (Spark, Hadoop MapReduce, …) that run user-code. It is their whole point: Parallel execution of user-defined code.

I would say there are two answers to this, none of which offer 100% security:

  1. Don’t let untrusted users onto your Hadoop cluster.
  2. If you are seriously worried about this, consider setting up Hadoop with Docker containers, which adds another isolation layer (linux containers) around the user-code. But even that is not 100% bullet-proof if the malware exploits a Linux kernel vulnerability that allows privilege escalation. Also this is very much a non-trivial setup.

And in the case of using HDFS with active directory to secure the datasystem, so that users need to specify a kerberos key to access their files, how could I manage multiple principals in one server, to get this working?

First, let’s distinguish two things: Authentication (to assert identity) and authorization (to assert what an identity is allowed to do).

On a secured cluster, authentication is handled by Kerberos across all services. Authorization is handled differently for each service. HDFS has file system permissions, Hive/Impala follow the general SQL Grant/revoke scheme, etc etc. Some Hadoop vendors (e.g. Coudera, Hortonworks) offer services to centralize permission management (Cloudera: Sentry / Hortonworks: Ranger), which makes it easier.

Björn

[1] Cluster Mode Overview - Spark 2.3.1 Documentation
[2] Apache Hadoop 2.9.1 – YARN Secure Containers
[3] Apache Hadoop 3.3.6 – Proxy user - Superusers Acting On Behalf Of Other Users

1 Like

Hi @bjoern.lohrmann,

thank you very much for the detailed answer.

One question that I still have, and didn’t get 100% is that how is this impersonation working actually. How is yarn distinguishing a different user while submmitting statements. Does it know that the proxyUser of session 3 is some user called harun, and if the server is now submitting a statement for session 3, then this must be harun and I shall give him access to only the resources that harun shall access?

And if a different user with another session is submitting a statement, than the Yarn cluster shouldn’t show resources from harun’s session. How does it do that? There is only one kerberos ticket associated to both sessions? Is it really only the session number that distinguishes between resources?

The same question applies for the HDFS of course, but I suppose it is the same mechanism.

Thanks in advance, and again thank you very much for your initial answer, it was really helpful :smiley:

The general scheme is described in link [3] from above. It is part of Hadoop’s configuration to know that Livy’s Kerberos principal is a “superuser” and is allowed to impersonate other users using secure impersonation. Also the scope on which Livy does impersonation is a session, not a statement within a session. A session always runs as one particular user, which hence affects all statements executed within that session.

And if a different user with another session is submitting a statement, than the Yarn cluster shouldn’t show resources from harun’s session. How does it do that? There is only one kerberos ticket associated to both sessions? Is it really only the session number that distinguishes between resources?

The only one with a kerberos ticket during secure impersonation is Livy-server itself. The user “harun” does not get its own ticket in this case. Livy requests so called delegation tokens for HDFS, YARN and Hive Metastore access (and other services) which are then made available to the Spark driver. The whole process of requesting delegation tokens also supports impersonation, so each delegation token only allows the Spark driver to do the things that user “harun” (for example) can do.

See Hadoop Delegation Tokens Explained - Cloudera Blog for a furhter explanation of delegation tokens.

Björn

Hi again,

thanks for the answer, it was really explanatory and the links helped also very much to understand what impersonation is and how it works with YARN.

Another question of me would be, is YARN really the best option to orchestrate the spark cluster? That got me thinking and then while looking in the Spark documentation, I came upon the support for Kubernetes. Would it be possible to specify a Kubernetes node as a spark master and still be able to use the impersonation feature of Livy using Kerberos? As far as I’ve seen is that kubernetes also has its own impersonation ability but that would be two levels of impersonation, and I think livy can’t use this feature by default. But is there a smart way to run spark in a kubernetes cluster and still being able to impersonate the sessions somehow? Because using kubernetes would be very nice to orchestrate spark and deploy it everywhere?

And also, I guess that would mean that the impersonation isn’t with kerberos anymore.

Another question of me would be, is YARN really the best option to orchestrate the spark cluster? That got me thinking and then while looking in the Spark documentation, I came upon the support for Kubernetes.

Unfortunately I cannot speak to the Kubernetes support in Spark or Livy, since I have no experience with that.

The whole security architecture I have described so far is specific to spark+kerberos+hadoop (yarn+hdfs). This is a well-tested and common type of setup, with good vendor support. Kubernetes support was however added very recently with Spark 2.3 and is still marked as experimental [1].

[1] Running Spark on Kubernetes - Spark 3.5.0 Documentation

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.