set spark config value in PySpark node to access DataLake from databricks.

mathi · October 8, 2020, 5:55pm

I had connected KNIME to Azure databricks through Create Databricks environment node and PySpark Script Source node to send spark commands.

Databricks will connect with Azure Datastore to fetch data. To authenticate Databricks to Azure Datalake, Azure ActiveDirectory is used.

For authentication purpose, I am following this blog.

Scala Code

spark.conf.set(“fs.azure.account.auth.type..dfs.core.windows.net”, “OAuth”)
spark.conf.set(“fs.azure.account.oauth.provider.type..dfs.core.windows.net”, “org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider”)
spark.conf.set(“fs.azure.account.oauth2.client.id..dfs.core.windows.net”, “”)
spark.conf.set(“fs.azure.account.oauth2.client.secret..dfs.core.windows.net”, dbutils.secrets.get(scope=“”,key=“”))
spark.conf.set(“fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net”, “https://login.microsoftonline.com//oauth2/token”)

In Databricks notebook we can use %scala in Python notebook and the value can be authentication can be configured but we don’t want to use the notebook in between. I am planning to use PySpark node instead of notebook.

My concern is how to set the above scala code inside the PySpark node.

AlexanderFillbrunn · October 12, 2020, 10:45am

Hi,
There is also Python code in the blog post you linked:

spark.conf.set(".oauth2.access.token.provider.type", “ClientCredential”)
spark.conf.set(".oauth2.client.id", “”)
spark.conf.set(".oauth2.credential", dbutils.secrets.get(scope = “”, key = “”))
spark.conf.set(".oauth2.refresh.url", “https://login.microsoftonline.com//oauth2/token”)

Have you tried that? The Python and Scala functions seem to be very similar, almost like you can copy it one-to-one.
Kind regards,
Alexander

mathi · October 12, 2020, 11:17pm

Haven’t given a shot to Python code.

Can i know how to import the DBUtils library.

from pyspark.dbutils import DBUtils

image1363×154 42.2 KB

In PySpark node I am not able to import the DBUtils package (pip install DBUtils)

AlexanderFillbrunn · October 13, 2020, 8:22am

Hi,
I found a possible solution here:

def get_dbutils():
    try:
        from pyspark.dbutils import DBUtils
        return DBUtils(spark.sparkContext)
    except ImportError:
        import IPython
        return IPython.get_ipython().user_ns['dbutils']

Can you try that?
Kind regards,
Alexander

mathi · October 13, 2020, 8:46am

Hi @AlexanderFillbrunn,

Ya I tried that earlier but there are errors.

I am not sure why PySpark node is not allowing me to import dbutils library from PySpark

potts · October 13, 2020, 12:38pm

The dbutils module is not a standard part of pyspark. Instead, it is made available through the databricks-connect module which supplies its own version of pyspark augmented with its own special, Databricks-relevant capabilities. It is non-obvious when users are instructed to type code like from pyspark.dbutils import DBUtils (also advocated in the Databricks Connect documentation), but the assumption that DBUtils comes from pyspark is incorrect – it comes from the databricks-connect module co-opting the pyspark name (and much of its code).

Do you have the databricks-connect module installed?

mathi · October 13, 2020, 2:14pm

Hi @potts,

thanks for your response. I had installed databricks-connect as mentioned in data bricks documention.

With databricks-connect installed, I am not able to import the dbutils.

potts · October 19, 2020, 2:33pm

Hi @mathi –

I suggest: verify that you are using the same conda environment (you appear to be pip-installing into a local conda env named “py3_knime” on a Windows system) in your Spark cluster (which appears to be Linux) where your Python code will actually run.

In a local conda env, I installed databricks-connect==7.1.1 against Python 3.8 and in the local Python shell I could successfully do the following: from pyspark.dbutils import DBUtils

In a different local conda env, I installed databricks-connect=5.5.3 (like you have installed on your Windows system) against Python 3.6 (because 5.5.3 appears to not fully support 3.8) and again I was successful with the same import of DBUtils.

As to where your Spark cluster actually is or what installation of Python it is using, I have no insights other than the clue that it appears to be on a Unix system based upon the filepaths captured in one of your screenshots. On a DataBricks Spark cluster, I believe installation of databricks-connect is part of the DataBricks installation instructions – I mention this because connecting to a Spark cluster but not a DataBricks Spark cluster could lead to some confusion.

Hope this helps,

Davin

system · April 20, 2021, 2:33am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.