Parquet to Spark with Local Big Data Env and Azure Blob Connection

Mol1hua · July 23, 2021, 7:21am

Hi all,

I have a similar problem to the one in this discussion: Azure and local Spark

I want to load parquet files from a folder in Azure Blob Storage (ADLS2) into the Spark context of my local big data environment.

I get the following error:

ERROR Parquet to Spark     4:1891     Execute failed: No FileSystem for scheme: wasbs (IOException)

I understand from the other discussion that I have to change some configuration in the “Create Local Big Data Environment” in order to be able to use the Azure File System connection.

Does anybody know what exactly I have to do to make this work?
Thank you very much!

Knime Version 4.3.1 (cannot upgrade currently)
Knime Azure Cloud Connectors 4.3.0
Knime Extension for Apache Spark 4.3.2
Knime Extension for Local Big Data Environments 4.3.1

sascha.wolke · August 2, 2021, 11:47am

Hi @Mol1hua,

the Local Big Data Environment contains a reduced Hadoop version that does not support the Azure file systems. You can try to read the file using the normal Parquet reader node and use Table to Spark to get your data into the local Spark instance.

Note that Azure Blob Storage and Azure Data Lake Storage Gen 2 (ADLS2) are behave slightly different and KNIME has two different connectors to solve this.

Cheers
Sascha

Mol1hua · August 2, 2021, 12:02pm

Hi @sascha.wolke,

Thank you for your reply!

So just to be sure I understand correctly, it is currently not possible to connect Azure file systems to the local Big Data environment directly, also not with some configuration changes as suggested in Azure and local Spark?

sascha.wolke · August 2, 2021, 12:25pm

Hi @Mol1hua,

in production, you usually use a real spark cluster and depending on your setup this should work with Azure Blob Storage as your cluster contains the required libs. KNIME exports the selected path as an wasbs://... URL and reads the parquet path in Spark using this URL.

The linked post mentioned that it might be possible to get the Local Big Data Spark working together with the Azure stuff, but this might become tricky and I have tried this. Any reason why you can’t use the normal Parquet reader? As far as I remember, KNIME bundles Hadoop 2.7.6 and you have to add the hadoop-azure jars and all dependencies to the custom spark settings together with your credentials.

Cheers
Sascha

Mol1hua · August 2, 2021, 1:04pm

Hi @sascha.wolke,

We were looking into this topic during some timing tests, comparing local big data environment, Knime Loop with Parquet reader, and Databricks cluster to aggregate multiple parquet files in one go.

If I use the parquet reader, I am converting to Knime and then load back to Spark. Using the “parquet to spark” node, I can directly load into Spark, without the Knime step inbetween.
But I will keep it in mind as a workaround.

Thank you for your input, now we know that the connection from Local Big Data Environment to Azure is currently not possible.

system · August 9, 2021, 1:04pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.