KNIME Spark Context problems

ashokharnal · December 22, 2022, 8:40am

KNIME presently uses two ways to generate Spark contexts. One way is through Create Local Big Data Environment node and other through Create Spark Context (livy). When the Local Big Data Environment node is used, spark has access to files only on local (linux ) file system and not on hdfs. Using the 'Create Spark Context (livy)’ node requires installation of either HDP or Cloudera. Even then both use Spark 2.4.x versions. Apache Livy is also outdated. Its development has stopped since quite long. It requires Spark versions which are compiled with either Scala 2.10 or 2.11. Current Spark binary versions (Spark 3.0) are available with compilation on Scala 2.12. Use of Create Spark Context (livy) node with Cloudera or HDP has its own sets of problems (for example, I tried my best to forward and make available apache livy port 8999 in HDP Sandbox 3.0.1 but could not succeed).
Is there any other option available (direct or indirect) through which one can access hdfs file system, use spark to read/write files from/on hdfs instead of local file system? Thanks for any help.

ScottF · December 30, 2022, 3:41pm

Hello @ashokharnal -

This is a good question, and one I haven’t been able to find a quick answer for. I’ve asked internally and I’ll let you know what I find out.

ashokharnal · December 31, 2022, 2:55am

Thanks for reply. Will await your reply. I plan to use KNIME in my classes on Hadoop/Spark. There is still time for the classes to begin. Thanks.

sascha.wolke · January 4, 2023, 5:02pm

Hi @ashokharnal,

Hortonworks and Cloudera has been merged into one company. Development on HDP has been stopped, and continue in the Cloudera Data Platform (CDP). You can find the supported Spark versions in the documentation: KNIME Big Data Extensions Admin Guide

Other alternatives can be found here (e.g. Databricks): KNIME Documentation

The current KNIME 4.7.0 release supports Spark 3.2 with Scala 2.12.

You do not need Spark at all, if you only like to read files from HDFS into KNIME. See the HDFS Connector and e.g. the Parquet Reader.

Note that the Local Big Data Environment should only be used to test things, and you should always use a real Spark cluster in production.

Cheers,
Sascha

ashokharnal · January 5, 2023, 7:57am

Thank you for your reply. I intend to use KNIME for teaching Spark machine learning. I have installed Hadoop on a Linux system by simply following the 'Getting Started’ steps for a single node ( Pseudo-Distributed Operation) from here. And then downloaded and installed a compatible Spark binary (Prebuilt for Apache Hadoop 3.3 and later) from here. After setting requisite PATH(s) and HOME(s) in bashrc, Spark begins to read files from hdfs, can perform ML operations on the read data and, if required, save files back in Hadoop. That is, a minimal hadoop-spark system exists for experimentation. I am also able to carry out experiments on spark-streaming using pyspark.
On this system, KNIME is able to access HDFS through KNIME’s HDFS connector and is also able to access HIVE server through KNIME’s HIVE connector. But as Apache Livy does not work with higher versions of Spark (beyond 2.4.X), I am unable to configure Apache Livy properly and therefore unable to use KNIME’s 'Create Spark Context (livy) to create Spark context.
The KNIME Big Data Extensions Admin Guide are directed for Cloudera/HDP installations or for Databricks cluster and are not generic in nature–and that is the source of problem. For example, if I have Hadoop + Spark 3.2 binary (prebuilt for hadoop) installed, how do I use KNIME’s 'Create Spark Context (livy) node to connect to Spark? I am prepared to install any other software. Does a KNIME parcel exist to help?
Thank you for your time,
Ashok Kumar Harnal

sascha.wolke · January 9, 2023, 5:58pm

Hi,

You might use the Create Local Big Data Environment Node without HDFS? It includes your local file system as the default file system in Spark. Not sure why you need HDFS in that case.

If you don’t like the Local Big Data Environment, and have a Spark cluster, then you need to use the Livy or Databricks node. This is the suggested way, and there is no other way to connect KNIME and Spark in that case. Cloudera or AWS EMR already provide Spark 3.2 and Livy. See the Cloudera documentation about a parcel: Installing CDS 3.2.3 Powered by Apache Spark

If you don’t like to use Cloudera/AWS/Databricks, you have to compile Livy by yourself from the current master branch. Not sure if this is an easy task. Looks like there is no recent release and the ticket about a new release is still open: [LIVY-901] Livy 0.8.0 Dependency Upgrades - ASF JIRA

Does a KNIME parcel exist to help?

KNIME does not provide a parcel anymore, as Cloudera provides already one (see the link above).

Cheers,
Sascha

system · April 9, 2023, 5:59pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.