Knime's support for HDP 3.0.1?

When will Knime’s “Extension for Apache Spark” be updated for Hortonworks HDP 3.0.1?

According to the extension’s web site, the most recent version of Spark Job Server available for Knime was for HDP 2.6.5. I went ahead to see how well it worked for HDP 3.0.1, since HDP 2.6.5 and 3.0.1 use a slightly different Spark 2.3 (version 2.3.0 in HDP 2.6.5 versus 2.3.1 in HDP 3.0.1). My simple test of linear regression, decision tree, correlation, and PCA seemed fine, but “Spark k-means” gave me the following error. Could this be the difference of Spark’s versions? If so, why does it only happen in k-means, but not the other models? Any pointers are appreciated.

ERROR Spark k-Means 2:18 Execute failed: org.apache.spark.ml.PipelineStage; local class incompatible: stream classdesc serialVersionUID = 7330592925129616646, local class serialVersionUID = 3275105016155696140 (InvalidClassException)

Hi @analytics1

yes, KNIME Extension for Apache Spark generally supports HDP 3.0.1, but there are some limitations which we are aware of.

The first limitation you have noticed yourself: Spark k-Means does not work. Reason: Spark in HDP 3.0.1 contains some changes that make it’s k-Means model incompatible with the k-Means model from Apache Spark (used by KNIME). Addressing this problem unfortunately requires some effort, we are hoping to have this particular problem (and others of this type) addressed for the summer release 2019. As far as we know, k-Means is currently the only node affected by this problem.

The second limitation has to do with accessing Hive from Spark. Hive and Spark are using different metastore catalogs starting with HDP 3 (see [1]). This means, that the Hive to Spark and Spark to Hive nodes cannot read/write Hive tables on HDP 3. The tables that these nodes read/write live in a different metastore catalog than the normal Hive tables.

According to the extension’s web site, the most recent version of Spark Job Server available for Knime was for HDP 2.6.5.

We are currently in the process of restructuring the documentation of the Spark integration. Website and documentation will be updated in early January. My apologies for the delay.

Best,
Björn

[1] https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/hive-overview/content/hive-apache-hive-3-architecturural-overview.html

3 Likes

One addendum: There is a possible workaround for the second limitation (Spark and Hive not seeing each others metastore tables).

It is possible to use “Hive Warehouse Connector” from inside a Spark DataFrame Java Snippet (Source).
You can pass in the SQL query via flow variable and then use the Hive Warehouse Connector as described in the Hortonworks blog post here (section 2.1):

https://de.hortonworks.com/blog/hive-warehouse-connector-use-cases/

The Scala code in the blog post should be very straightforward to convert to Java. You will need the hive-warehous-connector jar both in KNIME AP (to compile the Java Snippet) as well as in the Spark classpath of your cluster (on HDP 3 it is hopefully preinstalled).

The jar can be obtained from the Hortonworks maven repository:

http://repo.hortonworks.com/content/groups/public/com/hortonworks/hive/hive-warehouse-connector_2.11/1.0.0.3.0.1.6-2/

Best,
Björn

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.