Spark to DB error

andrew.dun · August 10, 2021, 12:09am

I’m using the Spark nodes with a Databricks cluster, and I’m having trouble executing the Spark to DB node. As per the attached screenshot, this can be demonstrated with a four node workflow:

Create Databricks Environment → DB Table Selector → DB to Spark → Spark to DB

knime_spark

The first three nodes run okay, but the final node gives the following error:

ERROR Spark to DB          0:31:2913  Execute failed: Failed to load JDBC data: [Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: Error running query: org.apache.spark.sql.catalyst.parser.ParseException: 
no viable alternative at input '("Id"'(line 1, pos 50)

== SQL ==
CREATE TABLE `default`.`spark_test_survey_table` ("Id" INTEGER , "SurveyName" TEXT , "CreatedBy" TEXT , "CreatedDate" TEXT , "IsCheckboxSurvey" BIT(1) , "IsActive" BIT(1) , "AllowQA" BIT(1) , "PreviousSurveyId" INTEGER )
--------------------------------------------------^^^
, Query: CREATE TABLE `default`.`spark_test_survey_table` ("Id" INTEGER , "SurveyName" TEXT , "CreatedBy" TEXT , "CreatedDate" TEXT , "IsCheckboxSurvey" BIT(1) , "IsActive" BIT(1) , "AllowQA" BIT(1) , "PreviousSurveyId" INTEGER ).

It looks to me like the problem may be that the SQL output from the node is quoting the column names, eg “Id”, when, apparently, Databricks expects column names to be unquoted.

I’m wondering if this is the problem and if so whether there may be any workaround?

sascha.wolke · August 16, 2021, 1:17pm

Hi @andrew.dun,

not sure what you try to accomplish, but this is not the way the Databricks ENV ports should be used. The DB to Spark and Spark to DB nodes using a JDBC connection in Spark to transfer the Data between the DB and Spark (and this not very efficient). The Databricks ENV node exposes a DB and a Spark port. The DB port can be used with the Database nodes in KNIME to send SQL queries to Spark and execute them in the Spark cluster. The Spark nodes can be used to execute stuff using the Spark API on the cluster. The DB to Spark and Spark to DB nodes should be used to read/write data from/to an external DB that is not Spark.

Cheers
Sascha

andrew.dun · August 17, 2021, 8:14am

Thanks Sascha. That’s indeed the way I’ve been using them, and have no issues using the DB port to execute queries on the spark. As per the following screenshot, I’ve also succeeded in using the Spark API (eg row filtering, column renaming) - the problem comes when I want to transfer the data executed via the API back to the DB.

I can certainly still work fine with the DB nodes - however I was keen to see whether I might get better performance by working through the Spark API.

sascha.wolke · August 17, 2021, 9:41am

Hi @andrew.dun,

I highly suggest to use the Spark Node or the DB nodes and don’t mix/connect them. Is there any specific node missing? There is a DB Column Rename and DB Row Filter node too.

You can use the Spark DataFrame Java Snippet (Sink) node and create a virtual table from your data frame using dataFrame.write().saveAsTable("mytable") as code. After that, the DB table selector should be able to read the table.

Cheers
Sascha

andrew.dun · August 18, 2021, 6:52am

Thanks @sascha.wolke

This has worked and I’ve now been able to achieve what I wanted. For the record, this was initially just to benchmark the performance of moving data around with the API vs the DB nodes, of using plain parquet vs parquet with delta, and of processing data within the API vs the DB nodes (though currently the amount of data is trivial so its probably not a fair comparison).

sascha.wolke · August 18, 2021, 1:53pm

Hi @andrew.dun,

another thing you might like to test are the Hive to Spark and Spark to Hive nodes that let you store and read data from delta too. That’s another way to exchange the data between the DB and Spark nodes.

Can’t read the things on your screenshot on this resolution, but feel free to ask if you have further questions.

Cheers
Sascha

andrew.dun · August 19, 2021, 2:02am

Great, thanks Sascha. I’ve gone ahead and tried out the Hive nodes and they’re working well too. It’s helpful to know there are several options to choose from.

Cheers,
Andrew.

system · September 8, 2021, 1:19am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.