How to update(or delete/insert) spark data in hive table

hhkim · September 14, 2022, 8:18am

Hello.

I was wondering if I can update spark data in hive table.
I can’t use spark to table node because a data size is too large.
So I want to update spark data directly in hive table, but I don’t know how.
Please let me know if there is a way to update directly.

Thanks,
hhkim

mlauber71 · September 15, 2022, 3:51pm

@hhkim first you might want to familiarise yourself with the workings of Spark and what a data frame / RDD is (Spark SQL and DataFrames - Spark 3.3.0 Documentation). Typically they only exist during execution (Being Lazy is Useful — Lazy Evaluation in Spark | by Lakshay Arora | Analytics Vidhya | Medium). And you might have to store them to the (HDFS or other) file system in order to keep them. One intermediate way to do that is persist and un-persist but that might only be useful if you stay within a certain Spark task you are doing. Typically the RDDs cannot be changed but a new one would be created for every manipulation - and if the system runs thru and you would not persist they would only exist with data as long as they are needed (An overview of KNIME based functions to access big data systems - use it on your own big data system (including PySpark) – KNIME Hub).

With the help of KNIME nodes you can execute various Spark tasks (KNIME Extension for Apache Spark | KNIME) including SparkSQL and later store the result on the Big Data system as a hive file. What format that has is another debate, external table, ORC … and (you might have guessed it … depends on the version of hive (and Cloudera or whatever system you are using) (Load and Write Data into Hive Corporate DB - #4 by mlauber71):

So your task might involve some planning and you might want to decide where you want the task to be executed. This very much will depend on the performance you can allocate on the cluster and the nature of your task. Depending on your settings Hive or Spark (Create Spark Context (Livy) – KNIME Hub) might be configured to allocate resources dynamically in oder to fulfill the task (if the admin would let you).

https://docs.knime.com/latest/bigdata_extensions_user_guide/index.html#spark_livy

More examples of the workings between KNIME and Hive can be found on the hub:

hhkim · September 19, 2022, 5:35am

Thank you for your reply.
There were many things that helped me to use Spark and Hive environment.

But, as you know, there is no ‘append’ option in ‘Spark to Hive’ node.
Comparing with ‘Spark to DB’, you can see that it has ‘append’ option.

(Spark to Hive)

(Spark to DB)

So, if there is no way to append(insert) spark data into hive table, then is this the only way to create a new table each time?

Thanks,
hhkim

sascha.wolke · September 22, 2022, 9:01am

Hi @hhkim,

can you explain how your setup looks like?

If you have a Hive cluster, you should use ACID tables that support insert/update/append. Spark does not support these types of tables and requires a warehouse connector to access tables managed by Hive. Cloudera already includes them, and the Spark to Hive Node should you them by default. More information can be found in the Cloudera documentation.

If you use a Databricks cluster, delta tables should support these operations. See Databricks documentation.

If you use the Create Local Big Data Environment Node, don’t use this in production with large data. This is a testing node and the DB output port does not connect to any real Hive cluster, it’s only the Spark Thrift server that emulates this. You might use the DB Loader or, without any DB nodes, use the Spark_to_Parquet node instead that supports appending.

Cheers
Sascha

system · December 21, 2022, 9:02am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.