How to update(or delete/insert) spark data in hive table

mlauber71 · September 15, 2022, 3:51pm

@hhkim first you might want to familiarise yourself with the workings of Spark and what a data frame / RDD is (Spark SQL and DataFrames - Spark 3.3.0 Documentation). Typically they only exist during execution (Being Lazy is Useful — Lazy Evaluation in Spark | by Lakshay Arora | Analytics Vidhya | Medium). And you might have to store them to the (HDFS or other) file system in order to keep them. One intermediate way to do that is persist and un-persist but that might only be useful if you stay within a certain Spark task you are doing. Typically the RDDs cannot be changed but a new one would be created for every manipulation - and if the system runs thru and you would not persist they would only exist with data as long as they are needed (An overview of KNIME based functions to access big data systems - use it on your own big data system (including PySpark) – KNIME Hub).

With the help of KNIME nodes you can execute various Spark tasks (KNIME Extension for Apache Spark | KNIME) including SparkSQL and later store the result on the Big Data system as a hive file. What format that has is another debate, external table, ORC … and (you might have guessed it … depends on the version of hive (and Cloudera or whatever system you are using) (Load and Write Data into Hive Corporate DB - #4 by mlauber71):

So your task might involve some planning and you might want to decide where you want the task to be executed. This very much will depend on the performance you can allocate on the cluster and the nature of your task. Depending on your settings Hive or Spark (Create Spark Context (Livy) – KNIME Hub) might be configured to allocate resources dynamically in oder to fulfill the task (if the admin would let you).

https://docs.knime.com/latest/bigdata_extensions_user_guide/index.html#spark_livy

More examples of the workings between KNIME and Hive can be found on the hub: