How to insert data into hive table with partition

hhkim · December 15, 2022, 9:32am

Hello.

I was wondering if I can insert data into hive table by using this sql query:

INSERT OVERWRITE TABLE testdb.test_table PARTITION (base_date = ‘202212’)

I have used “DB Connection Table Writer” node but it can’t be used with PARTIOTION option.

I think there is no node that can insert connected data into table with PARTITION option.

I’d would appreciate it if you could tell me how I can use PARTITION option.

mlauber71 · December 15, 2022, 3:57pm

@hhkim You can use the „DB Table creator“ with additional settings for partitions and then load data into the newly created table. The DB Loader would need a temporary folder in the HDFS file system.

You can try and adapt this example here. Please note the partition would be the last column and is being omitted i the create on purpose since it will be set in the additional options.

Another option would be to load the data into a temporary table and then use an SQL Executor command to insert the data.

Here is a complete „self-healing“ reporting system with Hive that could demonstrate additional functions:

mlauber71 · December 18, 2022, 7:49pm

@hhkim I created an example workflow that would demonstrate the use of Hive tables and partitions with the KNIME DB nodes.

If you want to do an OVERWRITE of a specific partition the easiest way to do so would be to first explicitly drop the relevant partition (there is a node doing that in the example).

please download the whole workflow group to get a complete example:

hhkim · December 19, 2022, 5:25am

Thank you for your reply.

I think DB Table Creator or DB Reader node can use my local PC resource.

So I just want insert selected DB data into DB table directly by using “INSERT OVERWRITE with PARTITION” query.

“INSERT OVERWRITE with PARTITION” query can use when the table that I want to select already exists.
But when the table that I want to select comes from DB node, I don’t know how to do.

I hope my explanation is enough so you can understand what I mean.

mlauber71 · December 19, 2022, 7:19pm

@hhkim there ware two ways to add a partition to an existing one:

You can use the DB Connection Table Writer and just make sure your partitions are in the same order (at the end) of the file - like in this version the column “education”. Here it would append the data (all educations=Masters) if you would execute it more than once:

Or you could use the syntax you already have mentioned - “INSERT OVERWRITE” where you mention the partition column. Again: the partitions have to be the last columns (and have to have the same order as in the target file).

INSERT OVERWRITE TABLE `default`.`data_train`
PARTITION (`education`)
SELECT *
FROM `default`.`data_education_masters`

You would add the partition in the form of an additional folder or parquet files within the folder (on the HDFS):

The settings for the target table “data_train” have been set in the additional options of the DB Table Creator node

STORED AS PARQUET
PARTITIONED BY (education STRING)
TBLPROPERTIES ( 'parquet.compression'='snappy'
  , 'external.table.purge'='true'
  , 'transactional'='false'
  , 'discover.partitions' = 'false'
  )

hhkim · December 26, 2022, 6:51am

Thank you for your answer.

Summarizing your answer, I understood that I can insert the data through the “DB SQL Executor” node when the table I want select exists, and I can insert the data through the “DB Connection Table Writer” node when the table I select does not exist.

It was helpful to me

system · January 2, 2023, 6:52am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.