Detect changes during merge operation on DB

szawadski · May 12, 2020, 11:41pm

Hello,
I would like to MERGE new updates into an existing table and update a timestamp if one of the set fields actually changes a value. If a record has been updated based on the matching fields but no value has changed, there would be no timestamp update.
Is there a nice and simple way to do it?

Thanks in advance for your help.

ipazin · May 14, 2020, 4:00pm

Hi there @szawadski,

think I know what you are talking about but just to be sure can you give example with tables?

Br,
Ivan

AnotherFraudUser · May 14, 2020, 7:36pm

Hi szawadski,

normally we just add an update timestamp column (and/or for an creation timestamp)
Then when inserting/updating new data to the able we set the timestamps

So I would just add an "Date&Time Configuration with “use execution time” enabled to create a new upd_tms column which will be set in the update as well?
grafik

szawadski · May 14, 2020, 7:59pm

Here is an example. The primary key is ID and is used to identify the records.

|-----------------------------------------|
| ID | Name | Qty1 | Qty2 | Last_update_on|
|-----------------------------------------|
| 1  | Tom  | 1    | 2    | 2020-01-01    |
| 2  | Ann  | 3    | 4    | 2020-01-02    |
| 3  | Els  | 5    | 6    | 2020-01-03    |
|-----------------------------------------|

I would like to merge with the following table.

|-----------------------------------------|
| ID | Name | Qty1 | Qty2 | Last_update_on|
|-----------------------------------------|
| 1  | Tom  | 7    | 8    | 2020-05-14    |
| 2  | Ann  | 3    | 4    | 2020-05-14    |
| 4  | Ron  | 9    | 1    | 2020-05-14    |
|-----------------------------------------|

I would expect the result to be

|-----------------------------------------|
| ID | Name | Qty1 | Qty2 | Last_update_on|
|-----------------------------------------|
| 1  | Tom  | 7    | 8    | 2020-05-14    | (1)
| 2  | Ann  | 3    | 4    | 2020-01-02    | (2)
| 3  | Els  | 5    | 6    | 2020-01-03    | (3)
| 4  | Ron  | 9    | 1    | 2020-05-14    | (4)
|-----------------------------------------|

(1) This record was existing and had Qty1 and Qty2 updated => timestamp updated
(2) This record was existing but has not been changed => timestamp not updated
(3) This record was not in the merge, so it stays and remains unchanged => timestamp not updated
An alternative behaviour could be to remove the record (or to add a flag “DELETED = Y” and update then the timestamp.
(4) This record is new (i.e. new ID) and is added to the table => new timestamp

This seems to be a pretty common behaviour for an ETL-type job for a datawarehouse. Usually, with a bit of effort, I always manage to achieve what I need to do with Knime. But in this case, the solutions I have in mind seem too complicated for such a common task. So I ask on the forum. So maybe there is a magic node or a magic trick that this great community will help me to find…

@AnotherFraudUser, if I understand well your proposal, it would not take into account properly the (2) and update the timestamp while I would need to keep it at 2020-01-02 (to know that this record has changed on 2020-01-02 and not on 2020-05-14).

FYI, I am trying to develop a sort of generic and configurable component to “Load table into Snowflake”. For now, I rebuild the table from scratch at each load, but the incremental merge would also be interesting.

izaychik63 · May 14, 2020, 8:06pm

I believe that part of what you need could be done by DB Merge. It also may have extended functionality in new release.

AnotherFraudUser · May 14, 2020, 8:34pm

Hi szawadski,

The examples helped alot to understand your problem.
yes I think if you do not want case (2) to be updated then it will get hard to to with the current knime db nodes.
I think in this case you will either need to first use the DB Query Reader to read all existing rows to KNIME before the merge (to filter out the existing rows without updates)
Or write the update contents into a volatile/temporary or stating table on the database -> then do the merge with an DB SQL Executor

I guess you could also do it with an ugly solution within KNIME… as the current Merge node does not support existing rows without changes but I do not think that would be better then using a volatile/temporary tables

~well one solution in knime could be the following for (2) while doing all the other cases with the normal merge node:
Use the DB Table Selector and Parameterized DB Query Reader to select all rows in your merge data with all equal columns (or the columns you want in (2)
grafik
Then use the refernce row filter to filter these cases out before your merge
Then do the merge

*and thw Parameterized DB Query Reader would just use the SQL you want for checking (2)
grafik

szawadski · May 14, 2020, 8:46pm

As far as I understand DB Merge, if I use “ID” as a WHERE field, and the other fields as SET fields, I will update the timestamp in case (2) which is not the expected behaviour.
I see as a potential solution:

first add all the new records to the table (with the new time stamp)
then groupby on all fields + min(timestamp)
then groupby on ID + max(timestamp) and join with the previous step.

This should work but seems to be compute intensive and is not adapted to most of my use cases. That’s way I am looking for a smarter approach…

szawadski · May 14, 2020, 8:58pm

Thanks for your advices. Unfortunately, this does not fit in my use cases for the following reasons:

I cannot aford loading all data locally - I need to perform the merge on the DB (there are millions of row in some tables)
I want to have something as generic as possible (i.e. I do not want to adapt the query each time I use a new table)

That’s why I think the best solution will come from Knime. Why not to think about a sort of ETL/Datawarehouse extension with this kind of features + logging of the changes in a separate table, etc…

In the meantime, I will test the idea exposed above (ADD rows + double groupby + join).

Thanks anyway for your interest in my question and your help.

AnotherFraudUser · May 14, 2020, 9:03pm

then I think your best quess would be the Parameterized DB Query solution
You would just need to set the SQL Statement with a parameter depending on the merge data columns

But good luck with your problem

ipazin · May 15, 2020, 3:53pm

Hi there @szawadski,

tnx for vivid example!

Well you have a record with updated timestamp but none of attributes updated? It can happen while doing ETL (on source you have attributes that have been updated but they are not needed on destination thus you have records that are actually not updated from destination point of view) but wondering how often and should KNIME cover this with a smart approach or ETL designer should take care of this on destination side somehow?

Logic you would need in DB Merge node is actually “Only update if values have changed” and appropriate dialog to choose columns upon which this should be checked. Also regarding deleting/marking data which is not in input table makes sense.

Potential solution seems fine to me!

Br,
Ivan

system · November 14, 2020, 3:54am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.