Type 7 Slowly Changing Dimension Tables ("SCD")

SCD.knwf (950.0 KB)

I thought to take a stab at a more complex, Type 7 SCD problem using @stelfrich’s great component (found here) as a base. I made some updates to more fully cover Type 7, but could use a hand from the community to make this a bit more robust.

In the attached SCD workflow, I have a list of Scenarios on the left side that I copy rows from into the “Incremental Reporting” Table Creator (this would mimic incremental CSVs or data pulls, that may or may not contain duplicate/redundant rows).

For my scenario tests, I copy one row at a time and evaluate the effect. When a scenario fails, I copy over the workflow and make changes until fixed, and copy a new workflow again below and proceed down the scenarios until the next failure.

I believe the new SCD component passes scenarios 1-7, but fails on scenario 8 (which is, what happens when we need to load an initial, older SCD dataset that needs to be updated to this new system).

Appreciate any support on evaluating this updated SCD component.

Scenario Test Table: (current component passes scenarios 1-7)

Scenario Sold Date List Price Last Status ListingID Original Price Sold PRice Updated Time Stamp Postal Code
1: First row 2160000.0 New Z1033099 1699000.0 2023-03-12 M6H1N7
2: Unsold (duplicate), repeat insert. 2160000.0 New Z1033099 1699000.0 2023-03-12 M6H1N7
3 Unsold, Updated Timestamp 2160000.0 New Z1033099 1699000.0 2023-03-13 M6H1N7
4 Unsold, Updated Timestamp, Price Changed 2200000 PriceChange Z1033099 1699000.0 2023-03-14 M6H1N7
5 Sold 2023-03-15 2200000 Sold Z1033099 1699000.0 2200000.0 2023-03-15 M6H1N7
6 Sold Duplicate 2023-03-15 2200000 Sold Z1033099 1699000.0 2200000.0 2023-03-15 M6H1N7
7 Sold Duplicate, Date Change (e.g., some irrelevant / nonvisible field changed) 2023-03-15 2200000 Sold Z1033099 1699000.0 2200000.0 2023-03-16 M6H1N7
8 All Scenarios Imported at Once 2023-03-15 2200000 Sold Z1033099 1699000.0 2200000.0 2023-03-16 M6H1N7

There is a potential scenario 9 where if columns XYZ did not change, but the updated time stamp (an external field) did change, then just update the “valid/current row” with the new updated time stamp. Updating the component to this scenario would allow for optimizing the overall SCD table size further when redundant updates are made.

Expected SCD Final State from Scenario 1-8:

Sold Date List Price Last Status ML Number Original Price Sold Price Updated Timestamp Postal Code
2160000.0 New Z1033099 1699000.0 2023-03-12 M6H1N7
2160000.0 New Z1033099 1699000.0 2023-03-13 M6H1N7
2200000 PriceChange Z1033099 1699000.0 2023-03-14 M6H1N7
2023-03-15 2200000 Sold Z1033099 1699000.0 2200000.0 2023-03-15 M6H1N7
2023-03-15 2200000 Sold Z1033099 1699000.0 2200000.0 2023-03-16 M6H1N7

Expected with 9th Bonus Scenario:

Sold Date List Price Last Status ML Number Original Price Sold Price Updated Timestamp Postal Code
2160000.0 New Z1033099 1699000.0 2023-03-13 M6H1N7
2200000 PriceChange Z1033099 1699000.0 2023-03-14 M6H1N7
2023-03-15 2200000 Sold Z1033099 1699000.0 2200000.0 2023-03-16 M6H1N7
1 Like

Fine tuned to handle Scenario 9 and published to KNIME Hub:

Scenario 8 (multiple rows at once for a single ML Number) potentially now passes as well, but have to test further to be sure.

So overall the updated logic does seem to allow for all history for a given ID to be imported, while at the same time selecting which history (so as redundant changes can be ignored).

Should be tested further with multiple ID’s and change scenarios. Filter rules could also be rationalized further.

So I’m trying this on a full data set (729,000 rows, 330 columns), and the two DB Merge nodes within the component are still running after 24+ hours. After 24 hours, one of the DB Merge nodes has reached just 29%.

Curious if this is normal for an initial node or the nature of the SCD Type 7 component.

Still running.

@szawadski - just saw your discussion here, I think we have an identical use case and similar db size concerns. I’m curious to know if you ended up solving.

1 Like

Seems that DB Merge is susceptible to “deadlock”, which I believe what was causing the node to run for days even for just 300,000 rows. Made some changes, swapped out the DB Merge with Update and Insert operations, and improved logic further.

The working-ish version is here and updates or inserts a new row in SCD Type 7 fashion, with option to define which headers determine insertion vs. update. Work still needs to be done to:

  1. improve speed (DB Update runs slow, taking minutes even when processing just 1,000 rows)

  2. fix logic on Valid To for false rows (while each event is correctly marked True/False under Valid field, the previous False rows seem to get replaced with the most recent Valid To)

  3. make this more generic with option to parameterize the business-relevant headers

Hello @qdmt ,
regarding the speed improvement, did you create an index for the columns you select in the DB Merge node as identification columns? If not the database will do a full table scan for each row you process which might take ages depending on the size of the database table you want to update.
If the columns are unique, which I guess is the case, you can create a unique index via the “Keys” tab of the DB Table Creator node where you can create a primary or unique key.
Bye
Tobias

2 Likes

Hello @tobias.koetter, thank you. Indexing was not the driver of speed improvement when switching from DB Merge to Update/Insert as I hadn’t yet indexed in either case. I am going try indexing next as well create a staging table.

Broader discussions on issues with MERGE (here, and here) had me sway away from DB Merge in favour of DB Update/Insert (though I admit I am not qualified to say whether the issues are applicable to KNIME’s implementation of the statement).

Hi @qdmt ,
the implementation depends on if the used database supports a merge statement. The fallback solution is to use Update/Insert also in our code.
Bye
Tobias

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.