Type 7 Slowly Changing Dimension Tables ("SCD")

qdmt · February 18, 2024, 2:44am

I thought to take a stab at a more complex, Type 7 SCD problem using @stelfrich’s great component (found here) as a base. I made some updates to more fully cover Type 7, but could use a hand from the community to make this a bit more robust.

In the attached SCD workflow, I have a list of Scenarios on the left side that I copy rows from into the “Incremental Reporting” Table Creator (this would mimic incremental CSVs or data pulls, that may or may not contain duplicate/redundant rows).

For my scenario tests, I copy one row at a time and evaluate the effect. When a scenario fails, I copy over the workflow and make changes until fixed, and copy a new workflow again below and proceed down the scenarios until the next failure.

I believe the new SCD component passes scenarios 1-7, but fails on scenario 8 (which is, what happens when we need to load an initial, older SCD dataset that needs to be updated to this new system).

Appreciate any support on evaluating this updated SCD component.

Scenario Test Table: (current component passes scenarios 1-7)

Scenario	Sold Date	List Price	Last Status	ListingID	Original Price	Sold PRice	Updated Time Stamp	Postal Code
1: First row		2160000.0	New	Z1033099	1699000.0		2023-03-12	M6H1N7
2: Unsold (duplicate), repeat insert.		2160000.0	New	Z1033099	1699000.0		2023-03-12	M6H1N7
3 Unsold, Updated Timestamp		2160000.0	New	Z1033099	1699000.0		2023-03-13	M6H1N7
4 Unsold, Updated Timestamp, Price Changed		2200000	PriceChange	Z1033099	1699000.0		2023-03-14	M6H1N7
5 Sold	2023-03-15	2200000	Sold	Z1033099	1699000.0	2200000.0	2023-03-15	M6H1N7
6 Sold Duplicate	2023-03-15	2200000	Sold	Z1033099	1699000.0	2200000.0	2023-03-15	M6H1N7
7 Sold Duplicate, Date Change (e.g., some irrelevant / nonvisible field changed)	2023-03-15	2200000	Sold	Z1033099	1699000.0	2200000.0	2023-03-16	M6H1N7
8 All Scenarios Imported at Once	2023-03-15	2200000	Sold	Z1033099	1699000.0	2200000.0	2023-03-16	M6H1N7

There is a potential scenario 9 where if columns XYZ did not change, but the updated time stamp (an external field) did change, then just update the “valid/current row” with the new updated time stamp. Updating the component to this scenario would allow for optimizing the overall SCD table size further when redundant updates are made.

Expected SCD Final State from Scenario 1-8:

Sold Date	List Price	Last Status	ML Number	Original Price	Sold Price	Updated Timestamp	Postal Code
	2160000.0	New	Z1033099	1699000.0		2023-03-12	M6H1N7
	2160000.0	New	Z1033099	1699000.0		2023-03-13	M6H1N7
	2200000	PriceChange	Z1033099	1699000.0		2023-03-14	M6H1N7
2023-03-15	2200000	Sold	Z1033099	1699000.0	2200000.0	2023-03-15	M6H1N7
2023-03-15	2200000	Sold	Z1033099	1699000.0	2200000.0	2023-03-16	M6H1N7

Expected with 9th Bonus Scenario:

Sold Date	List Price	Last Status	ML Number	Original Price	Sold Price	Updated Timestamp	Postal Code
	2160000.0	New	Z1033099	1699000.0		2023-03-13	M6H1N7
	2200000	PriceChange	Z1033099	1699000.0		2023-03-14	M6H1N7
2023-03-15	2200000	Sold	Z1033099	1699000.0	2200000.0	2023-03-16	M6H1N7

qdmt · February 18, 2024, 1:31pm

Fine tuned to handle Scenario 9 and published to KNIME Hub:

Scenario 8 (multiple rows at once for a single ML Number) potentially now passes as well, but have to test further to be sure.

So overall the updated logic does seem to allow for all history for a given ID to be imported, while at the same time selecting which history (so as redundant changes can be ignored).

Should be tested further with multiple ID’s and change scenarios. Filter rules could also be rationalized further.

qdmt · February 20, 2024, 2:02am

So I’m trying this on a full data set (729,000 rows, 330 columns), and the two DB Merge nodes within the component are still running after 24+ hours. After 24 hours, one of the DB Merge nodes has reached just 29%.

Curious if this is normal for an initial node or the nature of the SCD Type 7 component.

qdmt · February 24, 2024, 3:39am

Still running.

@szawadski - just saw your discussion here, I think we have an identical use case and similar db size concerns. I’m curious to know if you ended up solving.

qdmt · February 26, 2024, 8:56pm

Seems that DB Merge is susceptible to “deadlock”, which I believe what was causing the node to run for days even for just 300,000 rows. Made some changes, swapped out the DB Merge with Update and Insert operations, and improved logic further.

The working-ish version is here and updates or inserts a new row in SCD Type 7 fashion, with option to define which headers determine insertion vs. update. Work still needs to be done to:

improve speed (DB Update runs slow, taking minutes even when processing just 1,000 rows)
fix logic on Valid To for false rows (while each event is correctly marked True/False under Valid field, the previous False rows seem to get replaced with the most recent Valid To)
make this more generic with option to parameterize the business-relevant headers

tobias.koetter · February 27, 2024, 3:03pm

Hello @qdmt ,
regarding the speed improvement, did you create an index for the columns you select in the DB Merge node as identification columns? If not the database will do a full table scan for each row you process which might take ages depending on the size of the database table you want to update.
If the columns are unique, which I guess is the case, you can create a unique index via the “Keys” tab of the DB Table Creator node where you can create a primary or unique key.
Bye
Tobias

qdmt · February 28, 2024, 1:00am

Hello @tobias.koetter, thank you. Indexing was not the driver of speed improvement when switching from DB Merge to Update/Insert as I hadn’t yet indexed in either case. I am going try indexing next as well create a staging table.

Broader discussions on issues with MERGE (here, and here) had me sway away from DB Merge in favour of DB Update/Insert (though I admit I am not qualified to say whether the issues are applicable to KNIME’s implementation of the statement).

tobias.koetter · February 28, 2024, 7:46am

Hi @qdmt ,
the implementation depends on if the used database supports a merge statement. The fallback solution is to use Update/Insert also in our code.
Bye
Tobias

system · March 10, 2024, 5:09am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.