synapse pyspark delta lake merge scd type2 without primary key

I have a set of rows coming from previous process which has no primary key, and the composite keys are bound to change which are not a good case for composite key, only way the rows are unique is the whole row( including all keys and all values). I need to implement the SCD type2 on this data. The environment is Synapse pyspark, using delta lake Merge command and more.

how I tried
Using row hash: In this case the challenge without primary/composite key is to find which rows have changed/updated. With any updated values the row hash is changing and resulting into new row.

please suggest how this problem can be solved. If you have any questions on this, please write back.

@jdisunilkumar welcome to the KNIME forum. You could try and use some sort of similarity search on the whole row entry.

If the dimension are indeed slow changing this might work. You could also try to introduce an artificial ID based on this so to make the use easier.

This will depend on how many columns there are and what kind of data there is and how this would typically change. So for example you might use just the string values if you suspect them to form an identity.

Where this would fail is if you have things like addresses. If a customer would change the address the row would be completely different but still the same customer. In such a case you could try and just go for the numbers. Their portfolio or purchases might not have changed.

The as a remark. Database systems without proper key values are … not a good idea. But I assume you already know that.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.