ETL Subsystem 9: Slowly Changing Dimensions (SCD)
Source: Tod means Fox [link]
This article is part of a series discussing the Kimball Group’s “34 Subsystems of ETL“. The Subsystems are a group of “Best Practices” for delivering a BI/DW solution. In my articles, I discuss how each Subsystem can be implemented in SSIS or hand coded in Visual FoxPro.
To this point, I’ve discussed “Data Extraction” and “Cleansing and Conforming” ETL Subsystems. Now, I’ll shift attention to data delivery (the “L” in ETL). As I have already discussed Slowly Changing Dimensions in a previous post, I’ll only highlight some of the points I think are most important here. So please, if you haven’t already, read my previous post “Dimensional Modeling: Loading the Slowly Changing Dimension” before you continue.
Some important takeaways:
- Dimensions change over time. We can identify how to process these changes by using SCD Types 1, 2, and 3 techniques.
- Dimensions (especially large ones) sometimes need a hybrid approach. This approach is sometimes referred to as a Type 1.5 SCD, which allow a dimension to be added with incomplete data. Updates are processed later as if the dimension was a Type 1 (empty and incomplete attributes are overwritten instead of retaining history) until the entire row can be labelled as “complete”. To indicate this, I have found including a “complete” flag in each dimension is a clear way to mark Type 1.5 dimension rows as complete, or in need of additional data.
- Like the Type 1.5, sometimes a dimension row must be added with no details. These are called Inferred Members. In this scenario, just a natural key and the system-generated surrogate key are added. The dimension is then updated much like the Type 1.5 SCD until its data is complete.
- For Type 1.5 and Type 2 SCDs to work, the row needs some sort of status indicator, the best being active “to” and “from” dates.
SQL Server 2005 Integration Services (SSIS)
I already mentioned in my previous post how SSIS’s Slowly Changing Dimension Task can be used to meet the requirements of this subsystem. Again, my only word of advice is to be careful about the amount of data you plan on passing through the generated SCD Tasks. I have seen performance degrade substantially on large datasets. It is best, in these scenarios, to introduce additional logic before the SCD logic to filter out inserts, updates, and deletes into different trees.
Also, it is a good idea to fully understand the different SCD types (inferred dimensions as well) before running through the wizard. Not only will the entire process be clearer, but you will better understand how to optimize the generated data flow when all is created.
Hand Coding with Visual FoxPro (VFP9)
I’ve had great success using FoxPro for hand-coded dimensional modeling projects. My first stab at loading Slowly Changing Dimensions was in 1998, using VFP6. Using a database trigger, I wrote code to update a denormalized data model that was used for reporting and ad-hoc queries. At the time, I had no idea what a SCD was, but somehow, instinctively I implemented it pretty close to the tried-and-true Kimball methods. That system is still in use today (the app is in VFP9 now) and I have redesigned it to match the Kimbal SCD methodology much closer than I originally conceived it.
If you are involved in a similar VFP project, or are at all interested in learning more about how VFP can be used in this way, please do contact me. I’m pretty passionate about VFP and Dimensional Modeling, so I’d love to hear from you!