Type 6 Slowly Changing Dimensions Example

Overview

Here we outline Matillion's approach to maintaining slowly changing dimension data using Matillion ETL by way of an extended example.

Often than using Matillion ETL, a customer requires the ability to  maintain Slowly Changing Dimensions (SCD) using Matillion ETL, in particular reference to Type 6/Hybrid SCD (further discussion by Mr Kimball here).

SCD is becoming a more common customer requirement as their data-warehousing project reaches greater levels of maturity. It allows visibility of the transition of data over time in a data warehouse and not just offering a snapshot of the current data image.

This document demonstrates a simple, repeatable and scalable approach to storing SCD data via Matillion ETL.
 

Getting Started with the example

The attached export illustrates a supplier dimension (supplier code, name, current state (location) and historical state), where supplier name and state location can be tracked over time and stored with unique keys, validity date ranges and an easy reference current record flag.

To import the Jobs you will need an environment connected to a blank schema then just use Project → Import Job and select the file.

The process is split into 2 sections, 'Detect Changes' and 'Transact Changes', detailed below.
 

Detect Changes

This job reads the data from the SCD and compares the active records against the changing data source based on a key field, in this case supplier code. It then processes new or changed records and writes the required data ready for inserting/updating the dimension.


Transact Changes

This job picks up the data collated in the Detect Changes job and writes new dimension records via a standard table output with the strategy set to append and updates existing SCD records via a table update component with the strategy set to Update/Insert.


Running through the SCD example job.

The job uses a variable to represent “now” so we can see how the job would react over time if it was run daily (clearly this could be run more often with a few tweaks) The data position starts at ‘day zero’ whereby no SCD data has been has been collated i.e. the supplier dimension database table is empty.

Day 1 ~ Initial Population

Set the parameter to today’s date - (YYYY-MM-DD)

In the SCD Process Changes job view the fixed input data for the 2 suppliers.

 

This is the initial data for the SCD and if set up as above would result in the target dimension table having open ended records for Acme Inc and Coyote Supplies in the respective states from 2016-04-29 to 2099-01-01.

 

On Day 1 the SCD Process Changes job will digest the fixed flow data, find no related SCD data stored and treat the 2 supplier records as ‘new data’ to be processed and uploaded into the target database.

To publish this data run the SCD Automator (on the initial run the supplier_location target table will be created).

 

Day 2 ~ No changes:

Set the parameter to the following days date and make no other changes.

On Day 2 the SCD Detect Changes job will digest the unchanged fixed flow data find no updates and no new records and have nothing output to process in Transact Changes.

Run the SCD Automator.

 

Day 5 ~ Coyote moves

Set the run date parameter to be 4 days from Day 1, and change the fixed flow data to move Coyote Supplies to another state code.

On Day 5 the SCD Detect Changes job will track the change from TX to CL for Coyote supplies and find no change for Acme Inc.  The Transact Changes  job will ‘close out’ the initial Coyote record for TX location (start date 2016-04-29 end date 2016-05-03), and set the current flag of that record to be ‘N’, create a new open dated record for Coyote with CL as the current state and TX as the historical state, dates will be (2016-05-03 to 2099-01-01) with the current flag set to Y.  The Acme record will remain unchanged.

Run the SCD Automator.

 

Day 7 ~ Both move to Texas

Set the run date parameter to be 7 days from Day 1, and change the fixed flow data to move Coyote Supplies to another state code. Change the fixed flow data for the supplier state as below.

On Day 7 the SCD Detect Changes job will track the change of location for both suppliers, the target table will now have 5 records in total with the initial Acme record being closed out and a new open ended record created, similarly the 2nd Coyote record will be closed out and a 3rd record created. The 3 Coyote records will provide a contiguous history from Day 1 showing the moving of states and the prior state for each record.​

Run the SCD Automator.

 

Day 10 ~ Coyote renamed

Set the run date parameter to be 10 days from Day 1, and change the fixed flow data to change the Coyote company name to Coyote Supplies Inc. Change the fixed flow data for the supplier state as below.

On Day 10 the SCD Detect Changes job will not track any Acme changes but will create a new record for Coyote Supplies Inc (same company code COYO) and close out the previous open ended record, the state location data will remain unchanged.  This shows that you can track changes in different ways i.e. you don’t have to maintain a ‘current’ field alongside a historical field.​

Run the SCD Automator.

The final output of the dimensions should look something like this.


Further Tests

You can add other suppliers during this process and in doing so the data model would treat those suppliers as starting on the run date used for that particular run.

If desired the ETL could be extended to reflect a supplier ceasing to exist i.e. if they disappeared from the periodical flow the the detect changes component would flag the record with an Indicator of ‘D’ this could be used to close out the record to the date the company code was no longer present in the fixed flow.

Matillion ETL offers the flexibility to track as many changes as you wish for a dimension and to build business logic to cater for almost any eventuality.

The example jobs can be downloaded here.