Manage CDC

Overview

Note: This feature is currently only available to AWS-hosted Matillion ETL instances

CDC (Change Data Capture) allows users to synchronise their RedshiftSnowflake data with their own database, ensuring the former is continually and automatically kept up to date with the latter. Note that this will incur some additional costs; namely for a DMS (Database Migration Service) Replication Instance and a Lambda Function that are required for the CDC process.

CDC in Matillion ETL works by DMS checking your source database for changes and recording those changes on an S3 bucket. This S3 then triggers an event that causes Matillion ETL to pull data from the bucket and update tables on RedshiftSnowflake accordingly. Thus, any changes made to the source database will be reflected in their respective RedshiftSnowflake tables. All changes are incremental with a default of 5 minutes between updates.

To access the Manage CDC Panel in Matillion ETL, browse to Project → Manage CDC.

This should open a CDC tab in the main panel of the client. Any existing CDC Tasks are listed here along with their Identifier, Status and Source. These details will not updated automatically and must be refreshed using the Refresh button below. All CDC Tasks and configurations are Project-specific.

For a CDC Task to be fully functional, it requires:

Inside AWS Console:

  1. The source database to be properly configured. Source tables must have primary keys.
  2. A DMS Replication instance to be set up by the user. This is used to listen to the source database for changes and write them to S3.
  3. A FIFO Queue to be set up by the user that is not being used for any other purpose than CDC Tasks on this specific Matillion ETL Project.
  4. An S3 bucket 
  5. Instance Credentials, Lambda Role ARN and S3 Role ARN giving correct permissions to be set up by the user. (See sections: Manage CDC Credentials, Lambda Role ARN, S3 Role ARN) A full list of all permissions required can be found on the IAM Roles & Permissions (AWS) article.
Inside the Matillion ETL client:
  1. CDC Configuration to be correctly Managed in the Matillion ETL client. (See section: New CDC Configurations)
  2. A new CDC Task to be correctly set up in the Matillion ETL  (See section: Adding a new CDC Task)


The CDC Tasks work differently depending on whether it is the initial run or not, illustrated by the below diagrams.
 

Initial Run

On the CDC Tasks's initial run, the entire table is loaded into S3 which Matillion ETL then gathers the table structure for and uses it to construct a RedshiftSnowflake table.

 

Subsequent Runs (incremental load)

After its initial run, CDC continues to write any changes to the source database into S3 which a lambda function writes back out with line numbers. Matillion ETL periodically checks the S3 Out bucket for new files and merges them into the target table(s).



New CDC Configurations

Before creating a new CDC Task, you must enter a CDC Configuration that tells this Matillion ETL Project which credentials to use to listen for incoming data. The + button to add new Tasks will be unavailable until this step is completed.

Clicking Manage will open a menu that allows the user to edit the CDC Configuration.

  • Credentials: The AWS Credentials used for various AWS resources including; Lambda Functions, listing Replication Instances, stopping/starting DMS Tasks and S3 Buckets and deleting resources when removing a task. Clicking Manage will allow you to create or edit user-defined credentials. More information on these credentials can be found in the Manage CDC Credentials section below.
  • Region: The AWS Region that contains the FIFO queue you wish to use.
  • Listen Queue: The FIFO Queue to be used to listen for incoming changes.
    Warning: This FIFO Queue MUST only be used for CDC on this specific Project and nothing else.
  • Lambda Role ARN: The ARN of an existing IAM role to use as the AWS Lambda Execution Role. Ensure it has read/write access to the S3 buckets you wish to use for CDC. This Role must have been set up by the user in advance. See the Lambda Role ARN section below for more details.

 

New CDC Tasks

After completing your CDC Configuration, clicking + will let you set up a new CDC Task. CDC Tasks will cause a number of job queues equal to the number of tables being updated. Care should be taken when keeping track of large numbers of tables as it may render the Matillion ETL instance.



1. Task Configuration

This first page sets up your DMS to point at your existing database. Currently we support MYSQL and Oracle databases.
  • Task Name: An arbitrary name given to your new CDC Task. This name will be used for the newly-created Job in the current Project as well as the name of the DMS Task.
  • Replication Instance: Select the name of your Replication Instance. This must be associated with the credentials used in your CDC Configuration.


2. Source Endpoint Configuration

  • Endpoint Name: An arbitrary name given to your new DMS Endpoint that points to your Source Database.
  • Database Type: The type of database to be used as a data source.
  • Host Address: The address of your source database.
  • Port: The port number for connecting to your source database. Not always required.
  • Database: The name of the database to connect to.
  • SSL Mode: The process of verification for your SSL connection. verify-ca will verify that the server is trustworthy by checking the certificate chain up to a trusted certificate authority (CA). verify-full will also verify that the server host name matches its certificate. The SSL connection will fail if the server certificate cannot be verified. verify-full is recommended in most cases.
  • Username: The username to log into your source database. Not always required.
  • Password: The password to log into your source database. Not always required.

Scrolling down, you can find the following Connection Option properties which can be edited to add or remove connection options:
  • JDBC Connection Options: User-defined JDBC connection options for the Database Type of choice that Matillion ETL uses in its connection to the source database during the initial CDC run.
    • Microsoft SQL Server - see their website for more details. 
    • Oracle - see their website for more details.
    • PostgreSQL - see their website for more details. 
    • MySQL - see their website for more details. 
      Note: For some databases, you must first provide a JDBC driver as not all drivers can be distributed with Matillion ETL. See this article for instructions on managing drivers. 
  • DMS Connection Options: User-defined ODBC connection options for the Database Type of choice that the Replication Instance uses in its connection to the source database.

These options can be tested using the Test button beneath.

Clicking Next will take a minute to confirm your configuration before moving on to the next part of the setup.


3. Creating Source Endpoint

In this step, Matillion ETL is attempting to create a DMS Endpoint and will fail if it cannot, or if the source database cannot be reached. 





 

4. Target Endpoint Configuration

Now we're setting up your S3 to accept table changes and your target database to be updated with them. It might be helpful to think of this as the logical counterpart of the config we just completed; first we connect CDC to the source, now we connect it to the target.
  • Endpoint Name: An arbitrary name that will be the name of the DMS endpoint.
  • S3 Role ARN: IAM Role ARN to attach to the DMS Task Endpoints that gives access to S3 buckets. This Role must be set up in advance by the user - see the S3 Role ARN Section below for details.
  • Bucket Name: The name of the desired S3 bucket. These are autodetected by the credentials you used in the Manage menu.
  • Batch Interval (s): Timing (in seconds) between checks to see if there are differences between your target and source (RedshiftSnowflake and MYSQL, respectively). We use a default of 300 seconds and suggest anywhere between 60 and 3600. Note that larger CDC Tasks require more time to complete and there is little point having a Max Batch Interval lower than the time it takes for the full CDC Task to complete.
  • Enable Logging: Enable logging of information of migrations during this CDC Task.
  • Table Selection: User-defined rules on which tables to include in the CDC Task (which source tables we want to keep updated in RedshiftSnowflake).
    • 1 source table results in 1 Matillion ETL job updating 1 RedshiftSnowflake table. Including hundreds or thousands of source tables can potentially bottleneck your Matillion ETL instance.
    • Includes nothing by default. A rule to include tables must be created before a rule to exclude any.
    • symbol denotes any number of wildcard characters. Using a lone will match all tables, such as the Include example below.
    • Schema name is: The schema name in your source database that holds the desired tables.
    • Table name is like: A string to match source table names by. Using a lone % denotes any and all tables in the source.
    • Action: Whether this rule is to Include or Exclude matched tables. All tables are excluded by default. You must provide an Include action before providing an Exclude action to filter it if desired.
 

5. Creating Target Endpoint



 

6. Job Configuration

  • Environment:  The Environment to run the job in.
  • Staging Schema:  The schema that contains the table(s) you wish to use for the staged data. The staged data is the raw change data from the source database.
  • Staging Table Prefix: An arbitrary prefix for target table names.
  • Target Schema: The schema that is to contain the target table(s). The target data is an identical copy of the source data.
  • Target Table Prefix: An arbitrary prefix for target table names.
  • Job Folder: The folder in which to put the CDC job in Matillion ETL. Clicking Manage will open an explorer where you can select a folder as shown below.
Clicking Next will create a DMS Target Endpoint with the name specified in the Target Name field. If that is successful completing the cdc setup on the following page will create the DMS Task and a Run Orchestration job with the same name as the DMS Task is created on the client in the current Version. Matillion ETL will prompt for the user to select whether they want to run the CDC Task immediately or not.

Regardless of your choice, you should see your new CDC Job in the folder you chose. This Job can be edited as like any other Job and it is often desirable to place a workflow after the CDC component such that a workflow is instigated after each update. The job is parameterised using Job Variables according to the CDC Task configuration you completed in the steps above. These Job Variables can be changed to alter the CDC Job's behaviour but this is absolutely not recommended.



Running and Managing CDC Tasks

From the Manage CDC Panel (Project → Manage CDC), CDC Tasks are  listed by Identifier with their Status, Source and Completion % displayed. Beside this information are buttons to ■ Stop or ► Start  the CDC Task and an X to Delete the CDC Task.


Starting and Stopping Jobs

When started, the initial run of a CDC Task will update all source tables fully and bring them into RedshiftSnowflake. After this, the CDC Task will continue to run and incrementally check for updates to the source tables until the Task is stopped via the Manage CDC panel.

When stopping a CDC Task, users can choose to either pause the task or fully stop it. The former can be resumed later while the latter will instigate a full update upon restarting, as though the task had never run before. Thus, it is more expensive (in both computational resources and monetary cost) to fully stop and resume a job.

CDC tasks can be paused and resumed or started again from scratch (fully reloading all tables).

Note: You must click Refresh to see the updated progress for CDC Tasks.


Deleting CDC Tasks

Tasks can be deleted if they are stopped using the X icon at their right side. Clicking this button will bring up a new dialog that asks the user to confirm the deletion and lists AWS resources that will be automatically destroyed as part of this CDC task being destroyed.



 


 

Manage CDC Credentials

The credentials requested in the Manage CDC dialog can either be Instance Credentials or User-Defined Credentials. These are used for AWS resource creation and modification, and for listening to the sqs queue. Since these credentials need only be used for the CDC Configuration, we recommend:

  1. Making a new IAM User in the AWS Console
  2. Create a new policy (attached to the user) with the actions given below.
  3. Taking the Access Key and Secret Key for this user.
  4. Creating User-Defined Credentials in Matillion ETL using the Access Key and Secret Key
  5. Using the new User-Defined Credentials solely for your Manage CDC Configuration.

Note 1: The below policy makes all resources available to the role but security-conscious users may opt to add their own resource ARNs.
Note 2: The sqs:listqueues and iam:listroles permissions are optional as they are used solely for populating selection boxes in the ui which can be filled in with the correct values regardless.


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sqs:DeleteMessage",
                "sqs:GetQueueUrl",
                "sqs:ChangeMessageVisibility",
                "sqs:ReceiveMessage",
                "s3:PutBucketNotification",
                "s3:GetBucketNotification",
                "dms:CreateEndpoint",
                "dms:CreateReplicationTask",
                "dms:DeleteEndpoint",
                "dms:DeleteReplicationTask",
                "dms:DescribeConnections",
                "dms:DescribeEndpoints",
                "dms:DescribeReplicationInstances",
                "dms:DescribeReplicationTasks",
                "dms:ModifyEndpoint",
                "dms:StartReplicationTask",
                "dms:StopReplicationTask",
                "dms:TestConnection",
                "ec2:DescribeRegions",
                "iam:ListRoles",
                "iam:PassRole",
                "lambda:AddPermission",
                "lambda:CreateFunction",
                "lambda:DeleteFunction",
                "lambda:GetFunction",
                "lambda:GetPolicy",
                "lambda:RemovePermission",
                "lambda:TagResource",
                "lambda:UpdateFunctionCode",
                "lambda:UpdateFunctionConfiguration",
                "s3:ListAllMyBuckets",
                "sqs:ListQueues"
            ],
            "Resource": "*"
        }
    ]
}
 

Lambda Role ARN

The Lambda Role ARN is used as part of the Manage Dialog. Your Lambda Role will need access to the S3 bucket and SQS Queue used in the CDC process. Create a new Lambda Role and attach the policy below. When creating a new Role, ensure that the type of trusted entity is set to AWS Service and Lambda is selected as the service.




Selecting the new Lambda Role in the AWS Console (IAM section) will expose the Role ARN that can then be used in Matillion ETL.

Note: The below policy makes all resources available to the role but security-conscious users may opt to add their own resource ARNs.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "sqs:GetQueueUrl",
                "sqs:SendMessage"
            ],
            "Resource": "*"
        }
    ]
}

 

S3 Role ARN

The S3 Role ARN is used in the creation of a new CDC Task. This is a Role that has DMS as a Trusted Entity (ergo, the Role has DMS as a Trust Relationship) and gives your Replication Instance access to the S3 bucket it needs to write changes to. Create a new DMS Role and attach the policy below. 

Warning: The S3 Role ARN is named so due to the permissions it gives. However, it MUST be attached to the DMS service. When creating a new Role, ensure that the type of trusted entity is set to AWS Service and DMS is selected as the service.



Selecting the new Lambda Role in the AWS Console (IAM section) will expose the Role ARN that can then be used in Matillion ETL.

Note: The below policy makes all resources available to the role but security-conscious users may opt to add their own resource ARNs.


{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject",
                "s3:ListBucket"
            ],
            "Resource": "*"
        }
    ]
}

 

CDC Log Table

The shared job writes a log table to the defined staging schema:

<staging schema>.mat_cdc_log

This will be created by the Shared Job the first time a CDC Task executes in that staging schema, it contains the following columns:
  • cdc_task_name: The CDC Task Name as entered when the CDC Task was created
  • cdc_target_table: The name of the table being updated in the <target schema>
  • cdc_dms_timestamp: The timestamp of the file produced by DMS (initial load files will be listed as 1900-01-01 00:00:00 as they have no timestamp)
  • cdc_mat_jobstart: The timestamp of when Matillion began processing the file
  • cdc_s3_bucket:The bucket containing the DMS file
  • cdc_s3_filename: The path to the DMS file within the bucket
  • cdc_table_status: The status code of the load of the file. Any non-zero value against table will cause subsequent loads to fail. However, a value of 99 can also indicate a load is still in progress and will be updated to 0 when the load completes successfully. Possible codes are:
    • 0: Load successful
    • 1: A previous load for this table failed or the file is dated before the last loaded file - the file has not been applied to the RedshiftSnowflake Table
    • 99: The is still in progress or has failed load - no further update files will be applied for this file

One row will be written for each DMS file Matillion ETL attempts to load - the table is designed to act as a complete history of each table.


Useful Queries

Find any loads in progress, or loads that did not complete successfully:

SELECT * 
FROM mat_cdc_log
WHERE cdc_table_status = 99
ORDER BY cdc_mat_jobstart DESC;

Find any tables that have failed loads:

SELECT DISTINCT cdc_task_name, cdc_target_table 
FROM staging.mat_cdc_log
WHERE cdc_table_status > 0
ORDER BY cdc_mat_jobstart 

Find the complete load history for a given table:

SELECT * 
FROM staging.mat_cdc_log
WHERE cdc_target_table = '<table name>'
ORDER BY cdc_mat_jobstart DESC
 

​Video