Incremental or High Water Mark data loading

Overview

It is a common use case to retrieve regular batches of data from a data source where the source table has a last updated time-stamp or other incrementing key. This makes it easy to retrieve just the data that has been updated. The following explains a technique for loading that data efficiently and processing micro-batches of the data in Matillion ETL.

Incremental loads are often easier dealt with by using one of Matillion ETL's Incremental Load Generators.

 

Explanation of the attached jobs

Import the jobs attached using Project → Import

The Orchestration Job uses a single recursion of the Table Iterator component to get the high water mark from a view created on the target target table of the audit information (see the Data Transformation Job), then its a simple where clause to get the new rows into a staging table and publish the output to Cloudwatch so we can keep an eye on it (and create a pretty chart).

The Transformation Job is a dummy but you can do what you like here so long as the max ID or Date is preserved in the output ready to be queried the next time.

The important element here is the second transformation that does an Aggregation to get the high watermark and ends on the Create View. This was the view used to iterate in the orchestration. This Transform does not need to be rerun every time because the view is preserved once created. If the data has a primary key you could use the Table Update component to effectively merge new output records.

Note this required a recent version of Matillion in order to be able to import the variable called max_event_date and rows_transferred.

 

Incremental Loading Video Guide

We apologise that there is no video guide for this article, yet. It is our goal to have video help for all aspects of Matillion ETL and we will update this page as soon as it is available.