Schemas

Overview

Matilion ETL is aware that you may have multiple schemas. Each Matillion ETL environment specifies a default schema.

The default schema is used by Matillion ELT in the following ways:

  • It is the default place that tables are searched for in components that read and write data such as Table Input
  • It is the schema where Matillion ETL will write its own internal views that represent transformation components.

Although powerful, using multiple schemas adds an additional layer of complexity which needs to be carefully considered and managed.

Single-schema setup

When creating your first project, an environment is created for you. An environment describes your connection to where your tables are stored, and includes a default schema. Often, this is called ‘public’.

If you have no particular need for multiple schemas you can simply leave this as ‘public’, ignore multiple schemas. In any component that allows you to specify a specific schema, just leave the default setting of [Environment Default]. Then, if you decide to change the default schema, you only have to change it in one place (the environment) and everything will continue to work in the new schema.

Whenever a component asks for a schema, it will almost always need a table or tables too - the schema defines which set of tables are visible.

Single-schema per environment setup

This is the setup favoured internally within Matillion for its own projects and balances the power of multiple schemas with the ease of use of Matillion ETL. It is particularly useful when you want to separate Test, Staging and Production environments within a single database, but expect all the same database tables to exist in each schema.

There is always a currently selected environment which can be changed at any time in the environments panel. Many operations will use the currently selected environment, and therefore its default schema, unless you choose to use a different environment.

Each environment can (and should) use a different Default Schema.

You can run a job in any environment, so you can design your ETLs and run them against a test environment, and when fully tested and production ready you can then schedule them to run in a Staging or Production environment.

The environment to run a job in is available everywhere - from the scheduler, within the SQS queue message, and the right-click menu on a job designer.

Component-level Explicit Schema

This setup can be useful if you want to use the “Single-schema per environment” setup outlined above, with the exception that all of those environments can share some of their data available in another schema. For example, source data may be staged into a single schema, which can then be run through to a Test or Production environment as above.

In this case, following the setup for the “Single-schema per environment”, but when reading or writing particular tables override the default schema within the component.

 

Once this is set, you may run the job in any environment but the selected schema will always be used for that component regardless of the default set in the environment.

You must ensure the database user specified in the environment has the necessary access to read/write data in those additional schemas.

 

Component-level Explicit Schema using Variables

One common best practice that Matillion customers employ is to use three schemas arranged as shown in this diagram.

This approach gives you separation of concerns between the schemas and allows security grants to be applied at schema level. It means that the views created by Matillion ETL do not "pollute" the staging data schema and are not visible to analysis users using a Data Warehouse Analysis tool. Setting this up is best achieved with variables.

The downside of section 3 (using Component-level explicit schemas) is that once they are selected they are fixed. If you ever rename a schema, or use multiple clusters or multiple databases within a cluster, your schemas may not be named consistently and your job may fail when run against another cluster.

For this, you must use a second level of indirection using variables. All environments, in addition to specifying the database connection details and default schema, carry a set of variables which can be referenced in many places. Define a variable to use as the schema using the Edit Environment Variables screen:

Then use the variable name when referencing a schema in a component:

The schema is then dynamic, and takes it values from the variable value associated with the environment the job runs in.


 

Creating External Schemas

Using external tables requires the availability of Amazon Spectrum. Currently-supported regions are US East (Northern Virginia), US East (Ohio), US West (Oregon), as well as Europe (Ireland) and Asia Pacific (Tokyo). For information on how to connect Amazon Redshift Spectrum to your Matillion ETL instance, see here.

  1. Expand the 'Environments' list at the bottom-left of the client.
  2. Right click on the intended environment (one that is associated with the Redshift cluster we previously enabled Amazon Redshift Spectrum policies on).
  3. Select 'Create External Schema' from the right-click menu.
  1. Enter a name for your new external schema. Ensure this name does not already exist as a schema of any kind.
  2. Add the name of your athena data catalog. A new catalog will be created if the name entered does not correspond to an existing Athena catalog associated with your AWS account.
  3. Add the Role ARN of the role used to allow Amazon Redshift Spectrum access to your EC2 instance.

Now components within Matillion that make use of external tables (and thus, Amazon Redshift Spectrum) can be used providing they use this external schema. Note that any tables that exist within the linked Athena Catalog will be instantly available for use in Matillion ETL through the external schema that links to them.
 

Refresh Source Schema

Sometimes you might effect change of a schema that Matillion ETL is using outside of the client. An example of this might be creating a new table or view via console commands and then attempting to find that table in a Data Staging component, only to find it does not appear. This is due to Matillion ETL using a cache of table data for many components.

To sync the Matillion ETL client with your platform and resolve this issue, right-click the component in question and select Refresh Sourrce Schema.