Getting Started with Amazon Redshift Spectrum
Amazon Redshift Spectrum allows users to create 'External' tables that reference data stored in S3, allowing transformation of large data sets without having to host the data on Redshift. This is done through Amazon Athena that allows SQL queries to be made directly against data in S3.
Currently-supported regions are US East (Northern Virginia), US East (Ohio), US West (Oregon), as well as Europe (Ireland) and Asia Pacific (Tokyo).
To enable these services in Matillion, users are required to add relevant permissions to their Matillion ETL instance through the AWS console as well as creating an external schema.
Create A Role
Log into your AWS account console and through the main upper-left dropdown menu, select 'IAM'. This will take you to a new page where you can then browse to the 'Roles' section via the list at the left-hand side.
- A new Role can be created through the 'Create New Role' button at the top-left.
- Select the 'Amazon Redshift' option when asked to select a role type.
- Tick boxes for the following roles:
- Press 'Next Step' at the bottom right.
- Enter a role name and description of your choice.
- Press 'Create role' at the bottom right.
Your new role will be instantly created. Remember its name for the next step.
Alternately, an existing role can be selected and the same policies can be added through the 'Attach Policy' button. In either case, selecting the role intended for Amazon Redshift Spectrum use will give an overview of the role and allow you to take note of the Role ARN, which is used in the next section.
Attaching A Role
- Log into your AWS account console and through the main upper-left dropdown menu, select 'Redshift'.
- From the Redshift page, select Clusters.
- Check the box of the cluster you wish to allow Amazon Redshift Spectrum on (the cluster that the Matillion ETL instance will be using for Redshift).
- Click 'Manage IAM Roles' above the cluster list.
- From the 'Available roles' dropdown, select the role created in the previous section.
- Click 'Apply Changes' to give this role to the cluster.
Once complete, your Matillion ETL client is ready to be configured for use with Amazon Redshift Spectrum.
Configuring The Matillion ETL Client
- Log into your Matillion ETL instance
- Expand the 'Environments' list at the bottom-left of the client.
- Right click on the intended environment (one that is associated with the Redshift cluster we previously enabled Amazon Redshift Spectrum policies on).
- Select 'Create External Schema' from the right-click menu.
- Enter a name for your new external schema. Ensure this name does not already exist as a schema of any kind.
- Add the name of your athena data catalog. A new catalog will be created if this name is not found.
- Add the Role ARN of the role used to allow Amazon Redshift Spectrum as defined in the previous section.
Now components within Matillion that make use of external tables (and thus, Amazon Redshift Spectrum) can be used providing they use this external schema. Note that any tables that exist within the linked Athena Catalog will be instantly available for use in Matillion ETL through the external schema that links to them.
Amazon Redshift Spectrum-Ready Components
External Tables can be created using the Create External Table component and overwritten using the Rewrite External Table component. These components can write external tables in the following formats:
Query components in Matillion ETL are ready to use external tables with only a minor amount of configuration. In each of these components, locate the 'Type' property and set it to 'External' from the dropdown list. Then, ensure that the schema specified in that component's 'Schema' property is an external schema. Available tables in the 'Target Table' property will then be external tables available to that schema.
Most data staging ('Query') components can make use of these external tables.