Create External Table

Create External Table

This component enables users to create a table that references data stored in an S3 bucket. Note that this creates a table that references the data that is held externally, meaning the table itself does not hold the data. It is important that the Matillion ETL instance has access to the chosen external data source.

Referencing externally-held data can be valuable when wanting to query large datasets without resorting to storing that same volume of data on the redshift cluster.

External tables are part of Amazon Redshift Spectrum and may not be available in all regions. For a list of supported regions see the Amazon documentation.

For full information on working with external tables, see the official documentation here.

Note: Nested data loads from JSON or Parquet file formats may also be set up using this component via the 'Define Nested Metadata' checkbox in the 'Table Metadata' property. An example of this can be found at the bottom of this article.


Properties

Property Setting Description
Name Text The descriptive name for the component.

This is automatically determined from the table name when the Table Name property is first set.
Schema Select Select the table schema. Note that external tables require external schemas and regular schemas will not work. To learn more about external schemas, please consult the 'Configuring The Matillion ETL Client' section of the Getting Started With Amazon Redshift Spectrum documentation. The special value, [Environment Default] will use the schema defined in the environment. For more information on using multiple schemas, see Schema Support.
New Table Name Text The name of the table to create or replace.
Create/Replace Select Create: the default option, creates a new table. This will generate an error if a table with the same name already exists, but will never destroy existing data.
Create if not exists: This will only create a new table if a table of the same name does not already exist. It will not destroy existing data.
If the schema of the existing table does not match the schema defined in this component, no attempt is made to fix or correct it, which could lead to errors later in the job if you did not expect an existing table to exist, or to have a different schema to the one defined in this component.
Replace: This drops any existing table of the same name, and then creates a new table. This guarantees that after the component succeeds the table matches the schema defined in this component; however, any existing data in an existing table will be lost.
Note: Since other database objects may depend upon this table,
drop ... cascade
is used, which may actually remove many other database objects.
Table Metadata Column Name The name of the new column
Data Type Text: This type can hold any type of data, subject to a maximum size. More...
Integer: This type is suitable for whole-number types (no decimals). More...
Numeric: This type is suitable for numeric types, without or with decimals. More...
Boolean: This type is suitable for data that is either 'true' or 'false'. More...
Date: This type is suitable for dates without times. More...
DateTime: This type is suitable for dates, times, or timestamps (both date and time). More...
Size For Text types, this is the maximum length. This is a limit on the number of bytes, not characters. For Redshift, since all data is stored using UTF-8, any non-ASCII character will count as 2 or more bytes.
For Numeric types, this is the total number of digits allowed, whether before or after the decimal point.
Decimal Places Relevant only for Numeric, it is the maximum number of digits that may appear to the right of the decimal point.
Table Metadata (Nested) Checkbox When the Define Nested Metadata checkbox is ticked inside the Table Metadata property, a tree-structure can be defined for metadata. See the Example 2 at the bottom of this article for more information.
Partition Select Multiple Mark one or more columns in this table as potential partitions. Partition columns allows queries on large data sets to be optimized when that query is made against the columns chosen as partition columns. When a partition is created, values for that column become distinct S3 storage locations, allowing rows of data in a location that is dependant on their partition column value. For example, it is common for a date column to be chosen as a partition column, thus storing all other data according to the date it belongs to. When creating partitioned data using the Add Partition Component it is vital that those partitioned columns have already been marked using this property.
Location Select The S3 Bucket location for the external table data. The Matillion instance must have access to this data (typically, access is granted according to the AWS credentials on the instance or if the bucket is public).
Format Select Choose a format for the source file. Choose between PARQUET, RCFILE, SEQUENCEFILE, TEXTFILE and JSON.
Field Terminator Text Note: This parameter only applies when the Format parameter is set to TEXTFILE.
The delimiter to be used that separates fields (columns) in the file. Defaults to \A.
Line Terminator Text Note: This parameter only applies when the Format parameter is set to TEXTFILE.
The delimiter to be used that separates records (rows) in the file. Defaults to newline. \n can also signify a newline. \r can signify a carriage return.
Skip Header Rows Integer The number of rows at the top of the file to skip. Default is empty.
Strip Outer Array Select Note: This parameter only applies when the Format parameter is set to JSON.
Strips the outer array from the JSON file, enabling JSON files that contain a single, anonymous array to be loaded without error. Default is "No"

Availability

Using external tables requires the availability of Amazon Redshift Spectrum. Currently-supported regions are us-east-1, us-east-2, and us-west-2. For information on how to connect Amazon Redshift Spectrum to your Matillion ETL instance, see here.

Example

In this example, we have a large amount of data taken from the data staging component 'JIRA Query' and we wish to hold that data in an external table that is partitioned by date. By doing so, future queries against this data can be optimized when targeting specific dates. The orchestration job is shown below.


To begin, a new external table is created using the Create External Table component. In its properties (shown below) we give the table a name of our choosing and ensure its metadata matches the column names and types of the ones we will be expecting from the JIRA Query component used later on.

If we are unsure about this metadata, it is possible to load data into a regular table using just the JIRA Query component, and then sample that data inside a Transformation job. The 'metadata' tab on the Table Input component will reveal the metadata for the loaded columns.

Most important are the 'Partition' and 'Location' properties. We choose to partition by the 'created' column - the date on which issues are created on JIRA, a sensible choice to sort the data by. The Location property is an S3 location of our choosing that will be the base path for the partitioned directories.

Note The 'created' column is NOT included in the Table Metadata. The values for this column are implied by the S3 location paths, thus there is no need to have a column for 'created'.


To finish our partitioned table, we continue to the Add Partition component. Here we ensure the table name is the same as our newly-created external table. We then choose a partition value, which is the value our partitioned column ('created') contains when that data is to be partitioned. In this case, we have chosen to take all rows from a specific date and partition that data. Finally note that we have appended the Location we used before with that same date, so this partition has its own unique S3 location.


Now all that's left is to load the data in via the JIRA Query component. Credentials for the chosen URL are entered and we make sure 'Data Selection' contains the columns we want for this data.


Note again that the included columns do NOT include the 'created' column that we will be partitioning the data by. However, we do add a Data Source filter to ensure we only take rows belonging to the date we want to create the partition for, shown below. Since this data type is 'datetime', we can specify all times within a certain date by ensuring the filter takes all rows after our date begins but before the next day starts.


The JIRA Query component is given a target table different to the external table we set up earlier. This is because data staging components will always drop an existing table and create a new one. Instead, we ensure this new external table points to the same S3 Location that we set up earlier for our partition. Thus, both this external table and our partitioned one will share the same location, but only our partitioned table contains information on the partitioning and can be used for optimized queries.

This data can be sampled using a Transformation job to ensure all has worked as planned. Simply use a Table Input component that is set to use an external schema, and is pointed to the partitioned table we created earlier. This should be able to bring the partitioned data into Matillion ETL and be sampled.


Note that our sampled data DOES contain the 'created' column despite us not actually including it in the loaded data. This is because the partition column is implicitly given by the S3 location.


Example 2 - Nested Tables

Before using Matillion ETL's Nested Data Load component, it is necessary to create an external table capable of handling the nested data. This can be done by ticking the 'Define Nested Table' checkbox in the 'Table Metadata' property.

Below is a snippet of a JSON file that contains nested data. There are 4 top-level records with name 's' and each contains a nested set of columns "col1", an integer, and "col2", a string.

{
    "s": {
    "col1":1,
    "col2":"one"
    }
}
{
    "s": {
    "col1":2,
    "col2":"two"
    }
}
{
    "s": {
    "col1":3,
    "col2":"three"
    }
}
{
    "s": {
    "col1":4,
    "col2":"four"
    }
}

Normally, Matillion ETL could not usefully load this data into a table and Redshift has severely limited use with nested data. However, the Create External Table component can have a nested structure defined in the Table Metadata property by checking the Define Nested Metadata box.

To begin, we add a new structure by right-clicking the Columns structure and selecting Add. In the new menu that appears, we specify that our new Column Type is to be a structure and name it as we like. In this case, we name it "s" to match our rather arbitrary JSON.

Note: Struct, Array and Field names MUST match those in the JSON so that data can be mapped correctly. Failing to do so is unlikely to cause an error message but will cause Matillion ETL to overlook the data in the source files.

Note: Similar to the above, not all columns in the source JSON need to be defined and users are free to be selective over the data they include in the external table.


Now that we've added the 's' structure to our table, we need to add the data nested inside it. To do so, right-click the 's' structure we just created and again click Add. This time, we will be selecting Field as the column type and specifying what data type to expect. We do this process for each column to be added.


Since we added those columns to our 's' structure, they exist nested within it in our metadata, matching that of the JSON. We're now ready to complete the configuration for the new External Table.


Back on the component properties, we point the Location property to the S3 bucket that contains our nested JSON and set the Format property to JSON.

Note: Create External Table will attempt to take ALL files from the given S3 location, regardless of format, and load their data as an External Table. Ensure the only thing your bucket contains are files to be loaded in this exact manner.


You now have an External Table that references nested data. Matillion ETL (and Redshift) has limited functionality surrounding this form of data and it is heavily advised users refer to the Nested Data Load Component documentation for help with loading this data into a practical form within a standard Redshift table.