S3 Load Component

S3 Load Component

Note: This feature is only available for instances hosted on AWS.

Load data into an existing table from objects stored in Amazon Simple Storage Service (Amazon S3).

Many of the configuration settings on this component have sensible defaults, mirroring the defaults provided by Redshift when that option is not specified. Mandatory settings are:

Many of the configuration settings on this component have sensible defaults, mirroring the defaults provided by Snowflake when that option is not specified. Mandatory settings are:

  1. Target Table Name
  2. Load Columns
  3. S3 URL Location
  4. S3 Object Prefix
  5. Data File Type

In addition, it is likely you will need to confirm the following settings:

  1. Delimiter
  2. Compression Method
  3. Ignore Header Rows

Note: This component requires working AWS Credentials with read access to the bucket containing the source data file(s). The is easily achieved by attaching an IAM role to the instance when launching Matillion ETL, however it can also be managed manually by editing an Environment. See the example.

Furthermore, Matillion requires use of a policy that contains the s3:ListBucket action such as the policy provided in the Managing Credentials documentation

Properties

Property Setting Description
For more information on all the settings in this component, see the Amazon Redshift COPY syntax for more information.
For more information on all the settings in this component, see the Snowflake COPY INTO table syntax for more information.
Stage Select Select a staging area for the data. Staging areas can be created through Snowflake using the CREATE STAGE command. Internal stages can be setup this way to store staged data within Snowflake.
Selecting [Custom] will avail the user of properties to specify a custom staging area on S3.
S3 Object Prefix Text The S3 Bucket location and file path for any data to be loaded.
Pattern Text A string that will partially match all filenames that are to be included in the load. Defaults to '.*' indicating all files within the S3 Object Prefix.
Name Text The descriptive name for the component.
Warehouse Select Choose a Snowflake warehouse that will run the load.
Database Select Select the database that the newly-created table will belong to.
Schema Select Select the table schema. The special value, [Environment Default] will use the schema defined in the environment. For more information on using multiple schemas, see this article.
Table Select Select an existing table to load data into.
Load Columns Select Multiple One or more columns that exist in the target table.
Format Select Select a premade file format that will automatically set many of the S3 Load component properties accordingly. These formats can be created through the Create File Format component.
File Type Select Select the expected file type that data will be loaded from. Supports AVRO, CSV, JSON, ORC, PARQUET and XML. Component properties will change to reflect the choice made here and give options based on the specific file type.
S3 URL Location S3 Tree The URL of the S3 bucket to get the files from. This follows the format s3://bucket-name/location, where location is optional.
S3 Object Prefix Text All files that begin with this prefix will be included in the load into the target table.
Encryption Select (AWS Only) Decide on how the files are encrypted inside the S3 Bucket.This property is available when using an Existing Amazon S3 Location for Staging.
None: No encryption.
SSE Encryption: Encrypt the data according to a key stored on KMS.
S3 Encryption: Encrypt the data according to a key stored on an S3 bucket.
Client Side Encryption: Data is encrypted with server-side encryption.
KMS Key ID Select (AWS Only) The ID of the KMS encryption key you have chosen to use in the 'Encryption' property.
Master Key Select The ID of the server side encryption key you have chosen to use in the 'Encryption' property.
IAM Role ARN Text Supply the value of a role ARN that is already attached to your Redshift cluster, and has the necessary permissions to access S3. This is optional, since without this style of setup, the credentials of the environment (instance credentials or manually entered access keys) will be used.
See the Redshift documentation for more information about using a Role ARN with Redshift.
Data Type Select Available options are:
Delimited
CSV
Fixed Width - this requires an additional "Fixed Width Spec". See the amazon documentation for details.
JSON - this requires an additional "JSON Format". See the amazon documentation for details.
Avro - this requires an additional "Avro Format". See the amazon documentation for details.
Compression Select Whether the input file is compressed in GZIP format, BROTLI, BZ2, DEFLATE, RAW_DEFLATE, ZSTD or not compressed at all.
Delimiter Text The delimiter that separates columns. The default is a Comma. A [TAB] character can be specified as "\t".
Record Delimiter Text The delimiter to be used that separates records (rows) in the file. Defaults to newline. \n can also signify a newline. \r can signify a carriage return.
Field Delimiter Text The delimiter that separates columns. The default is a Comma. A [TAB] character can be specified as "\t".
Fixed Width Spec Text Loads the data from a file where each column width is a fixed length, rather than separated by a delimiter. Each column is described by a name and length, seperated by a colon. Each described column is then separated by a comma.
e.g. We have four columns; name, id, age, state. These columns have the respective lengths; 12,8,2,2.
The written description to convert this data into a table using fixed-width columns would then be:
name:12,id:8,age:2,state:2

Note that the columns can have any plaintext name. For more information on fixed width inputs, please consult the AWS documentation.
CSV Quoter Text Specifies the character to be used as the quote character when using the CSV option.
JSON Layout Text Defaults to 'auto' which should work for the majority of JSON files if the fields match the table field names. Optionally can specify the URL to a JSONPaths file to map the data elements in the JSON source data to the columns in the target table.
AVRO Layout Text Defaults to 'auto' which should work for the majority of Avro files if the fields match the table field names. Optionally can specify the URL to a JSONPaths file to map the data elements in the Avro source data to the columns in the target table.
Region Select The Amazon S3 region hosting the S3 bucket. This is not normally required and can be left as "None" if the bucket is in the same region as your Redshift cluster.
Compression Method Select Whether the input file is compressed in GZIP format, LZOP format, or not compressed at all.
Encoding Select The encoding the data is in. This defaults to UTF-8.
Remove Quotes Select Whether to remove any quotes surrounding data values. Note: This option is not available if the chosen data type is CSV.
Replace Invalid Characters Text If there are any invalid unicode characters in the data, this parameter specified the single character replacement for them. Defaults to '?'.
Maximum Errors Text The maximum number of individual parsing errors that cause the whole load to fail. Values up to this will be substituted as null values.This value defaults to 0, but the Amazon default is 1000.
Date Format Text Defaults to 'auto' - this can be used to manually specify a date format.
Time Format Text Defaults to 'auto' - this can be used to manually specify a time format.
Timestamp Format Text Defaults to 'auto' - this can be used to manually specify a timestamp format.
Ignore Header RowsSkip Header Text The number of rows at the top of the file to ignore - defaults to 0.
Accept Any Date Select If this is enabled, invalid dates such as '45-65-2018' are not considered an error, but will be loaded as the null value.
Ignore Blank Lines Select If this is set, any blank lines in the input file are ignored.
Truncate Columns Select If this is set, any instance of data in the input file that is too long to fit into the specified target column width will be truncated to fit instead of causing an error.
Fill Record Select Allows data files to be loaded when contiguous columns are missing at the end of some of the records. The remaining columns are set to null.
Trim Blanks Select Removes trailing and leading whitespace from the input data.
Trim Space Select Removes trailing and leading whitespace from the input data.
NULL As Text This option replaces the specified string with null in the output table. Use this is your data has a particular representation of missing data.
Null If Text This option replaces the specified string with null in the output table. Use this is your data has a particular representation of missing data.
Error On Column Count Mismatch Select Generate an error if the number of delimited columns in the input does not match that of the table. If false, extra columns are not loaded into the table and missing columns are recorded as NULL in the table.
Empty Field As Null Select If this is set, empty columns in the input file will become NULL.
Encoding Type Select The type of encoding that has been applied to the data. This defaults to UTF-8.
Enable Octal Select Enables parsing of octal numbers.
Allow Duplicates Select Allows duplicate object field names (keeping only the last one used).
Strip Outer Array Select Enable removal of outer square brackets from JSONs.
Strip Null Values Select Enables removal of null values.
Ignore UTF8 Errors Select When true, replaces invalid UTF-8 sequences with the unicode replacement character (U+FFFD), instead of throwing an error.
Preserve Space Select When parsing XML files, leading and trailing spaces in elements are preserved if set to true.
Strip Outer Element Select When parsing XML files, this will strip the outermost XML element, exposing the 2nd level elements as separate documents.
Disable Snowflake Data Select When parsing XML files, will disable recognition of Snowflake semi-structured data tags if set to true.
Disable Auto Convert Select When parsing XML files, enables conversion of numeric and Boolean values from text to their native types if set to true.
Blanks As Null Select If this is set, blank columns in the input file will become NULL.
Comp Update Select Controls whether compression encodings are automatically applied during a COPY. This is usually a good idea to optimise the compression used when storing the data.
Escape Select When this option is specified, the backslash character (\) in input data is treated as an escape character.
Escape Unenclosed Field Text Single character string used as the escape character for unenclosed field values. Accepts common escape sequences, octal values, or hex values. Also accepts a value of NONE (default). If a character is specified in the 'Escape' field, it will override this field.
Field Optionally Enclosed Text A character that is used to enclose strings. Can be single quote (') or double quote (") or NONE (default). Note that the character chosen can be escaped by that same character.
Stat Update Select Governs automatic computation and refresh of optimizer statistics at the end of a successful COPY command.
Round Decimals Select If this option is set, round any decimals to fit into the column when the number of decimal places in the input data is larger than defined for the target column.
Manifest Select Whether or not the given object prefix is that of a manifest file. See the Redshift documentation for more information.
Explicit IDs Select Whether or not to load data from the S3 Objects into an IDENTITY column. See the Redshift documentation for more information.
On Error Select Choose what to do when this component encounters an error.
Size Limit (B) Text The upper limit filesize for any individual file loaded.
Purge S3 Files Select Choose whether to destroy the source file on the S3 Bucket after loading its data.
Force Load Select When TRUE, will load all files into Snowflake even if that file has not changed since a prior load. If FALSE (default), the component will attempt to avoid such redundant loads.
Metadata Fields Select Multiple Choose any Snowflake metadata columns to be included in the load.

Example

This example loads the Airports data into a new table.

The data files in this case are comma-separated, GZIP compressed files with a header row.

On completion, the number of rows loaded is shown in the status.

You can check that your instance credentials have S3 access, or define additional credentials, by editing the Environment. See this page for further details.