The problem that we see is there is no way from Matillion to execute the .SQL file that they produce in order to create the proper schema in Redshift. We have already seen differences month-to-month in the schema, so it's important to be able create the monthly tables in Redshift according to the Create statement they produce.
Any ideas/suggestions are welcome.
8 Community Answers
Ian Funnell —
You should be able to execute the .SQL file from an SQL Script component. The generated SQL contains a CREATE TABLE statement and a COPY (i.e. a bulk load).
Other alternatives might be to use the Matillion S3 Load Generator on the file, or to access the files through Athena/Spectrum. But if the schema keeps changing like you’ve observed then it will be best to use their generated .SQL file.
Sorry, I'm still missing something here. The SQL that I need to execute is in an S3 object. The SQL Script component seems to only execute a SQL statement typed into the component text field. How would I execute the SQL stored in the S3 object?
You can load the SQL to a table in the Datawarehouse and then use table iterator component to read the SQL from that table and move it to Job variable. Once the SQL is available in the Variable you can use that in the SQL Script. I know its little involved process but it would get the job done.
.SQL file in S3 with some script to create the target table
,CSV file with data
I would recommend using a Python Script component. You may read the script from the file in S3 and use the cursor object to execute the script against redshift. There is an example of using the Cursor object in the python Script component. Please note, you need the JYTHON interpretor (default) to use the cursor object.
Then use an S3 Load component to load the .CSV file to the newly created table.
An option on the S3 Load Component that would be really useful here is to only load the specified columns from a CSV file where the column name in the header row of the CSV matches the column name in the target table. In the current situation, I have no control over the contents of the CSV file. I do have control of the target table schema. I would like to load only the columns from the the CSV that match the columns in my target table, ignore any other columns in the CSV.
The only way I can think of is to Load the complete file and transfer specific columns to your target table via a transformation job. However, you may already be doing this!
Alternatively, you will need some process to process the file externally and extract the said columns to a specified file and then load into redshift. I cant think of any tools off my head but it must be possible. Are there any issues with your current process?