I'm having trouble figuring out how (or if it is possible) to identify the' format' of a file using its contents.
I would like to be able to receive files with a non-idenntifying name lie "file06018.gz" and use the first row of the file, containing some string like "Title:User" or "Title"Venue" to determine which type of Redshift table to load the data in to.
Any help would be much appreciated,
3 Community Answers
David Lipowitz —
Thanks for the inquiry. Unfortunately, there’s no out-of-the-box Matillion component that does what you describe.
However, Matillion does have Python script component and you may be able to do this using Python’s gzip module.
You’ll need to be very careful here and ensure that you’re using streaming approach as loading very large files into memory can cause the Matillion instance to become unresponsive.
Perhaps some overall background would be helpful here. I am trying to create a system in which files can be loaded into S3 and, using a lambda function (per this article: https://redshiftsupport.matillion.com/customer/en/portal/articles/2243961-triggering-etl-from-an-s3-event-via-aws-lambda?b_id=8915), become accessible to Matillion.
Within the lambda function, a variable, file_to_load is defined that becomes accessible to Matillion for use (I use it to load S3 files).
Since posting yesterday, I was wondering if perhaps this could be done using the Bash Script Component. I have tried doing so but have encountered some issues.
In running the bash script I want to do the necessary transformation, I am encountering an issue accessing the actual file as it is in S3 and not on the local EC2 instance that the script runs on. Is there a way to use an S3 path to refer to and/or operate on the files as needed, or is that impossible? I am hoping I can use the file_to_load variable to accomplish this reference.
If the file were stored locally one could just refer to the argument used when calling the script file and access the file for the needed transformations.
If this is possible, would that also cause the irresponsiveness you mentioned earlier
Thanks for all the detail! Can you tell me more about the issues that you encountered with the Bash Script component? Bash scripts executed through Matillion should be able to invoke AWS CLI commands like “aws s3” if that’s what you were attempting.
That said, the typical use case for Matillion is to load the file directly from S3 into a Redshift table, and then to transform the data after its loaded. Needing to inspect the file for it’s format does throw a wrinkle into that pattern, so if there’s a way to know that format ahead of time, I think that will greatly simplify your workflow.