This is what I have done already. I have simply created two separate buckets in S3, and there I will have csv files which contain data. So once a file is put into the bucket, it will trigger the lambda and then send a message to the SQS. That will ideally kick off the Matillion jobs as expected. I have synced both the lambda functions to the same queue in SQS. So whenever I insert file in either of the bucket it will fire up the lambda accordingly and start the Matillion job.
My problem is, since I have two buckets and let's say if I"m inserting the csv files at two different time frames, I have to wait till both the csv files are loaded into the Redshift staging tables in order to do the ETL transformation and joins and push into the prod table.
So how am I suppose to handle that? In other words, how can I make sure that data has arrived in both the buckets and then do the transformation and the other operations?
Any help would be appreciated. Thanks.
4 Community Answers
Laura Malins —
The easiest way to control this would be to have a control table held in your Redshift database. It can hold the names of your two files and an indicator to whether it has been loaded or not. After loading the file, your Matillion job can set the indicator to say that file has been loaded. Then the next step is to run an IF to check if both files have been loaded. If so, run your transform and reset the control table at the end. If not, just end the job with success.
So control table is just another "Create/Replace Table" in the Matillion component in an Orchestration job?
If so, I'll have to store the names in the above table as it comes in. So what do you mean by an indicator? If you can point me out on the component perspective? Sorry, since I"m a newbie to this still.