Name is required.
Email address is required.
Invalid email address
Answer is required.
Exceeding max length of 5KB

Triggering Job Dependencies

This is what I have done already. I have simply created two separate buckets in S3, and there I will have csv files which contain data. So once a file is put into the bucket, it will trigger the lambda and then send a message to the SQS. That will ideally kick off the Matillion jobs as expected.
I have synced both the lambda functions to the same queue in SQS. So whenever I insert file in either of the bucket it will fire up the lambda accordingly and start the Matillion job.

My problem is, since I have two buckets and let's say if I"m inserting the csv files at two different time frames, I have to wait till both the csv files are loaded into the Redshift staging tables in order to do the ETL transformation and joins and push into the prod table.

So how am I suppose to handle that? In other words, how can I make sure that data has arrived in both the buckets and then do the transformation and the other operations?

Any help would be appreciated. Thanks.

4 Community Answers

Matillion Agent  

Laura Malins —

Hi Kulasangar

The easiest way to control this would be to have a control table held in your Redshift database. It can hold the names of your two files and an indicator to whether it has been loaded or not. After loading the file, your Matillion job can set the indicator to say that file has been loaded. Then the next step is to run an IF to check if both files have been loaded. If so, run your transform and reset the control table at the end. If not, just end the job with success.

I hope that makes sense.

Thanks
Laura


Kulasangar Gowrisangar —

Hi Laura,

Thank you for the quick response.

So control table is just another "Create/Replace Table" in the Matillion component in an Orchestration job?

If so, I'll have to store the names in the above table as it comes in. So what do you mean by an indicator? If you can point me out on the component perspective? Sorry, since I"m a newbie to this still.

Thanks.


Kulasangar Gowrisangar —

In additionally to that, I'll be actually loading in multiple files within those two folders. Could I still go with the same procedure?

Regards.


Matillion Agent  

Laura Malins —

Hi Kulasangar

Yes it will work with multiple files – just use the bucket name rather than the filename. You can pass this through in your lambda

You first of all need to create a table in Redshift with the below columns:
Bucketname Varchar(200)
Indicator boolean

Then populate this with the names of your buckets and set the indicator to false

Now create your Orchestration jobs to load the files to Redshift. After the file loads, run a transformation job which sets the indicator column to true for the bucket you’ve just loaded data from.

Then check the table, you can do this using a database cursor in the Python component. Use this to set a variable to say whether you want to run your Transformation on everything.

Check the variable in the IF component

Run Transform if required.

Thanks
Laura

Post Your Community Answer

To add an answer please login