Name is required.
Email address is required.
Invalid email address
Answer is required.
Exceeding max length of 5KB

Deploy Job only if it has been altered

Hi,

We deploy Jobs to PROD in a programmatic way using Matillion API. We use a Git repo and we deploy from release branch. When we deploy we check the Job names that are in release branch and the Job names that are in PROD and 1) if there are Jobs in PROD which are not in release branch we delete them from PROD and 2) we deploy all Jobs in release branch to PROD (some might be new Jobs, some might be edited Jobs and others can be exactly the same as in PROD but we deploy them anyway).

The problem is that Matillion doesn't handle well situations when we deploy a Job that is running at the same time. And therefore the solution above isn't good enough. Sometimes we altered only 1 Job and need to deploy it but we actually deploy 10 or 15 Jobs (all that are stored in the same Git repo). It worked well at the beginning when we didn't have that many Jobs but not now.

Looking at the commit history of the Git repo to find only the Jobs that have been altered is a bit overkill because every time we release we would have to know the commit used in the last release (it can be the previous commit in release branch, or there might have been several commits/pull requests/etc in release branch since last release - when we are releasing we don't know when was the last release!).

Another approach I though about was to compare the Jobs/JSON files, i.e., the one in release branch when we are releasing and the one in PROD so that I POST it to PROD only if they are different. But it doesn't work because the file in Git was developed in DEV environment and commited in Git from there, and the one in PROD got different IDs when it was uploaded to PROD. Thus they are never exactly the same.
I thought I would find a method in Matillion API to compare 2 Jobs so that I could deploy only the ones that are different but I didn't find such a possibility.

Any suggestion from your side?

Tiago Silva

12 Community Answers

Matillion Agent  

Damian Chan —

Hello Tiago,

Is there a time slot where you’ll know that none of the jobs are running? If there is, you could use the API to export your current project out from the PROD env in the Matillion instance. Once that’s finished you can, again, use the API to delete all the jobs in the PROD Matillion instance and then finally import your project back in using the GIT repo as your source. That way you’ll know that:

1) You’ve definitely got the new jobs inside Matillion.
2) You have a backup of what was previously running in the PROD Matillion instance.

Best Regards,
Damian


Tiago Silva —

Hi Damian,
There is no such time window when none jobs are running. There was at the beginning but not now with the amount of jobs we have in PROD.
I would like to either be able to deploy jobs that are currently running or be able to identify if a Job I have in Git repo is identical to the one in PROD and in that case I wouldn't deploy it.

/TS


Matillion Agent  

Damian Chan —

Hello Tiago,

If you’re using the Matillion V1 API then you can check for a running task. Find out more in the link below.

https://redshiftsupport.matillion.com/customer/portal/articles/2920263-api-v1-map#header7

I’m still not entirely certain on what your GIT repo structure looks like but you can try experimenting with the above. Hopefully that’ll help you achieve what you want.

Best Regards,
Damian


Tiago Silva —

Hi,

My Git repo contains all jobs that populate the dimensional model in our DW. This means that when I change the job that populates table DIM_A and I want to deploy it to PROD, I will also deploy all jobs that populate all other dimension and fact tables, which are hundreds.
Even if I check if a Job is running and wait for it to finish then it will take forever to deploy a single change to one job because I am actually deploying 100 or 200 jobs. Do you get my point?

The solution here is to deploy only the jobs that were actually changed and not all the jobs in the Git branch. Today the solution works this way: "When we deploy we check the Job names that are in release branch and the Job names that are in PROD and 1) if there are Jobs in PROD which are not in release branch we delete them from PROD and 2) we deploy all Jobs in release branch to PROD (some might be new Jobs, some might be edited Jobs and others can be exactly the same as in PROD but we deploy them anyway)."
If I export a project and import it, as you suggested, I am going to have the same problem...

/TS


Matillion Agent  

Laura Malins —

Hi Tiago

I understand where you’re coming from here.

You can export, commit to Git and import at a job level, so that might be what you require. An example command is:
http://<instance address>/rest/v1/group/name/<>/project/name/<>/version/name/<>/job/name/<>/export

I know we’re in the process of designing tighter integration with Git so I will feed this back to the product team and see what can be done.

Thanks
Laura


Tiago Silva —

Hi,

We do everything at job level.

Imagine this scenario:
1) We have jobs A, B, C and D in PROD. This implies our release branch contains jobs A, B, C and D.
2) Developer1 changes job A and commits it to master branch.
3) Developer2 changes job B and commits it to master branch.
4) We start a new release, i.e., we merge master branch into release branch (2 commits: jobs A and B) and right before deploying what's in release branch to PROD we find there is a bug in job C.
5) Developer1 changes job C and commits it to master branch.
6) We start a new release, i.e., we merge master branch into release branch (1 commit: job C) and we deploy what's in release branch to PROD (jobs A, B, C and D).

In this scenario we deployed job D and we didn't need to. This is the problem.
Why did we deploy job D? Because we don't know which jobs were changed since our last release. If we look at our last commit in release branch, only job C was changed but we actually need to deploy jobs A, B and C. Right?

Regards,
Tiago Silva


Matillion Agent  

Laura Malins —

Hi Tiago

Yes I hear you. Unfortunately unless you maintain a list of changed jobs you will have this situation.

As I said I’ll have a chat to the product team to see what we we can do here.

Thanks
Laura


Tiago Silva —

Hi,
Then the only feasible solution is to deploy only when no jobs are running.
This is a step back where we have to think twice before deploying even a very small change. It goes against CI principle that I thought you supported. For example, we are not able to promote changes to TEST environment automatically with a CI/CD pipeline every time a developer commits in the source control system because there might be another job running in TEST env that is going to be deployed too although it wasn't changed.

Regards,
Tiago Silva


Matillion Agent  

Ian Funnell —

Hi Tiago,

Agreed that you will be safest to deploy when no jobs are running.

Matillion’s version control mechanism will never be identical to the CI mechanisms that you might find in an API (like API Gateway for example), the fundamental reason being that data transformation is not stateless like an API.

For example, if you make a change to a job which populates a Dimension or a Fact, you need to also run a data correction to fix all the historical data which is now incorrect because it does not adhere to the new business rules.

In CI mode, how would the logic for that data correction get generated, and when should it be run?

Best regards,
Ian


Tiago Silva —

Hi Ian,

If there is a data correction that needs to be run it can be either a SQL script or the Matillion Job(s) will refill the table(s). In the latter the developer just needs to commit the Matillion Job whenever he/she is ready with his/her changes.
If there is a SQL script (or more) then they are commited 1st in a Git repo used for Snowflake. Once they are commited they are deployed to TEST automatically (Continuous Integration). If everything went fine, the developer can commit the Matillion Job afterwards and it will be deployed to TEST automatically (Continuous Integration) as well.

All this is developed in DEV environment first. It is commited afterwards and the developer only needs to pay attention to the order that he commits his/her stuff which is natural for him/her since it is the same order he/she used to develop his/her stuff in DEV env.
When we release to PROD, the same deployment order will be followed (automatically) and it will happen in PROD the same that happened in TEST.

What I wrote above is a very interesting discussion, no doubts. And I am more than happy to have it with you and your team if you want to.
However, we are diverting from the real problem here. As I wrote before:
"Imagine this scenario:
1) We have jobs A, B, C and D in PROD. This implies our release branch contains jobs A, B, C and D.
2) Developer1 changes job A and commits it to master branch.
3) Developer2 changes job B and commits it to master branch.
4) We start a new release, i.e., we merge master branch into release branch (2 commits: jobs A and B) and right before deploying what's in release branch to PROD we find there is a bug in job C.
5) Developer1 changes job C and commits it to master branch.
6) We start a new release, i.e., we merge master branch into release branch (1 commit: job C) and we deploy what's in release branch to PROD (jobs A, B, C and D).

In this scenario we deployed job D and we didn't need to. This is the problem."

Now imagine that instead of job D I have job D1, D2, D..., D1000.
I will have to deploy 1000 jobs that were not altered and the worst is that they cannot be running when I deploy them. So far all suggestions I got from Matillion didn't solve this problem. Your official suggestion (in your website) is to export a project from DEV for example and import it in TEST. This implies that none jobs from that project can be running at the time of deployment. Is this a good/feasible solution?!

/TS


Matillion Agent  

Ian Funnell —

Hi Tiago,

Great to hear you have a good solution for the data correction problem.

In which case I think we’re back to the beginning of the discussion again :-)

If your release branch contains jobs A, B, C and D, and you promote them all, Matillion will simply try to import them all.

In contrast, Matillion does have job-level import/export (example: http://<instance address>/rest/v1/group/name/<>/project/name/<>/version/name/<>/job/name/<>/export as Laura mentioned) so the way to achieve this with Matillion is to store your exports at job level and maintain a list of which jobs need to be imported.

Best regards,
Ian


Mike Robinson —

Tiago,

I see two possible paths, and possibly a combo of the two:

1. Have your deployment script do a git diff between the last tagged release and this one, flagging each job export file for deployment if they've changed.

2. Instead of diffing the entire JSON file of a job export, diff a known key (or set of keys). I'm considering something like this, will post code snippets if I end up trying it.

Good luck!

Post Your Community Answer

To add an answer please login