Loading Task Information

Overview

It can be useful to take the task information from Matillion ETL and load it into a table and/or to preserve it on in file storage, allowing the user to run diagnostic or logistical analysis of the jobs that have been run. Task Information is available within Matillion through Project → Task History (see Task

Using the Matillion API, it is possible to have the task information returned as a JSON string. Meanwhile, the API Query Component in Matillion can be used to create the necessary API and return the information into a table.

For more information, see the documentation on the Matillion API, API Examples and API Query Component.
 

Task Information API

The Matillion API can be used to take task information as a JSON string. The format is as follows:

curl -X GET "http://<InstanceURL>/rest/v0/tasks?groupName=<GroupName>&projectName=<ProjectName>&versionName=<VersionName>&environmentName=<EnvName>&jobName=<jobName>"

For example:

curl -X GET "http://matillioninstance.server.com/rest/v0/tasks?groupName=MainGroup&projectName=BookKeeping&versionName=default&environmentName=Live&jobName=Diagnostics"

The result is a JSON string that gives information on currently-running tasks being run under the specified Group, Project, Version, Environment and Job. Parts of the API can be omitted to give a broader search. For example, to gather any tasks under the project of the above example:

curl -X GET "http://matillioninstance.server.com/rest/v0/tasks?groupName=MainGroup&projectName=BookKeeping"

Note that this is only for currently running jobs. To take a task history, we need to specify to find jobs that are not currently running with 'running=false'. The following API call is amended to search for the task history.

curl -X GET "http://matillioninstance.server.com/rest/v0/tasks?groupName=MainGroup&projectName=BookKeeping&running=false"

This will return a task history in the form of JSON data but will not create a file nor import the information into a table. To do this, we can make use of the API Query Component.
 

Example

A job can be built where the API Query component is used to import Matillion Task information, filter out any failed jobs, convert the dates into a more readable format and finally save the remaining entries into table that will then be stored on an S3 Bucket. A job to this effect is available for download at the bottom of this page in the form of an importable JSON.

The API Query Component is set up to use the table 'task_table' that has been created in the Create/Replace Table Component. The component is set up to use the Matillion API included in Matillion ETL and the Data Source is specified as 'Run History Details' (details can be found through Project → Manage API Profiles → Matillion API).



Note that using the Matillion API in this way requires you provide the intended authscheme, username and password for this API call. To do so, add the following entries to the 'Connection Options' properties:

1. authscheme - BASIC
2. user - <your Matillion ETL instance username>
3. password - <your Matillion ETL instance password>
 
Extracting Task History Via API

This component uses an API call equivalent to the one below:

http://docker.dc.matillion.com:32815/rest/v0/tasks?projectName=DelMe&running=false

To deliver a Task History in JSON form, such as the snippet below:

      {
        "type": "VALIDATE_ORCHESTRATION",
        "jobID": 1057,
        "jobName": "Regression Pack",
        "jobRevision": 2,
        "jobTimestamp": 1485246350127,
        "componentID": 1062,
        "componentName": "Orc/Trans Regression suite",
        "state": "SUCCESS",
        "rowCount": -1,
        "startTime": 1485246445284,
        "endTime": 1485246445313,
        "message": ""
      },
      {
        "type": "VALIDATE_ORCHESTRATION",
        "jobID": 3511,
        "jobName": "Run Tests",
        "jobRevision": 2,
        "jobTimestamp": 1485246351156,
        "componentID": 3540,
        "componentName": "Start 0",
        "state": "SUCCESS",
        "rowCount": -1,
        "startTime": 1485246445327,
        "endTime": 1485246445338,
        "message": ""
      },

The JSON is not particularly accessible when wanting to look over a large number of jobs by eye. Thankfully, the API Query Component will take this JSON and reformat it into a table according to the specified profile. These profiles can be found through Project → Manage API Profiles.

As indicated by our component properties, we are using the Matillion API and the 'Run History Details' RSD contained within. This RSD defines the conversion of the JSON data into a table format. For more information on API Profiles, see the API documentation.

Manipulating The Task History

The API Query Component stores the job data in a table that a Transformation job can use. In the Transformation job, the data is filtered to take only successful runs then has 2 SQL Components that reformat the start and end times. Finally these data streams are joined and columns are selected to output to a table.

The task_table contains the returned table from the API Query Component - task data that has been taken from the JSON string that the Matillion API returns.

It can be seen that several jobs have failed and for our report, we're concentrated on the successful runs. The best way to remove the failed rows is to add a Filter Component and set it up to filter any rows where the 'state' column is not 'SUCCESS'

This filter returns only successful rows from task_table.

 
Using SQL To Reformat Task History

Now we can use the SQL Component to change the dates to a readable format using the two SQL lines, one in each component:

SELECT TIMESTAMP 'epoch' + convert(bigint, starttime/1000) * INTERVAL '1 Second ' as startdate ,* from task_table
SELECT TIMESTAMP 'epoch' + convert(bigint, endtime/1000) * INTERVAL '1 Second ' as enddate ,* from task_table

And the Join Component is used to take data from these sources (termed A and B in the Join Component) and return a single table.

Joining two Task History tables.

The two tables are compared such that we use the id and componentname columns to ensure rows from A and B are the same so we can take the enddate from B and match to the correct row in A. We have mapped the new formatted columns 'startdate' and 'enddate' to the old timestamps 'starttime' and 'endtime' respectively. We can also omit any columns that are unwanted.

This data is then written to task_table using the Rewrite Table Component. The data can be checked using the Sample tab of either the Join or Rewrite Table Component.

From the sample, the starttime and endtime appear to have been reformatted correctly and the desired columns are present. This data will remain in task_table which will be written to an S3 Bucket by the S3 Unload Component in the Orchestration job.