Data Lineage

Data Lineage is an Enterprise Feature that allows users to track columns across transformations in order to understand how that column has been affected by components up to the point of inspection.

Consider the simple Transformation Job shown below. At any point, we could sample a component and see the current values that each columns holds. However, identifying where that column has come from and the journey it has taken can be difficult to determine, especially if one is not the creator of the job.

By selecting a component and browsing to the Lineage tab, we can attain the lineage for the data as per the conclusion of this component.

In the example below, we have checked the lineage of the Calculator component which yields 6 columns. Clicking one of the columns, projectid, we can see the lineage of this column on the right-hand side. Using the arrows to expand the tree, we can follow the PROJECTID column through different components. Each entry is given in the format:
 
<ComponentName or TableName>.<ColumnName>

We see that this column originated in the table DOC_TBL and has been brought in by a Table Input component, passed through a Convert Type component and finally ended up in the Calculator component that we are inspecting.

Clicking on any part of this tree will highlight the respective component on the job canvas. Without lineage, we can only guess as to where a column comes from without inspecting each component in turn, tracing it back - a substantial task for large and branched workflows.

For contrast, below we check the lineage for the summed column on the same component. In this case, we find that SUMMED did not even exist as a column before the Calculator component in our workflow. Usefully, data lineage still traces the origins of this column, knowing that the Calculator component used the KEY and ID columns in the creation of SUMMED.


If we click the View SQL button, that same tree is expanded in terms of what each component is doing in SQL terms. Any SQL that includes the column of interest is highlighted so we can easily understand its Transformation. As summed was the result of a calculation of columns id and key, those columns are highlighted when summed does not yet exist.