Python Script

Python Script Component

Run a Python script.

The script is executed in-process by an interpreter of the user's choice (Jython, Python2 or Python3). Any output written via 'print' statements will appear as the task completion message, and so output should be brief.

While it is valid to handle exceptions within the script using try/except, any uncaught exceptions will cause the component to be marked as failed and its failure link to be followed.

You may import any modules from the Python Standard Library. Matillion ETL does not uninstall any customer-installed Python libraries. Matillion ETL runs as a Tomcat user and care must be taken to ensure this user has sufficient access to resources.

AWS Users Note: For Jython and Python2, the 'boto' and 'boto3' APIs are made available to enable interaction with the rest of AWS. The AWS credentials defined in Matillion ETL are automatically made available, therefore it is not recommended (or necessary) to put security keys in the script.

Warning: Calling sys.exit() from a Jython script will shut down Matillion ETL and is to be avoided. Jython scripts can be safely terminated by allowing the script to run to the end or by using quit().


Properties

Property Setting Description
Name Text The descriptive name for the component.
Script Text The python script to execute.
Interpreter Select Choice of interpreter between Jython, Python2 and Python3
Timeout (Python2 and Python3 Only) Integer Number in seconds for timeout (Python 2 and 3 only). Defaults to 300. If set, it is advised to choose a value above 1.

Strategy

Runs the python script, redirecting any output it produces into the task message.


Variables

When run, the Python Script component creates a set of new variables of the same name, type and default value as those listed in the environment variables list. Thus, environment variables can be used within the script (the syntax ${variable} is not required, you may simply use variable).

Since the python script already contains python counterparts of the environment variables, users should be careful to not use those same names for their own variables, especially when of a different type.

Note that the python script variables will disappear after the python script ends. If you need to push values back to environment variables to use in other components later in the job, use the special 'context' object like so:

context.updateVariable("variable", "new value")

Both arguments are strings that should parse as the target variable type.

Database Access (Jython Only)

To access the database defined in the current environment, use the 'cursor' object provided.

cursor = context.cursor()

The cursor object is described in Python DB-API V2. The connection is made automatically for you using the current environment defined in Matillion ETL, and this connection will be closed automatically after the script terminates.

This feature is provided for convenience, and is not designed for retrieving large amounts of data. After executing a query, you should iterate the cursor to retrieve the results one row at a time, and avoid using fetchall() which may lead to out-of-memory issues.

If you execute database updates, you should not try to commit or rollback. Transactions are handled for you, either automatically (Auto-commit mode) or manually using the Begin/Commit/Rollback components. This is a change compared to previous versions, so older scripts may still have commit() calls in them - these should be removed.


Grid Variables

Similar to Variables, Grid Variables can also be accessed through the Python Script component. Details on using Grid Variables in this manner can be found in the Grid Variables documentation.


Additional Modules (Jython & Python2 Only)

Additional python modules may be installed by running the pip command, e.g.

pip install modulename

Log into the instance with SSH and run the command as root:

sudo pip install modulename

As well as pip, you may also upload your own modules to the instance. In that case, you must include the location of the modules in the python search path, and this location must be readable by the 'tomcat' user. For example

import sys sys.path.append('/path/to/directory/with/python/modules/and/packages')

Note: Regardless of whether a module is installed with pip or manually, it must not rely on external C modules in order to run successfully on the embedded Jython interpreter. However, such scripts should work on Python2 and Python3.
 


Additional Modules (Python3 Only)

Additional Python3 modules may be installed by running the pip command. Log into the instance with SSH and run the command as root to begin the installation:

sudo yum install python34-pip

Then to install the modules:

sudo pip-3.4 install

<modulename></modulename>

Task Cancellation (Jython Only)

Scripts are never forcibly killed. If you want a long running script to respond to task cancellation, the script must check for cancellation and act accordingly, ensuring any resources are cleaned up. Cancellation can be checked by querying the context:

context.isCancelled()
Since the cancellation is being handled within the script, the component will still end successfully, since no uncaught exception has been thrown. IN order for cancellation to also mark the script task as a failure, raise an exception. For example:
if context.isCancelled():
    raise Exception("Script cancelled during loop")
    

Task Cancellation (Python2 and Python3 Only)

If a script runs longer than its timeout (in seconds) it is forcibly killed - similar to the BASH component.

As well as pip, you may also upload your own modules to the instance. In that case, you must include the location of the modules in the python search path, and this location must be readable by the 'tomcat' user. For example

import sys sys.path.append('/path/to/directory/with/python/modules/and/packages')

Note: Regardless of whether a module is installed with pip or manually, it must not rely on external C modules in order to run successfully on the embedded Jython interpreter. However, such scripts should work on Python2 and Python3.


Example 1

This example moves all of the objects within an S3 bucket into another S3 bucket. You may wish to do this following an S3 Load, to ensure those same files are not loaded again by subsequent runs of this same job. The target bucket could also use Amazon Glacier to reduce the cost of storing the already loaded files.

The Python script imports the 'boto' module and uses it to move the files. In fact, the script copies the objects to the other bucket, and then removes the source object. A similar script could instead rename the objects and leave them within the same bucket. A list of available variables is given on the left of the window, and used in code written on the right. The script can be executed by clicking 'Run' as though the component had been run on the Matillion UI. The output of the code is shown beneath after running.


Example 2

The example script below shows a database query which retrieves a single (aggregate) row of data, and stores the result into a Variable for use elsewhere in Matillion ETL.

cursor = context.cursor()
cursor.execute("select count(*) from flights")
result = cursor.fetchone()

print result

context.updateVariable("total_count", str(result[0]))

Video