Name is required.
Email address is required.
Invalid email address
Answer is required.
Exceeding max length of 5KB

Database query optimisation

I've recently experienced very slow load times when using the database query component.

Source: SQL Server running on AWS
Target: Redshift

I'm wondering what the potential bottlenecks are. I presume that the Matillion EC2 instance is responsible for the grunt work of getting the data from the remote server to S3, and then copying the data into Redshift itself, although I'm unsure of how much legwork if any Redshift must do to receive data via the COPY command.

What kind of protocol is used to stream the data from the remote server to S3?
What metrics should I be looking at in the AWS console to determine where the bottleneck lies? I can see CPU on the EC2 is at 100% for less than a quarter of the time it takes to load into Redshift, so there appears to be spare capacity there.
How would I go about determining if the problem lies on the remote server's end?

Cheers

4 Community Answers

Matillion Agent  

David Lipowitz —

Hi Alastair,

Thanks for writing in and in answer to your questions:

What kind of protocol is used to stream the data from the remote server to S3?

https

What metrics should I be looking at in the AWS console to determine where the bottleneck lies?

One place to start is by exporting the “Time Taken to Load” and “Time Taken to Stage” variables as described here: Database Query Component — Variable Exports. Once done, you can print those values (perhaps using a Python Script Component) and determine whether staging the data or loading it is taking the majority of the time.

How would I go about determining if the problem lies on the remote server’s end?

If the suggestion above indicates that the “Time Taken to Stage” is the bulk of the time, then I would ask your source system’s DBA if they can help optimize the query that’s executing there.

FYI, Redshift can sometimes take a very long time to load data if there are a large number of columns. That’s because our default is to update compression and optimizer stats for the data that’s been loaded, and each column adds time to this total. One suggestion would be to disable compression and stat updates; you can do this by setting the “Comp Update” and “Stat Update” Load Options to “Off” in the Database Query component’s properties.

Hope that helps and please let us know if you have any additional questions on this or anything else Matillion-related.

Best Regards,
Dave


Alastair Munro —

Hi David, thanks for this excellent response.

I've created a process to monitor the time to stage and load, and it seems that staging takes the vast majority (90%+) of the time.

Perhaps this is a bit out of your scope, but I'll ask anyway: When requesting data via HTTPS, is it correct to say that both the sender and receiver do the same amount of work? Sort of like "for every byte uploaded there's a byte downloaded".

Therefore if there's spare CPU and network capacity on the receiving end, the issue should lie on the sender's end or somewhere in between?


Matillion Agent  

David Lipowitz —

Hi Alastair,

is it correct to say that both the sender and receiver do the same amount of work?

Yes, every packet that’s transmitted from the source is received by Matillion, which then streams those results into S3. That entire series of events encompasses the “Time Taken to Stage” metric. “Time Taken to Load” represents the time it takes Redshift to load the file landed in S3 into the cluster.

My guess would be that the query being executed by Matillion on the source system is taking up the bulk of the time here. Are your source system queries using any sort of WHERE clause? If so, and if the following hasn’t been done already, there may be an opportunity to optimize those queries by indexing column being filtered on. For example, if you’re executing the following against the source:

SELECT *
FROM some_table
WHERE last_updated_date > '2019-01-01'

then there may be an opportunity to speed up the query on the source system by indexing the last_updated_date column.

Therefore if there’s spare CPU and network capacity on the receiving end, the issue should lie on the sender’s end or somewhere in between?

You raise another valid possibility here. If the source system is in an on-premises data center, and the network bandwidth between that data center and AWS (and therefore Matillion) is very narrow, then that can also increase the “Time Taken to Stage.” Note that the bandwidth between Matillion and S3 is very high so I believe it unlikely that this leg of the network is the bottleneck.

That said, I believe the easiest part of this to fix is the source system’s query execution performance, so I would recommend investigating there first.

Hope that gives you a direction to go in and please let us know if you need more support here.

Best Regards,
Dave


Alastair Munro —

Fantastic response David, much appreciated.

Post Your Community Answer

To add an answer please login