Name is required.
Email address is required.
Invalid email address
Answer is required.
Exceeding max length of 5KB

Remove duplicates before storing into staging tables


Is it possible to, remove duplicates based on few columns( ie: when the column has a null or empty value should be removed) before storing it into the staging tables?

So basically I"m using an S3 Load to insert data from S3 bucket into my staging table. Before I store it into my Redshift table, I want to do the eliminate the duplicate records based on some conditions.

Is it possible to do that before storing it (ie: before doing the S3 Load)?

I tried to eliminate the records by using the DELETE command in an SQL component as a transformation job. But then seems like Matillion doesn't support DELETE commands if I"m not mistaken.

Any help would be appreciated.


3 Community Answers

Kulasangar Gowrisangar —

Thanks Laura, understood!

Matillion Agent  

Laura Malins —

Hi Kulasangar

To do this you would need to stage the data into a staging table in Redshift, then use a Transformation job to remove the duplicates. I can recommend using the DISTINCT component. Then write the data to a new table. You can then delete your initial staging table if you require.


Matillion Agent  

Kalyan Arangam —

Hi Kulasangar,

As you already know, we cannot process duplicates when loading from S3.

I would go with Laura’s suggestion of loading the data into Redshift and then using a transformation job to remove duplicates.

We also think you should look into using some window-functions or Ranking functions to associate a rank based on your criteria. Then you can filter rows where RANK=1, assuming that’s what you are after.

Hope that makes sense.


Post Your Community Answer

To add an answer please login