Remove duplicates before storing into staging tables
Is it possible to, remove duplicates based on few columns( ie: when the column has a null or empty value should be removed) before storing it into the staging tables?
So basically I"m using an S3 Load to insert data from S3 bucket into my staging table. Before I store it into my Redshift table, I want to do the eliminate the duplicate records based on some conditions.
Is it possible to do that before storing it (ie: before doing the S3 Load)?
I tried to eliminate the records by using the DELETE command in an SQL component as a transformation job. But then seems like Matillion doesn't support DELETE commands if I"m not mistaken.
To do this you would need to stage the data into a staging table in Redshift, then use a Transformation job to remove the duplicates. I can recommend using the DISTINCT component. Then write the data to a new table. You can then delete your initial staging table if you require.
As you already know, we cannot process duplicates when loading from S3.
I would go with Laura’s suggestion of loading the data into Redshift and then using a transformation job to remove duplicates.
We also think you should look into using some window-functions or Ranking functions to associate a rank based on your criteria. Then you can filter rows where RANK=1, assuming that’s what you are after.