Name is required.
Email address is required.
Invalid email address
Answer is required.
Exceeding max length of 5KB

s3 Load

I have files on s3 bucket, gzipped and delimited with ~|~.

Trying to figure out a way to use S3 Load component however running into two issues/scenarios.
First, I can't use the S3 load on its own, since it appears I can only have a single CHAR as a field delimiter.

Tried to use bash script to see if I can use sed script to do a replace of the ~|~ to just single pipe |.
Getting different errors on trying the gzip - (invalid command when prefix w/ "aws ", no such files without prefix)
This was my thought process in the bash component

gzip -cd s3://FileA.txt.gz s3://FileB.txt ## unzip to txt file
sed -i -e 's/~|~/|/g' s3://FileB.txt ## replace ~|~ to |
gzip -c s3://FileB.txt s3://FileA.txt.gz ## zip back


Can I use gzip/sed in the bash script component?

Is there another way this can be implemented, without having to change delimiter on original files.

Thanks,
-john

5 Community Answers

Matillion Agent  

Ian Funnell —

Hi John,

You’re correct about the single delimiter limitation, but you certainly can do this conversion in a Matillion Bash script.

It’s possible to use the shell’s command line piping capability to download the files from S3, gunzip and convert them, then gzip again and re-upload in a single operation without requiring any storage on the Matillion server.

You’ll need a Bash Script containing something like the following:

aws s3 cp s3://your-bucket/yourfolder/FileA.txt.gz - | gunzip | sed "s/~|~/\t/g" | gzip | aws s3 cp - s3://your-bucket/yourfolder/FileA.tsv.gz

The above changes every | into a TAB, and re-uploads the file with a .tsv suffix.

Best regards,
Ian


John Alwan —

Ian,

Thanks for confirming this, my only challenge here is that some fields in this dataset are free form text, so there may be TAB or even | in there. Trying to see if I can use a non standard ascii char(\x7F) to replace the ~|~ and remove dbl quotes, since this was giving issue with S3 Load as well.

Currently looks like:
aws s3 cp s3://FileA.txt.gz - | gunzip | sed "s/~|~/\x7F/g" | gzip | aws s3 cp - s3://FileB.txt.gz
aws s3 cp s3://FileB.txt.gz - | gunzip | sed "s/\"//g" | gzip | aws s3 cp - s3://FileC.txt.gz

However getting the following err using the S3 load with this char on FileC.
ERROR: COPY delimiter must be single character.

Appreciate the help.


Matillion Agent  

Laura Malins —

Hi John

I don’t think non-standard ASCII is accepted in the copy command. I know another customer managed to get the ^A delimiter working with the delimiter \\001. Could you try that please?

Thanks
Laura


John Alwan —

Hi Laura,

This does not appear to work for me, can you provide full script example.

aws s3 cp s3://FileA.txt.gz - | gunzip | sed "s/~|~/\\001/g" | gzip | aws s3 cp - s3://FileB.txt.gz

The above is what I tried.


Matillion Agent  

Ian Funnell —

Hi John,

Shouldn’t be a problem, but please could you email us a sample file? This will allow us to create a script which will work with your data.

I don’t think you can attach files to this thread, but please email direct to support@matillion.com mentioning case number 10777.

I’ll revisit this case and paste the eventual solution on here in case anyone else is interested.

Best regards,
Ian

Post Your Community Answer

To add an answer please login