I have files on s3 bucket, gzipped and delimited with ~|~.
Trying to figure out a way to use S3 Load component however running into two issues/scenarios. First, I can't use the S3 load on its own, since it appears I can only have a single CHAR as a field delimiter.
Tried to use bash script to see if I can use sed script to do a replace of the ~|~ to just single pipe |. Getting different errors on trying the gzip - (invalid command when prefix w/ "aws ", no such files without prefix) This was my thought process in the bash component
gzip -cd s3://FileA.txt.gz s3://FileB.txt ## unzip to txt file sed -i -e 's/~|~/|/g' s3://FileB.txt ## replace ~|~ to | gzip -c s3://FileB.txt s3://FileA.txt.gz ## zip back
Can I use gzip/sed in the bash script component?
Is there another way this can be implemented, without having to change delimiter on original files.
5 Community Answers
Ian Funnell —
You’re correct about the single delimiter limitation, but you certainly can do this conversion in a Matillion Bash script.
It’s possible to use the shell’s command line piping capability to download the files from S3, gunzip and convert them, then gzip again and re-upload in a single operation without requiring any storage on the Matillion server.
You’ll need a Bash Script containing something like the following:
Thanks for confirming this, my only challenge here is that some fields in this dataset are free form text, so there may be TAB or even | in there. Trying to see if I can use a non standard ascii char(\x7F) to replace the ~|~ and remove dbl quotes, since this was giving issue with S3 Load as well.