Name is required.
Email address is required.
Invalid email address
Answer is required.
Exceeding max length of 5KB

Load JSON with different structure

Hello,

I have a JSON file (Server log with webtracking events) and the output has different structures...

Example:

Row 1:
{
"category_id":"1234"
,"q":"xxxx"
,"locale":"de"
,"controller":"xxxx"
,"action":"show"
,"company_slug":"xxxx"
}

Row 2:
{
"utm_source":"xxxx"
,"utm_medium":"1234"
,"utm_campaign":"xxxx"
,"locale":"de"
,"controller":"xxxx"
,"action":"show"
,"company_slug":"xxxx"
}

Row 3:
{
"locale":"en"
,"controller":"xxxx"
,"action":"show"
,"company_slug":"xxxx"
}

and so on...

The different structures are stored in "request_params". Some attributes are repeated, but sometimes in different positions.

How can I handle this dynamically without writing endless nested case-statements?

I have a target table with 10 columns for the attributes and would like to assign them 1: 1 from the JSON file.
If an attribute does not fit, it should be omitted.

5 Community Answers

Matillion Agent  

Laura Malins —

Hi Daniel

How are you loading this? If it’s via an API profile you could have one rsd in the profile per output and write these into the same table?

Thanks
Laura


Daniel Petri —

Hi Laura,
the JSON files are stored in S3 and I load the files with S3 load component and write it to a target table.

This is a typical dataset:

{
"timestamp": "2019-01-27T00:00:01.000+01:00",
"uuid": "xxxx",
"page_type": "company_profile",
"category_id": 1234,
"category_name": null,
"doc_type": "page",
"search_term": null,
"search_params": {},
"companies": [
{
"company_id": 1234,
"category_id": 5678,
"is_customer": false,
"package_name": "free",
"is_bonus": true,
"topranking_position": 0
}
],
"products": [],
"videos": [],
"catalogs": [],
"suggested_categories": [],
"http_status_code": 200,
"language": "de",
"request_host": "www.xxx.at",
"request_path": "xxxx",
"request_params": {
"category_id": "1234",
"q": "xxxx",
"locale": "de",
"controller": "xxxx",
"action": "show",
"company_slug": "xxxx"
},
"ip": "12.34.5678.123",
"user_agent": "xxxx",
"referrer": null,
"session_id": "xxxx",
"cookies": {},
"bot_useragent": true,
"is_bot": true,
"is_ip_blacklisted": false,
"wlw_client_id": "xxxx",
"ga_client_id": null,
"wlw_user_id": null,
"full_tracking_allowed": true
}

The problem are the attributes underneath "request_params". There are changing from dataset to dataset.
So far I've been able to find 5 different variations.
My goal is to store certain attributes in the destination table. However, these are not always in the same position or can not exist at all.

Can I solve this problem with a python script?


Matillion Agent  

Laura Malins —

Hi Daniel

Thank you for the complete sample – that helps. I would load the request_params json into a single column in a table in Redshift and then use a Transformation job to split out the JSON into separate columns. This can be done in a calculator with the below function:

https://docs.aws.amazon.com/redshift/latest/dg/JSON_EXTRACT_ARRAY_ELEMENT_TEXT.html

I’m no expert in Python but I don’t think it in this scenario.

Thanks
Laura


Daniel Petri —

Hi Laura,

json_extract_path_text is the solution.
With this function I can pick every attribute by his name.

Thank you!

Best,
Daniel


Matillion Agent  

Laura Malins —

Hi Daniel

Thank you for updating us. I’m glad you’ve got it working :-)

Thanks
Laura

Post Your Community Answer

To add an answer please login