JSON FOMPTING PYSPARK
我有一个以下格式存储为字符串的JSON,
{
'aaa':'',
'bbb':'',
'ccc':{
'ccc':[{dict of values}] //list of dictionaries
}
'ddd':'',
'eee':{
'eee':[{dict of values},{dict of values},{dict of values}] //list of dictionaries
}
}
以这种格式有将近70亿个JSON字符串。我想使用python pandas的json_normize,但是由于记录计数,我正在考虑使用pyspark。有人可以指导什么是处理和存储这些JSON字符串的最佳方法。我需要一个将JSON中所有键的输出作为列以及它们的数据作为行,然后将它们存储为Parquet文件。
另外,在某些情况下,并非所有键都会存在,在这种情况下,我需要在该键中存储null或无需存储null或没有任何密钥,以进行JSON字符串
样本输入: {“ AAA”:“ 123”,“ BBB”:“ ASDNCJ”,“ CCC”:{“ CCC”:[{“ CCC1”:true,“ ccc2”:“ ABCD”,“ CCC3”,“ CCC3”:“ ABCD”} ,{“ ccc1”:true,“ ccc2”:“ abcde”,“ ccc3”:“ abcdee”},{“ ccc1”:true,ccc2':“ ccc2”:“ abcdef”,“ ccc3”,“ ccc3”:“ abcdefe”}]}]} ,“ ddd”:“ aabcd”,“ eee”:{“ eee”:[{“ eee1”:“ 123”,“ eee2”:“ 1”,“ eee3”:“ hcudh”},{“ eee1” “ 2234”,“ EEE2”:“ 1”,“ EEE3”:“ HHCB”}]}}}}}}}}}}}}
,我想拥有3个表,一个用于键AAA,BBB和DDD。 CCC中的键和EEE的第三个表。
I have a json stored as string in the below format
{
'aaa':'',
'bbb':'',
'ccc':{
'ccc':[{dict of values}] //list of dictionaries
}
'ddd':'',
'eee':{
'eee':[{dict of values},{dict of values},{dict of values}] //list of dictionaries
}
}
I have nearly some 70mil json strings in this format. I thought to use json_normalize from python pandas, but because of the record count, I am thinking to use pyspark. Could someone guide what is the best way to process and store these json strings in a Glue table. I would need an output with all the keys in json as columns along with their data as rows and I would store them as parquet files.
Also in some cases, not all the keys will be present and in that case, I need to store null or none in that key for that json string
Sample input:
{"aaa":"123","bbb":"asdncj","ccc":{"ccc":[{"ccc1":true,"ccc2":"abcd","ccc3":"abcd"},{"ccc1":true,"ccc2":"abcde","ccc3":"abcdee"},{"ccc1":true,"ccc2":"abcdef","ccc3":"abcdefe"}]},"ddd":"aabcd","eee":{"eee":[{"eee1":"123","eee2":"1","eee3":"hcudh"},{"eee1":"2234","eee2":"1","eee3":"hhcb"}]}}
output, I want to have 3 tables, one for keys aaa,bbb and ddd. the second for keys in ccc and the third table for eee.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您不一定需要使用Pyspark(例如,您只需要阅读)即可。然后,我建议使用内置
JSON
模块。以下是一个用途的示例:您还可以详细介绍如何形成JSON数据。
If you don't necessarily need to use pyspark (e.g you just need to read it). Then I would recommend using the built-in
json
module. The below is an example of a use:Could you also elaborate on how to json data needs to be formated.