JSON FOMPTING PYSPARK

发布于 2025-02-05 07:32:40 字数 927 浏览 1 评论 0原文

我有一个以下格式存储为字符串的JSON,

{
'aaa':'',
'bbb':'',
'ccc':{
       'ccc':[{dict of values}] //list of dictionaries
      }
'ddd':'',
'eee':{
       'eee':[{dict of values},{dict of values},{dict of values}] //list of dictionaries
      }
}

以这种格式有将近70亿个JSON字符串。我想使用python pandas的json_normize,但是由于记录计数,我正在考虑使用pyspark。有人可以指导什么是处理和存储这些JSON字符串的最佳方法。我需要一个将JSON中所有键的输出作为列以及它们的数据作为行,然后将它们存储为Parquet文件。

另外,在某些情况下,并非所有键都会存在,在这种情况下,我需要在该键中存储null或无需存储null或没有任何密钥,以进行JSON字符串

样本输入: {“ AAA”:“ 123”,“ BBB”:“ ASDNCJ”,“ CCC”:{“ CCC”:[{“ CCC1”:true,“ ccc2”:“ ABCD”,“ CCC3”,“ CCC3”:“ ABCD”} ,{“ ccc1”:true,“ ccc2”:“ abcde”,“ ccc3”:“ abcdee”},{“ ccc1”:true,ccc2':“ ccc2”:“ abcdef”,“ ccc3”,“ ccc3”:“ abcdefe”}]}]} ,“ ddd”:“ aabcd”,“ eee”:{“ eee”:[{“ eee1”:“ 123”,“ eee2”:“ 1”,“ eee3”:“ hcudh”},{“ eee1” “ 2234”,“ EEE2”:“ 1”,“ EEE3”:“ HHCB”}]}}}}}}}}}}}}

,我想拥有3个表,一个用于键AAA,BBB和DDD。 CCC中的键和EEE的第三个表。

I have a json stored as string in the below format

{
'aaa':'',
'bbb':'',
'ccc':{
       'ccc':[{dict of values}] //list of dictionaries
      }
'ddd':'',
'eee':{
       'eee':[{dict of values},{dict of values},{dict of values}] //list of dictionaries
      }
}

I have nearly some 70mil json strings in this format. I thought to use json_normalize from python pandas, but because of the record count, I am thinking to use pyspark. Could someone guide what is the best way to process and store these json strings in a Glue table. I would need an output with all the keys in json as columns along with their data as rows and I would store them as parquet files.

Also in some cases, not all the keys will be present and in that case, I need to store null or none in that key for that json string

Sample input:
{"aaa":"123","bbb":"asdncj","ccc":{"ccc":[{"ccc1":true,"ccc2":"abcd","ccc3":"abcd"},{"ccc1":true,"ccc2":"abcde","ccc3":"abcdee"},{"ccc1":true,"ccc2":"abcdef","ccc3":"abcdefe"}]},"ddd":"aabcd","eee":{"eee":[{"eee1":"123","eee2":"1","eee3":"hcudh"},{"eee1":"2234","eee2":"1","eee3":"hhcb"}]}}

output, I want to have 3 tables, one for keys aaa,bbb and ddd. the second for keys in ccc and the third table for eee.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦中楼上月下 2025-02-12 07:32:40

如果您不一定需要使用Pyspark(例如,您只需要阅读)即可。然后,我建议使用内置JSON模块。以下是一个用途的示例:

import json

with open("your_file.json", "r") as f:
    raw_json = json.load(f)

您还可以详细介绍如何形成JSON数据。

If you don't necessarily need to use pyspark (e.g you just need to read it). Then I would recommend using the built-in json module. The below is an example of a use:

import json

with open("your_file.json", "r") as f:
    raw_json = json.load(f)

Could you also elaborate on how to json data needs to be formated.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文