JSON FOMPTING PYSPARK

发布于 2025-02-05 07:32:40 字数 927 浏览 1 评论 0原文

我有一个以下格式存储为字符串的JSON，

{
'aaa':'',
'bbb':'',
'ccc':{
       'ccc':[{dict of values}] //list of dictionaries
      }
'ddd':'',
'eee':{
       'eee':[{dict of values},{dict of values},{dict of values}] //list of dictionaries
      }
}

以这种格式有将近70亿个JSON字符串。我想使用python pandas的json_normize，但是由于记录计数，我正在考虑使用pyspark。有人可以指导什么是处理和存储这些JSON字符串的最佳方法。我需要一个将JSON中所有键的输出作为列以及它们的数据作为行，然后将它们存储为Parquet文件。

另外，在某些情况下，并非所有键都会存在，在这种情况下，我需要在该键中存储null或无需存储null或没有任何密钥，以进行JSON字符串

样本输入： {“ AAA”：“ 123”，“ BBB”：“ ASDNCJ”，“ CCC”：{“ CCC”：[{“ CCC1”：true，“ ccc2”：“ ABCD”，“ CCC3”，“ CCC3”：“ ABCD”} ，{“ ccc1”：true，“ ccc2”：“ abcde”，“ ccc3”：“ abcdee”}，{“ ccc1”：true，ccc2'：“ ccc2”：“ abcdef”，“ ccc3”，“ ccc3”：“ abcdefe”}]}]} ，“ ddd”：“ aabcd”，“ eee”：{“ eee”：[{“ eee1”：“ 123”，“ eee2”：“ 1”，“ eee3”：“ hcudh”}，{“ eee1” “ 2234”，“ EEE2”：“ 1”，“ EEE3”：“ HHCB”}]}}}}}}}}}}}}

，我想拥有3个表，一个用于键AAA，BBB和DDD。 CCC中的键和EEE的第三个表。

原文

I have a json stored as string in the below format

{
'aaa':'',
'bbb':'',
'ccc':{
       'ccc':[{dict of values}] //list of dictionaries
      }
'ddd':'',
'eee':{
       'eee':[{dict of values},{dict of values},{dict of values}] //list of dictionaries
      }
}

I have nearly some 70mil json strings in this format. I thought to use json_normalize from python pandas, but because of the record count, I am thinking to use pyspark. Could someone guide what is the best way to process and store these json strings in a Glue table. I would need an output with all the keys in json as columns along with their data as rows and I would store them as parquet files.

Also in some cases, not all the keys will be present and in that case, I need to store null or none in that key for that json string

Sample input:
{"aaa":"123","bbb":"asdncj","ccc":{"ccc":[{"ccc1":true,"ccc2":"abcd","ccc3":"abcd"},{"ccc1":true,"ccc2":"abcde","ccc3":"abcdee"},{"ccc1":true,"ccc2":"abcdef","ccc3":"abcdefe"}]},"ddd":"aabcd","eee":{"eee":[{"eee1":"123","eee2":"1","eee3":"hcudh"},{"eee1":"2234","eee2":"1","eee3":"hhcb"}]}}

output, I want to have 3 tables, one for keys aaa,bbb and ddd. the second for keys in ccc and the third table for eee.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦中楼上月下 2025-02-12 07:32:40

如果您不一定需要使用Pyspark（例如，您只需要阅读）即可。然后，我建议使用内置JSON模块。以下是一个用途的示例：

import json

with open("your_file.json", "r") as f:
    raw_json = json.load(f)

您还可以详细介绍如何形成JSON数据。

If you don't necessarily need to use pyspark (e.g you just need to read it). Then I would recommend using the built-in json module. The below is an example of a use:

import json

with open("your_file.json", "r") as f:
    raw_json = json.load(f)

Could you also elaborate on how to json data needs to be formated.

回复收藏 0 原文

~没有更多了~

关于作者

落花随流水

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

JSON FOMPTING PYSPARK

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

JSON FOMPTING PYSPARK

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。