为什么读取json格式文件会导致所有记录都进入pyspark中的_corrupt_record

发布于 2025-01-16 14:31:46 字数 1545 浏览 0 评论 0原文

我正在从 api 调用读取数据，数据采用 json 形式，如下所示：

{'success': True, 'errors': \[\], 'requestId': '151a2#fg', 'warnings': \[\], 'result': \[{'id': 10322433, 'name': 'sdfdgd', 'desc': '', 'createdAt': '2016-09-20T13:48:58Z+0000', 'updatedAt': '2020-07-16T13:08:03Z+0000', 'url': 'https://eda', 'subject': {'type': 'Text', 'value': 'Register now'}, 'fromName': {'type': 'Text', 'value': 'ramjdn fg'}, 'fromEmail': {'type': 'Text', 'value': '[email protected]'}, 'replyEmail': {'type': 'Text', 'value': '[email protected]'}, 'folder': {'type': 'Folder', 'value': 478, 'folderName': 'sjha'}, 'operational': False, 'textOnly': False, 'publishToMSI': False, 'webView': False, 'status': 'approved', 'template': 1031, 'workspace': 'Default', 'isOpenTrackingDisabled': False, 'version': 2, 'autoCopyToText': True, 'preHeader': None}\]}

现在，当我使用以下代码从该数据创建数据帧时：

df = spark.read.json(sc.parallelize(\[data\]))

我只得到一列，即 _corrupt_record，下面是数据帧 o /p 我明白了。我尝试过使用 multine is true 但仍然没有得到所需的输出。

+--------------------+
|     \_corrupt_record|
\+--------------------+
|{'id': 12526, 'na...|
\+--------------------+

预期的 o/p 是将 json 分解为不同列后的数据帧，例如 id 作为一列，name 作为另一列等等。

我已经尝试了很多方法但无法解决这个问题。

原文

I am reading data from an api call and the data is in the form of json like below:

{'success': True, 'errors': \[\], 'requestId': '151a2#fg', 'warnings': \[\], 'result': \[{'id': 10322433, 'name': 'sdfdgd', 'desc': '', 'createdAt': '2016-09-20T13:48:58Z+0000', 'updatedAt': '2020-07-16T13:08:03Z+0000', 'url': 'https://eda', 'subject': {'type': 'Text', 'value': 'Register now'}, 'fromName': {'type': 'Text', 'value': 'ramjdn fg'}, 'fromEmail': {'type': 'Text', 'value': '[email protected]'}, 'replyEmail': {'type': 'Text', 'value': '[email protected]'}, 'folder': {'type': 'Folder', 'value': 478, 'folderName': 'sjha'}, 'operational': False, 'textOnly': False, 'publishToMSI': False, 'webView': False, 'status': 'approved', 'template': 1031, 'workspace': 'Default', 'isOpenTrackingDisabled': False, 'version': 2, 'autoCopyToText': True, 'preHeader': None}\]}

Now when I am creating a dataframe out of this data using below code:

df = spark.read.json(sc.parallelize(\[data\]))

I am getting only one column which is _corrupt_record, below is the dataframe o/p I am getting. I have tried using multine is true but am still not getting the desired output.

+--------------------+
|     \_corrupt_record|
\+--------------------+
|{'id': 12526, 'na...|
\+--------------------+

Expected o/p is the dataframe after exploding json with different columns, like id as one column, name as other column and so on.

I have tried lot of things but not able to fix this.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兔小萌 2025-01-23 14:31:46

我做了一些改变并且奏效了。

我需要定义自定义架构

然后使用这段代码

data = sc.parallelize([items])
df = Spark.createDataFrame(数据,模式=模式)

并且它起作用了。

如果有任何优化的解决方案，请随时分享。

I have made certain changes and it worked.

I need to define the custom schema

Then used this bit of code

data = sc.parallelize([items])
df = spark.createDataFrame(data,schema=schema)

And It worked.

If there are any optimized solution to this please feel free to share.

回复收藏 0 原文

~没有更多了~

关于作者

雾里花

暂无简介

文章

26 人气

关注发私信

白云不回头

文章 0 评论 0

关注

糖粟与秋泊

文章 0 评论 0

关注

洋豆豆

文章 0 评论 0

关注

泛滥成性

文章 0 评论 0

关注

mb_2YvjCLvt

文章 0 评论 0

关注

夜光

文章 0 评论 0

友情链接

文江博客

为什么读取json格式文件会导致所有记录都进入pyspark中的_corrupt_record

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

白云不回头

糖粟与秋泊

洋豆豆

泛滥成性

mb_2YvjCLvt

夜光

友情链接

为什么读取json格式文件会导致所有记录都进入pyspark中的_corrupt_record

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

白云不回头

糖粟与秋泊

洋豆豆

泛滥成性

mb_2YvjCLvt

夜光

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。