pyspark用自定义嵌套模式读取JSON不适用
我有一个简单的JSON文件:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"right":false,"left":false}}}}
但是,当我尝试这样阅读时:
spark.read.option("inferSchema", "true") \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()
我得到:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}
这是错误的,因为adas_lane_keepassist
参数不正确。
如果在源json中我将adas_lane_keepassist
参数更改为“ true”,则映射是正确的...
我还认为也许是Inferschema
问题的根源,所以我做了一个custom_schema:
custom_schema = StructType([
StructField("adas",StructType([
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
])),
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
]))
]))
])
并这样阅读:
spark.read.schema(custom_schema) \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()
我得到了错误的结果:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}
有趣的是,如果我更改custom_shema中的顺序
这样:
custom_schema = StructType([
StructField("adas",StructType([
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
])),
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
]))
]))
])
现在,每个参数代码> ADAS_PARKASSIST_FRONT/左是错误的:
{"adas":{"lane":{"keepAssist":{"right":false,"left":false}}, "parkAssist":{"rear":{"right":false,"left":false},"front":{"right":false,"left":false}}}}
这是Pyspark的限制吗?
I have this simple JSON file:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"right":false,"left":false}}}}
But when I'm trying to read it like this:
spark.read.option("inferSchema", "true") \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()
I get:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}
Which is wrong because adas_lane_keepAssist
arguments are not correct.
If in source JSON I change one of the adas_lane_keepAssist
arguments to "true", then the mapping is correct...
I also thought that maybe it's inferSchema
the root of the problem, so I've made a custom_schema:
custom_schema = StructType([
StructField("adas",StructType([
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
])),
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
]))
]))
])
and read it like this:
spark.read.schema(custom_schema) \
.option("multiline", "true") \
.json(///myfile.json) \
.first() \
.asDict()
And I get the same wrong result:
{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}
The funny thing is if I change the order in my custom_shema
like this:
custom_schema = StructType([
StructField("adas",StructType([
StructField("lane",StructType([
StructField("keepAssist",StructType([
StructField("right",BooleanType(),True),
StructField("left",BooleanType(),True)
]))
])),
StructField("parkAssist",StructType([
StructField("rear",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
])),
StructField("front",StructType([
StructField("alarm",BooleanType(),True),
StructField("muted",BooleanType(),True)
]))
]))
]))
])
Now every argument of adas_parkAssist_front/left
is wrong:
{"adas":{"lane":{"keepAssist":{"right":false,"left":false}}, "parkAssist":{"rear":{"right":false,"left":false},"front":{"right":false,"left":false}}}}
Is this a limitation of PySpark?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对我来说也很奇怪。我尝试了
第一个
,head
和收集
,但它们都返回了相同的扭曲结构。在这些行之前,如果我打印了模式,那是正确的。因此,问题在功能中first
,head
,收集
无法与嵌套结构合作...寻找解决方法,我对整个模式(在读取JSON文件后正确)到地图类型。
It's very strange to me too. I tried
first
,head
andcollect
, but they all returned the same distorted structure. Before those lines, if I printed the schema, it was correct. So, the problem is in functionsfirst
,head
,collect
not working correctly with nested structs...Looking for a workaround, I transformed the whole schema (which was correct after reading the JSON file) to a map type.
我的火花版本是3.1.1。
将其更新为3.2.0后,按预期读取自定义嵌套模式。
谢谢 !
My spark version was 3.1.1.
After updating it to 3.2.0, the custom nested schema was read as expected.
Thx !