pyspark用自定义嵌套模式读取JSON不适用

发布于 2025-02-05 17:34:55 字数 2960 浏览 1 评论 0原文

我有一个简单的JSON文件:

{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"right":false,"left":false}}}}

但是,当我尝试这样阅读时:

spark.read.option("inferSchema", "true") \
          .option("multiline", "true") \
          .json(///myfile.json) \
          .first() \
          .asDict()

我得到:

{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}

这是错误的,因为adas_lane_keepassist参数不正确。

如果在源json中我将adas_lane_keepassist参数更改为“ true”,则映射是正确的...

我还认为也许是Inferschema问题的根源,所以我做了一个custom_schema:

custom_schema = StructType([
    StructField("adas",StructType([
        StructField("parkAssist",StructType([
            StructField("rear",StructType([
                StructField("alarm",BooleanType(),True),
                StructField("muted",BooleanType(),True)
            ])),
            StructField("front",StructType([
                StructField("alarm",BooleanType(),True),
                StructField("muted",BooleanType(),True)
            ]))
        ])),
        StructField("lane",StructType([
            StructField("keepAssist",StructType([
                StructField("right",BooleanType(),True),
                StructField("left",BooleanType(),True)
            ]))
        ]))
    ]))
  ])

并这样阅读:

spark.read.schema(custom_schema) \
          .option("multiline", "true") \
          .json(///myfile.json) \
          .first() \
          .asDict()

我得到了错误的结果:

{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}

有趣的是,如果我更改custom_shema中的顺序这样:

custom_schema = StructType([
    StructField("adas",StructType([
        StructField("lane",StructType([
            StructField("keepAssist",StructType([
                StructField("right",BooleanType(),True),
                StructField("left",BooleanType(),True)
            ]))
        ])),
        StructField("parkAssist",StructType([
            StructField("rear",StructType([
                StructField("alarm",BooleanType(),True),
                StructField("muted",BooleanType(),True)
            ])),
            StructField("front",StructType([
                StructField("alarm",BooleanType(),True),
                StructField("muted",BooleanType(),True)
            ]))
        ]))
    ]))
  ])

现在,每个参数代码> ADAS_PARKASSIST_FRONT/左是错误的:

{"adas":{"lane":{"keepAssist":{"right":false,"left":false}}, "parkAssist":{"rear":{"right":false,"left":false},"front":{"right":false,"left":false}}}}

这是Pyspark的限制吗?

I have this simple JSON file:

{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"right":false,"left":false}}}}

But when I'm trying to read it like this:

spark.read.option("inferSchema", "true") \
          .option("multiline", "true") \
          .json(///myfile.json) \
          .first() \
          .asDict()

I get:

{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}

Which is wrong because adas_lane_keepAssist arguments are not correct.

If in source JSON I change one of the adas_lane_keepAssist arguments to "true", then the mapping is correct...

I also thought that maybe it's inferSchema the root of the problem, so I've made a custom_schema:

custom_schema = StructType([
    StructField("adas",StructType([
        StructField("parkAssist",StructType([
            StructField("rear",StructType([
                StructField("alarm",BooleanType(),True),
                StructField("muted",BooleanType(),True)
            ])),
            StructField("front",StructType([
                StructField("alarm",BooleanType(),True),
                StructField("muted",BooleanType(),True)
            ]))
        ])),
        StructField("lane",StructType([
            StructField("keepAssist",StructType([
                StructField("right",BooleanType(),True),
                StructField("left",BooleanType(),True)
            ]))
        ]))
    ]))
  ])

and read it like this:

spark.read.schema(custom_schema) \
          .option("multiline", "true") \
          .json(///myfile.json) \
          .first() \
          .asDict()

And I get the same wrong result:

{"adas":{"parkAssist":{"rear":{"alarm":false,"muted":false},"front":{"alarm":false,"muted":false}},"lane":{"keepAssist":{"alarm":false,"muted":false}}}}

The funny thing is if I change the order in my custom_shema like this:

custom_schema = StructType([
    StructField("adas",StructType([
        StructField("lane",StructType([
            StructField("keepAssist",StructType([
                StructField("right",BooleanType(),True),
                StructField("left",BooleanType(),True)
            ]))
        ])),
        StructField("parkAssist",StructType([
            StructField("rear",StructType([
                StructField("alarm",BooleanType(),True),
                StructField("muted",BooleanType(),True)
            ])),
            StructField("front",StructType([
                StructField("alarm",BooleanType(),True),
                StructField("muted",BooleanType(),True)
            ]))
        ]))
    ]))
  ])

Now every argument of adas_parkAssist_front/left is wrong:

{"adas":{"lane":{"keepAssist":{"right":false,"left":false}}, "parkAssist":{"rear":{"right":false,"left":false},"front":{"right":false,"left":false}}}}

Is this a limitation of PySpark?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

雨落□心尘 2025-02-12 17:34:56

对我来说也很奇怪。我尝试了第一个head收集,但它们都返回了相同的扭曲结构。在这些行之前,如果我打印了模式,那是正确的。因此,问题在功能中firsthead收集无法与嵌套结构合作...

寻找解决方法,我对整个模式(在读取JSON文件后正确)到地图类型。

df = spark.read.json(r"path\test_file.json")
df = df.withColumn('adas', F.create_map(
    F.lit('lane'), F.create_map(
        F.lit('keepAssist'), F.create_map(
            F.lit('left'), F.col('adas.lane.keepAssist.left'),
            F.lit('right'), F.col('adas.lane.keepAssist.right')
        )
    ),
    F.lit('parkAssist'), F.create_map(
        F.lit('front'), F.create_map(
            F.lit('alarm'), F.col('adas.parkAssist.front.alarm'),
            F.lit('muted'), F.col('adas.parkAssist.front.muted')
        ),
        F.lit('rear'), F.create_map(
            F.lit('alarm'), F.col('adas.parkAssist.rear.alarm'),
            F.lit('muted'), F.col('adas.parkAssist.rear.muted')
        )
    )
))
print(df.head().asDict())
# {'adas': {'lane': {'keepAssist': {'left': False, 'right': False}}, 'parkAssist': {'rear': {'alarm': False, 'muted': False}, 'front': {'alarm': False, 'muted': False}}}}

It's very strange to me too. I tried first, head and collect, but they all returned the same distorted structure. Before those lines, if I printed the schema, it was correct. So, the problem is in functions first, head, collect not working correctly with nested structs...

Looking for a workaround, I transformed the whole schema (which was correct after reading the JSON file) to a map type.

df = spark.read.json(r"path\test_file.json")
df = df.withColumn('adas', F.create_map(
    F.lit('lane'), F.create_map(
        F.lit('keepAssist'), F.create_map(
            F.lit('left'), F.col('adas.lane.keepAssist.left'),
            F.lit('right'), F.col('adas.lane.keepAssist.right')
        )
    ),
    F.lit('parkAssist'), F.create_map(
        F.lit('front'), F.create_map(
            F.lit('alarm'), F.col('adas.parkAssist.front.alarm'),
            F.lit('muted'), F.col('adas.parkAssist.front.muted')
        ),
        F.lit('rear'), F.create_map(
            F.lit('alarm'), F.col('adas.parkAssist.rear.alarm'),
            F.lit('muted'), F.col('adas.parkAssist.rear.muted')
        )
    )
))
print(df.head().asDict())
# {'adas': {'lane': {'keepAssist': {'left': False, 'right': False}}, 'parkAssist': {'rear': {'alarm': False, 'muted': False}, 'front': {'alarm': False, 'muted': False}}}}
一张白纸 2025-02-12 17:34:56

我的火花版本是3.1.1。

将其更新为3.2.0后,按预期读取自定义嵌套模式。

谢谢 !

My spark version was 3.1.1.

After updating it to 3.2.0, the custom nested schema was read as expected.

Thx !

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文