将高度嵌套的列从字符串更新到结构

发布于 2025-01-24 18:35:12 字数 928 浏览 4 评论 0原文

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- log: string (nullable = true)

我有上述嵌套模式,我想将Z列的日志从字符串更改为struct。

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- log: struct (nullable = true)
 |    |    |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |    |    |-- c: string (nullable = true)

我不使用Spark 3,而是Spark 2.4.x。会更喜欢Scala方式,但是Python也可以使用,因为这是一次手动的东西来回填一些过去的数据。

有没有办法使用一些UDF或任何其他方法来执行此操作?

我知道可以通过frof_json进行此操作,但是结构的嵌套阵列正在引起问题。

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- log: string (nullable = true)

I have the above nested schema where I want to change column z's log from string to struct.

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- log: struct (nullable = true)
 |    |    |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |    |    |-- c: string (nullable = true)

I'm not using Spark 3 but Spark 2.4.x. Will prefer Scala way but python works too since this is a one time manual thing to backfill some past data.

Is there a way to do this with some udf or any other way?

I know it's easy to do this via from_json but the nested array of struct is causing issues.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

杯别 2025-01-31 18:35:12

我认为这取决于log列中的值。我的意思是,您要将字符串分为两个单独的字段的方式。

以下Pyspark代码只会“移动”您的log值为bc字段。

# Example data:

schema = (
    T.StructType([
        T.StructField('x', T.ArrayType(T.StructType([
            T.StructField('y', T.LongType()),
            T.StructField('z', T.ArrayType(T.StructType([
                T.StructField('log', T.StringType())
            ]))),
        ])))
    ])
)
df = spark.createDataFrame([
    [
        [[
            9,
            [[
                'text'
            ]]
        ]]
    ]
], schema)

df.printSchema()
# root
#  |-- x: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- y: long (nullable = true)
#  |    |    |-- z: array (nullable = true)
#  |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |-- log: string (nullable = true)
df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(struct(e.z.log[0] as b, e.z.log[0] as c) as log)) as z))'))

df.printSchema()
# root
#  |-- x: array (nullable = true)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- y: long (nullable = true)
#  |    |    |-- z: array (nullable = false)
#  |    |    |    |-- element: struct (containsNull = false)
#  |    |    |    |    |-- log: struct (nullable = false)
#  |    |    |    |    |    |-- b: string (nullable = true)
#  |    |    |    |    |    |-- c: string (nullable = true)

如果需要在log列上使用字符串变换,则需要更改零件需要更改零件以包括字符串转换。

I think it depends on the values in your log column. I mean, the way you want to split the string into 2 separate fields.

The following PySpark code will just "move" your log values to b and c fields.

# Example data:

schema = (
    T.StructType([
        T.StructField('x', T.ArrayType(T.StructType([
            T.StructField('y', T.LongType()),
            T.StructField('z', T.ArrayType(T.StructType([
                T.StructField('log', T.StringType())
            ]))),
        ])))
    ])
)
df = spark.createDataFrame([
    [
        [[
            9,
            [[
                'text'
            ]]
        ]]
    ]
], schema)

df.printSchema()
# root
#  |-- x: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- y: long (nullable = true)
#  |    |    |-- z: array (nullable = true)
#  |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |-- log: string (nullable = true)
df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(struct(e.z.log[0] as b, e.z.log[0] as c) as log)) as z))'))

df.printSchema()
# root
#  |-- x: array (nullable = true)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- y: long (nullable = true)
#  |    |    |-- z: array (nullable = false)
#  |    |    |    |-- element: struct (containsNull = false)
#  |    |    |    |    |-- log: struct (nullable = false)
#  |    |    |    |    |    |-- b: string (nullable = true)
#  |    |    |    |    |    |-- c: string (nullable = true)

If string transformations are needed on log column, e.z.log[0] parts need to be changed to include string transformations.

我很坚强 2025-01-31 18:35:12

在这种情况下,高级功能是您的朋友。基本上是合并的。下面的代码

 df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(coalesce(("1" as a,"2" as b)) as log))as z))')).printSchema()

root
 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = false)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- log: struct (nullable = false)
 |    |    |    |    |    |-- a: string (nullable = false)
 |    |    |    |    |    |-- b: string (nullable = false)

Higher Order functions are your friend in this case. Coalesce basically. Code below

 df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(coalesce(("1" as a,"2" as b)) as log))as z))')).printSchema()

root
 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = false)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- log: struct (nullable = false)
 |    |    |    |    |    |-- a: string (nullable = false)
 |    |    |    |    |    |-- b: string (nullable = false)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文