将高度嵌套的列从字符串更新到结构

发布于 2025-01-24 18:35:12 字数 928 浏览 4 评论 0原文

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- log: string (nullable = true)

我有上述嵌套模式，我想将Z列的日志从字符串更改为struct。

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- log: struct (nullable = true)
 |    |    |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |    |    |-- c: string (nullable = true)

我不使用Spark 3，而是Spark 2.4.x。会更喜欢Scala方式，但是Python也可以使用，因为这是一次手动的东西来回填一些过去的数据。

有没有办法使用一些UDF或任何其他方法来执行此操作？

我知道可以通过frof_json进行此操作，但是结构的嵌套阵列正在引起问题。

原文

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- log: string (nullable = true)

I have the above nested schema where I want to change column z's log from string to struct.

 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- log: struct (nullable = true)
 |    |    |    |    |    |    |-- b: string (nullable = true)
 |    |    |    |    |    |    |-- c: string (nullable = true)

I'm not using Spark 3 but Spark 2.4.x. Will prefer Scala way but python works too since this is a one time manual thing to backfill some past data.

Is there a way to do this with some udf or any other way?

I know it's easy to do this via from_json but the nested array of struct is causing issues.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

杯别 2025-01-31 18:35:12

我认为这取决于log列中的值。我的意思是，您要将字符串分为两个单独的字段的方式。

以下Pyspark代码只会“移动”您的log值为b和c字段。

# Example data:

schema = (
    T.StructType([
        T.StructField('x', T.ArrayType(T.StructType([
            T.StructField('y', T.LongType()),
            T.StructField('z', T.ArrayType(T.StructType([
                T.StructField('log', T.StringType())
            ]))),
        ])))
    ])
)
df = spark.createDataFrame([
    [
        [[
            9,
            [[
                'text'
            ]]
        ]]
    ]
], schema)

df.printSchema()
# root
#  |-- x: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- y: long (nullable = true)
#  |    |    |-- z: array (nullable = true)
#  |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |-- log: string (nullable = true)

df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(struct(e.z.log[0] as b, e.z.log[0] as c) as log)) as z))'))

df.printSchema()
# root
#  |-- x: array (nullable = true)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- y: long (nullable = true)
#  |    |    |-- z: array (nullable = false)
#  |    |    |    |-- element: struct (containsNull = false)
#  |    |    |    |    |-- log: struct (nullable = false)
#  |    |    |    |    |    |-- b: string (nullable = true)
#  |    |    |    |    |    |-- c: string (nullable = true)

如果需要在log列上使用字符串变换，则需要更改零件需要更改零件以包括字符串转换。

I think it depends on the values in your log column. I mean, the way you want to split the string into 2 separate fields.

The following PySpark code will just "move" your log values to b and c fields.

# Example data:

schema = (
    T.StructType([
        T.StructField('x', T.ArrayType(T.StructType([
            T.StructField('y', T.LongType()),
            T.StructField('z', T.ArrayType(T.StructType([
                T.StructField('log', T.StringType())
            ]))),
        ])))
    ])
)
df = spark.createDataFrame([
    [
        [[
            9,
            [[
                'text'
            ]]
        ]]
    ]
], schema)

df.printSchema()
# root
#  |-- x: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- y: long (nullable = true)
#  |    |    |-- z: array (nullable = true)
#  |    |    |    |-- element: struct (containsNull = true)
#  |    |    |    |    |-- log: string (nullable = true)

df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(struct(e.z.log[0] as b, e.z.log[0] as c) as log)) as z))'))

df.printSchema()
# root
#  |-- x: array (nullable = true)
#  |    |-- element: struct (containsNull = false)
#  |    |    |-- y: long (nullable = true)
#  |    |    |-- z: array (nullable = false)
#  |    |    |    |-- element: struct (containsNull = false)
#  |    |    |    |    |-- log: struct (nullable = false)
#  |    |    |    |    |    |-- b: string (nullable = true)
#  |    |    |    |    |    |-- c: string (nullable = true)

If string transformations are needed on log column, e.z.log[0] parts need to be changed to include string transformations.

回复收藏 0 原文

我很坚强 2025-01-31 18:35:12

在这种情况下，高级功能是您的朋友。基本上是合并的。下面的代码

 df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(coalesce(("1" as a,"2" as b)) as log))as z))')).printSchema()

root
 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = false)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- log: struct (nullable = false)
 |    |    |    |    |    |-- a: string (nullable = false)
 |    |    |    |    |    |-- b: string (nullable = false)

Higher Order functions are your friend in this case. Coalesce basically. Code below

 df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(coalesce(("1" as a,"2" as b)) as log))as z))')).printSchema()

root
 |-- x: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- y: long (nullable = true)
 |    |    |-- z: array (nullable = false)
 |    |    |    |-- element: struct (containsNull = false)
 |    |    |    |    |-- log: struct (nullable = false)
 |    |    |    |    |    |-- a: string (nullable = false)
 |    |    |    |    |    |-- b: string (nullable = false)

回复收藏 0 原文

~没有更多了~