将高度嵌套的列从字符串更新到结构
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: string (nullable = true)
我有上述嵌套模式,我想将Z列的日志从字符串更改为struct。
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: struct (nullable = true)
| | | | | | |-- b: string (nullable = true)
| | | | | | |-- c: string (nullable = true)
我不使用Spark 3,而是Spark 2.4.x。会更喜欢Scala方式,但是Python也可以使用,因为这是一次手动的东西来回填一些过去的数据。
有没有办法使用一些UDF或任何其他方法来执行此操作?
我知道可以通过frof_json进行此操作,但是结构的嵌套阵列正在引起问题。
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: string (nullable = true)
I have the above nested schema where I want to change column z's log from string to struct.
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: struct (nullable = true)
| | | | | | |-- b: string (nullable = true)
| | | | | | |-- c: string (nullable = true)
I'm not using Spark 3 but Spark 2.4.x. Will prefer Scala way but python works too since this is a one time manual thing to backfill some past data.
Is there a way to do this with some udf or any other way?
I know it's easy to do this via from_json but the nested array of struct is causing issues.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我认为这取决于
log
列中的值。我的意思是,您要将字符串分为两个单独的字段的方式。以下Pyspark代码只会“移动”您的
log
值为b
和c
字段。如果需要在
log
列上使用字符串变换,则需要更改零件需要更改零件以包括字符串转换。I think it depends on the values in your
log
column. I mean, the way you want to split the string into 2 separate fields.The following PySpark code will just "move" your
log
values tob
andc
fields.If string transformations are needed on
log
column,e.z.log[0]
parts need to be changed to include string transformations.在这种情况下,高级功能是您的朋友。基本上是合并的。下面的代码
Higher Order functions are your friend in this case. Coalesce basically. Code below