在一个键的新列中解析变化键的pyspark数据框列

发布于 2025-01-28 04:43:37 字数 1456 浏览 3 评论 0原文

我有一个输入Pyspark DataFrame DF。 DataFrame DF具有具有字典值的列“ Field1”。词典并非都有相同的键。我想将“ B”键解析到一个新领域“ Newcol”中。为了进一步使事物复杂化，field1是数据类型字符串。我已经尝试了下面的代码，但是下面遇到了错误。有人有建议如何做到这一点吗？

输入DF：

+--+---------------------+
|id|field1               |
+--+---------------------+
| 1|{"a":1,"b":"f"}      |
+--+---------------------+
| 2|{"a":1,"b":"e","c":3}|
+--+---------------------+

输出DF：

+--+---------------------+------+
|id|field1               |newcol|
+--+---------------------+------+
| 1|{"a":1,"b":"f"}      |'f'   |
+--+---------------------+------+
| 2|{"a":1,"b":"e","c":3}|'e'   |
+--+---------------------+------+

代码：

df.select(
    col('id'),
    col('field1'),
    from_json(col("field1"), ArrayType(StringType())).getItem("b")
).show(truncate=False)

错误：

An error was encountered:
An error occurred while calling o571.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 8.0 failed 4 times, most recent failure: Lost task 2.3 in stage 8.0 (TID 59, ip-10-100-190-16.us-west-2.compute.internal, executor 49): org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://treez-data-lake/product_api/products/part-00005-ab9a676d-e9fa-4594-a998-77e8ae0dd95b-c000.snappy.parquet, range: 0-41635218, partition values: [empty row], isDataPresent: false

...

原文

I have an input pyspark dataframe df. the dataframe df has a column "field1" that has values that are dictionaries. the dictionaries do not all have the same keys. I would like to parse the "b" key into a new field "newcol". to further complicate things field1 is of datatype string. I've tried the code below, but I'm getting the error below. does anyone have a suggestion how to do this?

input df:

+--+---------------------+
|id|field1               |
+--+---------------------+
| 1|{"a":1,"b":"f"}      |
+--+---------------------+
| 2|{"a":1,"b":"e","c":3}|
+--+---------------------+

output df:

+--+---------------------+------+
|id|field1               |newcol|
+--+---------------------+------+
| 1|{"a":1,"b":"f"}      |'f'   |
+--+---------------------+------+
| 2|{"a":1,"b":"e","c":3}|'e'   |
+--+---------------------+------+

code:

df.select(
    col('id'),
    col('field1'),
    from_json(col("field1"), ArrayType(StringType())).getItem("b")
).show(truncate=False)

error:

An error was encountered:
An error occurred while calling o571.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 8.0 failed 4 times, most recent failure: Lost task 2.3 in stage 8.0 (TID 59, ip-10-100-190-16.us-west-2.compute.internal, executor 49): org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://treez-data-lake/product_api/products/part-00005-ab9a676d-e9fa-4594-a998-77e8ae0dd95b-c000.snappy.parquet, range: 0-41635218, partition values: [empty row], isDataPresent: false

...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

好久不见√ 2025-02-04 04:43:37

让我们尝试外国库LILLAL_EVAL转换为maptype，然后使用pyspark method map_values将值获取列表并将值由索引

from ast import literal_eval
import pyspark.sql.functions as F
df2 = df.withColumn('newcol', map_values(F.udf(literal_eval, 'map<string,string>')('field1'))[1])#.select(map_values("newcol").alias("newcol")[1])
df2.show(truncate=False)


+---+---------------------+------+
|d  |field1               |newcol|
+---+---------------------+------+
|1  |{"a":1,"b":"f"}      |f     |
|2  |{"a":1,"b":"e","c":3}|e     |
+---+---------------------+------+

Lets try foreign library literal_eval to convert to maptype and then use pyspark method map_values to get values into list and slice by index

from ast import literal_eval
import pyspark.sql.functions as F
df2 = df.withColumn('newcol', map_values(F.udf(literal_eval, 'map<string,string>')('field1'))[1])#.select(map_values("newcol").alias("newcol")[1])
df2.show(truncate=False)


+---+---------------------+------+
|d  |field1               |newcol|
+---+---------------------+------+
|1  |{"a":1,"b":"f"}      |f     |
|2  |{"a":1,"b":"e","c":3}|e     |
+---+---------------------+------+

回复收藏 0 原文

~没有更多了~