在一个键的新列中解析变化键的pyspark数据框列
我有一个输入Pyspark DataFrame DF。 DataFrame DF具有具有字典值的列“ Field1”。词典并非都有相同的键。我想将“ B”键解析到一个新领域“ Newcol”中。为了进一步使事物复杂化,field1是数据类型字符串。我已经尝试了下面的代码,但是下面遇到了错误。有人有建议如何做到这一点吗?
输入DF:
+--+---------------------+
|id|field1 |
+--+---------------------+
| 1|{"a":1,"b":"f"} |
+--+---------------------+
| 2|{"a":1,"b":"e","c":3}|
+--+---------------------+
输出DF:
+--+---------------------+------+
|id|field1 |newcol|
+--+---------------------+------+
| 1|{"a":1,"b":"f"} |'f' |
+--+---------------------+------+
| 2|{"a":1,"b":"e","c":3}|'e' |
+--+---------------------+------+
代码:
df.select(
col('id'),
col('field1'),
from_json(col("field1"), ArrayType(StringType())).getItem("b")
).show(truncate=False)
错误:
An error was encountered:
An error occurred while calling o571.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 8.0 failed 4 times, most recent failure: Lost task 2.3 in stage 8.0 (TID 59, ip-10-100-190-16.us-west-2.compute.internal, executor 49): org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://treez-data-lake/product_api/products/part-00005-ab9a676d-e9fa-4594-a998-77e8ae0dd95b-c000.snappy.parquet, range: 0-41635218, partition values: [empty row], isDataPresent: false
...
I have an input pyspark dataframe df. the dataframe df has a column "field1" that has values that are dictionaries. the dictionaries do not all have the same keys. I would like to parse the "b" key into a new field "newcol". to further complicate things field1 is of datatype string. I've tried the code below, but I'm getting the error below. does anyone have a suggestion how to do this?
input df:
+--+---------------------+
|id|field1 |
+--+---------------------+
| 1|{"a":1,"b":"f"} |
+--+---------------------+
| 2|{"a":1,"b":"e","c":3}|
+--+---------------------+
output df:
+--+---------------------+------+
|id|field1 |newcol|
+--+---------------------+------+
| 1|{"a":1,"b":"f"} |'f' |
+--+---------------------+------+
| 2|{"a":1,"b":"e","c":3}|'e' |
+--+---------------------+------+
code:
df.select(
col('id'),
col('field1'),
from_json(col("field1"), ArrayType(StringType())).getItem("b")
).show(truncate=False)
error:
An error was encountered:
An error occurred while calling o571.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 8.0 failed 4 times, most recent failure: Lost task 2.3 in stage 8.0 (TID 59, ip-10-100-190-16.us-west-2.compute.internal, executor 49): org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://treez-data-lake/product_api/products/part-00005-ab9a676d-e9fa-4594-a998-77e8ae0dd95b-c000.snappy.parquet, range: 0-41635218, partition values: [empty row], isDataPresent: false
...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
让我们尝试外国库
LILLAL_EVAL
转换为maptype
,然后使用pyspark
methodmap_values
将值获取列表并将值由索引Lets try foreign library
literal_eval
to convert tomaptype
and then usepyspark
methodmap_values
to get values into list and slice by index