如何在 Pyspark 中以 parquet 格式编写 NullType 字段?
我正在读取 json 文件并通过 Spark 推断架构。其中一个字段是 arr: []
,因此当我尝试将此 json 对象写入 parquet 格式时,它会引发错误: 遇到错误:“Parquet 数据源不支持数组
。 我已经重现了导致错误的代码(在示例代码中,我添加了架构,但在实际代码中,我使用胶水动态框架):
data = {"key":"val","arr":[]}
glueContext = GlueContext(SparkContext.getOrCreate())
schema = StructType([
StructField("key", StringType()),
StructField("arr", ArrayType(NullType()))
])
df = spark.createDataFrame([data], schema)
ddf= DynamicFrame.fromDF(df, glueContext, 'glue_df')
ddf.toDF().write.parquet("/home/file1")
目前,由于 arr 字段中没有值,因此元素它的内部被推断为 NullType() 但不会每次都是这样,因为稍后它可能是 StringType() 等。
我想更新代码,以便删除该字段或将 NullType 类型转换为 StringType。我尝试了 attribute_master = DropNullFields.apply(frame=ddf)
但没有成功。解决方法是什么?
I am reading a json file and inferring the schema through spark. One of the field is arr: []
, so when I am trying to write this json object into parquet format, it is throwing an error:An error was encountered: 'Parquet data source does not support array<null> data type.;'
.
I have reproduced the code resulting in the error(in the example code, I have added the schema, but in the actual code, I am using glue dynamic frame):
data = {"key":"val","arr":[]}
glueContext = GlueContext(SparkContext.getOrCreate())
schema = StructType([
StructField("key", StringType()),
StructField("arr", ArrayType(NullType()))
])
df = spark.createDataFrame([data], schema)
ddf= DynamicFrame.fromDF(df, glueContext, 'glue_df')
ddf.toDF().write.parquet("/home/file1")
For now, since there is no value in the arr field, the element inside it is inferred as NullType() but it won't be the case everytime since later on it could be StringType() etc.
I want to update the code so that either the field gets dropped or the NullType gets typecasted into StringType. I tried attribute_master = DropNullFields.apply(frame=ddf)
but it didn't worked. What could be the workaround?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论