pyspark:由于阶段失败而流产的工作:阶段86.0失败1次可能原因:Parquet列无法转换
我在将镶木quet文件从一个斑点编写到另一个斑点时面临一些问题。以下是我正在使用的代码。
df = spark.read.load(FilePath1,
format="parquet", modifiedAfter=datetime)
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
df.coalesce(1).write.format("parquet").mode("overwrite").save(FilePath2)
Error -
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times, most recent failure: Lost task 3.0 in stage 86.0 (TID 282) (10.0.55.68 executor driver): com.databricks.sql.io.FileReadException: Error while reading file dbfs:file.parquet. Possible cause: Parquet column cannot be converted.
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong.
任何帮助都将受到赞赏。谢谢。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
此错误的原因可能是由于列的十进制列被解码为二进制格式。
用于读取镶木点文件中的数据集,默认情况下,databricks运行时7.3及更高版本中启用了矢量化的parquet读取器。二进制,布尔值,日期,文本和时间戳都是读取模式中使用的原子数据类型。
解决方案是,如果您的源数据包含十进制类型列,则应禁用矢量化的镶木读取器。
要在集群级别禁用矢量化的镶木读取器,set
spark.sql.parquet.parquet.enableDevectorizedReader
tofalse> false> false
在集群的 spark noreferrer”> spark配置在笔记本级别上,您还可以通过运行矢量化的parquet读取器, :
spark.conf.set(“ spark.sql.parquet.enable vectorizedReader”,“ false”)
参考:
来禁用矢量化的镶木读取器:
pyspark作业因阶段失败而导致的错误
The cause of this error is possibly because of the decimal type of column is decoded into binary format by the vectorized Parquet reader.
For reading datasets in Parquet files, the vectorized Parquet reader is enabled by default in Databricks Runtime 7.3 and higher. Binary, boolean, date, text, and timestamp are all atomic data types used in the read schema.
The solution for this is, if your source data contains decimal type columns, you should disable the vectorized Parquet reader.
To disable the vectorized Parquet reader at the cluster level, set
spark.sql.parquet.enableVectorizedReader
tofalse
in the cluster’s Spark configurationAt the notebook level, you can also disable the vectorized Parquet reader by running:
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
References:
Apache Spark job fails with Parquet column cannot be converted error
Pyspark job aborted error due to stage failure