pyspark：由于阶段失败而流产的工作：阶段86.0失败1次可能原因：Parquet列无法转换

发布于 2025-02-06 12:18:43 字数 839 浏览 1 评论 0 原文

我在将镶木quet文件从一个斑点编写到另一个斑点时面临一些问题。以下是我正在使用的代码。

df = spark.read.load(FilePath1,
                     format="parquet", modifiedAfter=datetime)
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
df.coalesce(1).write.format("parquet").mode("overwrite").save(FilePath2)
Error - 
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times, most recent failure: Lost task 3.0 in stage 86.0 (TID 282) (10.0.55.68 executor driver): com.databricks.sql.io.FileReadException: Error while reading file dbfs:file.parquet. Possible cause: Parquet column cannot be converted.
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong.

任何帮助都将受到赞赏。谢谢。

原文

I am facing some issues while writing parquet files from one blob to another. below is the code I'm using.

df = spark.read.load(FilePath1,
                     format="parquet", modifiedAfter=datetime)
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
df.coalesce(1).write.format("parquet").mode("overwrite").save(FilePath2)
Error - 
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times, most recent failure: Lost task 3.0 in stage 86.0 (TID 282) (10.0.55.68 executor driver): com.databricks.sql.io.FileReadException: Error while reading file dbfs:file.parquet. Possible cause: Parquet column cannot be converted.
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong.

any help is appreciated. Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暮凉 2025-02-13 12:18:43

此错误的原因可能是由于列的十进制列被解码为二进制格式。
用于读取镶木点文件中的数据集，默认情况下，databricks运行时7.3及更高版本中启用了矢量化的parquet读取器。二进制，布尔值，日期，文本和时间戳都是读取模式中使用的原子数据类型。
解决方案是，如果您的源数据包含十进制类型列，则应禁用矢量化的镶木读取器。
要在集群级别禁用矢量化的镶木读取器，set spark.sql.parquet.parquet.enableDevectorizedReader to false> false> false 在集群的 spark noreferrer”> spark配置
在笔记本级别上，您还可以通过运行矢量化的parquet读取器，：

spark.conf.set（“ spark.sql.parquet.enable vectorizedReader”，“ false”）