pyspark:由于阶段失败而流产的工作:阶段86.0失败1次可能原因:Parquet列无法转换

发布于 2025-02-06 12:18:43 字数 839 浏览 1 评论 0 原文

我在将镶木quet文件从一个斑点编写到另一个斑点时面临一些问题。以下是我正在使用的代码。

df = spark.read.load(FilePath1,
                     format="parquet", modifiedAfter=datetime)
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
df.coalesce(1).write.format("parquet").mode("overwrite").save(FilePath2)
Error - 
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times, most recent failure: Lost task 3.0 in stage 86.0 (TID 282) (10.0.55.68 executor driver): com.databricks.sql.io.FileReadException: Error while reading file dbfs:file.parquet. Possible cause: Parquet column cannot be converted.
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong.

任何帮助都将受到赞赏。谢谢。

I am facing some issues while writing parquet files from one blob to another. below is the code I'm using.

df = spark.read.load(FilePath1,
                     format="parquet", modifiedAfter=datetime)
spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")
df.coalesce(1).write.format("parquet").mode("overwrite").save(FilePath2)
Error - 
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 86.0 failed 1 times, most recent failure: Lost task 3.0 in stage 86.0 (TID 282) (10.0.55.68 executor driver): com.databricks.sql.io.FileReadException: Error while reading file dbfs:file.parquet. Possible cause: Parquet column cannot be converted.
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableInt cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong.

any help is appreciated. Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

暮凉 2025-02-13 12:18:43
  • 此错误的原因可能是由于列的十进制列被解码为二进制格式。

  • 用于读取镶木点文件中的数据集,默认情况下,databricks运行时7.3及更高版本中启用了矢量化的parquet读取器。二进制,布尔值,日期,文本和时间戳都是读取模式中使用的原子数据类型。

  • 解决方案是,如果您的源数据包含十进制类型列,则应禁用矢量化的镶木读取器。

  • 要在集群级别禁用矢量化的镶木读取器,set spark.sql.parquet.parquet.enableDevectorizedReader to false> false> false 在集群的 spark noreferrer”> spark配置

  • 在笔记本级别上,您还可以通过运行矢量化的parquet读取器, :

    spark.conf.set(“ spark.sql.parquet.enable vectorizedReader”,“ false”)

参考:
来禁用矢量化的镶木读取器:

pyspark作业因阶段失败而导致的错误

  • The cause of this error is possibly because of the decimal type of column is decoded into binary format by the vectorized Parquet reader.

  • For reading datasets in Parquet files, the vectorized Parquet reader is enabled by default in Databricks Runtime 7.3 and higher. Binary, boolean, date, text, and timestamp are all atomic data types used in the read schema.

  • The solution for this is, if your source data contains decimal type columns, you should disable the vectorized Parquet reader.

  • To disable the vectorized Parquet reader at the cluster level, set spark.sql.parquet.enableVectorizedReader to false in the cluster’s Spark configuration

  • At the notebook level, you can also disable the vectorized Parquet reader by running:

    spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")

References:
Apache Spark job fails with Parquet column cannot be converted error
Pyspark job aborted error due to stage failure

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文