PYSPARK-胶水3.0发行,升级Spark 3.0:1582-10-15之前的阅读日期或1900-01-01-01T00之前的时间戳:00:00 Z

发布于 2025-01-28 03:56:22 字数 899 浏览 4 评论 0原文

升级到胶3.0后,我在处理 rdd 对象时会出现以下错误

调用O926.javatopython时发生错误。你可能会得到一个 由于Spark 3.0的升级:阅读日期,结果不同 在1582-10-15之前或1900-01-01-01t00:00:00 Z之前的时间戳 镶木木文件可能是模棱两可的,因为这些文件可以由Spark编写 2.X或遗产版本的Hive,它使用了与Spark 3.0+的Proleptic Gregorian日历不同的旧式混合日历。看 Spark-31404中的更多详细信息。您可以设置 spark.sql.legacy.parquet.datetimerebasemodeinread to'legacy' 在日历差异wrt期间的日历差异 阅读。或将spark.sql.legacy.parquet.datetimerebasemodeinread设置为 “校正”以读取DateTime值。

我已经添加了 doc

- conf spark.sql.sql.legacy.parquet.int96rebasemodeinread =校正-conf spark.sql.legacy.parquacy.parquet.int96 rebasemodeinwrite 。

注意:当地,我正在使用pyspark3.1.2,对于相同的数据,它无问题

After upgrading to Glue 3.0 I got the following error when handling rdd objects

An error occurred while calling o926.javaToPython. You may get a
different result due to the upgrading of Spark 3.0: reading dates
before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from
Parquet files can be ambiguous, as the files may be written by Spark
2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See
more details in SPARK-31404. You can set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to
rebase the datetime values w.r.t. the calendar difference during
reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to
'CORRECTED' to read the datetime values as it is.

I've already added the config mentioned in the doc

--conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED

this is really a blocking issue that prevents to run the Glue jobs !

Note: locally I'm using pyspark3.1.2, for the same data it works with no problem

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

白况 2025-02-04 03:56:22

我遵循 aws doc ,因为一般胶水建议是我们不应设置和使用-CONF参数,因为它在内部使用。我所涉及的解决方案以下是:

from pyspark import SparkConf
conf = SparkConf()

conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

我使用Mauricio的答案所面临的问题是sc.stop()实际上使用胶水3.0停止在Spark上下语上执行,并破坏了我从数据源中摄入的数据流。 (在我的情况下)。

I faced the same issue by following the aws doc, since general Glue recommendation is that we should not setup and use --conf parameter as it is used internally. My solution involved following:

from pyspark import SparkConf
conf = SparkConf()

conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

The problem I faced using Mauricio's answer was that sc.stop() actually stops the execution on spark context using Glue 3.0, and disrupts the stream of data I was ingesting from the data source (RDS in my case).

匿名。 2025-02-04 03:56:22

我这样解决了。默认值下面:

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

添加其他火花配置

conf = sc.getConf()
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)

...您的代码

I solved like this. Default below:

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

Add additional spark configurations

conf = sc.getConf()
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)

... your code

榆西 2025-02-04 03:56:22

设置SparkContext对我不起作用,我必须在Spark_session中设置它。

sc = SparkContext.getOrCreate() # version 3.1.1-amzn-0
conf = sc.getConf()
#NO#NO#NO#NO#NO#NO#NO#NO#NO#NO setting SparkContext does not work. 
# conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") # SPARK-31404   NOPE

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES setting spark_session does work. 
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") #SPARK-31404
print(spark.conf.get("spark.sql.legacy.parquet.datetimeRebaseModeInWrite")) #see?

这可能取决于您使用SC& Spark,我在询问:

df = spark.read.format("xml").etc   

Setting SparkContext did not work for me, I had to set it in spark_session.

sc = SparkContext.getOrCreate() # version 3.1.1-amzn-0
conf = sc.getConf()
#NO#NO#NO#NO#NO#NO#NO#NO#NO#NO setting SparkContext does not work. 
# conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") # SPARK-31404   NOPE

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES setting spark_session does work. 
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") #SPARK-31404
print(spark.conf.get("spark.sql.legacy.parquet.datetimeRebaseModeInWrite")) #see?

This may depend on how you are using sc & spark, I was querying with:

df = spark.read.format("xml").etc   
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文