PYSPARK-胶水3.0发行，升级Spark 3.0：1582-10-15之前的阅读日期或1900-01-01-01T00之前的时间戳：00：00 Z

发布于 2025-01-28 03:56:22 字数 899 浏览 4 评论 0原文

升级到胶3.0后，我在处理 rdd 对象时会出现以下错误

调用O926.javatopython时发生错误。你可能会得到一个由于Spark 3.0的升级：阅读日期，结果不同在1582-10-15之前或1900-01-01-01t00：00：00 Z之前的时间戳镶木木文件可能是模棱两可的，因为这些文件可以由Spark编写 2.X或遗产版本的Hive，它使用了与Spark 3.0+的Proleptic Gregorian日历不同的旧式混合日历。看 Spark-31404中的更多详细信息。您可以设置 spark.sql.legacy.parquet.datetimerebasemodeinread to'legacy' 在日历差异wrt期间的日历差异阅读。或将spark.sql.legacy.parquet.datetimerebasemodeinread设置为 “校正”以读取DateTime值。

我已经添加了 doc

- conf spark.sql.sql.legacy.parquet.int96rebasemodeinread =校正-conf spark.sql.legacy.parquacy.parquet.int96 rebasemodeinwrite 。

注意：当地，我正在使用pyspark3.1.2，对于相同的数据，它无问题

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

白况 2025-02-04 03:56:22

我遵循 aws doc ，因为一般胶水建议是我们不应设置和使用-CONF参数，因为它在内部使用。我所涉及的解决方案以下是：

from pyspark import SparkConf
conf = SparkConf()

conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

我使用Mauricio的答案所面临的问题是sc.stop（）实际上使用胶水3.0停止在Spark上下语上执行，并破坏了我从数据源中摄入的数据流。（在我的情况下）。

I faced the same issue by following the aws doc, since general Glue recommendation is that we should not setup and use --conf parameter as it is used internally. My solution involved following:

from pyspark import SparkConf
conf = SparkConf()

conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

The problem I faced using Mauricio's answer was that sc.stop() actually stops the execution on spark context using Glue 3.0, and disrupts the stream of data I was ingesting from the data source (RDS in my case).

回复收藏 0 原文

匿名。 2025-02-04 03:56:22

我这样解决了。默认值下面：

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

添加其他火花配置

conf = sc.getConf()
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)

...您的代码

I solved like this. Default below:

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

Add additional spark configurations

conf = sc.getConf()
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)

... your code

回复收藏 0 原文

榆西 2025-02-04 03:56:22

设置SparkContext对我不起作用，我必须在Spark_session中设置它。

sc = SparkContext.getOrCreate() # version 3.1.1-amzn-0
conf = sc.getConf()
#NO#NO#NO#NO#NO#NO#NO#NO#NO#NO setting SparkContext does not work. 
# conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") # SPARK-31404   NOPE

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES setting spark_session does work. 
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") #SPARK-31404
print(spark.conf.get("spark.sql.legacy.parquet.datetimeRebaseModeInWrite")) #see?

这可能取决于您使用SC＆amp; Spark，我在询问：

df = spark.read.format("xml").etc

Setting SparkContext did not work for me, I had to set it in spark_session.

sc = SparkContext.getOrCreate() # version 3.1.1-amzn-0
conf = sc.getConf()
#NO#NO#NO#NO#NO#NO#NO#NO#NO#NO setting SparkContext does not work. 
# conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY") # SPARK-31404   NOPE

sc = SparkContext.getOrCreate(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES#YES setting spark_session does work. 
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") #SPARK-31404
print(spark.conf.get("spark.sql.legacy.parquet.datetimeRebaseModeInWrite")) #see?

This may depend on how you are using sc & spark, I was querying with: