PYSPARK-胶水3.0发行,升级Spark 3.0:1582-10-15之前的阅读日期或1900-01-01-01T00之前的时间戳:00:00 Z
升级到胶3.0
后,我在处理 rdd 对象时会出现以下错误
调用O926.javatopython时发生错误。你可能会得到一个 由于Spark 3.0的升级:阅读日期,结果不同 在1582-10-15之前或1900-01-01-01t00:00:00 Z之前的时间戳 镶木木文件可能是模棱两可的,因为这些文件可以由Spark编写 2.X或遗产版本的Hive,它使用了与Spark 3.0+的Proleptic Gregorian日历不同的旧式混合日历。看 Spark-31404中的更多详细信息。您可以设置 spark.sql.legacy.parquet.datetimerebasemodeinread to'legacy' 在日历差异wrt期间的日历差异 阅读。或将spark.sql.legacy.parquet.datetimerebasemodeinread设置为 “校正”以读取DateTime值。
我已经添加了 doc
- conf spark.sql.sql.legacy.parquet.int96rebasemodeinread =校正-conf spark.sql.legacy.parquacy.parquet.int96 rebasemodeinwrite 。
注意:当地,我正在使用pyspark3.1.2
,对于相同的数据,它无问题
After upgrading to Glue 3.0
I got the following error when handling rdd objects
An error occurred while calling o926.javaToPython. You may get a
different result due to the upgrading of Spark 3.0: reading dates
before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from
Parquet files can be ambiguous, as the files may be written by Spark
2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See
more details in SPARK-31404. You can set
spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to
rebase the datetime values w.r.t. the calendar difference during
reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to
'CORRECTED' to read the datetime values as it is.
I've already added the config mentioned in the doc
--conf spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED
this is really a blocking issue that prevents to run the Glue jobs !
Note: locally I'm using pyspark3.1.2
, for the same data it works with no problem
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我遵循 aws doc ,因为一般胶水建议是我们不应设置和使用-CONF参数,因为它在内部使用。我所涉及的解决方案以下是:
我使用Mauricio的答案所面临的问题是
sc.stop()
实际上使用胶水3.0停止在Spark上下语上执行,并破坏了我从数据源中摄入的数据流。 (在我的情况下)。I faced the same issue by following the aws doc, since general Glue recommendation is that we should not setup and use --conf parameter as it is used internally. My solution involved following:
The problem I faced using Mauricio's answer was that
sc.stop()
actually stops the execution on spark context using Glue 3.0, and disrupts the stream of data I was ingesting from the data source (RDS in my case).我这样解决了。默认值下面:
添加其他火花配置
...您的代码
I solved like this. Default below:
Add additional spark configurations
... your code
设置SparkContext对我不起作用,我必须在Spark_session中设置它。
这可能取决于您使用SC& Spark,我在询问:
Setting SparkContext did not work for me, I had to set it in spark_session.
This may depend on how you are using sc & spark, I was querying with: