胶Pyspark错误:PyWritedYnamicFrame。 org.apache.parquet.column.values.dictionary.plainvaluesdictionary $ plainlongdictionary
我已经进行了2天的调试,并试图弄清楚发生了什么,而没有成功。 AWS支持也无法在这里帮助我,所以让我解释一下。
我有一个名为provisioned.customer
的表,该表在胶水上定义。该表位于已提供的数据库上。
还有另外2个表格,策划。
我 。
href =“ https://i.sstatic.net/gg7rj.png” rel =“ nofollow noreferrer”>
我将所有数据类型都设置为启动流程的字符串。
现在,我使用Pyspark有一个胶水作业3.0,以及以下代码:
# more imports...
import pyspark.sql.types as T
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import functions as F
def fudf(val):
return functools.reduce(lambda x, y: x+y, val)
flattenUdf = F.udf(fudf, T.StringType())
customer_df = spark.read.table('curated.customer')
core_customer_consents_df = spark.read.table('curated.core_customer_consents')
core_customer_consents_df = core_customer_consents_df.groupBy("tscid").agg(F.collect_list('purpose').alias('__purposes__'))
merged = customer_df.join(core_customer_consents_df, ['tscid'], how='inner')
merged = merged.select("*", flattenUdf("__purposes__").alias("cirrus_purposes"))
merged = merged.select([c for c in merged.columns if c not in {'__purposes__'}])
logger.info(merged)
glueContext.write_dynamic_frame.from_catalog(
frame=DynamicFrame.fromDF(merged, glueContext, "a"),
database = 'provisioned',
table_name = 'customer',
transformation_ctx = "datasource0",
)
这是logger.info
的输出:
2022-06-15 08:35:26 INFO logger: DataFrame[tscid: string, first_name: string, last_name: string, gender: string, birthdate: string, qualification: string, country: string, estimated_annual_earning: string, cirrus_purposes: string]
但是在执行此问题之后,我将面临以下问题:
An error occurred while calling o149.pyWriteDynamicFrame. org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
但是我已经检查了模式匹配,并且它们完全相同(请参见logger.info
输出),甚至在同一列顺序中。同样在我的本地,这确实会相应地生成数据框架。所以我不知道到底发生了什么。
你面对这个吗?先感谢您。
I have been over 2 days debugging and trying to figure out what's going on, without success. AWS Support has not been able to help me here either, so let me explain.
I have a table called provisioned.customer
which is defined on Glue. This table is on provisioned database.
I have another 2 tables, curated.customer_consents
and curated.customer
with their respective schemas:
I have set all data types to string for easing the process.
Now I have a Glue job 3.0 with PySpark, and the following code:
# more imports...
import pyspark.sql.types as T
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import functions as F
def fudf(val):
return functools.reduce(lambda x, y: x+y, val)
flattenUdf = F.udf(fudf, T.StringType())
customer_df = spark.read.table('curated.customer')
core_customer_consents_df = spark.read.table('curated.core_customer_consents')
core_customer_consents_df = core_customer_consents_df.groupBy("tscid").agg(F.collect_list('purpose').alias('__purposes__'))
merged = customer_df.join(core_customer_consents_df, ['tscid'], how='inner')
merged = merged.select("*", flattenUdf("__purposes__").alias("cirrus_purposes"))
merged = merged.select([c for c in merged.columns if c not in {'__purposes__'}])
logger.info(merged)
glueContext.write_dynamic_frame.from_catalog(
frame=DynamicFrame.fromDF(merged, glueContext, "a"),
database = 'provisioned',
table_name = 'customer',
transformation_ctx = "datasource0",
)
This is the output of the logger.info
:
2022-06-15 08:35:26 INFO logger: DataFrame[tscid: string, first_name: string, last_name: string, gender: string, birthdate: string, qualification: string, country: string, estimated_annual_earning: string, cirrus_purposes: string]
But after executing this, I face the following issue:
An error occurred while calling o149.pyWriteDynamicFrame. org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
But I have checked schema matching and they are exactly the same (see logger.info
output) and even in the same column order. Also in my local this does generate the data frame accordingly. So I don't know what's really going on.
Have you faced this? Thank you in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我发现了这个问题。我使用Pyarrow Parquet文件从CSV生成。列
estunated_annual_earning
正在生成很长时间。因此,每当我阅读它时,显然由于懒惰的性质而没有问题,但是当读取curtated.customer
时,由于数据类型之间的不匹配而调用操作(插入)时,它给出了此错误。解决方案:生成与胶Metastore中数据类型相同的数据类型的镶木。
I have found the problem. I was generating from CSV using pyarrow parquet files. Column
estimated_annual_earning
was being generated as long. So whenever I was reading it, apparently no issues due to the lazy nature, but it gave this error when the action (insert) was called, due to a mismatch between data type when readingcurated.customer
.Solution: generate parquet with the same data types as what you have in your glue metastore.