胶Pyspark错误:PyWritedYnamicFrame。 org.apache.parquet.column.values.dictionary.plainvaluesdictionary $ plainlongdictionary

发布于 2025-02-07 17:37:42 字数 2354 浏览 2 评论 0原文

我已经进行了2天的调试,并试图弄清楚发生了什么,而没有成功。 AWS支持也无法在这里帮助我,所以让我解释一下。

我有一个名为provisioned.customer的表,该表在胶水上定义。该表位于已提供的数据库上。

还有另外2个表格,策划。

我 。

​href =“ https://i.sstatic.net/gg7rj.png” rel =“ nofollow noreferrer”>

我将所有数据类型都设置为启动流程的字符串。

现在,我使用Pyspark有一个胶水作业3.0,以及以下代码:

# more imports...
import pyspark.sql.types as T
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import functions as F

def fudf(val):
    return functools.reduce(lambda x, y: x+y, val)

flattenUdf = F.udf(fudf, T.StringType())

customer_df = spark.read.table('curated.customer')
core_customer_consents_df = spark.read.table('curated.core_customer_consents')

core_customer_consents_df = core_customer_consents_df.groupBy("tscid").agg(F.collect_list('purpose').alias('__purposes__'))

merged = customer_df.join(core_customer_consents_df, ['tscid'], how='inner')
merged = merged.select("*", flattenUdf("__purposes__").alias("cirrus_purposes"))
merged = merged.select([c for c in merged.columns if c not in {'__purposes__'}])
logger.info(merged)

glueContext.write_dynamic_frame.from_catalog(
    frame=DynamicFrame.fromDF(merged, glueContext, "a"),
    database = 'provisioned', 
    table_name = 'customer', 
    transformation_ctx = "datasource0", 
)

这是logger.info的输出:

2022-06-15 08:35:26 INFO logger: DataFrame[tscid: string, first_name: string, last_name: string, gender: string, birthdate: string, qualification: string, country: string, estimated_annual_earning: string, cirrus_purposes: string]

但是在执行此问题之后,我将面临以下问题:

An error occurred while calling o149.pyWriteDynamicFrame. org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

但是我已经检查了模式匹配,并且它们完全相同(请参见logger.info输出),甚至在同一列顺序中。同样在我的本地,这确实会相应地生成数据框架。所以我不知道到底发生了什么。

你面对这个吗?先感谢您。

I have been over 2 days debugging and trying to figure out what's going on, without success. AWS Support has not been able to help me here either, so let me explain.

I have a table called provisioned.customer which is defined on Glue. This table is on provisioned database.

enter image description here

I have another 2 tables, curated.customer_consents and curated.customer with their respective schemas:

enter image description here

enter image description here

I have set all data types to string for easing the process.

Now I have a Glue job 3.0 with PySpark, and the following code:

# more imports...
import pyspark.sql.types as T
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import functions as F

def fudf(val):
    return functools.reduce(lambda x, y: x+y, val)

flattenUdf = F.udf(fudf, T.StringType())

customer_df = spark.read.table('curated.customer')
core_customer_consents_df = spark.read.table('curated.core_customer_consents')

core_customer_consents_df = core_customer_consents_df.groupBy("tscid").agg(F.collect_list('purpose').alias('__purposes__'))

merged = customer_df.join(core_customer_consents_df, ['tscid'], how='inner')
merged = merged.select("*", flattenUdf("__purposes__").alias("cirrus_purposes"))
merged = merged.select([c for c in merged.columns if c not in {'__purposes__'}])
logger.info(merged)

glueContext.write_dynamic_frame.from_catalog(
    frame=DynamicFrame.fromDF(merged, glueContext, "a"),
    database = 'provisioned', 
    table_name = 'customer', 
    transformation_ctx = "datasource0", 
)

This is the output of the logger.info:

2022-06-15 08:35:26 INFO logger: DataFrame[tscid: string, first_name: string, last_name: string, gender: string, birthdate: string, qualification: string, country: string, estimated_annual_earning: string, cirrus_purposes: string]

But after executing this, I face the following issue:

An error occurred while calling o149.pyWriteDynamicFrame. org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary

enter image description here

But I have checked schema matching and they are exactly the same (see logger.info output) and even in the same column order. Also in my local this does generate the data frame accordingly. So I don't know what's really going on.

Have you faced this? Thank you in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

驱逐舰岛风号 2025-02-14 17:37:42

我发现了这个问题。我使用Pyarrow Parquet文件从CSV生成。列estunated_annual_earning正在生成很长时间。因此,每当我阅读它时,显然由于懒惰的性质而没有问题,但是当读取curtated.customer时,由于数据类型之间的不匹配而调用操作(插入)时,它给出了此错误。

解决方案:生成与胶Metastore中数据类型相同的数据类型的镶木。

I have found the problem. I was generating from CSV using pyarrow parquet files. Column estimated_annual_earning was being generated as long. So whenever I was reading it, apparently no issues due to the lazy nature, but it gave this error when the action (insert) was called, due to a mismatch between data type when reading curated.customer.

Solution: generate parquet with the same data types as what you have in your glue metastore.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文