非分区表模式未使用胶水ETL作业更新

发布于 2025-01-28 20:37:22 字数 745 浏览 5 评论 0原文

我们有一个ETL作业，该作业使用以下代码片段来更新目录表：

sink = glueContext.getSink(connection_type='s3', path=config['glue_s3_path_bc'], enableUpdateCatalog=True, updateBehavior='UPDATE_IN_DATABASE')
sink.setFormat('glueparquet')
sink.setCatalogInfo(catalogDatabase=config['glue_db'], catalogTableName=config['glue_table_bc'], catalogId=args['catalog_id'])
sink.writeFrame(dyF)

该表不分配＆amp;需要每天用新数据覆盖。由于 glueContext 不支持覆盖物，因此我们使用 purge_s3_path ＆amp; purge_table 方法在使用上面写入之前，将S3位置清空。我们也为分区表做类似的事情＆amp;到目前为止，它一直对我们有效。

最近，更新了数据模式（添加了一些新列）。在完成ETL工作完成后，它成功地使用了新的模式更新了分区表，但是非分区的模式仍然相同。我们确实通过物理访问S3文件来验证＆amp;新字段存在于数据文件中。为什么未更新的模式类似于分区表？我们可以使用不同的方法吗？

原文

We have an ETL job that uses the below code snippet to update the catalog table:

sink = glueContext.getSink(connection_type='s3', path=config['glue_s3_path_bc'], enableUpdateCatalog=True, updateBehavior='UPDATE_IN_DATABASE')
sink.setFormat('glueparquet')
sink.setCatalogInfo(catalogDatabase=config['glue_db'], catalogTableName=config['glue_table_bc'], catalogId=args['catalog_id'])
sink.writeFrame(dyF)

The table is non-partitioned & needs to be overwritten with new data daily. Since glueContext does not support overwrite, we are using purge_s3_path & purge_table methods to empty the S3 Location a step before using the above write. We do similar thing for partitioned tables as well & it has been working fine for us so far.

Recently, the schema of the data was updated (added a few new columns). Upon the ETL job completion, it successfully updated the partitioned Table with the new schema but the non-partitioned schema is still the same. We did verify by physically accessing the S3 files & the new fields are present in the datafiles. Why is the schema not updated similar to the partitioned Table? Is there a different method that we can use?

分享到QQ

分享到微博