非分区表模式未使用胶水ETL作业更新
我们有一个ETL作业,该作业使用以下代码片段来更新目录表:
sink = glueContext.getSink(connection_type='s3', path=config['glue_s3_path_bc'], enableUpdateCatalog=True, updateBehavior='UPDATE_IN_DATABASE')
sink.setFormat('glueparquet')
sink.setCatalogInfo(catalogDatabase=config['glue_db'], catalogTableName=config['glue_table_bc'], catalogId=args['catalog_id'])
sink.writeFrame(dyF)
该表不分配&需要每天用新数据覆盖。由于 glueContext 不支持覆盖物,因此我们使用 purge_s3_path & purge_table 方法在使用上面写入之前,将S3位置清空。我们也为分区表做类似的事情&到目前为止,它一直对我们有效。
最近,更新了数据模式(添加了一些新列)。在完成ETL工作完成后,它成功地使用了新的模式更新了分区表,但是非分区的模式仍然相同。我们确实通过物理访问S3文件来验证&新字段存在于数据文件中。为什么未更新的模式类似于分区表?我们可以使用不同的方法吗?
We have an ETL job that uses the below code snippet to update the catalog table:
sink = glueContext.getSink(connection_type='s3', path=config['glue_s3_path_bc'], enableUpdateCatalog=True, updateBehavior='UPDATE_IN_DATABASE')
sink.setFormat('glueparquet')
sink.setCatalogInfo(catalogDatabase=config['glue_db'], catalogTableName=config['glue_table_bc'], catalogId=args['catalog_id'])
sink.writeFrame(dyF)
The table is non-partitioned & needs to be overwritten with new data daily. Since glueContext does not support overwrite, we are using purge_s3_path & purge_table methods to empty the S3 Location a step before using the above write. We do similar thing for partitioned tables as well & it has been working fine for us so far.
Recently, the schema of the data was updated (added a few new columns). Upon the ETL job completion, it successfully updated the partitioned Table with the new schema but the non-partitioned schema is still the same. We did verify by physically accessing the S3 files & the new fields are present in the datafiles. Why is the schema not updated similar to the partitioned Table? Is there a different method that we can use?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是一个已知的限制。
如果未划分表,则可以将其放置并重新创建它。
在文档中注意:“架构更新不支持非分区表(不使用“ PartitionKeys”选项)”
https://docs.aws.aws.aws.aws.aws.amazon.com/胶水/最新/dg/update-from-job.html
That's a known limitation.
If a table is not partitioned, you can just drop it and let it be recreated.
Notice in the docs: "Schema updates are not supported for non-partitioned tables (not using the "partitionKeys" option)"
https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html