AWS Glue,同一作业中的多个书签?
假设我有一个脚本正在加载具有不同模式的多个框架,
job.init()
DefinitionDyf = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [DefinitionPath], "recurse": True}, format = "csv",
format_options= {'withHeader': True}, transformation_ctx="DefinitionBookmark")
TypeDyf = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [TypePath], "recurse": True}, format = "csv",
format_options= {'withHeader': True}, transformation_ctx="TypeBookmark")
我做了一些转换,然后写入另一个存储桶,并使用
job.commit()
结束脚本。两个书签都会更新还是仅第一个书签会更新?是否建议像这样拆分书签?我看到的大多数示例每项工作只有一个书签。
Lets say I have a script that is loading multiple frames with different schemas
job.init()
DefinitionDyf = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [DefinitionPath], "recurse": True}, format = "csv",
format_options= {'withHeader': True}, transformation_ctx="DefinitionBookmark")
TypeDyf = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [TypePath], "recurse": True}, format = "csv",
format_options= {'withHeader': True}, transformation_ctx="TypeBookmark")
I do some transformations and then I wrote to another bucket and I end the script with
job.commit()
Would both bookmarks be updated or just the first one? Is it recommended to split up bookmarks like this? most of the examples I saw only had one bookmark for each job.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我相信答案是肯定的,基于此文档。也就是说,单个状态(一个带有 job.commit() 的脚本)可以有 N 个“状态元素”(复数)。每个状态都特定于每个源......我正在考虑对动态生成的源进行相同的思考,所有源都有自己的状态元素。
该页面的下方有一个示例 JSON,其属性“states”包含多个对象。
然后在 此文档示例代码显示两者源和目标的
transformation_ctx
参数鉴于此,没有任何内容描述处理多个源的任何限制。 strong>,并且以下内容应该有效,只要它们具有唯一的名称,就可以根据需要使用尽可能多的https:// docs.aws.amazon.com/glue/latest/dg/programming-etl-connect-bookmarks.html#monitor-continuations-implement-context)
transformation_ctx
参数(请参见此处:I believe the answer is yes, based on this documentation. It is saying there can be N "state elements" (plural) for a single state (one script with
job.commit()
). Each state is specific to each source... I am thinking of doing the same think with dynamically generated sources, all with their own state element.Further down in that page there is an example JSON, with an attribute, "states", containing multiple objects.
Then in this documentation examples the code shows the
transformation_ctx
parameter for both source and target. Given that, there is nothing describing any limitation about handling multiple sources, and the following should work.This makes sense to use as many
transformation_ctx
parameters as needed, as long as they are uniquely named. (see here: https://docs.aws.amazon.com/glue/latest/dg/programming-etl-connect-bookmarks.html#monitor-continuations-implement-context)