AWS Glue，同一作业中的多个书签？

发布于 2025-01-12 23:55:40 字数 685 浏览 0 评论 0原文

假设我有一个脚本正在加载具有不同模式的多个框架，

job.init()
    
DefinitionDyf = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [DefinitionPath], "recurse": True}, format = "csv",
     format_options= {'withHeader': True}, transformation_ctx="DefinitionBookmark")
    
    
TypeDyf = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [TypePath], "recurse": True}, format = "csv",
     format_options= {'withHeader': True}, transformation_ctx="TypeBookmark")

我做了一些转换，然后写入另一个存储桶，并使用

job.commit()

结束脚本。两个书签都会更新还是仅第一个书签会更新？是否建议像这样拆分书签？我看到的大多数示例每项工作只有一个书签。

原文

Lets say I have a script that is loading multiple frames with different schemas

job.init()
    
DefinitionDyf = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [DefinitionPath], "recurse": True}, format = "csv",
     format_options= {'withHeader': True}, transformation_ctx="DefinitionBookmark")
    
    
TypeDyf = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [TypePath], "recurse": True}, format = "csv",
     format_options= {'withHeader': True}, transformation_ctx="TypeBookmark")

I do some transformations and then I wrote to another bucket and I end the script with

job.commit()

Would both bookmarks be updated or just the first one? Is it recommended to split up bookmarks like this? most of the examples I saw only had one bookmark for each job.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情话难免假 2025-01-19 23:55:41

我相信答案是肯定的，基于此文档。也就是说，单个状态（一个带有 job.commit() 的脚本）可以有 N 个“状态元素”（复数）。每个状态都特定于每个源......我正在考虑对动态生成的源进行相同的思考，所有源都有自己的状态元素。

“作业书签存储作业的状态。状态的每个实例
由作业名称和版本号作为关键字。当脚本调用时
job.init，它检索其状态并始终获取最新版本。
一个状态内，有多个状态元素，状态元素是特定的
脚本中的每个源、转换和接收器实例。这些
状态元素由转换上下文标识，即
附加到相应的元素（源、转换或
水槽）在脚本中。状态元素在以下情况下被原子保存：
job.commit 是从用户脚本调用的......
作业书签中的状态元素是源、转换或特定于接收器的数据。”

该页面的下方有一个示例 JSON，其属性“states”包含多个对象。

{
  "job_name" : ...,
  "run_id": ...,
  "run_number": ..,
  "attempt_number": ...
  "states": {
    "transformation_ctx1" : {
      bookmark_state1
    },
    "transformation_ctx2" : {
      bookmark_state2
    }
  }
}

然后在此文档示例代码显示两者源和目标的transformation_ctx参数鉴于此，没有任何内容描述处理多个源的任何限制。 strong>，并且以下内容应该有效，

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "database",
    table_name = "relatedqueries_csv",
    transformation_ctx = "datasource0"
)

datasource1 = glueContext.create_dynamic_frame.from_catalog(
    database = "database",
    table_name = "relatedqueries1_csv",
    transformation_ctx = "datasource1"
)

只要它们具有唯一的名称，就可以根据需要使用尽可能多的 transformation_ctx 参数（请参见此处：https:// docs.aws.amazon.com/glue/latest/dg/programming-etl-connect-bookmarks.html#monitor-continuations-implement-context）

“...transformation_ctx，这是 ETL 运算符的唯一标识符
实例。 conversion_ctx参数用于识别状态
给定操作员的作业书签中的信息。
...transformation_ctx 充当在脚本中搜索特定源的书签状态的键。”

I believe the answer is yes, based on this documentation. It is saying there can be N "state elements" (plural) for a single state (one script with job.commit()). Each state is specific to each source... I am thinking of doing the same think with dynamically generated sources, all with their own state element.

"Job bookmarks store the states for a job. Each instance of the state
is keyed by a job name and a version number. When a script invokes
job.init, it retrieves its state and always gets the latest version.
Within a state, there are multiple state elements, which are specific
to each source, transformation, and sink instance in the script. These
state elements are identified by a transformation context that is
attached to the corresponding element (source, transformation, or
sink) in the script. The state elements are saved atomically when
job.commit is invoked from the user script....
The state elements in the job bookmark are source, transformation, or sink-specific data."

Further down in that page there is an example JSON, with an attribute, "states", containing multiple objects.

{
  "job_name" : ...,
  "run_id": ...,
  "run_number": ..,
  "attempt_number": ...
  "states": {
    "transformation_ctx1" : {
      bookmark_state1
    },
    "transformation_ctx2" : {
      bookmark_state2
    }
  }
}

Then in this documentation examples the code shows the transformation_ctx parameter for both source and target. Given that, there is nothing describing any limitation about handling multiple sources, and the following should work.

datasource0 = glueContext.create_dynamic_frame.from_catalog(
    database = "database",
    table_name = "relatedqueries_csv",
    transformation_ctx = "datasource0"
)

datasource1 = glueContext.create_dynamic_frame.from_catalog(
    database = "database",
    table_name = "relatedqueries1_csv",
    transformation_ctx = "datasource1"
)

This makes sense to use as many transformation_ctx parameters as needed, as long as they are uniquely named. (see here: https://docs.aws.amazon.com/glue/latest/dg/programming-etl-connect-bookmarks.html#monitor-continuations-implement-context)

"...transformation_ctx, which is a unique identifier for the ETL operator
instance. The transformation_ctx parameter is used to identify state
information within a job bookmark for the given operator.
...The transformation_ctx serves as the key to search the bookmark state for a specific source in your script."

回复收藏 0 原文

~没有更多了~