在分支PCollections的背景下如何处理？

发布于 2025-02-09 13:52:23 字数 2188 浏览 0 评论 0原文

Apache Beam跑步者可以执行 fusion 以优化图形执行。但是，在某些情况下，融合可能会导致次优性能。在Cloud DataFlow中，可以通过插入reshuffle变换或GroupByKey，它可以避免这种情况，从而破坏融合，按 decloveing-a-a-a-pipeline＃fusion-optimization 。否则，融合了几个ptransforms 导致融合阶段。值得注意的是，reshuffle被陈述为已弃用，有一个问题：（ apache beam/dataflow rephfeple已弃用，使用什么？）转换为窗口的groupByKey，然后进行峰值扩展。

如果发生故障，Cloud DataFlow将重试失败的捆绑包（用于批处理模式，最多可用于流式传输模式，最多4次），则根据部署管道＃错误和exception处理。重试仅在Apache Beam模型中的失败变换上发生，将其耦合（耦合失败）

除非基于此理解

，否则我执行了以下管道（简化为简洁）：管道a：读取文件＆gt;＆gt;＆gt;解析文件（高风扇 - 帕多转换）＆gt;＆gt; [写信给BigQuery，写入pub/sub]
管道B：读取文件＆gt;＆gt;解析文件（高风扇 - 帕多转换）＆gt;＆gt; [Reshuffle＆gt;＆gt;写信给BigQuery，写入pub/sub]

两个管道都写入不存在的酒吧/子主题以复制

管道A的错误场景A，我们观察到以下内容：

我们的pardo> （正在解析文件）正在重试，因为它无法写入pub/sub。
系统延迟正在增加，并且由于内部包含一个重新填充，因此bigquery没有输出，因此在reshuffle的groupbykey操作之前，每个步骤都是作为融合阶段的一部分而重试的，正如我们预期的那样（捆绑包具有错误的元素）。

对于管道B，我们观察到以下内容：

我们的pardo似乎并没有重试
，我们正在看到写入BigQuery的数据，即使数据流pub/sub指标中有错误（因此也有错误）希望重试的文件处理文件，甚至不再写入BigQuery）

，因为我以前的问题只会在下游观察到单个成功重试的输出/副作用：在工人的失败或捆绑捆绑中需要确切维护的处理？。。但是，管道A似乎符合此观察结果，而管道B则不符合。这些管道之间的唯一区别是添加了reshuffle ptransform，但都表现出不同的行为。

总而言之，我想我有以下问题：如何在融合阶段的背景下进行检索；特别是，与分支输出的融合阶段？例如，如果要重试捆扎，它是重试所有分支还是仅仅是失败的分支？

原文

Apache Beam runners can perform fusion to optimize the graph execution. However, in certain cases fusion may result in suboptimal performance.
In Cloud Dataflow, this can be avoided by inserting a Reshuffle Transform or GroupByKey, which breaks fusion, as per Deploying-a-pipeline#fusion-optimization . Otherwise, the fusion of several PTransforms
results in a fused Stage. Notably, a Reshuffle is stated to be deprecated, with the Question: (Apache Beam/Dataflow ReShuffle deprecated, what to use instead?) noting that Dataflow installs a ReshuffleOverrideFactory that essentially reduces the Reshuffle Transform to a windowed GroupByKey followed by iterable expansion.

In the event of a failure, Cloud Dataflow will retry the failed bundle (infinitely for streaming mode, and up to 4 times for batch mode), as per Deploying a Pipeline#error-and-exception-handling. The retry will only occur on the failing transform in the Apache Beam model,
unless it is coupled (Coupled Failing)

Based on this understanding, I have executed the following Pipelines(simplified for conciseness):

Pipeline A: Read Files >> Parse Files (High Fan-out ParDo transform) >> [Write to BigQuery, Write to Pub/Sub]
Pipeline B: Read Files >> Parse Files (High Fan-out ParDo transform) >> [Reshuffle >> Write to BigQuery, Write to Pub/Sub]

Both pipelines are writing to a non-existent Pub/Sub topic to replicate an error scenario

For pipeline A, we observed the following:

our ParDo (which is parsing the files) is retrying as since it could not write to Pub/Sub.
system latency is increasing and there is no output to BigQuery as it internally contains a Reshuffle, thus every step prior to the Reshuffle's groupbykey operation is being retried as part of the fused stage, as we expected (The bundle has an element with an error).

For pipeline B, we observed the following:

our ParDo does not appear to be retrying
Interestingly, we are seeing data written to BigQuery, even though there are errors reported in the Dataflow Pub/Sub Metrics(and thus expecting a retry to process the file instead of even writing to BigQuery)

Based on my previous question, I understand only the outputs/side effects of a single successful retry will be observed downstream: How is exactly-once processing maintained during worker failures or bundle retries?. However, Pipeline A appears to be conforming to this observation, while Pipeline B is not. The only difference between these Pipelines is the addition of the Reshuffle PTransform, yet both exhibit different behaviour.

In summary, I guess I have the following question:
How are retries handled in the context of a fused stage; in particular, fused stage with branching outputs? e.g. if there is a bundle retry, does it retry all branches or only the failing branch?

分享到QQ

分享到微博