带有 HotKeyFanout 的 JointFn Perkey 的子步骤(按键分组)中出现 OOM 错误

发布于 2025-01-15 05:42:07 字数 1137 浏览 5 评论 0原文

我有一个在 Dataflow 上运行的梁批处理作业。本次工作总结流程:

  • ->从 Big Query 读取 // 数亿条记录
  • ->将每个大查询记录转换为订单事件
  • ->使用其客户 ID 为每个订单事件设置键
  • ->应用组合函数(hotKeyFanout = 50)计算每个客户所有订单的平均价格
  • ->将组合结果转换为单个 TableRow
  • ->将TableRow写入BigQuery

我知道有些客户可以有数百万个订单(热键问题) 所以我应用了高级的combineFn和hotKeyFanout (50)

Combine Function

我非常确定,通过此设置,内存不会成为问题,因为数据是通过 Accumulator 聚合的,因此减少了内存占用在管道中。

然而我在CombineFn的一个小步骤(称为GroupByKey)中遇到了问题,我不太明白这一步的目的。 (也许这是合并累加器步骤?)

数据流组合步骤扩展

从 Cloud Profiler 中查看堆分析,似乎有一个巨大的字符串生成器对象 ~1GB : 堆分析关于有 OOM 问题的工作人员

目前,我很困惑为什么当我已经有 hotKeyFanout 选项时,工作人员中最终会出现大量数据。我目前的猜测是,热键 50 个分割还不够,所以有一个巨大的分割包。我倾向于将热键扇出增加到更大的数字,但我不确定它是否能解决问题,或者我从错误的角度看待问题。

寻找关于我在这里做错的事情的建议/建议。

I have a beam batch job runs on Dataflow. Summary flow of this job:

  • -> Read from Big Query // Hundreds of million records
  • -> Transform each big query record to Order event
  • -> Key each Order event by using its customer id
  • -> Apply combine function (with hotKeyFanout = 50) to calculate average price of all orders per customer
  • -> Transform combined result into a single TableRow
  • -> Write TableRow to BigQuery

I know that some customers can have millions of orders (hot key problem)
So I apply advanced combineFn with hotKeyFanout (50)

Combine Function

I was pretty sure that with this setup, memory won’t be a problem, because data is aggregated with Accumulator , hence reducing memory footprint in the pipeline.

Yet I ran into a problem with a small step of CombineFn, called GroupByKey, which I don’t really understand the purpose of this step. (Perhaps it is the merge accumulator step?)

Dataflow combine step expanded

Looking at the heap analysis from cloud profiler, there seems to be a huge string builder object ~1GB :
Heap analysis on worker with OOM problem

At the moment I’m pretty lost on why there is a huge amount of data ended up in a worker when i already have hotKeyFanout option. My guess at the moment is that 50 splits for hot key isn’t enough, so there is a huge split bundle. I tend to increase hotkeyfanout to larger number, but I’m not sure if it will solve the problem or I’m looking at the problem from the wrong angle.

Looking for suggestions/recommendation on what I did wrong here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文