带有 HotKeyFanout 的 JointFn Perkey 的子步骤(按键分组)中出现 OOM 错误
我有一个在 Dataflow 上运行的梁批处理作业。本次工作总结流程:
- ->从 Big Query 读取 // 数亿条记录
- ->将每个大查询记录转换为订单事件
- ->使用其客户 ID 为每个订单事件设置键
- ->应用组合函数(hotKeyFanout = 50)计算每个客户所有订单的平均价格
- ->将组合结果转换为单个 TableRow
- ->将TableRow写入BigQuery
我知道有些客户可以有数百万个订单(热键问题) 所以我应用了高级的combineFn和hotKeyFanout (50)
我非常确定,通过此设置,内存不会成为问题,因为数据是通过 Accumulator 聚合的,因此减少了内存占用在管道中。
然而我在CombineFn的一个小步骤(称为GroupByKey)中遇到了问题,我不太明白这一步的目的。 (也许这是合并累加器步骤?)
从 Cloud Profiler 中查看堆分析,似乎有一个巨大的字符串生成器对象 ~1GB :
目前,我很困惑为什么当我已经有 hotKeyFanout 选项时,工作人员中最终会出现大量数据。我目前的猜测是,热键 50 个分割还不够,所以有一个巨大的分割包。我倾向于将热键扇出增加到更大的数字,但我不确定它是否能解决问题,或者我从错误的角度看待问题。
寻找关于我在这里做错的事情的建议/建议。
I have a beam batch job runs on Dataflow. Summary flow of this job:
- -> Read from Big Query // Hundreds of million records
- -> Transform each big query record to Order event
- -> Key each Order event by using its customer id
- -> Apply combine function (with hotKeyFanout = 50) to calculate average price of all orders per customer
- -> Transform combined result into a single TableRow
- -> Write TableRow to BigQuery
I know that some customers can have millions of orders (hot key problem)
So I apply advanced combineFn with hotKeyFanout (50)
I was pretty sure that with this setup, memory won’t be a problem, because data is aggregated with Accumulator , hence reducing memory footprint in the pipeline.
Yet I ran into a problem with a small step of CombineFn, called GroupByKey, which I don’t really understand the purpose of this step. (Perhaps it is the merge accumulator step?)
Looking at the heap analysis from cloud profiler, there seems to be a huge string builder object ~1GB :
At the moment I’m pretty lost on why there is a huge amount of data ended up in a worker when i already have hotKeyFanout option. My guess at the moment is that 50 splits for hot key isn’t enough, so there is a huge split bundle. I tend to increase hotkeyfanout to larger number, but I’m not sure if it will solve the problem or I’m looking at the problem from the wrong angle.
Looking for suggestions/recommendation on what I did wrong here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论