Google DataFlow连续8个期间与JVM失败,用于全球合并

发布于 2025-02-13 03:21:35 字数 845 浏览 0 评论 0原文

在我的管道中,我有大约400万个记录,流量如下

  1. 阅读BigQuery

    的所有记录
  2. 阅读BigQuery

    转换为Proto

  3. 全球组合并创建基于KV的SST文件,后来用于RockSDB

此管道可用于记录高达150万,但后来由于此错误而失败。

连续8个测量GC连续关闭JVM 刺痛。内存用于/total/total/max = 2481/2492/2492 MB,GC Last/max = 97.50/97.50%,#buckbacks = 0,gc thrashing = true。堆未写的堆。

错误也不会改变,即使我使用了其他各种线程中建议的优化,例如

  1. 将机器类型更改为高内存,
  2. 减少了累加器(将工作人员计数减少为1)
  3. 使用SSD磁盘
  4. -Experiments = Shuffle_mode = Shaffle_mode = Service

Current Stats 我不能使用自定义文件接收器,因为基础SST作者不支持从bytable频道写作,因为在这里

关于解决此问题的任何见解都会有所帮助

In my pipeline I have around 4 million records and the flow is as follows

  1. Read all records from bigquery

  2. Transform to proto

  3. Combine globally and create a sorted kv based SST file which is later used for Rocksdb

This pipeline works for records upto 1.5 million but later fails with this error.

Shutting down JVM after 8 consecutive periods of measured GC
thrashing. Memory is used/total/max = 2481/2492/2492 MB, GC last/max =
97.50/97.50 %, #pushbacks=0, gc thrashing=true. Heap dump not written.

The error doesn't change even I used several optimizations suggested in various other threads such as

  1. Changing machine type to high memory
  2. Decreasing the accumulators (reduced the worker count to 1)
  3. Use ssd disk
  4. --experiments=shuffle_mode=service

Current stats
enter image description here
I can't use a custom file sink as the underlying SST writer doesn't support writing from bytable channel as here

Any insight on resolving this would be helpful

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

岁月静好 2025-02-20 03:21:35

注意到当前的内存仍然为3.75 GB,将工作机类型升级到N1-Standard-2工作

Noticed that the Current memory is still 3.75 GB, upgrading the worker machine type to n1-standard-2 worked

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文