Google DataFlow连续8个期间与JVM失败,用于全球合并
在我的管道中,我有大约400万个记录,流量如下
阅读BigQuery
的所有记录- 阅读BigQuery
转换为Proto
全球组合并创建基于KV的SST文件,后来用于RockSDB
此管道可用于记录高达150万,但后来由于此错误而失败。
连续8个测量GC连续关闭JVM 刺痛。内存用于/total/total/max = 2481/2492/2492 MB,GC Last/max = 97.50/97.50%,#buckbacks = 0,gc thrashing = true。堆未写的堆。
错误也不会改变,即使我使用了其他各种线程中建议的优化,例如
- 将机器类型更改为高内存,
- 减少了累加器(将工作人员计数减少为1)
- 使用SSD磁盘
- -Experiments = Shuffle_mode = Shaffle_mode = Service
Current Stats 我不能使用自定义文件接收器,因为基础SST作者不支持从bytable频道写作,因为在这里
关于解决此问题的任何见解都会有所帮助
In my pipeline I have around 4 million records and the flow is as follows
Read all records from bigquery
Transform to proto
Combine globally and create a sorted kv based SST file which is later used for Rocksdb
This pipeline works for records upto 1.5 million but later fails with this error.
Shutting down JVM after 8 consecutive periods of measured GC
thrashing. Memory is used/total/max = 2481/2492/2492 MB, GC last/max =
97.50/97.50 %, #pushbacks=0, gc thrashing=true. Heap dump not written.
The error doesn't change even I used several optimizations suggested in various other threads such as
- Changing machine type to high memory
- Decreasing the accumulators (reduced the worker count to 1)
- Use ssd disk
- --experiments=shuffle_mode=service
Current stats
I can't use a custom file sink as the underlying SST writer doesn't support writing from bytable channel as here
Any insight on resolving this would be helpful
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
注意到当前的内存仍然为3.75 GB,将工作机类型升级到N1-Standard-2工作
Noticed that the Current memory is still 3.75 GB, upgrading the worker machine type to n1-standard-2 worked