如何使用Apache Beam / Google DataFlow Python拆分大型镶木木材文件
我需要使用Apache Beam / Google DataFlow拆分30GB Parquet文件。 这是代码:
with beam.Pipeline(options=pipeline_options) as p:
(
p
| 'Read' >> beam.io.ReadFromParquet("gs://my-bucket/input/my-file.parquet")
| 'Write' >> beam.io.WriteToParquet(
file_path_prefix="gs://my-bucket/output/",
schema=SCHEMA,
codec='snappy',
file_name_suffix='.parquet',
num_shards=20,
)
)
当我在小镶木木上运行此代码时,它运行正常。但是,当我在一个大文件(30GB parquet)上运行它时,它会粘上并在闲置时间后丢下错误:
Root cause: The worker lost contact with the service.
我试图在更强大的虚拟机上运行它,如建议在这里:
--worker_machine_type=e2-standard-2 --disk_size_gb=500
但是这次,作业在同一步骤上粘贴并永远冻结:
我不是经验丰富的Apache Beam和DataFlow用户,很长一段时间都没有使用它。感谢任何帮助。
I need to split 30GB Parquet file with Apache Beam / Google Dataflow.
Here is the code:
with beam.Pipeline(options=pipeline_options) as p:
(
p
| 'Read' >> beam.io.ReadFromParquet("gs://my-bucket/input/my-file.parquet")
| 'Write' >> beam.io.WriteToParquet(
file_path_prefix="gs://my-bucket/output/",
schema=SCHEMA,
codec='snappy',
file_name_suffix='.parquet',
num_shards=20,
)
)
When I run this code on a small Parquet file, it runs fine. But when I run it on a big file (30GB Parquet) it sticks and throws an error after some idle time:
Root cause: The worker lost contact with the service.
I tried to run it on more powerful virtual machines, as recommended here:
--worker_machine_type=e2-standard-2 --disk_size_gb=500
But this time the job sticks on the same step and freezes forever:
I'm not an experienced Apache Beam and Dataflow user and have not been using it for a long time. Appreciate any help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论