如何使用Apache Beam / Google DataFlow Python拆分大型镶木木材文件

发布于 2025-01-26 08:10:48 字数 1210 浏览 4 评论 0原文

我需要使用Apache Beam / Google DataFlow拆分30GB Parquet文件。这是代码：

    with beam.Pipeline(options=pipeline_options) as p:
        (
            p 
            | 'Read' >> beam.io.ReadFromParquet("gs://my-bucket/input/my-file.parquet")
            | 'Write' >> beam.io.WriteToParquet(
                file_path_prefix="gs://my-bucket/output/",
                schema=SCHEMA,
                codec='snappy',
                file_name_suffix='.parquet',
                num_shards=20,
            )
         )

当我在小镶木木上运行此代码时，它运行正常。但是，当我在一个大文件（30GB parquet）上运行它时，它会粘上并在闲置时间后丢下错误：

Root cause: The worker lost contact with the service.

我试图在更强大的虚拟机上运行它，如建议在这里：

--worker_machine_type=e2-standard-2 --disk_size_gb=500

但是这次，作业在同一步骤上粘贴并永远冻结：

我不是经验丰富的Apache Beam和DataFlow用户，很长一段时间都没有使用它。感谢任何帮助。

原文

I need to split 30GB Parquet file with Apache Beam / Google Dataflow.
Here is the code:

    with beam.Pipeline(options=pipeline_options) as p:
        (
            p 
            | 'Read' >> beam.io.ReadFromParquet("gs://my-bucket/input/my-file.parquet")
            | 'Write' >> beam.io.WriteToParquet(
                file_path_prefix="gs://my-bucket/output/",
                schema=SCHEMA,
                codec='snappy',
                file_name_suffix='.parquet',
                num_shards=20,
            )
         )

When I run this code on a small Parquet file, it runs fine. But when I run it on a big file (30GB Parquet) it sticks and throws an error after some idle time:

Root cause: The worker lost contact with the service.

I tried to run it on more powerful virtual machines, as recommended here:

--worker_machine_type=e2-standard-2 --disk_size_gb=500

But this time the job sticks on the same step and freezes forever:

I'm not an experienced Apache Beam and Dataflow user and have not been using it for a long time. Appreciate any help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

关于作者

梦境

暂无简介

文章

27 人气

关注发私信

十二

文章 0 评论 0

关注

飞烟轻若梦

文章 0 评论 0

关注

OPleyuhuo

文章 0 评论 0

关注

wxb0109

文章 0 评论 0

关注

旧城空念

文章 0 评论 0

关注

-小熊_

文章 0 评论 0

友情链接

文江博客

如何使用Apache Beam / Google DataFlow Python拆分大型镶木木材文件

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如何使用Apache Beam / Google DataFlow Python拆分大型镶木木材文件

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。