Apache Beam:如何使用更新的数据覆盖源镶木文件

发布于 2025-01-24 06:11:48 字数 457 浏览 4 评论 0原文

我有一个用python编写的梁管道,该管道读取两个木板文件:state更新。管道将当前数据保存在状态文件中,读取更新文件,它将更新state> State file(可能添加新的行)具有更新的内容。理想情况下,我的管道应该覆盖状态文件,以便下次我的管道使用新更新时,我将它们与最新状态进行比较。 这是我的问题。我的启动文件被称为state.parquet我的管道结束后,它不会覆盖该文件,但它将创建一个名为state> state-sate> state-of-of-00001.parquet 。这使我意识到这样做可能不是一个好主意,因为当文件增长时,我可以将输出文件分解在多个单独的文件上,并会引起问题。

做出我要做的事情的更好方法是什么?

I have a beam pipeline written in python that reads two parquet files: state and updates. The pipeline keeps the current data "state" in the state file, reads a updates file and it will update the state file (possibly adding new rows) with the content of updates. Ideally my pipeline should overwrite the state file so that next time my pipeline runs with new updates I am comparing them with the most recent state.
Here is my issue. My starting file is called state.parquet after my pipeline ends, it won't override the file but it will create a new file called state-00000-of-00001.parquet. This makes me realise that this is probably not a really good idea to do this, because when the file will grow I could have the output file sharded across multiple separate files, and will cause problems.

What would be a better approach to accomplish what I am trying to do?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

以为你会在 2025-01-31 06:11:48

您可以给文件一个执行ID后缀,例如state-run0001-00000 of-0000x.parquetupdate-run0002-00000-of-0000x.parquet。使用文件名,例如state-run0001update-run0002作为管道输入。然后将输出写入state-run0002-00000 of-0000x.parquet

您只需要跟踪执行ID Run000x即可安排您的作业。

You can give your files an execution id suffix such as state-run0001-00000-of-0000x.parquet and update-run0002-00000-of-0000x.parquet. Use the file names such as state-run0001 and update-run0002 as inputs of your pipelines. Then write the output to state-run0002-00000-of-0000x.parquet.

You just need to keep track of the execution id run000x to schedule your jobs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文