Apache Beam:如何使用更新的数据覆盖源镶木文件
我有一个用python编写的梁管道,该管道读取两个木板文件:state
和更新
。管道将当前数据保存在状态
文件中,读取更新
文件,它将更新state> State
file(可能添加新的行)具有更新的内容
。理想情况下,我的管道应该覆盖状态文件,以便下次我的管道使用新更新时,我将它们与最新状态进行比较。 这是我的问题。我的启动文件被称为state.parquet
我的管道结束后,它不会覆盖该文件,但它将创建一个名为state> state-sate> state-of-of-00001.parquet 。这使我意识到这样做可能不是一个好主意,因为当文件增长时,我可以将输出文件分解在多个单独的文件上,并会引起问题。
做出我要做的事情的更好方法是什么?
I have a beam pipeline written in python that reads two parquet files: state
and updates
. The pipeline keeps the current data "state" in the state
file, reads a updates
file and it will update the state
file (possibly adding new rows) with the content of updates
. Ideally my pipeline should overwrite the state file so that next time my pipeline runs with new updates I am comparing them with the most recent state.
Here is my issue. My starting file is called state.parquet
after my pipeline ends, it won't override the file but it will create a new file called state-00000-of-00001.parquet
. This makes me realise that this is probably not a really good idea to do this, because when the file will grow I could have the output file sharded across multiple separate files, and will cause problems.
What would be a better approach to accomplish what I am trying to do?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以给文件一个执行ID后缀,例如
state-run0001-00000 of-0000x.parquet
和update-run0002-00000-of-0000x.parquet
。使用文件名,例如state-run0001
和update-run0002
作为管道输入。然后将输出写入state-run0002-00000 of-0000x.parquet
。您只需要跟踪执行ID
Run000x
即可安排您的作业。You can give your files an execution id suffix such as
state-run0001-00000-of-0000x.parquet
andupdate-run0002-00000-of-0000x.parquet
. Use the file names such asstate-run0001
andupdate-run0002
as inputs of your pipelines. Then write the output tostate-run0002-00000-of-0000x.parquet
.You just need to keep track of the execution id
run000x
to schedule your jobs.