拾取Pyspark ReadStream正在读取的JSON文件中的更改?

发布于 2025-01-30 15:19:04 字数 387 浏览 5 评论 0原文

我有JSON文件,每个文件都描述了一个特定实体,包括其状态。我试图通过使用ReadStream和Writestream将它们拉入三角洲。这对于新文件非常有效。这些JSON文件经常更新(即更改状态,添加了评论,添加了历史记录项目等)。更改的JSON文件不会随着ReadStream的形式提取。我认为这是因为ReadStream没有重新处理项目。有办法解决吗?

我正在考虑的一件事是更改我对JSON的初始文章,以在文件名中添加时间戳,以使其成为与流的不同记录(我已经必须在我的Writestream中进行de uping),但是我是试图不修改编写JSON的代码,因为它已经在生产中使用了。

理想情况下,我想找到类似Cosmos DB的ChangeFeed功能,但用于读取JSON文件。

有什么建议吗?

谢谢!

I have json files where each file describes a particular entity, including it's state. I am trying to pull these into Delta by using readStream and writeStream. This is working perfectly for new files. These json files are frequently updated (i.e., states are changed, comments added, history items added, etc.). The changed json files are not pulled in with the readStream. I assume that is because readStream does not reprocess items. Is there a way around this?

One thing I am considering is changing my initial write of the json to add a timestamp to the file name so that it becomes a different record to the stream (I already have to do a de-duping in my writeStream anyway), but I am trying to not modify the code that is writing the json as it is already being used in production.

Ideally I would like to find something like the changeFeed functionality for Cosmos Db, but for reading json files.

Any suggestions?

Thankss!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

人疚 2025-02-06 15:19:04

这不受火花结构化流的支持 - 处理文件后不会再次处理。

最接近您的要求仅存在于 a> - 它具有选项 cloudfiles.allowoverwrites option 允许重新安装修改的文件。

PS可能会使用CleanSource文件源( https://spark.apache.org/docs/latest/structured-streaming-programing-programming-guide.html#input-sources ) m不能100%确定。

This is not supported by the Spark Structured Streaming - after file is processed it won't be processed again.

The closest to your requirement is only exist in Databricks' Autoloader - it has option cloudFiles.allowOverwrites option that allows to reprocess modified files.

P.S. Potentially if you use cleanSource option for file source (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources), then it may reprocess files, but I'm not 100% sure.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文