拾取Pyspark ReadStream正在读取的JSON文件中的更改?
我有JSON文件,每个文件都描述了一个特定实体,包括其状态。我试图通过使用ReadStream和Writestream将它们拉入三角洲。这对于新文件非常有效。这些JSON文件经常更新(即更改状态,添加了评论,添加了历史记录项目等)。更改的JSON文件不会随着ReadStream的形式提取。我认为这是因为ReadStream没有重新处理项目。有办法解决吗?
我正在考虑的一件事是更改我对JSON的初始文章,以在文件名中添加时间戳,以使其成为与流的不同记录(我已经必须在我的Writestream中进行de uping),但是我是试图不修改编写JSON的代码,因为它已经在生产中使用了。
理想情况下,我想找到类似Cosmos DB的ChangeFeed功能,但用于读取JSON文件。
有什么建议吗?
谢谢!
I have json files where each file describes a particular entity, including it's state. I am trying to pull these into Delta by using readStream and writeStream. This is working perfectly for new files. These json files are frequently updated (i.e., states are changed, comments added, history items added, etc.). The changed json files are not pulled in with the readStream. I assume that is because readStream does not reprocess items. Is there a way around this?
One thing I am considering is changing my initial write of the json to add a timestamp to the file name so that it becomes a different record to the stream (I already have to do a de-duping in my writeStream anyway), but I am trying to not modify the code that is writing the json as it is already being used in production.
Ideally I would like to find something like the changeFeed functionality for Cosmos Db, but for reading json files.
Any suggestions?
Thankss!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这不受火花结构化流的支持 - 处理文件后不会再次处理。
最接近您的要求仅存在于 a> - 它具有选项 cloudfiles.allowoverwrites option 允许重新安装修改的文件。
PS可能会使用
CleanSource
文件源( https://spark.apache.org/docs/latest/structured-streaming-programing-programming-guide.html#input-sources ) m不能100%确定。This is not supported by the Spark Structured Streaming - after file is processed it won't be processed again.
The closest to your requirement is only exist in Databricks' Autoloader - it has option cloudFiles.allowOverwrites option that allows to reprocess modified files.
P.S. Potentially if you use
cleanSource
option for file source (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources), then it may reprocess files, but I'm not 100% sure.