如何可靠地处理外部代理定期上传的文件?
这是一种非常常见的场景:某个进程希望每 30 分钟左右在服务器上删除一个文件。 很简单,对吧? 好吧,我可以想到很多可能出错的方法。
例如,处理一个文件可能需要多于或少于 30 分钟,因此新文件可能会在我处理完前一个文件之前到达。 我不希望源系统覆盖我仍在处理的文件。
另一方面,文件很大,因此需要几分钟才能完成上传。 我不想开始处理部分文件。 这些文件只是通过 FTP 或 sftp(我的偏好)传输,因此操作系统级别的锁定不是一个选项。
最后,我确实需要将这些文件保留一段时间,以防我需要手动检查其中一个文件(用于调试)或重新处理其中一个文件。
我见过很多临时方法来调整上传文件、交换文件名、使用日期戳、触摸“指示符”文件以协助同步等等。 我还没有看到一个全面的处理文件的“算法”,可以解决并发性、一致性和完整性问题。
所以,我想利用这里群众的智慧。 有没有人见过一种真正万无一失的方法来处理批处理数据文件,这样它们就不会过早处理,在完成之前不会被覆盖,并且在处理后安全保存?
It's a very common scenario: some process wants to drop a file on a server every 30 minutes or so. Simple, right? Well, I can think of a bunch of ways this could go wrong.
For instance, processing a file may take more or less than 30 minutes, so it's possible for a new file to arrive before I'm done with the previous one. I don't want the source system to overwrite a file that I'm still processing.
On the other hand, the files are large, so it takes a few minutes to finish uploading them. I don't want to start processing a partial file. The files are just tranferred with FTP or sftp (my preference), so OS-level locking isn't an option.
Finally, I do need to keep the files around for a while, in case I need to manually inspect one of them (for debugging) or reprocess one.
I've seen a lot of ad-hoc approaches to shuffling upload files around, swapping filenames, using datestamps, touching "indicator" files to assist in synchronization, and so on. What I haven't seen yet is a comprehensive "algorithm" for processing files that addresses concurrency, consistency, and completeness.
So, I'd like to tap into the wisdom of crowds here. Has anyone seen a really bulletproof way to juggle batch data files so they're never processed too early, never overwritten before done, and safely kept after processing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
关键是在发送端进行初始处理。 发送者需要做的就是:
completed
的子目录。假设只有一个接收器进程,则接收器需要做的就是:
completed
目录中的任何文件。completed
中后,将其移动到名为processed
的子目录,并从那里开始处理它。在任何健全的文件系统上,文件移动都是原子的,只要它们发生在同一文件系统/卷中。 所以不存在竞争条件。
多个接收器
如果处理时间可能比传送文件之间的时间间隔更长,那么除非您有多个接收器进程,否则您将积压工作。 那么,如何处理多个接收者的情况呢?
简单:每个接收器进程的运行方式与以前完全相同。 关键是我们尝试在处理文件之前将其移动到
processed
:这一点以及同一文件系统文件移动是原子性的事实意味着即使多个接收者在completed
中看到相同的文件并尝试移动它,只有一个会成功。 您需要做的就是确保检查 rename() 的返回值,或者用于执行移动的任何操作系统调用,并且仅在成功时才继续处理。 如果移动失败,其他接收者会先到达那里,所以只需返回并再次扫描completed
目录即可。The key is to do the initial juggling at the sending end. All the sender needs to do is:
completed
.Assuming there is only a single receiver process, all the receiver needs to do is:
completed
directory for any files.completed
, move it to a subdirectory called e.g.processed
, and start working on it from there.On any sane filesystem, file moves are atomic provided they occur within the same filesystem/volume. So there are no race conditions.
Multiple Receivers
If processing could take longer than the period between files being delivered, you'll build up a backlog unless you have multiple receiver processes. So, how to handle the multiple-receiver case?
Simple: Each receiver process operates exactly as before. The key is that we attempt to move a file to
processed
before working on it: that, and the fact the same-filesystem file moves are atomic, means that even if multiple receivers see the same file incompleted
and try to move it, only one will succeed. All you need to do is make sure you check the return value ofrename()
, or whatever OS call you use to perform the move, and only proceed with processing if it succeeded. If the move failed, some other receiver got there first, so just go back and scan thecompleted
directory again.如果操作系统支持,请使用文件系统挂钩来拦截打开和关闭文件操作。 类似于 Dazuko。 其他操作系统可能会以另一种方式让您了解文件操作,例如 Novell Open Enterprise Server 允许您定义纪元,以及 读取在一个时期内修改的文件列表。
刚刚意识到在 Linux 中,您可以使用 inotify 子系统,或者 inotify-tools 包中的实用程序
If the OS supports it, use file system hooks to intercept open and close file operations. Something like Dazuko. Other operating systems may let you know about file operations in anoter way, for example Novell Open Enterprise Server lets you define epochs, and read list of files modified during an epoch.
Just realized that in Linux, you can use inotify subsystem, or the utilities from inotify-tools package
文件传输是系统集成的经典之一。 我建议您获取企业集成模式一书来构建您自己对这些问题的答案 -在某种程度上,答案取决于您用于端点实施和文件传输的技术和平台。 它是一个相当全面的可行模式集合,而且写得相当好。
File transfers is one of the classics of system integration. I'd recommend you to get the Enterprise Integration Patterns book to build your own answer to these questions -- to some extent, the answer depends on the technologies and platforms you are using for endpoint implementation and for file transfer. It's a quite comprehensive collection of workable patterns, and fairly well written.