如何可靠地处理外部代理定期上传的文件?

发布于 2024-07-14 21:04:26 字数 510 浏览 5 评论 0原文

这是一种非常常见的场景:某个进程希望每 30 分钟左右在服务器上删除一个文件。 很简单,对吧? 好吧,我可以想到很多可能出错的方法。

例如,处理一个文件可能需要多于或少于 30 分钟,因此新文件可能会在我处理完前一个文件之前到达。 我不希望源系统覆盖我仍在处理的文件。

另一方面,文件很大,因此需要几分钟才能完成上传。 我不想开始处理部分文件。 这些文件只是通过 FTP 或 sftp(我的偏好)传输,因此操作系统级别的锁定不是一个选项。

最后,我确实需要将这些文件保留一段时间,以防我需要手动检查其中一个文件(用于调试)或重新处理其中一个文件。

我见过很多临时方法来调整上传文件、交换文件名、使用日期戳、触摸“指示符”文件以协助同步等等。 我还没有看到一个全面的处理文件的“算法”,可以解决并发性、一致性和完整性问题。

所以,我想利用这里群众的智慧。 有没有人见过一种真正万无一失的方法来处理批处理数据文件,这样它们就不会过早处理,在完成之前不会被覆盖,并且在处理后安全保存?

It's a very common scenario: some process wants to drop a file on a server every 30 minutes or so. Simple, right? Well, I can think of a bunch of ways this could go wrong.

For instance, processing a file may take more or less than 30 minutes, so it's possible for a new file to arrive before I'm done with the previous one. I don't want the source system to overwrite a file that I'm still processing.

On the other hand, the files are large, so it takes a few minutes to finish uploading them. I don't want to start processing a partial file. The files are just tranferred with FTP or sftp (my preference), so OS-level locking isn't an option.

Finally, I do need to keep the files around for a while, in case I need to manually inspect one of them (for debugging) or reprocess one.

I've seen a lot of ad-hoc approaches to shuffling upload files around, swapping filenames, using datestamps, touching "indicator" files to assist in synchronization, and so on. What I haven't seen yet is a comprehensive "algorithm" for processing files that addresses concurrency, consistency, and completeness.

So, I'd like to tap into the wisdom of crowds here. Has anyone seen a really bulletproof way to juggle batch data files so they're never processed too early, never overwritten before done, and safely kept after processing?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

眼眸 2024-07-21 21:04:27

关键是在发送端进行初始处理。 发送者需要做的就是:

  1. 使用唯一的文件名存储文件。
  2. 文件发送后,立即将其移至名为 completed 的子目录。

假设只有一个接收器进程,则接收器需要做的就是:

  1. 定期扫描 completed 目录中的任何文件。
  2. 文件出现在 completed 中后,将其移动到名为 processed 的子目录,并从那里开始处理它。
  3. 完成后可选择将其删除。

在任何健全的文件系统上,文件移动都是原子的,只要它们发生在同一文件系统/卷中。 所以不存在竞争条件。

多个接收器

如果处理时间可能比传送文件之间的时间间隔更长,那么除非您有多个接收器进程,否则您将积压工作。 那么,如何处理多个接收者的情况呢?

简单:每个接收器进程的运行方式与以前完全相同。 关键是我们尝试在处理文件之前将其移动到processed:这一点以及同一文件系统文件移动是原子性的事实意味着即使多个接收者在completed中看到相同的文件并尝试移动它,只有一个会成功。 您需要做的就是确保检查 rename() 的返回值,或者用于执行移动的任何操作系统调用,并且仅在成功时才继续处理。 如果移动失败,其他接收者会先到达那里,所以只需返回并再次扫描 completed 目录即可。

The key is to do the initial juggling at the sending end. All the sender needs to do is:

  1. Store the file with a unique filename.
  2. As soon as the file has been sent, move it to a subdirectory called e.g. completed.

Assuming there is only a single receiver process, all the receiver needs to do is:

  1. Periodically scan the completed directory for any files.
  2. As soon as a file appears in completed, move it to a subdirectory called e.g. processed, and start working on it from there.
  3. Optionally delete it when finished.

On any sane filesystem, file moves are atomic provided they occur within the same filesystem/volume. So there are no race conditions.

Multiple Receivers

If processing could take longer than the period between files being delivered, you'll build up a backlog unless you have multiple receiver processes. So, how to handle the multiple-receiver case?

Simple: Each receiver process operates exactly as before. The key is that we attempt to move a file to processed before working on it: that, and the fact the same-filesystem file moves are atomic, means that even if multiple receivers see the same file in completed and try to move it, only one will succeed. All you need to do is make sure you check the return value of rename(), or whatever OS call you use to perform the move, and only proceed with processing if it succeeded. If the move failed, some other receiver got there first, so just go back and scan the completed directory again.

递刀给你 2024-07-21 21:04:27

如果操作系统支持,请使用文件系统挂钩来拦截打开和关闭文件操作。 类似于 Dazuko。 其他操作系统可能会以另一种方式让您了解文件操作,例如 Novell Open Enterprise Server 允许您定义纪元,以及 读取在一个时期内修改的文件列表。

刚刚意识到在 Linux 中,您可以使用 inotify 子系统,或者 inotify-tools 包中的实用程序

If the OS supports it, use file system hooks to intercept open and close file operations. Something like Dazuko. Other operating systems may let you know about file operations in anoter way, for example Novell Open Enterprise Server lets you define epochs, and read list of files modified during an epoch.

Just realized that in Linux, you can use inotify subsystem, or the utilities from inotify-tools package

失退 2024-07-21 21:04:27

文件传输是系统集成的经典之一。 我建议您获取企业集成模式一书来构建您自己对这些问题的答案 -在某种程度上,答案取决于您用于端点实施和文件传输的技术和平台。 它是一个相当全面的可行模式集合,而且写得相当好。

File transfers is one of the classics of system integration. I'd recommend you to get the Enterprise Integration Patterns book to build your own answer to these questions -- to some extent, the answer depends on the technologies and platforms you are using for endpoint implementation and for file transfer. It's a quite comprehensive collection of workable patterns, and fairly well written.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文