阻止 rsync 删除未完成的源文件

发布于 2024-07-04 02:05:12 字数 265 浏览 9 评论 0原文

我有两台机器,速度和质量。 speed 具有快速的互联网连接,并且正在运行爬虫程序,将大量文件下载到磁盘。 质量有大量的磁盘空间。 我想在下载完成后将文件从速度移动到质量。 理想情况下,我只是运行:

$ rsync --remove-source-files speed:/var/crawldir .

但我担心 rsync 会取消链接尚未完成下载的源文件。 (我查看了源代码,没有看到任何针对此问题的保护措施。)有什么建议吗?

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run:

$ rsync --remove-source-files speed:/var/crawldir .

but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

空袭的梦i 2024-07-11 02:05:13

您对下载过程有多少控制权? 如果您自己动手,则可以将正在下载的文件转到临时目录或使用临时名称,直到下载完成为止,然后在完成后将其 mv 到正确的名称。 如果您使用第三方软件,那么您没有那么多的控制权,但您仍然可以执行临时目录的操作。

How much control do you have over the download process? If you roll your own, you can have the file being downloaded go to a temp directory or have a temporary name until it's finished downloading, and then mv it to the correct name when it's done. If you're using third party software, then you don't have as much control, but you still might be able to do the temp directory thing.

橘香 2024-07-11 02:05:13

Rsync 可以排除与某些模式匹配的文件。 即使您无法修改它以使其将文件下载到临时目录,也许它有一个在下载过程中以不同的方式命名文件的约定(例如:在下载名为的文件时 foo.downloading foo),您可以使用此属性来排除仍在复制的下载文件。

Rsync can exclude files matching certain patters. Even if you can't modify it to make it download files to a temporary directory, maybe it has a convention of naming the files differently during download (for example: foo.downloading while downloading for a file named foo) and you can use this property to exclude files which are still being downloaded from being copied.

各自安好 2024-07-11 02:05:13

如果您可以控制爬网过程,或者它具有可预测的输出,则上述解决方案(存储在临时文件中直到完成,然后移动到已完成的下载位置,或忽略具有“.downloading”类型名称的文件)可能会起作用。 如果所有这些都超出了您的控制范围,您可以通过执行“lsof $filename”并检查是否有结果来确保该文件没有被任何进程打开。 显然,如果没有人打开该文件,则可以安全地将其移动。

If you have control over the crawling process, or it has predictable output, the above solutions (storing in a tempfile until finished, then mv'ing to the completed-downloads place, or ignoring files with a '.downloading' kind of name) might work. If all of that is beyond your control, you can make sure that the file is not opened by any process by doing 'lsof $filename' and checking if there's a result. Clearly if no one has the file open, it's safe to move it over.

幸福还没到 2024-07-11 02:05:12

在我看来,问题是在文件完成之前传输文件,而不是删除它。

如果这是 Linux,则进程 A 打开文件并且进程 B 可以取消链接该文件是可能的。 没有错误,但是 A 当然是在浪费时间。 因此,rsync删除源文件并不是问题。

问题是 rsync 仅在复制后才删除源文件,如果仍在将其写入磁盘,您将拥有部分文件。

怎么样:以speed方式将mass挂载为远程文件系统(NFS可以工作)。 然后直接网络爬取文件即可。

It seems to me the problem is transferring a file before it's complete, not that you're deleting it.

If this is Linux, it's possible for a file to be open by process A and process B can unlink the file. There's no error, but of course A is wasting its time. Therefore, the fact that rsync deletes the source file is not a problem.

The problem is rsync deletes the source file only after it's copied, and if it's still being written to disk you'll have a partial file.

How about this: Mount mass as a remote file system (NFS would work) in speed. Then just web-crawl the files directly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文