在运行多个并发实例时,如何修复 wget 混合下载数据的问题?

发布于 2024-10-08 09:17:41 字数 1688 浏览 0 评论 0原文

我正在运行一个脚本,该脚本又使用不同的参数集在后台多次调用另一个脚本。

辅助脚本首先对 ftp url 执行 wget 以获取该 url 处的文件列表。它将其输出到一个唯一的文件名。

简化示例:
其中每一个都由在后台运行的辅助脚本的单独实例调用。

wget --no-verbose 'ftp://foo.com/' -O '/downloads/foo/foo_listing.html' >foo.log

wget --no-verbose 'ftp://bar.com/' -O '/downloads/bar/bar_listing.html' >bar.log

当我一次运行辅助脚本一次时,一切都会按预期运行。我得到一个 html 文件,其中包含文件列表、文件链接以及有关文件的信息,就像通过浏览器查看 ftp url 时一样。

继续一次简化一个(和预期的)示例结果:

foo_listing.html:

...
<a href="ftp://foo.com/foo1.xml">foo1.xml</a> ...
<a href="ftp://foo.com/foo2.xml">foo2.xml</a> ...
...

bar_listing.html:

...
<a href="ftp://bar.com/bar3.xml">bar3.xml</a> ...
<a href="ftp://bar.com/bar4.xml">bar4.xml</a> ...
...

当我多次运行辅助脚本时在后台,一些生成的文件,尽管它们具有正确的基本 url(传入的文件),但列出的文件来自 wget 的不同运行。

继续简化的多处理(和实际)示例结果:

foo_listing.html:

...
<a href="ftp://foo.com/bar3.xml">bar3.xml</a> ...
<a href="ftp://foo.com/bar4.xml">bar4.xml</a> ...
...

bar_listing.html
正确,如上所述

奇怪的是,我下载的所有其他文件似乎都工作正常。只是这些列表文件变得混乱。

当前的解决方法是在后台进程之间添加 5 秒的延迟。只需要进行一处更改,一切都会完美运行。


有人知道如何解决这个问题吗?

请不要建议使用其他方法来获取列表文件或不同时运行。如果可能的话,我想真正知道在许多后台进程中使用 wget 时如何解决这个问题。

编辑:

注意:

我不是指 wget 喷到屏幕上的状态输出。我根本不关心这一点(实际上也存储在单独的日志文件中并且工作正常)。我指的是 wget 从网络下载的数据。

另外,我无法显示我正在使用的确切代码,因为它是我公司的专有代码。我的代码没有任何“问题”,因为在后台实例之间添加 5 秒的延迟时它可以完美地工作。

I am running a script that in turn calls another script multiple times in the background with different sets of parameters.

The secondary script first does a wget on an ftp url to get a listing of files at that url. It outputs that to a unique filename.

Simplified example:
Each of these is being called by a separate instance of the secondary script running in the background.

wget --no-verbose 'ftp://foo.com/' -O '/downloads/foo/foo_listing.html' >foo.log

wget --no-verbose 'ftp://bar.com/' -O '/downloads/bar/bar_listing.html' >bar.log

When I run the secondary script once at a time, everything behaves as expected. I get an html file with a list of files, links to them, and information about the files the same way I would when viewing an ftp url through a browser.

Continued simplified one at a time (and expected) example results:

foo_listing.html:

...
<a href="ftp://foo.com/foo1.xml">foo1.xml</a> ...
<a href="ftp://foo.com/foo2.xml">foo2.xml</a> ...
...

bar_listing.html:

...
<a href="ftp://bar.com/bar3.xml">bar3.xml</a> ...
<a href="ftp://bar.com/bar4.xml">bar4.xml</a> ...
...

When I run the secondary script many times in the background, some of the resulting files, although they have the base urls correct (the one that was passed in) the files listed are from a different run of wget.

Continued simplified multiprocessing (and actual) example results:

foo_listing.html:

...
<a href="ftp://foo.com/bar3.xml">bar3.xml</a> ...
<a href="ftp://foo.com/bar4.xml">bar4.xml</a> ...
...

bar_listing.html
correct, as above

Oddly enough, all other files I download seem to work just fine. It's only these listing files that get jumbled up.

The current workaround is to put in a 5 second delay between backgrounded processes. With only that one change everything works perfectly.


Does anybody know how to fix this?

Please don't recommend using some other method of getting the listing files or not running concurrently. I'd like to actually know how to fix this when using wget in many backgrounded processes if possible.

EDIT:

Note:

I am not referring to the status output that wget spews to the screen. I don't care at all about that (that is actually also being stored in separate log files and is working correctly). I'm referring to the data wget is downloading from the web.

Also, I cannot show the exact code that I am using as it is proprietary for my company. There is nothing "wrong" with my code as it works perfectly when putting in a 5 second delay between backgrounded instances.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

残花月 2024-10-15 09:17:41

使用 Gnu 记录错误,现在尽可能使用其他东西,在并发运行之间添加时间延迟。可能创建一个包装器来获取 ftp 目录列表,一次只允许检索一个。

:-/

Log bug with Gnu, use something else for now whenever possible, put in time delays between concurrent runs. Possibly create a wrapper for getting ftp directory listings that only allows one at a time to be retrieved.

:-/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文