Bash 中的并行 wget

发布于 2024-12-06 20:56:03 字数 264 浏览 5 评论 0原文

我从网站上获得了一堆相对较小的页面，并且想知道是否可以在 Bash 中并行执行它。目前我的代码看起来像这样，但执行需要一段时间（我认为减慢我速度的是连接的延迟）。

for i in {1..42}
do
    wget "https://www.example.com/page$i.html"
done

我听说过使用 xargs，但我对此一无所知，而且手册页非常混乱。有什么想法吗？甚至可以并行执行此操作吗？我还有其他方法可以解决这个问题吗？

原文

I am getting a bunch of relatively small pages from a website and was wondering if I could somehow do it in parallel in Bash. Currently my code looks like this, but it takes a while to execute (I think what is slowing me down is the latency in the connection).

for i in {1..42}
do
    wget "https://www.example.com/page$i.html"
done

I have heard of using xargs, but I don't know anything about that and the man page is very confusing. Any ideas? Is it even possible to do this in parallel? Is there another way I could go about attacking this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离线来电— 2024-12-13 20:56:03

比使用 & 或 -b 将 wget 推送到后台更好，您可以使用 xargs 来实现相同的效果效果，而且更好。

优点是 xargs 将正确同步，无需额外工作。这意味着您可以安全地访问下载的文件（假设没有发生错误）。一旦 xargs 退出，所有下载都将完成（或失败），并且您可以通过退出代码知道一切是否顺利。这比忙于等待 sleep 和手动测试完成情况要好得多。

假设 URL_LIST 是一个包含所有 URL 的变量（可以在 OP 的示例中使用循环构造，但也可以是手动生成的列表），运行此：

echo $URL_LIST | xargs -n 1 -P 8 wget -q

将一次传递一个参数（ -n 1) 到 wget，并且一次最多执行 8 个并行 wget 进程 (-P 8 ）。 xarg 在最后一个生成的进程完成后返回，这正是我们想知道的。不需要额外的诡计。

我选择的 8 个并行下载的“神奇数字”并不是一成不变的，但这可能是一个很好的折衷方案。 “最大化”一系列下载有两个因素：

一个是填充“电缆”，即利用可用带宽。假设“正常”条件（服务器比客户端具有更多带宽），一次或最多两次下载就已经是这种情况。在问题上投入更多连接只会导致数据包被丢弃，TCP 拥塞控制启动，并且 N 个下载，每个下载渐近 1/N 带宽，达到相同的净效果（减去丢弃的数据包，减去窗口大小恢复）。数据包被丢弃是 IP 网络中发生的正常现象，这就是拥塞控制的工作方式（即使是单个连接），通常影响几乎为零。然而，拥有不合理的大量连接会放大这种影响，因此它会变得很明显。无论如何，它不会使任何事情变得更快。

第二个因素是连接建立和请求处理。在这里，在飞行中进行一些额外的转机确实有帮助。面临的问题是两次往返的延迟（同一地理区域内通常为 20-40 毫秒，洲际为 200-300 毫秒）加上服务器处理请求和推送回复实际需要的 1-2 毫秒的奇怪时间到插座。这本身并不是很多时间，但乘以几百/千个请求，它很快就会增加。
进行中的任何六到十几个请求都会隐藏大部分或全部延迟（它仍然存在，但由于它重叠，因此无法总结！）。同时，只有几个并发连接不会产生不利影响，例如导致过度拥塞，或迫使服务器分叉新进程。

Much preferrable to pushing wget into the background using & or -b, you can use xargs to the same effect, and better.

The advantage is that xargs will synchronize properly with no extra work. Which means that you are safe to access the downloaded files (assuming no error occurs). All downloads will have completed (or failed) once xargs exits, and you know by the exit code whether all went well. This is much preferrable to busy waiting with sleep and testing for completion manually.

Assuming that URL_LIST is a variable containing all the URLs (can be constructed with a loop in the OP's example, but could also be a manually generated list), running this:

echo $URL_LIST | xargs -n 1 -P 8 wget -q

will pass one argument at a time (-n 1) to wget, and execute at most 8 parallel wget processes at a time (-P 8). xarg returns after the last spawned process has finished, which is just what we wanted to know. No extra trickery needed.

The "magic number" of 8 parallel downloads that I've chosen is not set in stone, but it is probably a good compromise. There are two factors in "maximising" a series of downloads:

One is filling "the cable", i.e. utilizing the available bandwidth. Assuming "normal" conditions (server has more bandwidth than client), this is already the case with one or at most two downloads. Throwing more connections at the problem will only result in packets being dropped and TCP congestion control kicking in, and N downloads with asymptotically 1/N bandwidth each, to the same net effect (minus the dropped packets, minus window size recovery). Packets being dropped is a normal thing to happen in an IP network, this is how congestion control is supposed to work (even with a single connection), and normally the impact is practically zero. However, having an unreasonably large number of connections amplifies this effect, so it can be come noticeable. In any case, it doesn't make anything faster.

The second factor is connection establishment and request processing. Here, having a few extra connections in flight really helps. The problem one faces is the latency of two round-trips (typically 20-40ms within the same geographic area, 200-300ms inter-continental) plus the odd 1-2 milliseconds that the server actually needs to process the request and push a reply to the socket. This is not a lot of time per se, but multiplied by a few hundred/thousand requests, it quickly adds up.
Having anything from half a dozen to a dozen requests in-flight hides most or all of this latency (it is still there, but since it overlaps, it does not sum up!). At the same time, having only a few concurrent connections does not have adverse effects, such as causing excessive congestion, or forcing a server into forking new processes.

回复收藏 0 原文

冷︶言冷语的世界 2024-12-13 20:56:03

仅在后台运行作业并不是一个可扩展的解决方案：如果您要获取 10000 个 url，您可能只想并行获取一些（例如 100 个）。 GNU Parallel 就是为此而生的：

seq 10000 | parallel -j100 wget https://www.example.com/page{}.html

Just running the jobs in the background is not a scalable solution: If you are fetching 10000 urls you probably only want to fetch a few (say 100) in parallel. GNU Parallel is made for that:

seq 10000 | parallel -j100 wget https://www.example.com/page{}.html

See the man page for more examples:
http://www.gnu.org/software/parallel/man.html#example__download_10_images_for_each_of_the_past_30_days

回复收藏 0 原文

自由范儿 2024-12-13 20:56:03

您可以使用-b选项：

wget -b "https://www.example.com/page$i.html"

如果您不需要日志文件，请添加选项-o /dev/null。

-o FILE  log messages to FILE.

You can use -b option:

wget -b "https://www.example.com/page$i.html"

If you don't want log files, add option -o /dev/null.

-o FILE  log messages to FILE.

回复收藏 0 原文

机场等船 2024-12-13 20:56:03

在命令中添加＆符号使其在后台运行

for i in {1..42}
do
    wget "https://www.example.com/page$i.html" &
done

Adding an ampersand to a command makes it run in the background

for i in {1..42}
do
    wget "https://www.example.com/page$i.html" &
done

回复收藏 0 原文

暮年慕年 2024-12-13 20:56:03

wget 版本 2 实现了多个连接。

https://github.com/rockdaboot/wget2

回复收藏 0 原文

~没有更多了~

关于作者

那伤。

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

Bash 中的并行 wget

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

Bash 中的并行 wget

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

櫻之舞

弥枳

m2429

寻找一个思念的角度

野却迷人

我怀念的。

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。