使用wget爬取网站并限制爬取链接总数

发布于 2024-10-17 04:31:34 字数 218 浏览 7 评论 0 原文

我想通过使用 wget 工具来了解有关爬虫的更多信息。我有兴趣抓取我部门的网站，并找到该网站上的前 100 个链接。到目前为止，下面的命令是我所拥有的。如何限制爬虫在100个链接后停止？

wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"

原文

I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, the command below is what I have. How do I limit the crawler to stop after 100 links?

wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

她如夕阳 2024-10-24 04:31:34

你不能。 wget 不支持这一点，所以如果你想要这样的东西，你必须自己编写一个工具。

您可以获取主文件，手动解析链接，然后逐一获取它们，限制为 100 个项目。但这不是 wget 支持的东西。

你也可以看看用于网站爬行的 HTTrack，它有很多额外的选项：http://www.httrack.com/。 httrack.com/

回复收藏 0 原文

缪败 2024-10-24 04:31:34

创建一个 fifo 文件 (mknod /tmp/httpipe p)
进行 fork
- 在子进程中执行wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
- 在父亲中：逐行读取/tmp/httpipe
- 解析输出=~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$ self->{tcport}/(.*)$}，打印$1
- 计算行数； 100 行后关闭文件，它会破坏管道