当前位置：文江博客话题详情

网络爬行和robots.txt - II

发布于 2024-11-18 01:24:07 字数 568 浏览 5 评论 0原文

与我之前的问题之一类似的情况：

使用 wget，我键入以下内容从网站（子文件夹）中提取图像：
<前><代码> wget -r -A.jpg http://www.abc.com/images/
我从上面的命令 - Img1，Img2。
http://www.abc.com/images/中的index.php文件仅引用Img2.jpg（查看源代码）。
如果我输入http://www.abc.com/images/Img4.jpg或http://www.abc.com/images/Img5.jpg，我得到两个单独的图像。
但是这些图像不是由wget 下载的。
我应该如何检索 http://www.abc.com/images/ 下的整个图像集？

原文

Similar scenario as one of my previous question:

Using wget, i type the following to pull down images from a site (sub-folder):
```
 wget -r -A.jpg http://www.abc.com/images/
```
I get two images from the above command - Img1, Img2.
The index.php file in http://www.abc.com/images/ refers to only Img2.jpg (saw the source).
If i key in http://www.abc.com/images/Img4.jpg or http://www.abc.com/images/Img5.jpg, i get two separate images.
But these images are not downloaded by wget.
How should I go about retrieving the entire set of images under http://www.abc.com/images/?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

碍人泪离人颜 2024-11-25 01:24:07

不太确定你想要什么，但试试这个：

wget --recursive --accept=gif,jpg,png http://www.abc.com

这将：

创建一个名为 www.abc.com\ 的目录
抓取 www.abc.com 上的所有页面
保存所有 .GIF www.abc.com\ 下相应目录下的、 .JPG 或 .PNG 文件

您可以删除除您感兴趣的目录之外的所有目录，即 www.abc.com \图像\

抓取所有页面是一项耗时的操作，但这可能是确保您获得 www.abc.com 上任何页面引用的所有图像的唯一方法。除非服务器允许目录浏览，否则没有其他方法可以检测 http://abc.com/images/ 中存在哪些图像。

Not exactly sure what you want but try this:

wget --recursive --accept=gif,jpg,png http://www.abc.com

This will:

Create a directory called www.abc.com\
Crawl all pages on www.abc.com
Save all .GIF, .JPG or .PNG files inside the corresponding directories under www.abc.com\

You can then delete all directories except the one you're interested in, namely, www.abc.com\images\

Crawling all pages is a time consuming operation but probably the only way to make sure you that you get all images that are referenced by any of the pages on www.abc.com. There is no other way to detect what images are present inside http://abc.com/images/ unless the server allows directory browsing.

回复收藏 0 原文

~没有更多了~