抓取网站并仅返回 URL

发布于 2024-08-31 23:31:11 字数 643 浏览 4 评论 0原文

我正在寻找一种伪蜘蛛网站的方法。关键是我实际上不需要内容,而是一个简单的 URI 列表。我可以使用 --spiderWget 来相当接近这个想法code> 选项,但是当通过 grep 管道输出时,我似乎找不到使其工作的正确魔法:

wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

grep 过滤器似乎完全没有影响 wget 输出。我是否出了什么问题,或者是否有其他我应该尝试的工具更适合提供这种有限的结果集?

更新

所以我刚刚离线发现,默认情况下,wget写入stderr。我在手册页中错过了这一点(事实上,如果它在那里的话,我仍然没有找到它)。一旦我通过管道返回到标准输出,我就更接近我所需要的了:

wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'

我仍然对做这种事情的其他/更好的方法感兴趣,如果存在的话。

I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:

wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'

The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?

UPDATE

So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:

wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'

I'd still be interested in other/better means for doing this kind of thing, if any exist.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

漫漫岁月 2024-09-07 23:31:11

我想做的绝对最后一件事就是自己下载并解析所有内容(即创建我自己的蜘蛛)。一旦我了解到 Wget 默认写入 stderr,我就能够将其重定向到 stdout 并适当地过滤输出。

wget --spider --force-html -r -l2 $url 2>&1 \
  | grep '^--' | awk '{ print $3 }' \
  | grep -v '\.\(css\|js\|png\|gif\|jpg\)

这为我提供了被蜘蛛抓取的内容资源(不是图像、CSS 或 JS 源文件的资源)URI 的列表。从那里,我可以将 URI 发送到第三方工具进行处理以满足我的需求。

输出仍然需要稍微简化(它会产生重复项,如上所示),但它几乎就在那里,我自己不必进行任何解析。

\ > urls.m3u

这为我提供了被蜘蛛抓取的内容资源(不是图像、CSS 或 JS 源文件的资源)URI 的列表。从那里,我可以将 URI 发送到第三方工具进行处理以满足我的需求。

输出仍然需要稍微简化(它会产生重复项,如上所示),但它几乎就在那里,我自己不必进行任何解析。

The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.

wget --spider --force-html -r -l2 $url 2>&1 \
  | grep '^--' | awk '{ print $3 }' \
  | grep -v '\.\(css\|js\|png\|gif\|jpg\)

This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meet my needs.

The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.

\ > urls.m3u

This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meet my needs.

The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.

离笑几人歌 2024-09-07 23:31:11

创建一些正则表达式来从所有地址中提取地址

<a href="(ADDRESS_IS_HERE)">.

这是我将使用的解决方案:

wget -q http://example.com -O - | \
    tr "\t\r\n'" '   "' | \
    grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \
    sed -e 's/^.*"\([^"]\+\)".*$/\1/g'

这将输出网页中的所有 http、https、ftp 和 ftps 链接。它不会为您提供相对网址,只会提供完整网址。

关于一系列管道命令中使用的选项的说明:

wget -q 使其不会有过多的输出(安静模式)。
wget -O - 使下载的文件回显到标准输出,而不是保存到磁盘。

tr 是 unix 字符转换器,在本示例中用于将换行符和制表符转换为空格,以及将单引号转换为双引号,以便我们可以简化正则表达式。

grep -i 使搜索不区分大小写
grep -o 使其仅输出匹配的部分。

sed 是 Stream EEditor unix 实用程序,它允许进行过滤和转换操作。

sed -e 只是让你给它一个表达式。

在“http://craigslist.org”上运行这个小脚本会产生相当长的链接列表:

http://blog.craigslist.org/
http://24hoursoncraigslist.com/subs/nowplaying.html
http://craigslistfoundation.org/
http://atlanta.craigslist.org/
http://austin.craigslist.org/
http://boston.craigslist.org/
http://chicago.craigslist.org/
http://cleveland.craigslist.org/
...

Create a few regular expressions to extract the addresses from all

<a href="(ADDRESS_IS_HERE)">.

Here is the solution I would use:

wget -q http://example.com -O - | \
    tr "\t\r\n'" '   "' | \
    grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \
    sed -e 's/^.*"\([^"]\+\)".*$/\1/g'

This will output all http, https, ftp, and ftps links from a webpage. It will not give you relative urls, only full urls.

Explanation regarding the options used in the series of piped commands:

wget -q makes it not have excessive output (quiet mode).
wget -O - makes it so that the downloaded file is echoed to stdout, rather than saved to disk.

tr is the unix character translator, used in this example to translate newlines and tabs to spaces, as well as convert single quotes into double quotes so we can simplify our regular expressions.

grep -i makes the search case-insensitive
grep -o makes it output only the matching portions.

sed is the Stream EDitor unix utility which allows for filtering and transformation operations.

sed -e just lets you feed it an expression.

Running this little script on "http://craigslist.org" yielded quite a long list of links:

http://blog.craigslist.org/
http://24hoursoncraigslist.com/subs/nowplaying.html
http://craigslistfoundation.org/
http://atlanta.craigslist.org/
http://austin.craigslist.org/
http://boston.craigslist.org/
http://chicago.craigslist.org/
http://cleveland.craigslist.org/
...
一向肩并 2024-09-07 23:31:11

我使用了一个名为 xidel 的工具,

xidel http://server -e '//a/@href' | 
grep -v "http" | 
sort -u | 
xargs -L1 -I {}  xidel http://server/{} -e '//a/@href' | 
grep -v "http" | sort -u

有点黑客,但让你更接近!这只是第一级。想象一下将其打包到一个自递归脚本中!

I've used a tool called xidel

xidel http://server -e '//a/@href' | 
grep -v "http" | 
sort -u | 
xargs -L1 -I {}  xidel http://server/{} -e '//a/@href' | 
grep -v "http" | sort -u

A little hackish but gets you closer! This is only the first level. Imagine packing this up into a self recursive script!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文