使用 python 机器人解析器

发布于 2024-12-08 09:43:58 字数 803 浏览 0 评论 0原文

我不明白如何使用 robotsparser 模块中的解析函数。这是我尝试过的:

In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")

In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")

In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True

看起来 rp.entries 是 [] 。我不明白出了什么问题。我尝试过更简单的例子但同样的问题。

I am not understandong how to use the parse function in robotparser module . Here is what I tried :

In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")

In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")

In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True

As it seems the rp.entries is [] . I am not understanding what is wrong . I have tried simpler example but same problem .

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

妄司 2024-12-15 09:43:58

这里有两个问题。首先,rp.parse 方法接受一个字符串列表,因此您应该将 .split("\n") 添加到该行。

第二个问题是 * 用户代理的规则存储在 rp.default_entry 中,而不是 rp.entries 中。如果您检查,您将看到它包含一个 Entry 对象。

我不确定这里是谁的错,但是解析器的 Python 实现仅尊重第一个 User-agent: * 部分,因此在示例中您只给出了 /next/不允许使用。其他禁止行将被忽略。我还没有阅读规范,所以我无法判断这是否是格式错误的 robots.txt 文件或者 Python 代码是否错误。但我认为是前者。

There are two issues here. Firstly the rp.parse method takes a list of strings, so you should add .split("\n") to that line.

The second issue is that rules for the * user agent are stored in rp.default_entry rather than rp.entries. If you check that you'll see it contains an Entry object.

I'm not sure who is at fault here, but the Python implementation of the parser only respects the first User-agent: * section so in the example you've given only /next/ is disallowed. The other disallow lines are ignored. I haven't read the spec so I can't say if this is a malformed robots.txt file or if the Python code is wrong. I would assume the former though.

笙痞 2024-12-15 09:43:58

嗯,我刚刚找到了答案。

1.问题是这个 robots.txt [来自 wordpress.com] 包含多个用户代理声明。 robotsparser 模块不支持此功能。我通过删除过多的 User-agent: * 行解决了这个问题。

2.要解析的参数是 Andrew 指出的列表。

Well I just found the answer .

1 . The thing was that this robots.txt [ from wordpress.com ] contained multiple User Agent declarations . This was not supported by robotparser module . I tiny hack of removing the excessive User-agent: * lines solved the problem .

2 . The argument to parse is list as was pointed by Andrew .

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文