使用 python 机器人解析器

发布于 2024-12-08 09:43:58 字数 803 浏览 0 评论 0原文

我不明白如何使用 robotsparser 模块中的解析函数。这是我尝试过的：

In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")

In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")

In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True

看起来 rp.entries 是 [] 。我不明白出了什么问题。我尝试过更简单的例子但同样的问题。

原文

I am not understandong how to use the parse function in robotparser module . Here is what I tried :

In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")

In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")

In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True

As it seems the rp.entries is [] . I am not understanding what is wrong . I have tried simpler example but same problem .

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

妄司 2024-12-15 09:43:58

这里有两个问题。首先，rp.parse 方法接受一个字符串列表，因此您应该将 .split("\n") 添加到该行。

第二个问题是 * 用户代理的规则存储在 rp.default_entry 中，而不是 rp.entries 中。如果您检查，您将看到它包含一个 Entry 对象。

我不确定这里是谁的错，但是解析器的 Python 实现仅尊重第一个 User-agent: * 部分，因此在示例中您只给出了 /next/不允许使用。其他禁止行将被忽略。我还没有阅读规范，所以我无法判断这是否是格式错误的 robots.txt 文件或者 Python 代码是否错误。但我认为是前者。