使用 python 机器人解析器
我不明白如何使用 robotsparser 模块中的解析函数。这是我尝试过的:
In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")
In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")
In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True
看起来 rp.entries 是 [] 。我不明白出了什么问题。我尝试过更简单的例子但同样的问题。
I am not understandong how to use the parse function in robotparser module . Here is what I tried :
In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt")
In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead.
# Please see http://en.wordpress.com/firehose/ for more details.
Sitemap: http://anilattech.wordpress.com/sitemap.xml
User-agent: IRLbot
Crawl-delay: 3600
User-agent: *
Disallow: /next/
# har har
User-agent: *
Disallow: /activate/
User-agent: *
Disallow: /signup/
User-agent: *
Disallow: /related-tags.php
# MT refugees
User-agent: *
Disallow: /cgi-bin/
User-agent: *
Disallow:""")
In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/")
Out[48]: True
As it seems the rp.entries is [] . I am not understanding what is wrong . I have tried simpler example but same problem .
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这里有两个问题。首先,
rp.parse
方法接受一个字符串列表,因此您应该将.split("\n")
添加到该行。第二个问题是
*
用户代理的规则存储在rp.default_entry
中,而不是rp.entries
中。如果您检查,您将看到它包含一个Entry
对象。我不确定这里是谁的错,但是解析器的 Python 实现仅尊重第一个
User-agent: *
部分,因此在示例中您只给出了/next/不允许使用
。其他禁止行将被忽略。我还没有阅读规范,所以我无法判断这是否是格式错误的 robots.txt 文件或者 Python 代码是否错误。但我认为是前者。There are two issues here. Firstly the
rp.parse
method takes a list of strings, so you should add.split("\n")
to that line.The second issue is that rules for the
*
user agent are stored inrp.default_entry
rather thanrp.entries
. If you check that you'll see it contains anEntry
object.I'm not sure who is at fault here, but the Python implementation of the parser only respects the first
User-agent: *
section so in the example you've given only/next/
is disallowed. The other disallow lines are ignored. I haven't read the spec so I can't say if this is a malformed robots.txt file or if the Python code is wrong. I would assume the former though.嗯,我刚刚找到了答案。
1.问题是这个 robots.txt [来自 wordpress.com] 包含多个用户代理声明。 robotsparser 模块不支持此功能。我通过删除过多的
User-agent: *
行解决了这个问题。2.要解析的参数是 Andrew 指出的列表。
Well I just found the answer .
1 . The thing was that this robots.txt [ from wordpress.com ] contained multiple User Agent declarations . This was not supported by robotparser module . I tiny hack of removing the excessive
User-agent: *
lines solved the problem .2 . The argument to parse is list as was pointed by Andrew .