Python 的 robotsparser 忽略站点地图
我有以下 robots.txt
User-agent: *
Disallow: /images/
Sitemap: http://www.example.com/sitemap.xml
和以下 robotsparser
def init_robot_parser(URL):
robot_parser = robotparser.RobotFileParser()
robot_parser.set_url(urlparse.urljoin(URL, "robots.txt"))
robot_parser.read()
return robot_parser
但是当我在 return robots_parser
上面执行 print robots_parser
时,我得到的只是
User-agent: *
Disallow: /images/
为什么它忽略 Sitemap 行,我是吗缺少什么吗?
I've the following robots.txt
User-agent: *
Disallow: /images/
Sitemap: http://www.example.com/sitemap.xml
and the following robotparser
def init_robot_parser(URL):
robot_parser = robotparser.RobotFileParser()
robot_parser.set_url(urlparse.urljoin(URL, "robots.txt"))
robot_parser.read()
return robot_parser
But when I do a print robot_parser
Above return robot_parser
all I get is
User-agent: *
Disallow: /images/
Why is it ignoring the Sitemap line, am I missing something?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Sitemap 是标准的扩展,robotparser 不支持它。您可以在来源中看到它只处理“user-agent”、“disallow”和“allow”。对于其当前功能(告诉您是否允许特定 URL),无需了解站点地图。
Sitemap is an extension to the standard, and robotparser doesn't support it. You can see in the source that it only processes "user-agent", "disallow", and "allow". For its current functionality (telling you whether a particular URL is allowed), understanding Sitemap isn't necessary.
您可以使用 Repply ( https://github.com/seomoz/reppy ) 解析 Robots.txt -它处理站点地图。
但请记住,在某些情况下,默认位置(/sitemaps.xml)上有一个站点地图,并且站点所有者没有在 robots.txt 中提及它(例如在 toucharcade.com 上)
我还发现至少一个站点地图被压缩的站点 - 即 robots.txt 会生成一个 .gz 文件。
You can use Repply ( https://github.com/seomoz/reppy ) to parse Robots.txt - it handles sitemaps.
Keep in mind though, that in some cases there is a sitemap on the default location (/sitemaps.xml), and the site owners didn't mention it within robots.txt (for example on toucharcade.com)
I also found at least one site which has its sitemaps compressed - that is robot.txt leads to a .gz file.