当前位置：文江博客话题详情

Python robots.txt

Python 的 robotsparser 忽略站点地图

发布于 2024-09-05 02:48:04 字数 553 浏览 2 评论 0原文

我有以下 robots.txt

User-agent: *
Disallow: /images/
Sitemap: http://www.example.com/sitemap.xml

和以下 robotsparser

def init_robot_parser(URL):
    robot_parser = robotparser.RobotFileParser()
    robot_parser.set_url(urlparse.urljoin(URL, "robots.txt"))
    robot_parser.read()

    return robot_parser

但是当我在 return robots_parser 上面执行 print robots_parser 时，我得到的只是

User-agent: *
Disallow: /images/

为什么它忽略 Sitemap 行，我是吗缺少什么吗？

I've the following robots.txt

User-agent: *
Disallow: /images/
Sitemap: http://www.example.com/sitemap.xml

and the following robotparser

def init_robot_parser(URL):
    robot_parser = robotparser.RobotFileParser()
    robot_parser.set_url(urlparse.urljoin(URL, "robots.txt"))
    robot_parser.read()

    return robot_parser

But when I do a print robot_parser Above return robot_parser all I get is

User-agent: *
Disallow: /images/

Why is it ignoring the Sitemap line, am I missing something?

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（2）

许你一世情深 2024-09-12 02:48:04

Sitemap 是标准的扩展，robotparser 不支持它。您可以在来源中看到它只处理“user-agent”、“disallow”和“allow”。对于其当前功能（告诉您是否允许特定 URL），无需了解站点地图。

回复收藏 0 原文

夏末的微笑 2024-09-12 02:48:04

您可以使用 Repply ( https://github.com/seomoz/reppy ) 解析 Robots.txt -它处理站点地图。

但请记住，在某些情况下，默认位置（/sitemaps.xml）上有一个站点地图，并且站点所有者没有在 robots.txt 中提及它（例如在 toucharcade.com 上）

我还发现至少一个站点地图被压缩的站点 - 即 robots.txt 会生成一个 .gz 文件。

回复收藏 0 原文

~没有更多了~

关于作者

暂无简介

0 文章

0 评论

23 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

1CH1MKgiKxn9p

文章 0 评论 0

ゞ记忆︶ㄣ

文章 0 评论 0

JackDx

文章 0 评论 0

信远

文章 0 评论 0

yaoduoduo1995

文章 0 评论 0

霞映澄塘

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文