任何人都可以获得任何 C# 代码来解析 robots.txt 并根据它评估 URL

发布于 2024-07-15 01:25:45 字数 710 浏览 8 评论 0原文

简短的问题:

是否有人有任何 C# 代码来解析 robots.txt,然后根据它评估 URL,看看它们是否会被排除。

长问题:

我一直在为尚未发布到谷歌的新网站创建站点地图。 站点地图有两种模式:用户模式(如传统站点地图)和“管理”模式。

管理模式将显示网站上所有可能的 URL,包括自定义的入口 URL 或特定外部合作伙伴的 URL,例如在 Oprah 上看到我们网站的任何人的 example.com/oprah。 我想在 Excel 电子表格以外的地方跟踪已发布的链接。

我不得不假设有人可能会在他们的博客或其他地方发布 /oprah 链接。 我们实际上并不希望这个“迷你奥普拉网站”被编入索引,因为这会导致非奥普拉观众能够找到奥普拉的特别优惠。

因此,在创建站点地图的同时,我还添加了要从我们的 robots.txt 文件中排除的 URL,例如 /oprah

然后(这是实际的问题)我想“如果能够在站点地图上显示文件是否已索引并且对机器人可见,这不是很好吗”。 这非常简单 - 只需解析 robots.txt,然后根据它评估链接即可。

然而,这是一个“额外功能”,我当然没有时间去写它(即使认为它可能没有那么复杂) - 所以我想知道是否有人已经编写了任何代码来解析 robots.txt ?

Short question:

Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not.

Long question:

I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode.

The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet.

I would have to assume that someone might publish the /oprah link on their blog or somewhere. We don't actually want this 'mini-oprah site' to be indexed because it would result in non-oprah viewers being able to find the special Oprah offers.

So at the same time I was creating the sitemap I also added URLS such as /oprah to be excluded from our robots.txt file.

Then (and this is the actual question) I thought 'wouldn't it be nice to be able to show on the sitemap whether or not files are indexed and visible to robots'. This would be quite simple - just parse robots.txt and then evaluate a link against it.

However this is a 'bonus feature' and I certainly don't have time to go off and write it (even thought its probably not that complex) - so I was wondering if anyone has already written any code to parse robots.txt ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

浪漫之都 2024-07-22 01:25:45

不想这么说,但只需谷歌“C# robots.txt 解析器”并单击第一个命中即可。 这是一篇 CodeProject 文章,介绍了一个用 C# 实现的名为“Searcharoo”的简单搜索引擎,它包含一个类 Searcharoo.Indexer.RobotsTxt,描述为:

  1. 检查网站上的 robots.txt 文件,如果存在,则下载并解析该文件
  2. 为Spider提供一个接口,以根据robots.txt规则检查每个网址

Hate to say that, but just google "C# robots.txt parser" and click the first hit. It's a CodeProject article about a simple search engine implemented in C# called "Searcharoo", and it contains a class Searcharoo.Indexer.RobotsTxt, described as:

  1. Check for, and if present, download and parse the robots.txt file on the site
  2. Provide an interface for the Spider to check each Url against the robots.txt rules
未央 2024-07-22 01:25:45

我喜欢 http://code.google.com/p/robotstxt/ 会推荐它作为起点。

I like the code and tests in http://code.google.com/p/robotstxt/ would recommend it as a starting point.

全部不再 2024-07-22 01:25:45

有点自我推销,但由于我需要一个类似的解析器并且找不到任何我满意的东西,所以我创建了自己的解析器:

http://nrobots.codeplex.com/

我希望得到任何反馈

A bit of self promoting, but since I needed a similar parser and couldn't find anything I was happy with, I created my own:

http://nrobots.codeplex.com/

I'd love any feedback

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文