任何人都可以获得任何 C# 代码来解析 robots.txt 并根据它评估 URL
简短的问题:
是否有人有任何 C# 代码来解析 robots.txt,然后根据它评估 URL,看看它们是否会被排除。
长问题:
我一直在为尚未发布到谷歌的新网站创建站点地图。 站点地图有两种模式:用户模式(如传统站点地图)和“管理”模式。
管理模式将显示网站上所有可能的 URL,包括自定义的入口 URL 或特定外部合作伙伴的 URL,例如在 Oprah 上看到我们网站的任何人的 example.com/oprah
。 我想在 Excel 电子表格以外的地方跟踪已发布的链接。
我不得不假设有人可能会在他们的博客或其他地方发布 /oprah
链接。 我们实际上并不希望这个“迷你奥普拉网站”被编入索引,因为这会导致非奥普拉观众能够找到奥普拉的特别优惠。
因此,在创建站点地图的同时,我还添加了要从我们的 robots.txt
文件中排除的 URL,例如 /oprah
。
然后(这是实际的问题)我想“如果能够在站点地图上显示文件是否已索引并且对机器人可见,这不是很好吗”。 这非常简单 - 只需解析 robots.txt,然后根据它评估链接即可。
然而,这是一个“额外功能”,我当然没有时间去写它(即使认为它可能没有那么复杂) - 所以我想知道是否有人已经编写了任何代码来解析 robots.txt ?
Short question:
Has anybody got any C# code to parse robots.txt and then evaluate URLS against it so see if they would be excluded or not.
Long question:
I have been creating a sitemap for a new site yet to be released to google. The sitemap has two modes, a user mode (like a traditional sitemap) and an 'admin' mode.
The admin mode will show all possible URLS on the site, including customized entry URLS or URLS for a specific outside partner - such as example.com/oprah
for anyone who sees our site on Oprah. I want to track published links somewhere other than in an Excel spreadsheet.
I would have to assume that someone might publish the /oprah
link on their blog or somewhere. We don't actually want this 'mini-oprah site' to be indexed because it would result in non-oprah viewers being able to find the special Oprah offers.
So at the same time I was creating the sitemap I also added URLS such as /oprah
to be excluded from our robots.txt
file.
Then (and this is the actual question) I thought 'wouldn't it be nice to be able to show on the sitemap whether or not files are indexed and visible to robots'. This would be quite simple - just parse robots.txt and then evaluate a link against it.
However this is a 'bonus feature' and I certainly don't have time to go off and write it (even thought its probably not that complex) - so I was wondering if anyone has already written any code to parse robots.txt ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不想这么说,但只需谷歌“C# robots.txt 解析器”并单击第一个命中即可。 这是一篇 CodeProject 文章,介绍了一个用 C# 实现的名为“Searcharoo”的简单搜索引擎,它包含一个类 Searcharoo.Indexer.RobotsTxt,描述为:
Hate to say that, but just google "C# robots.txt parser" and click the first hit. It's a CodeProject article about a simple search engine implemented in C# called "Searcharoo", and it contains a class Searcharoo.Indexer.RobotsTxt, described as:
我喜欢 http://code.google.com/p/robotstxt/ 会推荐它作为起点。
I like the code and tests in http://code.google.com/p/robotstxt/ would recommend it as a starting point.
有点自我推销,但由于我需要一个类似的解析器并且找不到任何我满意的东西,所以我创建了自己的解析器:
http://nrobots.codeplex.com/
我希望得到任何反馈
A bit of self promoting, but since I needed a similar parser and couldn't find anything I was happy with, I created my own:
http://nrobots.codeplex.com/
I'd love any feedback