是否可以阻止搜索引擎蜘蛛无限抓取搜索结果上的分页链接?

发布于 2024-07-10 01:00:04 字数 760 浏览 18 评论 0 原文

我们的 SEO 团队希望向蜘蛛开放我们的主要动态搜索结果页面,并从元标记中删除“nofollow”。 目前,蜘蛛可以通过允许 robots.txt 中的路径来访问它,但在元标记中使用“nofollow”子句,可以防止蜘蛛超出第一页。

我担心,如果我们删除“nofollow”,对我们的搜索系统的影响将是灾难性的,因为蜘蛛将开始抓取结果集中的所有页面。 我希望得到以下建议:

1)有没有办法从元标记中删除“nofollow”,但防止蜘蛛只跟踪页面上的某些链接? 我读过关于 rel="nofollow" 的不同意见,这是一个可行的选择吗?

下一页

2) 有没有办法控制“深度”蜘蛛会走多远? 如果他们读了几页就停下来了,那也不算太糟糕。

3)我们的搜索结果页面有标准的下一个/上一个链接,理论上这会导致蜘蛛无限递归地点击页面,这对SEO有什么影响?

我知道不同的蜘蛛有不同的行为,但我主要关注的是大公司,例如 Google、Yahoo、MSN。

注意我们的搜索结果页面和分页链接对机器人不友好,因为它们没有被重写并且有一个 ?name=value 查询字符串,但从我所看到的蜘蛛不再只是当他们看到“?”时中止 因为结果页面正在以良好的页面排名建立索引。

Our SEO team would like to open up our main dynamic search results page to spiders and remove the 'nofollow' from the meta tags. It is currently accessible to spiders via allowing the path in robots.txt, but with a 'nofollow' clause in the meta tag which prevents spiders from going beyond the first page.

<meta name="robots" content="index,nofollow">

I am concerned that if we remove the 'nofollow', the impact to our search system will be catastrophic, as spiders will start crawling through all pages in the result set. I would appreciate advice as to:

1) Is there a way to remove the 'nofollow' from the meta tag, but prevent spiders from following only certain links on the page? I have read mixed opinions on rel="nofollow", is this a viable option?

<a rel="nofollow" href="http://www.mysite.com/paginglink" >Next Page</a>

2) Is there a way to control the 'depth' of how far spiders will go? It wouldn't be so bad if they hit a few pages, then stopped.

3) Our search results pages have the standard next/previous links, which would in theory cause spiders to hit pages recursively to infinity, what is the effect of this on SEO?

I understand that different spiders behave differently, but am mainly concerned with the big players, such as Google, Yahoo, MSN.

Note our Search results pages and paging links are not bot-friendly, in that they are not re-written and have a ?name=value query string, but from what I've seen spiders no longer just abort when they see the '?' as the results pages ARE getting indexed with decent page rank.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

独﹏钓一江月 2024-07-17 01:00:04

老实说,您对 nofollow 的看法是错误的。 搜索蜘蛛很可能已经在搜索 nofollow 页面,尤其是 Google、Yahoo 和 MSN,因为它们仍然需要访问这些页面来查看是否有 noindex。

真正的问题是nofollow实际上并不意味着不关注,它只是意味着不要将我的声誉传递给此链接。 因此,除非您积极阻止机器人(听起来不像您那样),否则更改链接上的机器人元标记和机器人命令不会影响性能,因为它们已经访问了您的网站。 要确认这一点,只需查看您的 HTTP 服务器日志即可。

所以我的投票是,你不会发现取消机器人限制有任何问题。

To be honest you are looking at nofollow wrong. Chances are the search spiders are already especially Google, Yahoo, and MSN searching the nofollow pages, because they still have to hit those pages to see if they have a noindex.

The real problem is nofollow doesn't actually mean don't follow, it just means don't pass on my reputation to this link. So unless you are aggressively blocking bots, which it doesn't sound like you are, changing the ROBOTS meta tag and robot commands on links will not effect performance because they are already hitting your site. To confirm this just look at your HTTP Server Log.

So my vote is that you will not see any problem with removing the robot limits.

心是晴朗的。 2024-07-17 01:00:04

我见过 Google 索引了一个日历系统,该系统在每个页面上都有相对链接,直到时间结束(2038 年 1 月 19 日 - 请参阅:http://en.wikipedia.org/wiki/Year_2038_problem)。 我们没有注意到服务器上的负载,直到它暴露了处理 2038 年日期的源代码中的错误。

我不知道其他搜索引擎,但 Google 提供了许多有用的工具来控制 googlebot 的负载量影响您的服务器基础设施。 请参阅 http://www.google.com/webmasters/

网站管理员工具中有一个选项可以设置网站的抓取速度。

I've seen Google index a calendar system that had relative links on each page through the end of time (Jan 19, 2038 - see: http://en.wikipedia.org/wiki/Year_2038_problem). We didn't notice the load on our servers until it exposed a bug in the source code dealing with dates in 2038.

I don't know about the other search engines, but Google offers a number of helpful tools for controlling how much the googlebot impacts your server infrastructure. See http://www.google.com/webmasters/.

There is an option in webmaster tools to set the crawl rate for your site.

木格 2024-07-17 01:00:04

Google 机器人非常聪明,不会遍历动态生成的页面的整个数据库,只要 URL 给出一些提示,表明它们是动态的(即 .asp 或 .jsp 等文件扩展名以及数字 id 作为查询参数)。 如果您使用重写规则使您的 URL 变得“友好”,那么机器人就很难确定它们正在阅读的页面是静态页面还是动态生成的页面。 有关详细信息,请参阅这篇 Google 文章动态 URL 与静态 URL。

您可能还需要考虑创建一个 Google 站点地图来为机器人提供信息更好地了解网站上的哪些页面可以被索引,哪些不能被索引。

Google bots are pretty intelligent about not traversing an entire database of dynamically-generated pages, as long as the URLs give some hint that they are dynamic (i.e. file extension of .asp or .jsp, etc. and numeric ids as query parameters). If you use rewrite rules to make your URLs "friendly", then the bots have a harder time determining whether or not it's a static page they are reading or a dynamically generated page. See this Google article for more information about dynamic vs. static URLs.

You may also want to consider creating a Google Sitemap to give the bots a better idea about what pages on your site can be indexed and which cannot.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文