我可以允许(通过搜索引擎)对受限内容建立索引而不将其公开吗?

发布于 2024-09-28 02:01:51 字数 266 浏览 5 评论 0原文

我有一个包含一些受限内容的网站。我希望我的网站出现在搜索结果中,但不希望它公开。

有没有一种方法可以允许爬虫爬行我的网站,但阻止它们将其公开?

我找到的最接近的解决方案是 Google First Click Free 但即使它需要我第一次显示内容。

I have a site with some restricted content. I want my site to appear in search results, but I do not want it to get public.

Is there a way by which I can allow crawlers to crawl through my site but prevent them from making it public?

The closest solution I have found is Google First Click Free but even it requires me to show the content for the first time.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

酷炫老祖宗 2024-10-05 02:01:51

为什么要允许人们搜索点击链接后无法访问的页面?从技术上讲,它可能会使其变得困难(如果用户代理包含“googlebot”,请检查您的身份验证代码,尽管如果人们非常想要您的内容,则没有什么可以阻止人们伪造此用户代理),但基本上没有意义。

另外,谷歌的官方路线(IIRC,虽然在任何地方都找不到这个)是,你可能会因为故意试图向谷歌机器人显示与人类用户看到的内容不同的内容而受到惩罚。

Why do you want to allow people to search for a page that they can't access if they click the link? Its technically possible to make it difficult (check in your authentication code if useragent contains 'googlebot', though there is nothing stopping people from faking this useragent if they want your content bad enough) but largely pointless.

Also google's official line (IIRC, can't find this anywhere though) is that you may be penalized for deliberately trying to show googlebot different content to what human users see.

小糖芽 2024-10-05 02:01:51

您几乎已经锁定了 Google First Click Free。您唯一的其他解决方案就是冒着违反网站管理员规则的风险。

如果您确实使用 Google First Click Free,则可以保护您的某些内容。一种方法是对较长的文章或论坛进行分页,并且不允许对附加内容进行爬网。然后,系统会提示查找其余内容的用户注册您的网站。

更高级的方法是允许对所有内容进行爬网和索引。通过分析确定您更有价值的内容;然后让 Google 知道您不希望再抓取“附加”或辅助页面(通过 rel=、元机器人、x-robots 等)。确保您也不会存档这些页面,以便人们无法通过 Google 缓存后门访问内容。您已经有效地允许用户获取主要内容,但如果他们想阅读更多内容,则必须注册才能获得访问权限。

这可以被视为“灰色”帽子,因为您实际上没有违反任何网站管理员指南,但您正在创建一个不常见的实现。您不会向用户提供不同的内容,而是明确告诉 Google 您希望抓取哪些内容和不希望抓取哪些内容,同时保护您网站的价值。

当然,像这样的系统并不容易实现自动化,但如果您环顾四周,您会看到出版物或某些论坛/留言板在做类似的事情。

You're pretty much locked into the Google First Click Free. Your only other solution is to risk violating their Webmaster rules.

If you do use the Google First Click Free, you can protect some of your content. One way is to paginate longer articles or forums and not allow the additional content to be crawled. Users looking for the rest of the content can then be prompted to register for your site.

A more advanced way is to allow all your content to be crawled and indexed. Through analytics identify your more valuable content; then let Google know that you don't want the "additional" or ancillary pages crawled any more (via rel=, meta robots, x-robots, etc). Make sure you also noarchive those pages so people can't back door access the content via Google Cache. You've effectively allow users to get the main content, but if they want to read more they'll have to register to gain access.

This could be viewed as "gray"-hat since you're really not violating any of the webmaster guidelines, but you are creating an implementation that's not common. You're not serving up different content to the users, you're explicitly telling Google what you do and do not want crawled, and you're protecting the value of your site at the same time.

Of course a system like this isn't that easy to automate, but if you look around, you'll see publications or certain forums / message boards doing something similar.

白馒头 2024-10-05 02:01:51

并不真地。

您可以为来自已知搜索引擎的请求设置 cookie,并允许这些请求访问您的内容,但这并不能阻止人们欺骗他们的请求,或者使用谷歌翻译之类的东西来代理信息。

Not really.

You could set a cookie for requests coming from known search engines, and allow those requests to access your content, however that will not prevent people from spoofing their request, or using something like google translate to proxy the information out.

行至春深 2024-10-05 02:01:51

谷歌自定义搜索引擎有自己的索引。 http://www.google.com/cse/manage/create 这样你基本上就可以通过按需索引将您的所有网站推送到 Google 自定义搜索 http://www.google.com/support/customsearch/bin/topic.py?hl=en&topic=16792 并在此后不久阻止真正的 googlebot 再次访问它和/或通过以下方式将其踢出谷歌网站管理员工具。

但这会导致大量的黑客攻击,您的网站有时会逃逸到野外(或者有时您会将其从点播索引中踢出)。

和/或者您可以购买自己的小型 Google(称为 google enterprise)http://www. google.com/enterprise/search/index.html 然后你的谷歌可以访问它,但它不会被发布。可用的。

但再次阅读你的问题:这可能不是你想要的?不是吗?

google custom search engine has it's own index. http://www.google.com/cse/manage/create so you could basically push all you sites to google custom search via on demand indexing http://www.google.com/support/customsearch/bin/topic.py?hl=en&topic=16792 and shortly thereafter block the real googlebot from accessing it again and/or kicking it out via google webmaster tools.

but that would be a lot of hacking and your site will escape into the wild propably somtime (or you will kick it out of the ondemand index somtimes).

and/or you could buy your own little google (called google enterprise) http://www.google.com/enterprise/search/index.html then your google can access it, but it won't get pub. available.

but reading your questions again: that is propably not what you want? isn't it?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文