将搜索引擎列入黑名单的最佳方法是什么?

发布于 2024-10-24 01:35:33 字数 371 浏览 2 评论 0原文

我使用 CodeIgniter 作为框架,用 PHP/MySQL 构建了一个照片社区 Web 应用程序。所有内容都是公开的,因此搜索引擎会定期访问。这正是我想要的,但它有两个不需要的副作用:

  • 每次访问都会在我的会话表中创建一个会话。
  • 搜索引擎对照片页面的每次访问都会增加查看计数器

至于第二个问题,我正在重写对查看计数脚本的调用,仅从 JavaScript 调用,这应该可以防止搜索引擎的计数增加,对吧?

至于会话表,我的想法是使用 cron 事后清理它,以免对性能产生影响。我正在会话表中记录 IP 和用户代理字符串,因此在我看来,黑名单方法是最好的?如果是这样,最好的方法是什么?是否有一种简单/可重用的方法来确定会话来自搜索引擎?

I have built a photo community web application in PHP/MySQL, using CodeIgniter as a framework. All content is public so search engines regularly drop by. This is exactly what I want, yet it has two unwanted side effects:

  • Each visit creates a session in my session table.
  • Each visit of a search engine to a photo page increases the view counter

As for the second problem, I am rewriting the call to my view count script to be called from javascript only, that should prevent a count increase from search engines, right?

As for the session table, my thinking was to clean it up after the fact using a cron, to not have an impact on performance. I'm recording the IP and user agent string in the session table so it appears to me that a blacklist approach is best? If so, what is the best way to approach it? Is there an easy/reusable way to determine that a session is from a search engine?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一城柳絮吹成雪 2024-10-31 01:35:33
  • 识别主要搜索引擎(提示
  • 根据预编译列表检查访问者(见上文)
  • 不要启动会话/增加匹配计数器

编辑:

用户代理列表

  • Identify major search engines (Hint)
  • Check visitors against your precompiled list (above)
  • Do not start session/increase counter on match

Edit:

List of User-Agents

泪之魂 2024-10-31 01:35:33

您为什么担心这两种情况?处理爬虫的最佳策略是像对待其他用户一样对待它们。

搜索引擎创建的会话与任何其他会话没有什么不同。它们都必须被垃圾收集,因为您不可能假设每个用户在离开您的网站时都会单击“注销”按钮。处理它们的方式与处理任何过期会话的方式相同。无论如何你都必须这样做,那么为什么要花费额外的时间来以不同的方式对待搜索引擎呢?

至于搜索引擎增加视图计数器,为什么这是一个问题?无论如何,“观看次数”是一个容易引起误解的术语。你真正告诉人们的是该页面被请求了多少次。您无法确保一双眼睛确实看到该页面,而且确实没有合理的方法可以做到这一点。对于您“列入黑名单”的每个机器人,都会有十几个一次性抓取您的内容并且不提供友好的用户代理字符串。

Why are you worried about either of these situations? The best strategy for dealing with crawlers is to treat them like any other user.

Sessions created by search engines are no different than any other session. They all have to be garbage collected, as you can't possibly assume that every user is going to click the "logout" button when they leave your site. Handle them the same way as you handle any expired session. You have to do this anyways, so why invest extra time in treating search engines differently?

As far as search engine incrementing view counters, why is that a problem? "View count" is a miss-leading term anyways; what you're really telling people is how many times the page has been requested. It's not up to you to insure a pair of eyeballs actually sees the page, and there is really no reasonable way of doing so. For every bot you "blacklist", there will be a dozen more one-offs scraping your content and not serving up friendly user-agent strings.

脸赞 2024-10-31 01:35:33

使用 robots.txt 文件来准确控制允许搜索引擎爬虫查看和执行的操作

Use a robots.txt file to control exactly what search engine crawlers are allowed to see and do

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文