将搜索引擎列入黑名单的最佳方法是什么?
我使用 CodeIgniter 作为框架,用 PHP/MySQL 构建了一个照片社区 Web 应用程序。所有内容都是公开的,因此搜索引擎会定期访问。这正是我想要的,但它有两个不需要的副作用:
- 每次访问都会在我的会话表中创建一个会话。
- 搜索引擎对照片页面的每次访问都会增加查看计数器
至于第二个问题,我正在重写对查看计数脚本的调用,仅从 JavaScript 调用,这应该可以防止搜索引擎的计数增加,对吧?
至于会话表,我的想法是使用 cron 事后清理它,以免对性能产生影响。我正在会话表中记录 IP 和用户代理字符串,因此在我看来,黑名单方法是最好的?如果是这样,最好的方法是什么?是否有一种简单/可重用的方法来确定会话来自搜索引擎?
I have built a photo community web application in PHP/MySQL, using CodeIgniter as a framework. All content is public so search engines regularly drop by. This is exactly what I want, yet it has two unwanted side effects:
- Each visit creates a session in my session table.
- Each visit of a search engine to a photo page increases the view counter
As for the second problem, I am rewriting the call to my view count script to be called from javascript only, that should prevent a count increase from search engines, right?
As for the session table, my thinking was to clean it up after the fact using a cron, to not have an impact on performance. I'm recording the IP and user agent string in the session table so it appears to me that a blacklist approach is best? If so, what is the best way to approach it? Is there an easy/reusable way to determine that a session is from a search engine?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
编辑:
用户代理列表
Edit:
List of User-Agents
您为什么担心这两种情况?处理爬虫的最佳策略是像对待其他用户一样对待它们。
搜索引擎创建的会话与任何其他会话没有什么不同。它们都必须被垃圾收集,因为您不可能假设每个用户在离开您的网站时都会单击“注销”按钮。处理它们的方式与处理任何过期会话的方式相同。无论如何你都必须这样做,那么为什么要花费额外的时间来以不同的方式对待搜索引擎呢?
至于搜索引擎增加视图计数器,为什么这是一个问题?无论如何,“观看次数”是一个容易引起误解的术语。你真正告诉人们的是该页面被请求了多少次。您无法确保一双眼睛确实看到该页面,而且确实没有合理的方法可以做到这一点。对于您“列入黑名单”的每个机器人,都会有十几个一次性抓取您的内容并且不提供友好的用户代理字符串。
Why are you worried about either of these situations? The best strategy for dealing with crawlers is to treat them like any other user.
Sessions created by search engines are no different than any other session. They all have to be garbage collected, as you can't possibly assume that every user is going to click the "logout" button when they leave your site. Handle them the same way as you handle any expired session. You have to do this anyways, so why invest extra time in treating search engines differently?
As far as search engine incrementing view counters, why is that a problem? "View count" is a miss-leading term anyways; what you're really telling people is how many times the page has been requested. It's not up to you to insure a pair of eyeballs actually sees the page, and there is really no reasonable way of doing so. For every bot you "blacklist", there will be a dozen more one-offs scraping your content and not serving up friendly user-agent strings.
使用 robots.txt 文件来准确控制允许搜索引擎爬虫查看和执行的操作
Use a robots.txt file to control exactly what search engine crawlers are allowed to see and do