如何忽略网络爬虫?
我有一个页面可以计算用户(注册、访客、各种用户......)访问的次数。
因此,每次查看页面时,我都会更新数据库中的一个字段;是的,如果页面刷新得很快,但我不介意这一点。
当然,当一些机器人/爬虫扫描我的网站时,他们会增加这个值,我会摆脱这个。那么,是否有一个可以忽略的 IP 地址列表?或者有什么机制可以帮助我做到这一点?
I have a page that count how many times is visited by a user (registered, guest, every kind of users...).
So I update a field on the database every time the page is viewed; yes, also if the page is refreshed quickly, but I don't mind about this.
Of course, when some bots/crawler scans my website they will increment this value, and I'll get rid about this. So, is there a list of IP addresses to ignore? Or some mechanism that can help me to do it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
另一种方法是使用 ajax。大多数爬虫不解析 javascript。
Another way to do it is with ajax. Most crawlers don't parse javascript.
IP 地址可能会发生变化,因此这并不是检测访问者是否为机器人的最佳方法。相反,我建议查看 HTTP 请求参数中的用户代理字符串。
以下是用户代理字符串列表: http://www.user-agents.org/ 。特别查看 R 类型下的“机器人、爬行器、蜘蛛”。
IP addresses can change so it's not be the best way to detect whether or not a visitor is a bot. Instead, I suggest looking at the user-agent string in the HTTP request parameters.
Here's a list of user-agent strings: http://www.user-agents.org/ . Look specifically under the type R for "robots, crawler, spider".
大多数人没有静态 IP 地址。您是否设置了 robots.txt 来拒绝爬虫/机器人的访问?您可以定期查询日志文件,以尝试识别那些不尊重 robots.txt 的文件,尽管用户代理很容易被欺骗/更改。
Most people don't have a static IP address. Have you setup a robots.txt to deny access to crawlers/bots? You could periodically query your log files to try and identify those which don't respect robots.txt, though the user agent is easily spoofed/changed.