谷歌机器人会发出无效请求吗?
我正在构建一个组件,根据垃圾邮件机器人一直发出的无效请求来禁止垃圾邮件机器人的 IP,并且任何用户都不会错误地发出请求。
例如,他们总是尝试提交空表单,或者向仅应接收 POST 请求的 URL 发出 GET 请求。
我想知道的是我这样做是否有被禁止谷歌机器人的风险。
他们是否足够聪明,不会抓取他们遇到的每个网址?他们会避免使用表单网址吗?
I'm building a component to ban spam bots' IPs based on the invalid requests that they make all the time, and that no user could ever make by mistake.
For example, they are always trying to submit empty forms, or making GET requests to urls that should only receive POST requests.
What I want to know is if I am at risk of banning google bots by doing so.
Are they smart enough not to crawl every url they encounter? Do they avoid form urls?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Googlebot 会跟踪链接。它只会请求找到链接的页面。当然,该链接不必驻留在您的网站上,因此可能不受您的直接控制。
Googlebot 只会发出 GET 请求,因为根据 RFC,GET 请求不得有副作用。因此,他们无法更改服务器上的状态。提示:切勿使用链接(即“获取”)来执行或确认对网站的某些更改,否则任何网络蜘蛛都可能会触发它。
为了安全起见,您拥有的每个更改站点状态的 CGI 都应该验证传入请求确实是 POST。
Googlebot follows links. It will only request pages for which it finds a link. Of course, that link doesn't have to reside on your site and so may not be in your direct control.
Googlebot will only make GET requests because, according to the RFC, GET requests must not have side-effects. Thus, they cannot change state on the server. Hint: Never use a link (i.e. "get") to perform or confirm some change to your site or any web spider might trigger it.
Every CGI you have that changes the state of your site should verify that the incoming request is indeed a POST, just to be safe.
Googlebot 确实会发出无效请求。我发现一些使用“From:”标头发出的请求在标头指定的邮箱名称中不包含“@”符号。其他机器人有时也会这样做。因此,请注意请求中无效的可选标头数据。
Googlebot does make invalid requests. I have found some requests made with a “From:” header that does not contain an “@“ sign in the mailbox name the header specifies. Other bots also sometimes do this. Therefore, watch for invalid optional header data in requests.