是否有可用于发现垃圾邮件机器人的 HTTP 标头字段?

发布于 2024-10-04 17:59:51 字数 706 浏览 12 评论 0原文

按理说,抓取工具和垃圾邮件机器人的构建不会像普通的网络浏览器那样好。考虑到这一点,似乎应该有某种方法可以通过查看公然的垃圾邮件机器人提出请求的方式来发现它们。

是否有任何方法可以分析 HTTP 标头,或者这只是一个白日梦?

Array
(
    [Host] => example.com
    [Connection] => keep-alive
    [Referer] => http://example.com/headers/
    [Cache-Control] => max-age=0
    [Accept] => application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    [User-Agent] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.44 Safari/534.7
    [Accept-Encoding] => gzip,deflate,sdch
    [Accept-Language] => en-US,en;q=0.8
    [Accept-Charset] => ISO-8859-1,utf-8;q=0.7,*;q=0.3
)

It stands to reason that scrapers and spambots wouldn't be built as well as normal web browsers. With this in mind, it seems like there should be some way to spot blatant spambots by just looking at the way they make requests.

Are there any methods for analyzing HTTP headers or is this just a pipe-dream?

Array
(
    [Host] => example.com
    [Connection] => keep-alive
    [Referer] => http://example.com/headers/
    [Cache-Control] => max-age=0
    [Accept] => application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
    [User-Agent] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.44 Safari/534.7
    [Accept-Encoding] => gzip,deflate,sdch
    [Accept-Language] => en-US,en;q=0.8
    [Accept-Charset] => ISO-8859-1,utf-8;q=0.7,*;q=0.3
)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

一绘本一梦想 2024-10-11 17:59:51

如果我正在编写垃圾邮件机器人,我会伪造普通浏览器的标头,所以我怀疑这是一种可行的方法。其他一些可能有帮助的建议

代替

  • 如果验证码太烦人,可以使用验证码
  • ,一个简单但有效的技巧是包含一个被 CSS 规则隐藏的文本输入;用户不会看到它,但垃圾邮件机器人通常不会费心去解析和应用所有 CSS 规则,因此他们不会意识到该字段不可见,并会在其中放入一些内容。检查表单提交时该字段是否为空,如果为空则忽略它。
  • 在您的表单上使用随机数;检查呈现表单时使用的随机数是否与提交表单时使用的随机数相同。这不会捕获所有内容,但会确保该帖子至少是由首先收到该表单的人发布的。理想情况下,每次呈现表单时都更改随机数。

If I were writing a spam bot, I would fake the headers of a normal browser, so I doubt this is a viable approach. Some other suggestions that might help

Instead

  • use a captcha
  • if that's too annoying, a simple but effective trick is to include a text input which is hidden by a CSS rule; users won't see it, but spam bots won't normally bother to parse and apply all the CSS rules, so they won't realise the field is not visible and will put something in it. Check on form submission that the field is empty and disregard it if it is.
  • use a nonce on your forms; check that the nonce that was used when you rendered the form is the same as when it's submitted. This won't catch everything, but will ensure that the post was at least made by something that received the form in the first place. Ideally change the nonce every time the form is rendered.
眼眸里的快感 2024-10-11 17:59:51

您无法通过这种方式找到所有机器人,但您可以捕获一些机器人,或者至少获得 UA 是机器人的一定概率,并将其与另一种方法结合使用。

有些机器人会忘记 Accept-CharsetAccept-Encoding 标头。您还可能会发现AcceptUser-Agent 的不可能组合(例如IE6 不会要求XHTML,Firefox 不会宣传MS Office 类型)。

阻止时,请小心代理,因为它们可以修改标头。如果您看到 ViaX-Forwarded-For 标头,我建议您退出。

理想情况下,您可以使用贝叶斯分类器,而不是手动编写规则。它可以像将相关标题连接在一起并将它们用作分类器中的单个“单词”一样简单。

You can't find all bots this way, but you could catch some, or at least get some probability of UA being a bot and use that with conjunction with another method.

Some bots forget about Accept-Charset and Accept-Encoding headers. You may also find impossible combinations of Accept and User-Agent (e.g. IE6 won't ask for XHTML, Firefox doesn't advertise MS Office types).

When blocking, be careful about proxies, because they could modify the headers. I recommend backing off if you see Via or X-Forwarded-For headers.

Ideally, instead of writing rules manually, you could use bayesian classifier. It could be as simple as joining relevant headers together and using them as a single "word" in the classifier.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文