是否有可用于发现垃圾邮件机器人的 HTTP 标头字段?
按理说,抓取工具和垃圾邮件机器人的构建不会像普通的网络浏览器那样好。考虑到这一点,似乎应该有某种方法可以通过查看公然的垃圾邮件机器人提出请求的方式来发现它们。
是否有任何方法可以分析 HTTP 标头,或者这只是一个白日梦?
Array
(
[Host] => example.com
[Connection] => keep-alive
[Referer] => http://example.com/headers/
[Cache-Control] => max-age=0
[Accept] => application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
[User-Agent] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.44 Safari/534.7
[Accept-Encoding] => gzip,deflate,sdch
[Accept-Language] => en-US,en;q=0.8
[Accept-Charset] => ISO-8859-1,utf-8;q=0.7,*;q=0.3
)
It stands to reason that scrapers and spambots wouldn't be built as well as normal web browsers. With this in mind, it seems like there should be some way to spot blatant spambots by just looking at the way they make requests.
Are there any methods for analyzing HTTP headers or is this just a pipe-dream?
Array
(
[Host] => example.com
[Connection] => keep-alive
[Referer] => http://example.com/headers/
[Cache-Control] => max-age=0
[Accept] => application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
[User-Agent] => Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.517.44 Safari/534.7
[Accept-Encoding] => gzip,deflate,sdch
[Accept-Language] => en-US,en;q=0.8
[Accept-Charset] => ISO-8859-1,utf-8;q=0.7,*;q=0.3
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果我正在编写垃圾邮件机器人,我会伪造普通浏览器的标头,所以我怀疑这是一种可行的方法。其他一些可能有帮助的建议
代替
If I were writing a spam bot, I would fake the headers of a normal browser, so I doubt this is a viable approach. Some other suggestions that might help
Instead
您无法通过这种方式找到所有机器人,但您可以捕获一些机器人,或者至少获得 UA 是机器人的一定概率,并将其与另一种方法结合使用。
有些机器人会忘记
Accept-Charset
和Accept-Encoding
标头。您还可能会发现Accept
和User-Agent
的不可能组合(例如IE6 不会要求XHTML,Firefox 不会宣传MS Office 类型)。阻止时,请小心代理,因为它们可以修改标头。如果您看到
Via
或X-Forwarded-For
标头,我建议您退出。理想情况下,您可以使用贝叶斯分类器,而不是手动编写规则。它可以像将相关标题连接在一起并将它们用作分类器中的单个“单词”一样简单。
You can't find all bots this way, but you could catch some, or at least get some probability of UA being a bot and use that with conjunction with another method.
Some bots forget about
Accept-Charset
andAccept-Encoding
headers. You may also find impossible combinations ofAccept
andUser-Agent
(e.g. IE6 won't ask for XHTML, Firefox doesn't advertise MS Office types).When blocking, be careful about proxies, because they could modify the headers. I recommend backing off if you see
Via
orX-Forwarded-For
headers.Ideally, instead of writing rules manually, you could use bayesian classifier. It could be as simple as joining relevant headers together and using them as a single "word" in the classifier.