我们想在 html 正文中设置一个小蜜罐图像来检测抓取工具/恶意机器人。
以前有人设置过类似的东西吗?
我们认为最好的方法是:
a) 通过以下方式注释 html:
<!-- <img src="http://www.domain.com/honeypot.gif"/> -->
b) 将 css 样式应用于图像,使其对浏览器隐藏:
.... id="honeypot" ....
#honeypot{
display:none;
visibility:hidden;
}
使用上述内容是否有人预见到适当的情况真正的用户代理会拉取图像/尝试渲染它?
honeypot.gif 将是一个 mod_rewriting 的 php 脚本,我们将在其中进行日志记录。
虽然我知道任何编码良好的抓取工具都可能会跳过上述两个条件,但它至少可以让您对非常脏的条件有一些了解。
关于最好的方法还有其他指示吗?
We want to setup a little honeypot image in our html bodies to detect scrapers / bad bots.
Has anyone set something like this up before?
We were thinking the best way to go at it would be to:
a) Comment the html out via:
<!-- <img src="http://www.domain.com/honeypot.gif"/> -->
b) Apply css styles to the image that would make it hidden from browsers via:
.... id="honeypot" ....
#honeypot{
display:none;
visibility:hidden;
}
Using the above does anyone foresee any situations where a proper and real useragent would pull the image / attempt to render it?
The honeypot.gif would be a mod_rewritten php script where we would do our logging.
While I understand that the above 2 conditions might be skipped by any well coded scraper, it would at least shed some insight on the very dirty ones.
Any other pointers as to the best way to go at this?
发布评论
评论(2)
机器人会忽略您的 img 标签,因为它位于评论内。
相反,您可以考虑创建一个不可见的 div,其中包含指向同一站点上的触发器 URL 的链接(最好在同一目录中,以防机器人对深度敏感)。
A bot will ignore your img tag because it's within a comment.
Instead, you might consider creating an invisible div which contains a link to a trigger URL on the same site (preferably within the same directory, in case the bot is depth sensitive).
IMO 我认为任何好的抓取工具都会知道如何使用 SGML 解析器传递 HTML,并且会跳过注释图像,但我可能是错的。
最多它会在发生时给你一个想法,但不会提供反击刷屏的方法。您可能最好提出某种基于 cookie 的解决方案,因为大多数机器人可能不关心这些。您还可以随机化请求之间的图像路径,并在短时间内使它们过期。
如果您不关心不支持引用者的浏览器或隐藏/更改引用者的人,那么检查引用者是显而易见的。
IMO I think any good scraper is going to know how to pass HTML using a
SGML parser
, and would just skip the commented image, but I could be wrong.At most it will give you an idea when it happens, but doesn't provide a way to counter at scraper. You would probably be better off coming up with some kind of cookie based solution, as most bots probably don't care about these. You could also randomize image paths between requests and expire them after a short period.
Checking referrer is an obvious one, if you don't care about browsers that don't support them or people that hide/alter them.