检测网页抓取的方法
我需要检测我网站上的信息抓取。我尝试过基于行为模式的检测,尽管计算量相对较大,但它似乎很有希望。
其基础是收集某些客户端的请求时间戳,并将其行为模式与常见模式或预先计算的模式进行比较。
更准确地说,我将请求之间的时间间隔收集到数组中,并按时间函数索引:
i = (integer) ln(interval + 1) / ln(N + 1) * N + 1
Y[i]++
X[i]++ for current client
其中 N 是时间(计数)限制,大于 N 的间隔将被删除。最初 X 和 Y 填充有 1。
然后,当我在 X 和 Y 中获得足够数量的数据后,就该做出决定了。 Criteria为参数C:
C = sqrt(summ((X[i]/norm(X) - Y[i]/norm(Y))^2)/k)
其中X为特定客户数据,Y为普通数据,norm()为校准函数,k为归一化系数,具体取决于norm()的类型。有3种类型:
- norm(X) = summ(X)/count(X), k = 2
- norm(X) = sqrt(summ(X[i]^2), k = 2
norm(X) = max(X[i]),k 是非空元素 X 的数量的平方根
C 在范围 (0..1) 内, 0 表示没有行为偏差,1 是最大偏差。
类型 1 的校准最适合重复请求,类型 2 适合间隔很少的重复请求,类型 3 适合非常量的请求间隔。
您觉得怎么样?如果你愿意在你的服务上尝试这个。
I need to detect scraping of info on my website. I tried detection based on behavior patterns, and it seems to be promising, although relatively computing heavy.
The base is to collect request timestamps of certain client side and compare their behavior pattern with common pattern or precomputed pattern.
To be more precise, I collect time intervals between requests into array, indexed by function of time:
i = (integer) ln(interval + 1) / ln(N + 1) * N + 1
Y[i]++
X[i]++ for current client
where N is time (count) limit, intervals greater than N are dropped. Initially X and Y are filled with ones.
Then, after I got enough number of them in X and Y, it's time to make decision. Criteria is parameter C:
C = sqrt(summ((X[i]/norm(X) - Y[i]/norm(Y))^2)/k)
where X is certain client data, Y is common data, and norm() is calibration function, and k is normalization coefficient, depending on type of norm(). There are 3 types:
norm(X) = summ(X)/count(X), k = 2
norm(X) = sqrt(summ(X[i]^2), k = 2
norm(X) = max(X[i]), k is square root of number of non-empty elements X
C is in range (0..1), 0 means there is no behavior deviation and 1 is max deviation.
Сalibration of type 1 is best for repeating requests, type 2 for repeating request with few intervals, type 3 for non-constant request intervals.
What do you think? I'll appreciate if you'll try this on your services.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
老实说,你的方法完全没有价值,因为它是微不足道的绕过。攻击者甚至不需要编写一行代码来绕过它。代理服务器是免费的,您可以在 amazon ec2 上以每小时 2 美分的价格启动具有新 IP 地址的新计算机。
更好的方法是 Roboo,它使用 cookie 技术来挫败机器人。绝大多数机器人无法运行 javascript 或 flash,这可以为您带来优势。
然而,所有这些“(in)虽然模糊但安全”,以及为什么它的唯一原因可能有用是因为你的数据不值得程序员花 5 分钟在上面。 (包括机器人)
To be honest your approach is completely worthless because its trivial bypass. An attacker doesn't even have to write a line of code to bypass it. Proxy servers are free and you can boot up a new machine with a new ip address on amazon ec2 for 2 cents an hour.
A better approach is Roboo which uses cookie techniques to foil robots. The vast majority of robots can't run javascript or flash, and this can be used to your advantage.
However all of this "(in)security though obscurity", and the ONLY REASON why it might work is because your data isn't worth a programmer spending 5 minutes on it. (Roboo included)
我进行了大量的网络抓取,并且总是使用多个 IP 地址和每个请求之间的随机间隔。
当抓取页面时,我通常只下载 HTML,而不下载依赖项(图像、CSS 等)。所以你可以尝试检查用户是否下载了这些依赖项。
I do a lot of web scraping and always use multiple IP addresses and random intervals between each request.
When scraping a page I typically only download the HTML and not the dependencies (images, CSS, etc). So you could try checking if the user downloads these dependencies.
如果您专门询问算法的有效性,这还不错,但似乎您过于复杂化了。您应该使用 WAF 已采用的基本方法来限制连接速率。一种已经存在的算法是漏桶算法 (http://en.wikipedia.org/wiki/Leaky_bucket )。
就阻止网络抓取的速率限制而言,尝试限制连接速率有两个缺陷。首先是人们使用代理网络或 TOR 来匿名化每个请求的能力。这基本上会让你的努力化为泡影。即使像 http://www.mozenda.com 这样的现成抓取软件也会使用大量 IP 并轮流使用他们来解决这个问题。另一个问题是您可能会阻止使用共享 IP 的人。公司和大学经常使用 NAT,您的算法可能会将他们误认为是一个人。
坦白说,我是 Distil Networks 的联合创始人,我们经常在 WAF 中找出速率限制等漏洞。我们认为需要更全面的解决方案,因此需要我们的服务。
If you are asking specifically to the validity of your algorithm, it isnt bad but it seems like you are over complicating it. You should use the basic methodologies already employed by WAF's to rate limit connections. One such algorithm that already exists is the Leaky Bucket Algorith (http://en.wikipedia.org/wiki/Leaky_bucket).
As far as rate limiting to stop web scraping, there are two flaws in trying to rate limit connections. First is people's ability to use proxy networks or TOR to anonymize each request. This essentially nullifies your efforts. Even off the shelf scraping software like http://www.mozenda.com use a huge block of IPs and rotate through them to solve this problem. The other issue is that you could potentially block people using a shared IP. Companies and universities often use NATs and your algorithm could mistake them as one person.
For full disclosure, I am a cofounder of Distil Networks and we often poke holes in WAF like rate limiting. We pitch that a more comprehensive solution is required and hence the need for our service.
好的,有人可以构建一个机器人来进入您的网站,下载 html(不是图像、css 等,如 @hoju 的响应)并构建要在您的网站上遍历的链接图。
机器人可以使用随机时间来发出每个请求,并使用代理、VPN、Tor 等更改每个请求中的 IP。
我很想回答说,您可以尝试通过使用 CSS 添加隐藏链接来欺骗机器人(网上找到的通用解决方案)。但这不是解决方案。当机器人访问禁止链接时,您可以禁止访问该IP。但最终你会得到一大堆被禁止的 IP 列表。此外,如果有人开始欺骗 IP 并向您服务器上的该链接发出请求,您最终可能会与世隔绝。抛开其他事情不说,有可能实现一种解决方案,让机器人能够看到隐藏的链接。
我认为,更有效的方法是使用检测代理、VPN、Tor 等的 API 检查每个传入请求的 IP。我在 Google 中搜索“api detector vpn proxy tor”并找到了一些(付费)服务。也许有免费的。
如果 API 响应是肯定的,则将请求转发到验证码。
Ok, someone could build a robot that would enter your website, download the html (not the images, css, etc, as in @hoju's response) and build a graph of the links to be traversed on your site.
The robot could use random timings to make each request and change the IP in each of them using a proxy, a VPN, Tor, etc.
I was tempted to answer that you could try to trick the robot by adding hidden links using CSS (a common solution found on the Internet). But it is not a solution. When the robot accesses a forbidden link you can prohibit access to that IP. But you would end up with a huge list of banned IPs. Also, if someone started spoofing IPs and making requests to that link on your server, you could end up isolated from the world. Apart from anything else, it is possible that a solution can be implemented that allows the robot to see the hidden links.
A more effective way, I think, would be to check the IP of each incoming request, with an API that detects proxies, VPNs, Tor, etc. I searched Google for "api detection vpn proxy tor" and found some (paid) services. Maybe there are free ones.
If the API response is positive, forward the request to a Captcha.