如何保护/监控您的网站免遭恶意用户抓取

发布于 2024-07-10 06:50:59 字数 543 浏览 8 评论 0原文

情况:

  • 网站的内容受用户名/密码保护(并非全部受控,因为他们可以是试用/测试用户),
  • 由于用户名/密码限制,普通搜索引擎无法访问该网站,
  • 恶意用户仍然可以登录并传递会话 cookie到“wget -r”或其他东西。

问题是监视此类活动并对其做出响应的最佳解决方案是什么(考虑到站点策略不允许爬行/抓取)

我可以想到一些选项:

  1. 设置一些流量监视解决方案以限制请求的数量给定的用户/IP。
  2. 与第一点相关:自动阻止某些用户代理
  3. (邪恶:))设置一个隐藏链接,在访问该链接时注销用户并禁用他的帐户。 (大概普通用户不会访问它,因为他看不到它并单击它,但机器人会抓取所有链接。)

对于第 1 点,您是否知道已经实施的良好解决方案? 有什么相关经验吗? 一个问题是,对于非常活跃但人类的情况,可能会出现一些误报。 用户。

对于第三点:你认为这真的很邪恶吗? 或者您认为它可能存在任何问题吗?

也接受其他建议。

Situation:

  • Site with content protected by username/password (not all controlled since they can be trial/test users)
  • a normal search engine can't get at it because of username/password restrictions
  • a malicious user can still login and pass the session cookie to a "wget -r" or something else.

The question would be what is the best solution to monitor such activity and respond to it (considering the site policy is no-crawling/scraping allowed)

I can think of some options:

  1. Set up some traffic monitoring solution to limit the number of requests for a given user/IP.
  2. Related to the first point: Automatically block some user-agents
  3. (Evil :)) Set up a hidden link that when accessed logs out the user and disables his account. (Presumably this would not be accessed by a normal user since he wouldn't see it to click it, but a bot will crawl all links.)

For point 1. do you know of a good already-implemented solution? Any experiences with it? One problem would be that some false positives might show up for very active but human
users.

For point 3: do you think this is really evil? Or do you see any possible problems with it?

Also accepting other suggestions.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

离鸿 2024-07-17 06:50:59

我不建议自动锁定,并不是因为它们一定是邪恶的,而是因为它们向恶意用户提供了他们触发传感器的即时反馈,并让他们知道不要对他们签署的下一个帐户做同样的事情跟上。

并且用户代理阻止可能不会有太大帮助,因为显然用户代理很容易伪造。

您能做的最好的事情就是监控,但是您仍然必须询问如果检测到恶意行为您将做什么。 只要您拥有不受控制的访问权限,您锁定的任何人都可以使用不同的身份再次注册。 我不知道您需要什么样的信息才能获得帐户,但例如,仅姓名和电子邮件地址不会对任何人构成太大障碍。

这是典型的 DRM 问题——如果任何人都可以看到该信息,那么任何人都可以用它做任何他们想做的事情。 你可以让它变得困难,但最终如果有人真的下定决心,你就无法阻止他们,并且你可能会干扰合法用户并损害你的业务。

I would not recommend automatic lock-outs, not so much because they are necessarily evil, but because they provide immediate feedback to the malicious user that they tripped a sensor, and let them know not to do the same thing with the next account they sign up with.

And user-agent blocking is probably not going to be very helpful, because obviously user-agents are very easy to fake.

About the best you can probably do is monitoring, but then you still have to ask what you're going to do if you detect malicious behavior. As long as you have uncontrolled access, anyone you lock out can just sign up again under a different identity. I don't know what kind of info you require to get an account, but just a name and e-mail address, for instance, isn't going to be much of a hurdle for anybody.

It's the classic DRM problem -- if anyone can see the information, then anyone can do anything else they want with it. You can make it difficult, but ultimately if someone is really determined, you can't stop them, and you risk interfering with legitimate users and hurting your business.

度的依靠╰つ 2024-07-17 06:50:59

第1点有你自己提到的问题。 此外,它也无助于减缓网站的爬行速度,或者如果有的话,对于合法的重度用户来说可能会更糟。

您可以扭转第 2 点,只允许您信任的用户代理。 当然,这对对抗伪造标准用户代理的工具没有帮助。

第 3 点的变体只是向网站所有者发送通知,然后他们可以决定如何处理该用户。

同样,对于我对第 2 点的变体,您可以将其设置为更温和的操作,并仅通知有人正在使用奇怪的用户代理访问该网站。

编辑:相关的是,当我访问自己的非公开 URL 时,我曾经遇到过一个奇怪的问题(我只是暂存一个我没有在任何地方宣布或链接的网站)。 虽然除了我之外没有人应该知道这个 URL,但我突然注意到日志中的点击次数。 当我找到它时,我发现它来自某个内容过滤网站。 事实证明,我的移动 ISP 使用第三方来阻止内容,并且它拦截了我自己的请求 - 因为它不知道该网站,所以它然后获取我试图访问的页面,并且(我假设)在其中进行了一些关键字分析来决定是否阻止。 这种事情可能是您需要注意的尾端情况。

Point 1 has the problem you have mentioned yourself. Also it doesn't help against a slower crawl of the site, or if it does then it may be even worse for legitimate heavy users.

You could turn point 2 around and only allow the user-agents you trust. Of course this won't help against a tool that fakes a standard user-agent.

A variation on point 3 would just be to send a notification to the site owners, then they can decide what to do with that user.

Similarly for my variation on point 2, you could make this a softer action, and just notify that somebody is accessing the site with a weird user agent.

edit: Related, I once had a weird issue when I was accessing a URL of my own that was not public (I was just staging a site that I hadn't announced or linked anywhere). Although nobody should have even known this URL but me, all of a sudden I noticed hits in the logs. When I tracked this down, I saw it was from some content filtering site. Turned out that my mobile ISP used a third party to block content, and it intercepted my own requests - since it didn't know the site, it then fetched the page I was trying to access and (I assume) did some keyword analysis in order to decide whether or not to block. This kind of thing might be a tail end case you need to watch out for.

○闲身 2024-07-17 06:50:59

取决于我们谈论的是哪种恶意用户。

如果他们知道如何使用 wget,他们可能可以设置 Tor 并每次获取新的 IP,慢慢复制你拥有的一切。 我认为您无法在不给您的(付费?)用户带来不便的情况下阻止这种情况。

它与游戏、音乐、视频上的 DRM 相同。 如果最终用户应该看到某些内容,那么您就无法保护它。

Depending on what kind of malicious user are we talking about.

If they know how to use wget, they can probably set up Tor and get new IP every time, slowly copying everything you have. I don't think you can prevent that without inconveniencing your (paying?) users.

It is same as DRM on games, music, video. If end-user is supposed to see something, you cannot protect it.

一曲爱恨情仇 2024-07-17 06:50:59

简短的回答:它无法可靠地完成。

您可以通过简单地阻止在某个时间范围内导致一定数量的点击的 IP 地址来走很长的路(一些网络服务器开箱即用地支持这一点,其他网络服务器需要一些模块,或者您可以通过解析日志文件来做到这一点,例如使用 iptables ),但您需要注意不要阻止主要搜索引擎爬虫和大型 ISP 的代理。

Short answer: it can't be done reliably.

You can go a long way by simply blocking IP addresses that cause a certain number of hits in some time frame (some webservers support this out of the box, others require some modules, or you can do it by parsing your logfile and e.g. using iptables), but you need to take care not to block the major search engine crawlers and large ISP's proxies.

深巷少女 2024-07-17 06:50:59

选项 3 的问题在于,一旦爬虫弄清楚发生了什么,自动注销就很容易避免。

The problem with option 3 is that the auto-logout would be trivial to avoid once the scraper figures out what is going on.

转身以后 2024-07-17 06:50:59

@frankodwyer:

  • 只有受信任的用户代理才不起作用,尤其要考虑 IE 用户代理字符串,它会被插件或 .net 版本修改。 可能性太多,而且可以伪造。
  • 第 3 点的变化。通知管理员可能会起作用,但如果管理员没有持续监控日志,这将意味着不确定的延迟。

@Greg Hewgill:

  • 自动注销也会禁用用户帐户。 至少必须创建一个新帐户,留下更多踪迹,例如电子邮件地址和其他信息。

随机更改 3. 的 logout/disable-url 会很有趣,但还不知道如何实现它:)

@frankodwyer:

  • Only trusted user agents won't work, consider especially IE user-agent string which gets modified by addons or .net version. There would be too many possibilities and it can be faked.
  • variation on point 3. with notification to admin would probably work, but it would mean a non-determined delay if an admin isn't monitoring the logs constantly.

@Greg Hewgill:

  • The auto-logout would also disable the user account. At the least a new account would have to be created leaving more trails like email-address and other information.

Randomly changing logout/disable-url for 3. would be interesting, but don't know how I would implement it yet :)

蓝礼 2024-07-17 06:50:59

http://recaptcha.net

每次有人登录或注册时。 也许您可以每十次显示一次验证码。

http://recaptcha.net

Either every time someone logs in or while signing up. Maybe you could show a captcha every tenth time.

聽兲甴掵 2024-07-17 06:50:59

添加评论:

  • 我知道您无法完全保护普通用户应该能够看到的内容。 我一直站在这个问题的两面:)
  • 从开发人员的角度来看,您认为花费的时间与受保护的案例的最佳比例是多少? 我猜想一些简单的用户代理检查会删除一半或更多的潜在爬虫,我知道你可以花几个月的时间来开发以防止最后 1%

再次,从服务提供商的角度来看,我也感兴趣一个用户(爬虫)不会为其他用户消耗CPU/带宽,因此您可以指出任何好的带宽/请求限制器吗?

对评论的回应: 平台规范:基于在 JBoss AS 上运行的 JBoss Seam 的应用程序。 不过前面有一个apache2。 (在Linux上运行)

Added comments:

  • I know you can't completely protect something that a normal user should be able to see. I've been on both sides of the problem :)
  • From a developer side what do you think is best ratio of time spent versus protected cases? I'd guess some simple user-agent checks would remove half or more of the potential crawlers, and I know you can spend months developing to protect from the last 1%

Again, from a service provider point of view I'm also interested that one user (crawler) doesn't consume cpu/bandwidth for others so any good bandwidth/request limiters you can point out?

response to comment: Platform specifications: Application based on JBoss Seam running on JBoss AS. However there is an apache2 in front of it. (running on linux)

挽梦忆笙歌 2024-07-17 06:50:59

Apache 有一些按 IP 限制带宽的模块 AFAIK,对于我自己的包含大量数字内容的大型 Java/JSP 应用程序,我使用了自己的 servlet 过滤器来执行相同的操作(并限制来自一个 IP 块的同时连接等)。

我同意上面的评论,最好是微妙的,这样恶意用户就无法知道他们是否/何时触发了你的警报,因此不知道采取规避行动。 就我而言,我的服务器似乎变得缓慢、不稳定且不可靠(所以没有变化)...

Rgds

Damon

Apache has some bandwidth-by-IP limiting modules AFAIK, and for my own largeish Java/JSP application with a lot of digital content I rolled my own servlet filter to do the same (and limit simultaneous connections from one IP block, etc).

I agree with comments above that it's better to be subtle so that a malicious user cannot tell if/when they've tripped your alarms and thusy don't know to take evasive action. In my case my server just seems to become slow and flaky and unreliable (so no change there then)...

Rgds

Damon

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文