跟踪和测试 PHP 中的滥用客户端

发布于 2024-08-09 17:10:12 字数 788 浏览 5 评论 0原文

现在有一个主题可以有多种方式。希望我能够在描述我的问题并开始获得建议时消除混淆。

我正在开发一个网站，它将取代现有的网站。从历史上看，我们遇到的问题之一是蜘蛛机器人进入并吸收所有内容。现在我们不介意内容正在被下载。事实上，我们对此感到很高兴，但是事实证明，某些批量下载器和下载加速器对当前站点存在问题。

我正在寻找的是位于我的 php 开头的东西，它几乎首先运行。它获取页面请求的指纹（ip、referrer、请求 uri、cookie、会话 id 等）并将其传递给...某物。然后，该东西会将指纹与最后一秒或三秒的指纹进行比较。然后，它根据一些预先配置的阈值返回一条消息，以处理该请求。

一些阈值是：

用户已请求> >最近 0.n 秒内的 x 页。
用户已请求 << 中的相同页面0.n秒。
用户在过去 n 秒内向表单提交了相同的数据。

所以你看我正在看一些非常紧的窗户。检测这些东西是否可行？我可以使用某种文件或数据库数据源来完成此操作吗？无论我用什么来存储页面加载之间的指纹，都会经历很多混乱，因为大多数数据都会保留一两秒钟。我应该有一些东西来解析 apache 日志来检查阈值吗？我是否应该寻找某种外部守护程序，将数据在内存中保存一两秒，以便我可以从脚本调用？ apache 中是否有可以处理这个问题的东西，我是否只需要向服务器人员转注来处理这个问题？

假设这是我可以在 PHP 或一些称为外部守护程序中执行的操作，我如何响应超出阈值的行为？我的直觉告诉我 HTTP 响应，例如 408 或 503，但我的直觉常常是错误的。我该怎么做才能告诉客户退一步？某种“哇哦”页面？

原文

Now there is a subject that could be taken many ways. Hopefully I will be able to de-obfuscate it as I describe my problem and start getting suggestions.

I am developing a site that will be replacing an existing one. Historically one of the problems we have had is spider bots coming in and sucking down all out content. Now we don't mind that the content is being downloaded. In fact we are glad for it, however some of the bulk downloaders and download accelerators have proved problematic with the current site.

What I am looking for is something to sit at the beginning of my php that runs pretty much first. It takes a fingerprint of the page request (ip, referrer, request uri, cookies, session id, whatever) and passes it to ...something. That something then compares the fingerprint to fingerprints in the last second or three. It then returns a message based on some pre-configured threshold what to do with the request.

Some thresholds are:

The user has requested > x pages in the last 0.n seconds.
The user has requested the same page in < 0.n seconds.
The user has submitted the identical data to a form in the last n seconds.

So you see I am looking at some pretty tight windows. Is detecting such things even feasible? Would I be able to do it with some sort of file or db data source? Whatever I use to store the fingerprints between page loads is going to experience a lot of churn since most data will be held for a second or two. Should I just have something that parses the apache logs to check against the threshold? Should I be looking for some sort of external daemon that holds the data for a second or two in memory that I can call from the script? Is there something in apache that can handle this, and do I just need to punt to the server guy to handle this?

Assuming that this is something I can do in PHP or some called external daemon how do I respond to behavior out of the thresholds? My gut says HTTP responses, something like 408 or 503, but my gut is often wrong. What can I do to tell the client to back off a bit? Some sort of "Woah there" page?

分享到QQ

分享到微博