您将如何保护链接数据库不被抓取?
我有一个大型链接数据库,这些链接都以特定方式排序并附加到其他信息,这(对某些人)很有价值。
目前我的设置(似乎有效)只是调用一个像 link.php?id=123 这样的 php 文件,它将带有时间戳的请求记录到数据库中。在发出链接之前,它会检查过去 5 分钟内从该 IP 发出的请求数量。如果它大于 x,它会将您重定向到验证码页面。
一切都工作得很好,但该网站已经变得非常受欢迎(并且已经被 DDO 攻击了大约 6 周),所以 php 已经陷入困境,所以我试图尽量减少我必须点击 php 来做的事情某物。我想以纯文本形式显示链接,而不是通过 link.php?id= 并有一个 onclick 函数来简单地将视图计数加 1。我仍在使用 php,但至少如果它滞后,它会在后台执行,并且用户可以立即看到他们请求的链接。
问题是,这使得该网站真正变得可抓取。有什么办法可以防止这种情况发生,但在吐出链接之前仍然不依赖 php 进行检查?
I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people).
Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page.
That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away.
Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
看来瓶颈是在数据库上。每个请求执行插入(记录请求),然后执行选择(确定过去 5 分钟内来自 IP 的请求数量),然后执行执行应用程序核心功能所需的任何数据库操作。
考虑在服务器内存中维护请求限制数据(IP、请求时间)而不是给数据库增加负担。两种解决方案是 memcache (http://www.php.net/manual/en /book.memcache.php) 和 memcached (http://php. net/manual/en/book.memcached.php)。
正如其他人所指出的,请确保查询的任何键(例如链接 id 等字段)都存在索引。如果索引已到位并且数据库仍然承受负载,请尝试使用 HTTP 加速器,例如 Varnish (http://varnish -cache.org/)。
It seems that the bottleneck is at the database. Each request performs an insert (logs the request), then a select (determine the number of requests from the IP in the last 5 minutes), and then whatever database operations are necessary to perform the core function of the application.
Consider maintaining the request throttling data (IP, request time) in server memory rather than burdening the database. Two solutions are memcache (http://www.php.net/manual/en/book.memcache.php) and memcached (http://php.net/manual/en/book.memcached.php).
As others have noted, ensure that indexes exist for whatever keys are queried (fields such as the link id). If indexes are in place and the database still suffers from the load, try an HTTP accelerator such as Varnish (http://varnish-cache.org/).
您可以在 Web 服务器级别进行 ip 限制。也许您的网络服务器存在一个模块,或者作为示例,您可以使用 apache 编写自己的重写映射并让它查阅守护程序,以便您可以执行更复杂的操作。让守护程序查询内存数据库。会很快。
You could do the ip throttling at the web server level. Maybe a module exists for your webserver, or as an example, using apache you can write your own rewritemap and have it consult a daemon program so you can do more complex things. Have the daemon program query a memory database. It will be fast.
检查你的数据库。您是否正确索引所有内容?包含这么多条目的表会很快变大并减慢速度。您可能还想运行一个夜间进程,删除早于 1 小时的条目等。
如果这些都不起作用,您正在考虑升级/负载平衡您的服务器。直接链接到页面只会为您赢得这么多时间,然后您就必须升级。
Check your database. Are you indexing everything properly? A table with this many entries will get big very fast and slow things down. You might also want to run a nightly process that deletes entries older than 1 hour etc.
If none of this works, you are looking at upgrading/load balancing your server. Linking directly to the pages will only buy you so much time before you have to upgrade anyway.
你在客户端所做的每件事都无法受到保护,为什么不直接使用 AJAX 呢?
有一个调用 ajax 函数的 onClick 事件,该函数仅返回链接并将其填充到页面上的 DIV 中,因为请求和答案的大小很小,所以它将足够快地满足您的需要。只需确保在您调用的函数中检查时间戳即可,很容易制作一个多次调用该函数的脚本来加强您的链接。
您可以查看 jQuery 或其他 AJAX 库(我使用 jQuery 和 sAjax)。而且我有很多页面可以非常快地动态更改内容,客户端甚至不知道这不是纯JS。
Every thing you do on the client side can't be protected, Why not just use AJAX ?
Have a onClick event that call's an ajax function, that returns just the link and fill it in a DIV on your page, beacause the size of the request an answer is small, it will work fast enougth for what you need. Just make sure in the function you call to check the timestamp, It is easy to make a script that call that function many times to steel you links.
You can check out jQuery, or other AJAX libraries (i use jQuery and sAjax). And I have lots of page that dinamicly change content very fast, The client doesn't even know is not pure JS.
大多数抓取工具只分析静态 HTML,因此对链接进行编码,然后使用 JavaScript 在客户端的 Web 浏览器中动态解码它们。
坚定的抓取者仍然可以绕过这个问题,但如果数据足够有价值,他们可以绕过任何技术。
Most scrapers just analyze static HTML so encode your links and then decode them dynamically in the client's web browser with JavaScript.
Determined scrapers can still get around this, but they can get around any technique if the data is valuable enough.