django 防爬虫实现
基本防爬策略
一般的,像 nginx 这种 web 服务器都有防爬虫模块,基于 ip, user_agent 来识别爬虫,从而限制爬虫的爬取频率。但是基于某一个登录用户去做频率限制就有点麻烦了,这种情况下,只能通过下一层程序去防,比如 django 的话,通过中间件结合缓存的方式可以做到一定的频率限制。
django-ratelimit
django-ratelimit 这个库基本从 user 层对爬虫做了限制,具体用法如下
# ip 每分钟 5 次请求,但是 block 请求 @ratelimit(key='ip', rate='5/m') def myview(request): # Will be true if the same IP makes more than 5 POST # requests/minute. was_limited = getattr(request, 'limited', False) return HttpResponse() # 限制 ip 每分钟 5 次请求, block 这个请求 @ratelimit(key='ip', rate='5/m', block=True) def myview(request): # If the same IP makes >5 reqs/min, will raise Ratelimited return HttpResponse() # 限制 post 中有 username 每分钟 5 次请求,另外请求方法只能为 GET ,POST @ratelimit(key='post:username', rate='5/m', method=['GET', 'POST']) def login(request): # If the same username is used >5 times/min, this will be True. # The `username` value will come from GET or POST, determined by the # request method. was_limited = getattr(request, 'limited', False) return HttpResponse() # 可以使用多种限制 @ratelimit(key='post:username', rate='5/m') @ratelimit(key='post:password', rate='5/m') def login(request): # Use multiple keys by stacking decorators. return HttpResponse() @ratelimit(key='get:q', rate='5/m') @ratelimit(key='post:q', rate='5/m') def search(request): # These two decorators combine to form one rate limit: the same search # query can only be tried 5 times a minute, regardless of the request # method (GET or POST) return HttpResponse() @ratelimit(key='ip', rate='4/h') def slow(request): # Allow 4 reqs/hour. return HttpResponse() # rate 可以自定义函数 rate = lambda r: None if request.user.is_authenticated() else '100/h' @ratelimit(key='ip', rate=rate) def skipif1(request): # Only rate limit anonymous requests return HttpResponse() # key 可以是 django 的 user ,如果没登录,则使用 ip @ratelimit(key='user_or_ip', rate='10/s') @ratelimit(key='user_or_ip', rate='100/m') def burst_limit(request): # Implement a separate burst limit. return HttpResponse() @ratelimit(group='expensive', key='user_or_ip', rate='10/h') def expensive_view_a(request): return something_expensive() @ratelimit(group='expensive', key='user_or_ip', rate='10/h') def expensive_view_b(request): # Shares a counter with expensive_view_a return something_else_expensive() # key 可以为 header 里面的值 @ratelimit(key='header:x-cluster-client-ip') def post(request): # Uses the X-Cluster-Client-IP header value. return HttpResponse() # key 可以自定义函数 @ratelimit(key=lambda r: r.META.get('HTTP_X_CLUSTER_CLIENT_IP', r.META['REMOTE_ADDR']) def myview(request): # Use `X-Cluster-Client-IP` but fall back to REMOTE_ADDR. return HttpResponse()
爬虫报警邮件
如果想要在爬虫爬取的时候通知管理员,可以采用 celery 的方式异步发送邮件, django-ratelimit 在 raise ratelimited 的时候可以自定义处理函数,需要将配置写入 settings.py 如下
RATELIMIT_VIEW = 'ratelimit.views.ratelimited'
MIDDLEWARE_CLASSES = (
'ratelimit.middleware.RatelimitMiddleware',
)
然后定义 RATELIMIT_VIEW
from importlib import import_module from django.conf import settings from django.http import HttpResponseForbidden def ratelimited(request, exception): usage = exception.usage if usage: task_view = settings.RATELIMIT_TASK if task_view: mod, attr = task_view.rsplit('.', 1) keyfn = getattr(import_module(mod), attr) keyfn.delay(usage) return HttpResponseForbidden()
RATELIMIT_TASK 为指定 celery 异步发送邮件,这里面需要配置 celery 和 邮件
# email settings EMAIL_HOST = '***' EMAIL_PORT = '***' EMAIL_USE_SSL = True EMAIL_HOST_USER = '***' EMAIL_HOST_PASSWORD = '***' EMAIL_RECEIVER = ['**', '**'] # 需要接收邮件的用户列表 RATELIMIT_COUNT_TO_SEND_EMAIL = [2, 50, 100] # 当计数器到某个值时发邮件,防止邮件发送太频繁 REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 RATELIMIT_TASK = 'ratelimit.task.send_mail' CELERY_RESULT_BACKEND = "redis://%s:%d/2" % (REDIS_HOST, REDIS_PORT) REDIS_BROKER = "redis://%s:%d/3" % (REDIS_HOST, REDIS_PORT)
这里用了 redis 作为消息队列,RATELIMIT_TASK 如下
# -*- coding: utf-8 -*- # from __future__ import absolute_import import os import json import django from celery import Celery from django.conf import settings from django.core import mail os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'proj.settings') django.setup() app = Celery('block_spider', backend=settings.CELERY_RESULT_BACKEND, broker=settings.REDIS_BROKER) app.autodiscover_tasks(lambda: settings.INSTALLED_APPS) @app.task() def send_mail(usage): emails = ['block spider', json.dumps(usage), settings.EMAIL_HOST_USER, settings.EMAIL_RECEIVER] if (usage.get('count') - usage.get('limit')) in settings.RATELIMIT_COUNT_TO_SEND_EMAIL: mail.send_mail(*emails)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
上一篇: C# 数据类型
下一篇: VSCode C++ 环境配置
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论