文章来源于网络收集而来,版权归原创者所有,如有侵权请及时联系!
15.3 Scrapy 和 BloomFilter
Scrapy自带了去重方案,同时支持通过RFPDupeFilter来完成去重。在RFPDupeFilter源码中依然是通过set()进行去重。部分源码如下:
class RFPDupeFilter(BaseDupeFilter): """Request Fingerprint duplicates filter""" def __init__(self, path=None, debug=False): self.file = None self.fingerprints = set() self.logdupes = True self.debug = debug self.logger = logging.getLogger(__name__) if path: self.file = open(os.path.join(path, 'requests.seen'), 'a+') self.file.seek(0) self.fingerprints.update(x.rstrip() for x in self.file)
继续查看源代码,可以了解到Scrapy是根据request_fingerprint方法实现过滤的,将Request指纹添加到set()中。部分源码如下:
def request_fingerprint(request, include_headers=None): if include_headers: include_headers = tuple(to_bytes(h.lower()) for h in sorted(include_headers)) cache = _fingerprint_cache.setdefault(request, {}) if include_headers not in cache: fp = hashlib.sha1() fp.update(to_bytes(request.method)) fp.update(to_bytes(canonicalize_url(request.url))) fp.update(request.body or b'') if include_headers: for hdr in include_headers: if hdr in request.headers: fp.update(hdr) for v in request.headers.getlist(hdr): fp.update(v) cache[include_headers] = fp.hexdigest() return cache[include_headers]
从代码中我们可以看到,去重指纹为sha1(method+url+body+header),对这个整体进行去重,去重比例太小。下面我们根据URL进行去重,定制过滤器。代码如下:
from scrapy.dupefilter import RFPDupeFilter class URLFilter(RFPDupeFilter): """根据URL过滤""" def __init__(self, path=None): self.urls_seen = set() RFPDupeFilter.__init__(self, path) def request_seen(self, request): if request.url in self.urls_seen: return True else: self.urls_seen.add(request.url)
但是这样依旧不是很好,因为URL有时候会很长导致内存上升,我们可以将URL经过sha1操作之后再去重,改进如下:
from scrapy.dupefilter import RFPDupeFilter from w3lib.util.url import canonicalize_url class URLSha1Filter(RFPDupeFilter): """根据urlsha1过滤""" def __init__(self, path=None): self.urls_seen = set() RFPDupeFilter.__init__(self, path) def request_seen(self, request): fp = hashlib.sha1() fp.update(canonicalize_url(request.url)) url_sha1 = fp.hexdigest() if url_sha1 in self.urls_seen: return True else: self.urls_seen.add(url_sha1)
这样似乎好了一些,但是依然不够,继续优化,加入BloomFilter进行去重。改进如下:
class URLBloomFilter(RFPDupeFilter): """根据urlhash_bloom过滤""" def __init__(self, path=None): self.urls_sbf = ScalableBloomFilter(mode=ScalableBloomFilter.SMALL_SET_ GROWTH) RFPDupeFilter.__init__(self, path) def request_seen(self, request): fp = hashlib.sha1() fp.update(canonicalize_url(request.url)) url_sha1 = fp.hexdigest() if url_sha1 in self.urls_sbf: return True else: self.urls_sbf.add(url_sha1)
经过这样的处理,去重能力将得到极大提升,但是稳定性还是不够,因为是内存去重,万一出现服务器宕机的情况,内存数据将全部消失。如果能把Scrapy、BloomFilter、Redis这三者完美地结合起来,才是一个比较稳定的选择。下一章将继续讲解Redis+BloomFilter去重。有一点一定要注意,代码编写完成后,去重组件是无法工作的,需要在settings中设置DUPEFILTER_CLASS字段,指定过滤器类的路径,比如:
DUPEFILTER_CLASS = "test.test.bloomRedisFilter. URLBloomFilter "
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论