pyspider分布式爬取的设置
我写了一个爬虫,想使用pyspider把它的爬取过程分布到两台机器1和2上,但发现这样爬取一轮所花费的时间和单机几乎没有区别,都是5分16秒左右,我不知道该怎么改进才能使分布式的时间优于单机的时间.
机器规格:
机器1:1个scheduler,1个fetcher,1个processor,1个result_worker,1个webui
机器2:1个fetcher,1个processor,1个result_worker
爬虫特点:
rate/burst是20.0/3.0(试过100.0/3.0,结果几乎没有区别);on_start
设的是@every(seconds=60*60)
,on_start
内大概发75个请求,每个请求都调用回调方法1;
回调方法1设的是@config(age=1)
,里面发一个请求,调用回调方法2;
回调方法2设的是@config(age=1)
,里面写一次文件;
爬虫代码:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2015-08-14 11:04:00
# Project: pyspider
import json
import lxml.etree as etree
import time
import datetime
import codecs
import random
from pyspider.libs.base_handler import *
from projects.produtils import *
from time import gmtime, strftime
class Handler(DistTestHandler):
crawl_config = {
}
def __init__(self):
self.round = 0
@every(seconds=60*60)
def on_start(self):
self.round = self.round+1
cities = ['110106','120116','130108','131003','130204','130304','130602','130703','130402','130804','130503','440305','440111','441900','440608']
cats = ['fruit', 'food', 'drink', 'snack', 'milk']
for city in cities:
for cat in cats:
cat_url = 'http://www.company.com/product/category/jsd-hb-{cat}?platform=android&access_token=&android_channel_value=wandoujia&version=3.1.0&prodcrgen={prodcrgen}'.format(cat=cat, prodcrgen=random.randrange(1,10000))
region = '{"address_code":"'+city+'"}'
self.crawl(cat_url, save={'round': self.round, 'cat': cat, 'region': city}, callback=self.cat_start)
@config(age=1)
def cat_start(self, response):
jsonresp = response.json
products = jsonresp['products']
for product in products:
if 'code' in product:
print 'has code, skipped'
else:
prod_url = 'http://www.company.com/product/{sku}?platform=android&access_token=&android_channel_value=wandoujia&version=3.1.0&prodcrgen={prodcrgen}'.format(sku=product['sku'], prodcrgen=random.randrange(1,10000))
region = '{"address_code":"'+response.save['region']+'"}'
self.crawl(prod_url, save={'round': response.save['round'], 'cat': response.save['cat'], 'region': response.save['region']}, callback=self.prod_start)
@config(age=1)
def prod_start(self, response):
jsonresp = response.json
s = u'{sku}{_separator}{name}{_separator}{cat}{_separator}{region}{_separator}{stock}{_separator}{price}{_separator}{round}{_separator}{create_time}'\
.format(sku=jsonresp['sku'], name=jsonresp['name'], cat=response.save['cat'], region=response.save['region'], stock=jsonresp['stock'], \
price=jsonresp['price'], round=str(response.save['round']), create_time=strftime("%Y-%m-%d %H:%M:%S", gmtime()), _separator=' | ')
with codecs.open ('/home/ubuntu/pyspider/disttest1/history', 'a', 'utf-8') as f: f.write (s+'\n')
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
增加 burst 不小于 rate,burst 过小会导致在一个 schduler 调度循环中,由于水槽过小,只能分配到 3 个请求配额。
dashboard 上各个队列的状态,增加堵塞的队列下游的对应模块数量。