pyspider分布式爬取的设置

发布于 2022-09-02 14:34:48 字数 3144 浏览 19 评论 0

我写了一个爬虫，想使用pyspider把它的爬取过程分布到两台机器1和2上，但发现这样爬取一轮所花费的时间和单机几乎没有区别，都是5分16秒左右，我不知道该怎么改进才能使分布式的时间优于单机的时间．

机器规格：
机器1：1个scheduler，1个fetcher，1个processor，1个result_worker，1个webui
机器2：1个fetcher，1个processor，1个result_worker

爬虫特点：
rate/burst是20.0/3.0(试过100.0/3.0，结果几乎没有区别)；
on_start设的是@every(seconds=60*60)，on_start内大概发75个请求，每个请求都调用回调方法1；
回调方法1设的是@config(age=1)，里面发一个请求，调用回调方法2；
回调方法2设的是@config(age=1)，里面写一次文件；

爬虫代码：

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2015-08-14 11:04:00
# Project: pyspider

import json
import lxml.etree as etree

import time
import datetime
import codecs
import random
from pyspider.libs.base_handler import *
from projects.produtils import *
from time import gmtime, strftime

class Handler(DistTestHandler):
    crawl_config = {
    }

    def __init__(self):
        self.round = 0

    @every(seconds=60*60)
    def on_start(self):
        self.round = self.round+1
        cities = ['110106','120116','130108','131003','130204','130304','130602','130703','130402','130804','130503','440305','440111','441900','440608']
        cats = ['fruit', 'food', 'drink', 'snack', 'milk']
        for city in cities:
            for cat in cats:
                cat_url = 'http://www.company.com/product/category/jsd-hb-{cat}?platform=android&access_token=&android_channel_value=wandoujia&version=3.1.0&prodcrgen={prodcrgen}'.format(cat=cat, prodcrgen=random.randrange(1,10000))
                region = '{"address_code":"'+city+'"}'
                self.crawl(cat_url, save={'round': self.round, 'cat': cat, 'region': city}, callback=self.cat_start)

    @config(age=1)
    def cat_start(self, response):
        jsonresp = response.json

        products = jsonresp['products']
        for product in products:
            if 'code' in product:
                print 'has code, skipped'
            else:
                prod_url = 'http://www.company.com/product/{sku}?platform=android&access_token=&android_channel_value=wandoujia&version=3.1.0&prodcrgen={prodcrgen}'.format(sku=product['sku'], prodcrgen=random.randrange(1,10000))
                region = '{"address_code":"'+response.save['region']+'"}'
                self.crawl(prod_url, save={'round': response.save['round'], 'cat': response.save['cat'], 'region': response.save['region']}, callback=self.prod_start)

    @config(age=1)
    def prod_start(self, response):
        jsonresp = response.json
        s = u'{sku}{_separator}{name}{_separator}{cat}{_separator}{region}{_separator}{stock}{_separator}{price}{_separator}{round}{_separator}{create_time}'\
            .format(sku=jsonresp['sku'], name=jsonresp['name'], cat=response.save['cat'], region=response.save['region'], stock=jsonresp['stock'], \
                    price=jsonresp['price'], round=str(response.save['round']), create_time=strftime("%Y-%m-%d %H:%M:%S", gmtime()), _separator=' | ')

        with codecs.open ('/home/ubuntu/pyspider/disttest1/history', 'a', 'utf-8') as f: f.write (s+'\n')

分享到QQ

分享到微博