当只有一个启动URL时，如何实现废弃的并发性？

发布于 2025-02-11 15:06:36 字数 1170 浏览 1 评论 0原文

我有一个情况，我需要每秒向Web服务器发送20个请求。我有一个产品列表页面URL，我从一开始就将蜘蛛传递给蜘蛛。START_URLS，并且我从列表页面上获得的产品URL太多，我需要刮擦，但它正在向所有这些产品URL发送请求顺序。我该如何并发？

蜘蛛代码逻辑

class WebSpider(scrapy.Spider):

    def __init__(self, country=None, category=None, *args, **kwargs):
        self.start_urls.append(url[country][category])

    def start_requests(self):
        for url in self.start_urls:
            request = Request(
                url=url,
                headers=self.default_headers,
                callback=self.parse, 
                cookies=self.cookies,
                cb_kwargs=dict(link= url, asin= self.asin),
            )
            yield request
    
    def parse(self, response, **kwargs):
        products = response.xpath('//link')
        for product in products:
            yield Request(
                url=url,
                headers=self.default_headers,
                callback=self.product_parse, 
                cookies=self.cookies,
                cb_kwargs=dict(link= url, asin= self.asin),
            )

    def product_parse(self, response, **kwargs):
        # get all product details
        yield item

原文

I have a situation where I need to send 20 requests per second to the web server with scrapy. I have a product listing page URL which I am passing to the spider at the start to self.start_urls and there are so many product URLs that I get from that listing page that I need to scrape but it is sending requests to all those product URLs sequentially. how can I make it concurrent?

Spider code Logic

class WebSpider(scrapy.Spider):

    def __init__(self, country=None, category=None, *args, **kwargs):
        self.start_urls.append(url[country][category])

    def start_requests(self):
        for url in self.start_urls:
            request = Request(
                url=url,
                headers=self.default_headers,
                callback=self.parse, 
                cookies=self.cookies,
                cb_kwargs=dict(link= url, asin= self.asin),
            )
            yield request
    
    def parse(self, response, **kwargs):
        products = response.xpath('//link')
        for product in products:
            yield Request(
                url=url,
                headers=self.default_headers,
                callback=self.product_parse, 
                cookies=self.cookies,
                cb_kwargs=dict(link= url, asin= self.asin),
            )

    def product_parse(self, response, **kwargs):
        # get all product details
        yield item

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤君无依 2025-02-18 15:06:36

如下代码所示，您可以在本地的自定义设置中添加concurrent_requests，以启用并发请求。蜘蛛可以定义自己的设置，该设置将优先并覆盖项目。要阅读更多每个蜘蛛设置，请参阅 this


 class WebSpider(scrapy.Spider):
    name= 'webspider'

    custom_settings = {
     'CONCURRENT_REQUESTS' = 32
     }

    def __init__(self, country=None, category=None, *args, **kwargs):
        self.start_urls.append(url[country][category])

    def start_requests(self):
        for url in self.start_urls:
            request = Request(
                url=url,
                headers=self.default_headers,
                callback=self.parse, 
                cookies=self.cookies,
                cb_kwargs=dict(link= url, asin= self.asin),
            )
            yield request

您可以使用许多其他设置，或者与concurrent_requests一起使用。

concurrent_requests_per_ip - 设置每个IP地址的并发请求的数量。

concurrent_requests_per_domain - 定义每个域允许的请求数。

max_concurrent_requests_per_domain - 设置域允许的并发请求数的最大限制。

As shown in the code below, you can add CONCURRENT_REQUESTS in the custom settings for this particular spider locally to enable concurrency requests. Spiders can define their own settings that will take precedence and override the project ones.To read more Setting per Spider, refer this


 class WebSpider(scrapy.Spider):
    name= 'webspider'

    custom_settings = {
     'CONCURRENT_REQUESTS' = 32
     }

    def __init__(self, country=None, category=None, *args, **kwargs):
        self.start_urls.append(url[country][category])

    def start_requests(self):
        for url in self.start_urls:
            request = Request(
                url=url,
                headers=self.default_headers,
                callback=self.parse, 
                cookies=self.cookies,
                cb_kwargs=dict(link= url, asin= self.asin),
            )
            yield request

There are many additional settings that you can use instead of, or together with CONCURRENT_REQUESTS.

CONCURRENT_REQUESTS_PER_IP – Sets the number of concurrent requests per IP address .

CONCURRENT_REQUESTS_PER_DOMAIN – Defines the number of concurrent requests allowed for each domain.

MAX_CONCURRENT_REQUESTS_PER_DOMAIN – Sets a maximum limit on the number of concurrent requests allowed for a domain.

回复收藏 0 原文

~没有更多了~