当只有一个启动URL时,如何实现废弃的并发性?
我有一个情况,我需要每秒向Web服务器发送20个请求。我有一个产品列表页面URL,我从一开始就将蜘蛛传递给蜘蛛。START_URLS,并且我从列表页面上获得的产品URL太多,我需要刮擦,但它正在向所有这些产品URL发送请求顺序。我该如何并发?
蜘蛛代码逻辑
class WebSpider(scrapy.Spider):
def __init__(self, country=None, category=None, *args, **kwargs):
self.start_urls.append(url[country][category])
def start_requests(self):
for url in self.start_urls:
request = Request(
url=url,
headers=self.default_headers,
callback=self.parse,
cookies=self.cookies,
cb_kwargs=dict(link= url, asin= self.asin),
)
yield request
def parse(self, response, **kwargs):
products = response.xpath('//link')
for product in products:
yield Request(
url=url,
headers=self.default_headers,
callback=self.product_parse,
cookies=self.cookies,
cb_kwargs=dict(link= url, asin= self.asin),
)
def product_parse(self, response, **kwargs):
# get all product details
yield item
I have a situation where I need to send 20 requests per second to the web server with scrapy. I have a product listing page URL which I am passing to the spider at the start to self.start_urls and there are so many product URLs that I get from that listing page that I need to scrape but it is sending requests to all those product URLs sequentially. how can I make it concurrent?
Spider code Logic
class WebSpider(scrapy.Spider):
def __init__(self, country=None, category=None, *args, **kwargs):
self.start_urls.append(url[country][category])
def start_requests(self):
for url in self.start_urls:
request = Request(
url=url,
headers=self.default_headers,
callback=self.parse,
cookies=self.cookies,
cb_kwargs=dict(link= url, asin= self.asin),
)
yield request
def parse(self, response, **kwargs):
products = response.xpath('//link')
for product in products:
yield Request(
url=url,
headers=self.default_headers,
callback=self.product_parse,
cookies=self.cookies,
cb_kwargs=dict(link= url, asin= self.asin),
)
def product_parse(self, response, **kwargs):
# get all product details
yield item
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如下代码所示,您可以在本地的自定义设置中添加
concurrent_requests
,以启用并发请求。蜘蛛可以定义自己的设置,该设置将优先并覆盖项目。要阅读更多每个蜘蛛设置
,请参阅 this您可以使用许多其他设置,或者与concurrent_requests一起使用。
concurrent_requests_per_ip
- 设置每个IP地址的并发请求的数量。concurrent_requests_per_domain
- 定义每个域允许的请求数。max_concurrent_requests_per_domain
- 设置域允许的并发请求数的最大限制。As shown in the code below, you can add
CONCURRENT_REQUESTS
in the custom settings for this particular spider locally to enable concurrency requests. Spiders can define their own settings that will take precedence and override the project ones.To read moreSetting per Spider
, refer thisThere are many additional settings that you can use instead of, or together with CONCURRENT_REQUESTS.
CONCURRENT_REQUESTS_PER_IP
– Sets the number of concurrent requests per IP address .CONCURRENT_REQUESTS_PER_DOMAIN
– Defines the number of concurrent requests allowed for each domain.MAX_CONCURRENT_REQUESTS_PER_DOMAIN
– Sets a maximum limit on the number of concurrent requests allowed for a domain.