施工在跑步时一次又一次地卡住

发布于 2025-02-10 07:38:53 字数 3867 浏览 2 评论 0原文

问题得到了解决。答案是在此教程

我一直在运行一个刮擦和刮擦脚本。一切都很好。但是在跑步时,它不断陷入困境。 这是显示的内容,

[scrapy.extensions.logstats] INFO: Crawled 1795 pages (at 0 pages/min), scraped 1716 items (at 0 items/min)

然后我停止了使用Contorl+Z和Reran Reran运行的代码。然后,在爬行并刮擦一些数据后,它被卡住了。你以前遇到了这个问题吗?您是如何克服它的?

link>

这 蜘蛛

import scrapy
from scrapy.loader import ItemLoader
from healthgrades.items import HealthgradesItem
from scrapy_playwright.page import PageMethod 

# make the header elements like they are in a dictionary
def get_headers(s, sep=': ', strip_cookie=True, strip_cl=True, strip_headers: list = []) -> dict():
d = dict()
for kv in s.split('\n'):
    kv = kv.strip()
    if kv and sep in kv:
        v=''
        k = kv.split(sep)[0]
        if len(kv.split(sep)) == 1:
            v = ''
        else:
            v = kv.split(sep)[1]
        if v == '\'\'':
            v =''
        # v = kv.split(sep)[1]
        if strip_cookie and k.lower() == 'cookie': continue
        if strip_cl and k.lower() == 'content-length': continue
        if k in strip_headers: continue
        d[k] = v
return d

# spider class
    class DoctorSpider(scrapy.Spider):
    name = 'doctor'
    allowed_domains = ['healthgrades.com']
    url = 'https://www.healthgrades.com/usearch?what=Massage%20Therapy&entityCode=PS444&where=New%20York&pageNu    m={}&sort.provider=bestmatch&='

# change the header of bot to look like a browser
    def start_requests(self):
        h = get_headers(
            '''
            accept: */*
            accept-encoding: gzip, deflate, be
            accept-language: en-US,en;q=0.9
            dnt: 1
            origin: https://www.healthgrades.com
            referer: https://www.healthgrades.com/
            sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
            sec-ch-ua-mobile: ?0
            sec-ch-ua-platform: "Windows"
            sec-fetch-dest: empty
            sec-fetch-mode: cors
        vsec-fetch-site: cross-site
            user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
            '''
        )

        for i in range(1, 6): # Change the range to the page numbers. more improvement can be done
            # GET request. url to first page
            yield scrapy.Request(self.url.format(i), headers =h, meta=dict(
                playwright = True,
                playwright_include_page = True,
                playwright_page_methods =    [PageMethod('wait_for_selector', 'h3.card-name a')] # for     waiting for a particular element to load 
            )) 

    def parse(self, response):
        for link in response.css('div h3.card-name a::attr(href)'): # individual doctor's link
            yield response.follow(link.get(), callback = self.parse_categories) # enter into the website
        
    def parse_categories(self, response):
        l = ItemLoader(item  = HealthgradesItem(), selector = response)

        l.add_xpath('name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/h1')
        l.add_xpath('specialty', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/div[2]/p/span[1]')
        l.add_xpath('practice_name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/p')
        l.add_xpath('address', 'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)')

        yield l.load_item()

The question is solved. The answer is in this tutorial.

I have been running a scrapy script for crawling and scraping. It was all doing fine. But while running, it keeps getting stuck at some point.
Here is what it shows

[scrapy.extensions.logstats] INFO: Crawled 1795 pages (at 0 pages/min), scraped 1716 items (at 0 items/min)

I then stopped the code running with Contorl+Z and reran the spider. And then again, after crawling and scraping some data, it gets stuck. Did you face that problem before? How did you overcome it?

Here is the link to the whole code

Here is the code of the spider

import scrapy
from scrapy.loader import ItemLoader
from healthgrades.items import HealthgradesItem
from scrapy_playwright.page import PageMethod 

# make the header elements like they are in a dictionary
def get_headers(s, sep=': ', strip_cookie=True, strip_cl=True, strip_headers: list = []) -> dict():
d = dict()
for kv in s.split('\n'):
    kv = kv.strip()
    if kv and sep in kv:
        v=''
        k = kv.split(sep)[0]
        if len(kv.split(sep)) == 1:
            v = ''
        else:
            v = kv.split(sep)[1]
        if v == '\'\'':
            v =''
        # v = kv.split(sep)[1]
        if strip_cookie and k.lower() == 'cookie': continue
        if strip_cl and k.lower() == 'content-length': continue
        if k in strip_headers: continue
        d[k] = v
return d

# spider class
    class DoctorSpider(scrapy.Spider):
    name = 'doctor'
    allowed_domains = ['healthgrades.com']
    url = 'https://www.healthgrades.com/usearch?what=Massage%20Therapy&entityCode=PS444&where=New%20York&pageNu    m={}&sort.provider=bestmatch&='

# change the header of bot to look like a browser
    def start_requests(self):
        h = get_headers(
            '''
            accept: */*
            accept-encoding: gzip, deflate, be
            accept-language: en-US,en;q=0.9
            dnt: 1
            origin: https://www.healthgrades.com
            referer: https://www.healthgrades.com/
            sec-ch-ua: ".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"
            sec-ch-ua-mobile: ?0
            sec-ch-ua-platform: "Windows"
            sec-fetch-dest: empty
            sec-fetch-mode: cors
        vsec-fetch-site: cross-site
            user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
            '''
        )

        for i in range(1, 6): # Change the range to the page numbers. more improvement can be done
            # GET request. url to first page
            yield scrapy.Request(self.url.format(i), headers =h, meta=dict(
                playwright = True,
                playwright_include_page = True,
                playwright_page_methods =    [PageMethod('wait_for_selector', 'h3.card-name a')] # for     waiting for a particular element to load 
            )) 

    def parse(self, response):
        for link in response.css('div h3.card-name a::attr(href)'): # individual doctor's link
            yield response.follow(link.get(), callback = self.parse_categories) # enter into the website
        
    def parse_categories(self, response):
        l = ItemLoader(item  = HealthgradesItem(), selector = response)

        l.add_xpath('name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/h1')
        l.add_xpath('specialty', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[1]/div[1]/div[2]/p/span[1]')
        l.add_xpath('practice_name', '//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/p')
        l.add_xpath('address', 'string(//*[@id="summary-section"]/div[1]/div[2]/div/div/div[2]/div[1]/address)')

        yield l.load_item()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

〗斷ホ乔殘χμё〖 2025-02-17 07:38:53

问题是,并发设置有一个限制。

这是解决方案

并发请求

将并发添加到废品中实际上是一项非常简单的任务。已经有一个允许的并发请求数的设置,您只需要修改。

您可以选择在您制作的蜘蛛的自定义设置或影响所有蜘蛛的全局设置中进行修改。

全局

要在全球添加此内容,只需将以下行添加到您的设置文件中。

  concurrent_requests = 30
 

我们将并发请求的数量设置为30。您可以在合理的限制内使用任何想要的值。

local

要在本地添加设置,我们必须使用自定义设置将并发请求添加到我们的scrapy Spider中。

  custom_settings = {
     'concurrent_requests'= 30
}
 

其他设置

您可以使用许多其他设置,而不是与Concurrent_requests一起使用。

  • concurrent_requests_per_ip - 设置每个IP地址的并发请求的数量。
  • concurrent_requests_per_domain - 定义每个域允许的并发请求数。
  • max_concurrent_requests_per_domain - 设置域允许的并发请求数的最大限制。

The issue is, there is a limit to concurrent settings.

Here is the solution

Concurrent Requests

Adding concurrency into Scrapy is actually a very simple task. There is already a setting for the number of concurrent requests allowed, which you just have to modify.

You can choose to modify this in the custom settings of the spider you’ve made, or the global settings which effect all spiders.

Global

To add this globally, just add the following line to your settings file.

CONCURRENT_REQUESTS = 30

We’ve set the number of concurrent requests to 30. You may use any value that you wish, within a reasonable limit though.

Local

To add settings locally, we have to use custom settings to add concurrency requests to our Scrapy spider.

custom_settings = {
     'CONCURRENT_REQUESTS' = 30
}

Additional Settings

There are many additional settings that you can use instead of, or together with CONCURRENT_REQUESTS.

  • CONCURRENT_REQUESTS_PER_IP – Sets the number of concurrent requests per IP address.
  • CONCURRENT_REQUESTS_PER_DOMAIN – Defines the number of concurrent requests allowed for each domain.
  • MAX_CONCURRENT_REQUESTS_PER_DOMAIN – Sets a maximum limit on the number of concurrent requests allowed for a domain.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文