砂纸和硒一起刮擦网站

发布于 2025-02-10 05:12:24 字数 2122 浏览 1 评论 0原文

对我来说,用硒和砂纸刮擦Mulitple页面的最大挑战是我搜索了许多问题,如何用硒和砂纸刮擦多个页面,但我找不到任何解决方案 我面临的问题是,他们只会刮擦1页

,我用硒刮擦了多页,它对我有用,但硒刮擦多个页面的速度并不比我要进行的scrapy更快,因为它们与Selenium相比要快得多,这就是页面链接 https://wwwww.ifep.ro/justice/justice/lawyers/lawyers/lawyerspanel.aspx

import scrapy
from selenium import webdriver


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }
    
    def __init__(self):
      self.driver = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')
    
    
    def parse(self, response):
        for k in range(1,10):
            books = response.xpath("//div[@class='list-group']//@href").extract()
            for book in books:
                url = response.urljoin(book)
                if url.endswith('.ro') or url.endswith('.ro/'):
                    continue
                yield Request(url, callback=self.parse_book)
            
        next = self.driver.find_element_by_xpath("//a[@id='MainContent_PagerTop_NavNext']")
        next.click()
        
    def parse_book(self, response):
        
        title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
        d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
        d1=d1.strip()
        d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
        d2=d2.strip()
        d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
        d3=d3.strip()
        d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
        d4=d4.strip()
        
      
        yield{
            "title1":title,
            "title2":d1,
            "title3":d2,
            "title4":d3,
            "title5":d4,
        }
        

The biggest challenge for me to scrape mulitple pages with selenium and scrapy I have search many question how to scrape multiple pages with selenium and scrapy but I could not found any solution
the problem I facing Is that they will scrape only 1 page

I used the selenium to scrape multiple pages it work for me but selenium not faster to scrape multiple page than I will move to scrapy because they much faster as compared to selenium this is page link https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx

import scrapy
from selenium import webdriver


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }
    
    def __init__(self):
      self.driver = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')
    
    
    def parse(self, response):
        for k in range(1,10):
            books = response.xpath("//div[@class='list-group']//@href").extract()
            for book in books:
                url = response.urljoin(book)
                if url.endswith('.ro') or url.endswith('.ro/'):
                    continue
                yield Request(url, callback=self.parse_book)
            
        next = self.driver.find_element_by_xpath("//a[@id='MainContent_PagerTop_NavNext']")
        next.click()
        
    def parse_book(self, response):
        
        title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
        d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
        d1=d1.strip()
        d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
        d2=d2.strip()
        d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
        d3=d3.strip()
        d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
        d4=d4.strip()
        
      
        yield{
            "title1":title,
            "title2":d1,
            "title3":d2,
            "title4":d3,
            "title5":d4,
        }
        

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦一生花开无言 2025-02-17 05:12:24

您将更好地使用或创建下载器中间件。您将在文档中找到有关下载中期零工的所有内容: https:/https:/ /docs.scrapy.org/en/latest/topics/downloader-middleware.html

我建议使用scrapy> scrapy-selenium-middleware

  1. 安装库:<代码> pip安装scrapy-selenium-middleware
  2. 在您的零工项目设置文件中设置以下设置:
DOWNLOADER_MIDDLEWARES = {"scrapy_selenium_middleware.SeleniumDownloader":451}
CONCURRENT_REQUESTS = 1 # multiple concurrent browsers are not supported yet
SELENIUM_IS_HEADLESS = False
SELENIUM_PROXY = "http://user:password@my-proxy-server:port" # set to None to not use a proxy
SELENIUM_USER_AGENT = "User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>"           
SELENIUM_REQUEST_RECORD_SCOPE = ["api*"] # a list of regular expression to record the incoming requests by matching the url
SELENIUM_FIREFOX_PROFILE_SETTINGS = {}
SELENIUM_PAGE_LOAD_TIMEOUT = 120

在此处查找有关图书馆的更多信息: https://github.com/github.com/tal-leibman/scrapy -seleenium-Middleware

You would better use or create a downloader middleware for your scrapy project. You will find everything about downloader middlewares of Scrapy in the documentation: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

I would recommend using a built-in library like scrapy-selenium-middleware

  1. Install the library: pip install scrapy-selenium-middleware
  2. Set the following settings in your scrapy project settings file:

DOWNLOADER_MIDDLEWARES = {"scrapy_selenium_middleware.SeleniumDownloader":451}
CONCURRENT_REQUESTS = 1 # multiple concurrent browsers are not supported yet
SELENIUM_IS_HEADLESS = False
SELENIUM_PROXY = "http://user:password@my-proxy-server:port" # set to None to not use a proxy
SELENIUM_USER_AGENT = "User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>"           
SELENIUM_REQUEST_RECORD_SCOPE = ["api*"] # a list of regular expression to record the incoming requests by matching the url
SELENIUM_FIREFOX_PROFILE_SETTINGS = {}
SELENIUM_PAGE_LOAD_TIMEOUT = 120

Find more about the library here: https://github.com/Tal-Leibman/scrapy-selenium-middleware

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文