砂纸和硒一起刮擦网站

发布于 2025-02-10 05:12:24 字数 2122 浏览 1 评论 0原文

对我来说，用硒和砂纸刮擦Mulitple页面的最大挑战是我搜索了许多问题，如何用硒和砂纸刮擦多个页面，但我找不到任何解决方案我面临的问题是，他们只会刮擦1页

，我用硒刮擦了多页，它对我有用，但硒刮擦多个页面的速度并不比我要进行的scrapy更快，因为它们与Selenium相比要快得多，这就是页面链接 https://wwwww.ifep.ro/justice/justice/lawyers/lawyers/lawyerspanel.aspx

import scrapy
from selenium import webdriver


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }
    
    def __init__(self):
      self.driver = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')
    
    
    def parse(self, response):
        for k in range(1,10):
            books = response.xpath("//div[@class='list-group']//@href").extract()
            for book in books:
                url = response.urljoin(book)
                if url.endswith('.ro') or url.endswith('.ro/'):
                    continue
                yield Request(url, callback=self.parse_book)
            
        next = self.driver.find_element_by_xpath("//a[@id='MainContent_PagerTop_NavNext']")
        next.click()
        
    def parse_book(self, response):
        
        title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
        d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
        d1=d1.strip()
        d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
        d2=d2.strip()
        d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
        d3=d3.strip()
        d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
        d4=d4.strip()
        
      
        yield{
            "title1":title,
            "title2":d1,
            "title3":d2,
            "title4":d3,
            "title5":d4,
        }

原文

The biggest challenge for me to scrape mulitple pages with selenium and scrapy I have search many question how to scrape multiple pages with selenium and scrapy but I could not found any solution
the problem I facing Is that they will scrape only 1 page

I used the selenium to scrape multiple pages it work for me but selenium not faster to scrape multiple page than I will move to scrapy because they much faster as compared to selenium this is page link https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx

import scrapy
from selenium import webdriver


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }
    
    def __init__(self):
      self.driver = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')
    
    
    def parse(self, response):
        for k in range(1,10):
            books = response.xpath("//div[@class='list-group']//@href").extract()
            for book in books:
                url = response.urljoin(book)
                if url.endswith('.ro') or url.endswith('.ro/'):
                    continue
                yield Request(url, callback=self.parse_book)
            
        next = self.driver.find_element_by_xpath("//a[@id='MainContent_PagerTop_NavNext']")
        next.click()
        
    def parse_book(self, response):
        
        title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
        d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
        d1=d1.strip()
        d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
        d2=d2.strip()
        d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
        d3=d3.strip()
        d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
        d4=d4.strip()
        
      
        yield{
            "title1":title,
            "title2":d1,
            "title3":d2,
            "title4":d3,
            "title5":d4,
        }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦一生花开无言 2025-02-17 05:12:24

您将更好地使用或创建下载器中间件。您将在文档中找到有关下载中期零工的所有内容： https：/https：/ /docs.scrapy.org/en/latest/topics/downloader-middleware.html

我建议使用scrapy> scrapy-selenium-middleware

安装库：<代码> pip安装scrapy-selenium-middleware
在您的零工项目设置文件中设置以下设置：

DOWNLOADER_MIDDLEWARES = {"scrapy_selenium_middleware.SeleniumDownloader":451}
CONCURRENT_REQUESTS = 1 # multiple concurrent browsers are not supported yet
SELENIUM_IS_HEADLESS = False
SELENIUM_PROXY = "http://user:password@my-proxy-server:port" # set to None to not use a proxy
SELENIUM_USER_AGENT = "User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>"           
SELENIUM_REQUEST_RECORD_SCOPE = ["api*"] # a list of regular expression to record the incoming requests by matching the url
SELENIUM_FIREFOX_PROFILE_SETTINGS = {}
SELENIUM_PAGE_LOAD_TIMEOUT = 120

在此处查找有关图书馆的更多信息： https://github.com/github.com/tal-leibman/scrapy -seleenium-Middleware

You would better use or create a downloader middleware for your scrapy project. You will find everything about downloader middlewares of Scrapy in the documentation: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

I would recommend using a built-in library like scrapy-selenium-middleware

Install the library: pip install scrapy-selenium-middleware
Set the following settings in your scrapy project settings file:

DOWNLOADER_MIDDLEWARES = {"scrapy_selenium_middleware.SeleniumDownloader":451}
CONCURRENT_REQUESTS = 1 # multiple concurrent browsers are not supported yet
SELENIUM_IS_HEADLESS = False
SELENIUM_PROXY = "http://user:password@my-proxy-server:port" # set to None to not use a proxy
SELENIUM_USER_AGENT = "User-Agent: Mozilla/5.0 (<system-information>) <platform> (<platform-details>) <extensions>"           
SELENIUM_REQUEST_RECORD_SCOPE = ["api*"] # a list of regular expression to record the incoming requests by matching the url
SELENIUM_FIREFOX_PROFILE_SETTINGS = {}
SELENIUM_PAGE_LOAD_TIMEOUT = 120

Find more about the library here: https://github.com/Tal-Leibman/scrapy-selenium-middleware

回复收藏 0 原文

~没有更多了~