砂纸和硒一起刮擦网站
对我来说,用硒和砂纸刮擦Mulitple页面的最大挑战是我搜索了许多问题,如何用硒和砂纸刮擦多个页面,但我找不到任何解决方案 我面临的问题是,他们只会刮擦1页
,我用硒刮擦了多页,它对我有用,但硒刮擦多个页面的速度并不比我要进行的scrapy更快,因为它们与Selenium相比要快得多,这就是页面链接 https://wwwww.ifep.ro/justice/justice/lawyers/lawyers/lawyerspanel.aspx
import scrapy
from selenium import webdriver
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
def __init__(self):
self.driver = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')
def parse(self, response):
for k in range(1,10):
books = response.xpath("//div[@class='list-group']//@href").extract()
for book in books:
url = response.urljoin(book)
if url.endswith('.ro') or url.endswith('.ro/'):
continue
yield Request(url, callback=self.parse_book)
next = self.driver.find_element_by_xpath("//a[@id='MainContent_PagerTop_NavNext']")
next.click()
def parse_book(self, response):
title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
d1=d1.strip()
d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
d2=d2.strip()
d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
d3=d3.strip()
d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
d4=d4.strip()
yield{
"title1":title,
"title2":d1,
"title3":d2,
"title4":d3,
"title5":d4,
}
The biggest challenge for me to scrape mulitple pages with selenium and scrapy I have search many question how to scrape multiple pages with selenium and scrapy but I could not found any solution
the problem I facing Is that they will scrape only 1 page
I used the selenium to scrape multiple pages it work for me but selenium not faster to scrape multiple page than I will move to scrapy because they much faster as compared to selenium this is page link https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx
import scrapy
from selenium import webdriver
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.ifep.ro/justice/lawyers/lawyerspanel.aspx']
custom_settings = {
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 1,
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
}
def __init__(self):
self.driver = webdriver.Chrome('C:\Program Files (x86)\chromedriver.exe')
def parse(self, response):
for k in range(1,10):
books = response.xpath("//div[@class='list-group']//@href").extract()
for book in books:
url = response.urljoin(book)
if url.endswith('.ro') or url.endswith('.ro/'):
continue
yield Request(url, callback=self.parse_book)
next = self.driver.find_element_by_xpath("//a[@id='MainContent_PagerTop_NavNext']")
next.click()
def parse_book(self, response):
title=response.xpath("//span[@id='HeadingContent_lblTitle']//text()").get()
d1=response.xpath("//div[@class='col-md-10']//p[1]//text()").get()
d1=d1.strip()
d2=response.xpath("//div[@class='col-md-10']//p[2]//text()").get()
d2=d2.strip()
d3=response.xpath("//div[@class='col-md-10']//p[3]//span//text()").get()
d3=d3.strip()
d4=response.xpath("//div[@class='col-md-10']//p[4]//text()").get()
d4=d4.strip()
yield{
"title1":title,
"title2":d1,
"title3":d2,
"title4":d3,
"title5":d4,
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您将更好地使用或创建下载器中间件。您将在文档中找到有关下载中期零工的所有内容: https:/https:/ /docs.scrapy.org/en/latest/topics/downloader-middleware.html
我建议使用
scrapy> scrapy-selenium-middleware
在此处查找有关图书馆的更多信息: https://github.com/github.com/tal-leibman/scrapy -seleenium-Middleware
You would better use or create a downloader middleware for your scrapy project. You will find everything about downloader middlewares of Scrapy in the documentation: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
I would recommend using a built-in library like
scrapy-selenium-middleware
pip install scrapy-selenium-middleware
Find more about the library here: https://github.com/Tal-Leibman/scrapy-selenium-middleware