如何从新标签开口爬网数据
我正在尝试抓取此网页的详细信息 https:// https:// 。
因为我想抓取每种产品的详细信息,所以我从页面上爬了所有产品。然后,我使用回调方法将其解析到另一个DEF中,以抓取该URL的所有信息。
但是我尝试了很多解决方案,但是我的输出总是什么都没有显示的
是我的代码
import scrapy
import selenium
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys
class Goonet1Spider(scrapy.Spider):
name = 'goonet1'
def start_requests(self):
yield SeleniumRequest (
url='https://www.goo-net.com/php/search/summary.php',
wait_time=4,
callback=self.parse
)
def parse(self, response):
links = response.xpath("//*[@class='heading_inner']/h3/a")
url_detail = []
for link in links:
url = response.urljoin(link.xpath(".//@href").get())
url_detail.append(url)
for i in url_detail:
yield SeleniumRequest (
url=i,
wait_time=4,
callback=self.parse_item
)
def parse_item(self,response):
base_price = response.xpath("//table[@class='mainData']/tbody/tr[2]/td[1]/span/text()").get()
yield {
'base_price': base_price
}
是我的设置
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
#SELENIUM
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless'] # '--headless' if using chrome instead of firefox
。
I'm trying to crawl the detail of product of this webpage https://www.goo-net.com/php/search/summary.php by scrapy-selenium.
Because I want to crawl the detail information of each product, I crawled all url of product from the page. Then I use callback method to parse it into another def to crawl all the information of that url.
But I try a lot of solutions but my output always not showing anything
Here is my code
import scrapy
import selenium
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys
class Goonet1Spider(scrapy.Spider):
name = 'goonet1'
def start_requests(self):
yield SeleniumRequest (
url='https://www.goo-net.com/php/search/summary.php',
wait_time=4,
callback=self.parse
)
def parse(self, response):
links = response.xpath("//*[@class='heading_inner']/h3/a")
url_detail = []
for link in links:
url = response.urljoin(link.xpath(".//@href").get())
url_detail.append(url)
for i in url_detail:
yield SeleniumRequest (
url=i,
wait_time=4,
callback=self.parse_item
)
def parse_item(self,response):
base_price = response.xpath("//table[@class='mainData']/tbody/tr[2]/td[1]/span/text()").get()
yield {
'base_price': base_price
}
Here is my settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy_selenium.SeleniumMiddleware': 800
}
#SELENIUM
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless'] # '--headless' if using chrome instead of firefox
Please help me
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
将baseurl添加到url_detail中以完成您的链接:
Add BaseURL to url_detail to complete your link: