如何从新标签开口爬网数据

发布于 2025-01-31 10:49:24 字数 1642 浏览 3 评论 0原文

我正在尝试抓取此网页的详细信息 https:// https:// 。

因为我想抓取每种产品的详细信息,所以我从页面上爬了所有产品。然后,我使用回调方法将其解析到另一个DEF中,以抓取该URL的所有信息。

但是我尝试了很多解决方案,但是我的输出总是什么都没有显示的

是我的代码

import scrapy
import selenium
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys


class Goonet1Spider(scrapy.Spider):
    name = 'goonet1'

    def start_requests(self):
        yield SeleniumRequest (
            url='https://www.goo-net.com/php/search/summary.php',
            wait_time=4,
            callback=self.parse
        )

    def parse(self, response):
        links = response.xpath("//*[@class='heading_inner']/h3/a")
        url_detail = []
        for link in links:
            url = response.urljoin(link.xpath(".//@href").get())
            url_detail.append(url)
        for i in url_detail:
            yield SeleniumRequest (
                url=i,
                wait_time=4,
                callback=self.parse_item
            )

    def parse_item(self,response):
        base_price = response.xpath("//table[@class='mainData']/tbody/tr[2]/td[1]/span/text()").get()
        yield {
            'base_price': base_price
        }

是我的设置

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

#SELENIUM
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

I'm trying to crawl the detail of product of this webpage https://www.goo-net.com/php/search/summary.php by scrapy-selenium.

Because I want to crawl the detail information of each product, I crawled all url of product from the page. Then I use callback method to parse it into another def to crawl all the information of that url.

But I try a lot of solutions but my output always not showing anything

Here is my code

import scrapy
import selenium
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys


class Goonet1Spider(scrapy.Spider):
    name = 'goonet1'

    def start_requests(self):
        yield SeleniumRequest (
            url='https://www.goo-net.com/php/search/summary.php',
            wait_time=4,
            callback=self.parse
        )

    def parse(self, response):
        links = response.xpath("//*[@class='heading_inner']/h3/a")
        url_detail = []
        for link in links:
            url = response.urljoin(link.xpath(".//@href").get())
            url_detail.append(url)
        for i in url_detail:
            yield SeleniumRequest (
                url=i,
                wait_time=4,
                callback=self.parse_item
            )

    def parse_item(self,response):
        base_price = response.xpath("//table[@class='mainData']/tbody/tr[2]/td[1]/span/text()").get()
        yield {
            'base_price': base_price
        }

Here is my settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

#SELENIUM
from shutil import which
SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

Please help me

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

手心的海 2025-02-07 10:49:24

将baseurl添加到url_detail中以完成您的链接:

def parse(self, response):
        links = response.xpath("//*[@class='heading_inner']/h3/a")
        url_detail = []
        for link in links:
            url = response.urljoin(link.xpath(".//@href").get())
            url_detail.append(url)
        for i in url_detail:
            link = "https://www.goo-net.com" + i
            yield SeleniumRequest (
                url=link,
                wait_time=4,
                callback=self.parse_item
            )

Add BaseURL to url_detail to complete your link:

def parse(self, response):
        links = response.xpath("//*[@class='heading_inner']/h3/a")
        url_detail = []
        for link in links:
            url = response.urljoin(link.xpath(".//@href").get())
            url_detail.append(url)
        for i in url_detail:
            link = "https://www.goo-net.com" + i
            yield SeleniumRequest (
                url=link,
                wait_time=4,
                callback=self.parse_item
            )
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文