使用砂纸创建XPATH

发布于 2025-02-11 23:36:56 字数 1313 浏览 0 评论 0原文

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }


    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

  

    def parse_book(self, response):
        wev={}
        d1=response.xpath("//*[@class='line_list_K']//div//span")
        for i in range(len(d1)):
            if 'Status:' in d1[i].get():
                d2=response.xpath("//div["+str(i+1)+"]//text()").get()
                print(d2)

我将获得status值,但他们会给我空输出,这是页面链接 https://rejestradwokatow.pl/adwokat/abramska-danuta-51494

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }


    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

  

    def parse_book(self, response):
        wev={}
        d1=response.xpath("//*[@class='line_list_K']//div//span")
        for i in range(len(d1)):
            if 'Status:' in d1[i].get():
                d2=response.xpath("//div["+str(i+1)+"]//text()").get()
                print(d2)

I will get the status value but they will give me empty output this is page link https://rejestradwokatow.pl/adwokat/abramska-danuta-51494

enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

久夏青 2025-02-18 23:36:57

为什么不通过其文本更具体地选择您的元素,然后从其下一个兄弟姐妹中获取文本:

//span[text()[contains(.,'Status')]]/following-sibling::div/text()

示例: http://xpather.com/zuwi58a4

获取电子邮件:

//span[text()[contains(.,'Email')]]/following-sibling::div/(concat(@data-ea,'@',@data-eb))

Why not selecting your element more specific by its text and getting the text from its next sibling:

//span[text()[contains(.,'Status')]]/following-sibling::div/text()

Example: http://xpather.com/ZUWI58a4

To get the email:

//span[text()[contains(.,'Email')]]/following-sibling::div/(concat(@data-ea,'@',@data-eb))
原谅过去的我 2025-02-18 23:36:57

您的d2 XPath没有针对正确的div

这应该有效:

def parse_book(self, response):
    wev = {}  # <- this is never used
    for child in response.xpath('//div[@class="line_list_K"]/*'):
       if 'Status:' child.xpath(".//span/text()").get():
           d2 = child.xpath(".//div/text()").get()
           print(d2)

Your d2 xpath isn't targeting the correct div.

This should work:

def parse_book(self, response):
    wev = {}  # <- this is never used
    for child in response.xpath('//div[@class="line_list_K"]/*'):
       if 'Status:' child.xpath(".//span/text()").get():
           d2 = child.xpath(".//div/text()").get()
           print(d2)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文