给出XPath的无效表达错误

发布于 2025-02-11 18:44:45 字数 1176 浏览 0 评论 0 原文

它会给我带来无效的路径表达式,我正在尝试刮去电子邮件

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }


    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

  

    def parse_book(self, response):
        
        data=response.xpath("//span[text()[contains(.,'Email')]]/following-sibling::div/(concat(@data-ea,'@',@data-eb)")
        
       
        
        yield{
            'email':data
           
        }
    

It will give me the invalid path expression and I am trying to scrape email https://rejestradwokatow.pl/adwokat/abaewicz-dominik-49965

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }


    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

  

    def parse_book(self, response):
        
        data=response.xpath("//span[text()[contains(.,'Email')]]/following-sibling::div/(concat(@data-ea,'@',@data-eb)")
        
       
        
        yield{
            'email':data
           
        }
    

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

聽兲甴掵 2025-02-18 18:44:45

就像您说的那样,您的XPath是错误的:(

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }

    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

    def parse_book(self, response):
        data = response.xpath("concat(//span[text()[contains(.,'Email')]]/following-sibling::div/@data-ea, '@',//span[text()[contains(.,'Email')]]/following-sibling::div/@data-eb)").get()
        if data == '@':
            data = 'No Email Address'
        yield {
            'email': data
        }

顺便说一句,如果您愿意,您可以在没有concat的情况下得到它)

Like you said, your xpath is wrong:

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }

    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

    def parse_book(self, response):
        data = response.xpath("concat(//span[text()[contains(.,'Email')]]/following-sibling::div/@data-ea, '@',//span[text()[contains(.,'Email')]]/following-sibling::div/@data-eb)").get()
        if data == '@':
            data = 'No Email Address'
        yield {
            'email': data
        }

(BTW you can get it without concat if you want to)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文