给出XPath的无效表达错误

发布于 2025-02-11 18:44:45 字数 1176 浏览 0 评论 0 原文

它会给我带来无效的路径表达式，我正在尝试刮去电子邮件

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }


    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

  

    def parse_book(self, response):
        
        data=response.xpath("//span[text()[contains(.,'Email')]]/following-sibling::div/(concat(@data-ea,'@',@data-eb)")
        
       
        
        yield{
            'email':data
           
        }

原文

It will give me the invalid path expression and I am trying to scrape email https://rejestradwokatow.pl/adwokat/abaewicz-dominik-49965

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
        }


    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

  

    def parse_book(self, response):
        
        data=response.xpath("//span[text()[contains(.,'Email')]]/following-sibling::div/(concat(@data-ea,'@',@data-eb)")
        
       
        
        yield{
            'email':data
           
        }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

聽兲甴掵 2025-02-18 18:44:45

就像您说的那样，您的XPath是错误的：（

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }

    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

    def parse_book(self, response):
        data = response.xpath("concat(//span[text()[contains(.,'Email')]]/following-sibling::div/@data-ea, '@',//span[text()[contains(.,'Email')]]/following-sibling::div/@data-eb)").get()
        if data == '@':
            data = 'No Email Address'
        yield {
            'email': data
        }

顺便说一句，如果您愿意，您可以在没有concat的情况下得到它）

Like you said, your xpath is wrong:

import scrapy
from scrapy.http import Request
from scrapy.crawler import CrawlerProcess


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://rejestradwokatow.pl/adwokat/list/strona/1/sta/2,3,9']
    custom_settings = {
        'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
        'DOWNLOAD_DELAY': 1,
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }

    def parse(self, response):
        books = response.xpath("//td[@class='icon_link']//a//@href").extract()
        for book in books:
            url = response.urljoin(book)
            yield Request(url, callback=self.parse_book)

    def parse_book(self, response):
        data = response.xpath("concat(//span[text()[contains(.,'Email')]]/following-sibling::div/@data-ea, '@',//span[text()[contains(.,'Email')]]/following-sibling::div/@data-eb)").get()
        if data == '@':
            data = 'No Email Address'
        yield {
            'email': data
        }

(BTW you can get it without concat if you want to)

回复收藏 0 原文

~没有更多了~