在scrapy中获取与给定页面中的href元素关联的文本

发布于 2025-01-10 09:45:00 字数 1032 浏览 0 评论 0原文

目前,我的 scrapy 蜘蛛中的“产量”如下所示:

yield {
        'hreflink':mylink,
        'Parentlink':response.url
            }

这将返回一个字典

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/"
    }

现在,我还想要与该特定的 hreflink 关联的“文本”,在该特定的 Parentlink 中。所以我的最终输出应该看起来像

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/",
     'Yourtext' : "Download Pricing Info"
    }

什么是实现这一目标的最简单方法。我想使用 Xpath 表达式来获取父链接中的“文本”,其中 href element = @href 。

到目前为止,这是我绑的 - Yourtext = response.xpath('//a[@href='+json.dumps(each)+']//text()').get() 但它不打印任何东西。我尝试打印我的回复,它返回正确的页面 - 'https://www.southeasthealth.org/financial-information-price-transparency/'

Currently my 'yield' in my scrapy spider looks as follows :

yield {
        'hreflink':mylink,
        'Parentlink':response.url
            }

This returns me a dict

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/"
    }

Now, I also want the 'text' that is associated with this particular hreflink, in that particular Parentlink. So my final output should look like

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/",
     'Yourtext' : "Download Pricing Info"
    }

What would be the simplest way to achieve that. I want to use Xpath expressions to get the "text" in a parentlink where href element = @href .

So far Here is what I tied -
Yourtext = response.xpath('//a[@href='+json.dumps(each)+']//text()').get()
but its not printing anything. I tried printing my response and it returns the right page - 'https://www.southeasthealth.org/financial-information-price-transparency/'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

墨离汐 2025-01-17 09:45:00

如果我理解正确的话,您想要获取属于链接下载定价信息的文本。

我建议您尝试使用:

response.xpath("//span[@class='fusion-button-text']//text()").get()

If I understand you correctly you want to get the text belonging to the link Download Pricing Info.

I suggest you try using:

response.xpath("//span[@class='fusion-button-text']//text()").get()
闻呓 2025-01-17 09:45:00

我找到了我的问题的答案。

'//a[@href='+json.dumps(each)+']//text()' 

这是正确的表达式,但是 href 链接“each”区分大小写,并且需要完全匹配才能使该 Xpath 正常工作。

I found the answer to my question.

'//a[@href='+json.dumps(each)+']//text()' 

This is the correct expression however the href link 'each' is case sensitive and it needs to match exactly for this Xpath to work.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文