在scrapy中获取与给定页面中的href元素关联的文本

发布于 2025-01-10 09:45:00 字数 1032 浏览 0 评论 0原文

目前，我的 scrapy 蜘蛛中的“产量”如下所示：

yield {
        'hreflink':mylink,
        'Parentlink':response.url
            }

这将返回一个字典

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/"
    }

现在，我还想要与该特定的 hreflink 关联的“文本”，在该特定的 Parentlink 中。所以我的最终输出应该看起来像

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/",
     'Yourtext' : "Download Pricing Info"
    }

什么是实现这一目标的最简单方法。我想使用 Xpath 表达式来获取父链接中的“文本”，其中 href element = @href 。

到目前为止，这是我绑的 - Yourtext = response.xpath('//a[@href='+json.dumps(each)+']//text()').get() 但它不打印任何东西。我尝试打印我的回复，它返回正确的页面 - 'https://www.southeasthealth.org/financial-information-price-transparency/'

原文

Currently my 'yield' in my scrapy spider looks as follows :

yield {
        'hreflink':mylink,
        'Parentlink':response.url
            }

This returns me a dict

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/"
    }

Now, I also want the 'text' that is associated with this particular hreflink, in that particular Parentlink. So my final output should look like

 {
    'hreflink':"https://www.southeasthealth.org/wp-content/uploads/Southeast-Health-Standard-Charges-2022.xlsx",
    'Parentlink': "https://www.southeasthealth.org/financial-information-price-transparency/",
     'Yourtext' : "Download Pricing Info"
    }

What would be the simplest way to achieve that. I want to use Xpath expressions to get the "text" in a parentlink where href element = @href .

So far Here is what I tied -
Yourtext = response.xpath('//a[@href='+json.dumps(each)+']//text()').get()
but its not printing anything. I tried printing my response and it returns the right page - 'https://www.southeasthealth.org/financial-information-price-transparency/'

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

墨离汐 2025-01-17 09:45:00

如果我理解正确的话，您想要获取属于链接下载定价信息的文本。

我建议您尝试使用：

response.xpath("//span[@class='fusion-button-text']//text()").get()

If I understand you correctly you want to get the text belonging to the link Download Pricing Info.

I suggest you try using:

response.xpath("//span[@class='fusion-button-text']//text()").get()

回复收藏 0 原文

闻呓 2025-01-17 09:45:00

我找到了我的问题的答案。

'//a[@href='+json.dumps(each)+']//text()'

这是正确的表达式，但是 href 链接“each”区分大小写，并且需要完全匹配才能使该 Xpath 正常工作。

I found the answer to my question.

'//a[@href='+json.dumps(each)+']//text()'

This is the correct expression however the href link 'each' is case sensitive and it needs to match exactly for this Xpath to work.

回复收藏 0 原文

~没有更多了~

关于作者

两个我

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

在scrapy中获取与给定页面中的href元素关联的文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

紫罗兰の梦幻

-2134

liuxuanli

意中人

○愚か者の日

xxhui

友情链接

在scrapy中获取与给定页面中的href元素关联的文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

紫罗兰の梦幻

-2134

liuxuanli

意中人

○愚か者の日

xxhui

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。