美丽的汤 - 如何获取 href

发布于 2024-12-04 07:01:20 字数 413 浏览 0 评论 0原文

我似乎无法从以下 html 汤中提取 href（页面上只有一个 Website:）：

<div id='id_Website'>
<strong>Website:</strong> 
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>

这就是我的思想应该有效

href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]

原文

I can't seem to be able to extract the href (there is only one <strong>Website:</strong> on the page) from the following soup of html:

<div id='id_Website'>
<strong>Website:</strong> 
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>

This is what I thought should work

href = soup.find("strong" ,text=re.compile(r'Website')).next["href"]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

拥醉 2024-12-11 07:01:20

在本例中，.next 是一个 NavigableString，其中包含 标记和之间的空格。代码>标签。此外，text= 属性用于匹配 NavigableString，而不是元素。

我认为以下内容可以满足您的要求：

import re
from BeautifulSoup import BeautifulSoup

html = '''<div id='id_Website'>
<strong>Website:</strong> 
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''

soup = BeautifulSoup(html)

for t in soup.findAll(text=re.compile(r'Website:')):
    # Find the parent of the NavigableString, and see
    # whether that's a <strong>:
    s = t.parent
    if s.name == 'strong':
        print s.nextSibling.nextSibling['href']

...但这不是很稳健。如果封闭的 div 具有可预测的 ID，那么最好找到它，然后找到其中的第一个元素。

.next in this case is a NavigableString containing the whitespace between the <strong> tag and the <a> tag. Also, the text= attribute is for matching NavigableStrings, rather than elements.

The following does what you want, I think:

import re
from BeautifulSoup import BeautifulSoup

html = '''<div id='id_Website'>
<strong>Website:</strong> 
<a href='http://google.com' target='_blank' rel='nofollow'>www.google.com</a>
</div></div><div>'''

soup = BeautifulSoup(html)

for t in soup.findAll(text=re.compile(r'Website:')):
    # Find the parent of the NavigableString, and see
    # whether that's a <strong>:
    s = t.parent
    if s.name == 'strong':
        print s.nextSibling.nextSibling['href']

... but that isn't very robust. If the enclosing div has a predictable ID, then it would better to find that, and then find the first <a> element within it.

回复收藏 0 原文

~没有更多了~