当无法按位置或属性匹配时，在 BeautifulSoup 中提取标签值

发布于 2024-09-13 07:54:52 字数 408 浏览 2 评论 0原文

我正在使用 BS 来抓取网页，但我遇到了一个小问题。这是该页面的 HTML 片段。

<span style="font-family: arial;"><span style="font-weight: bold;">Artist:</span> M.I.A.<br>
</span>

一旦我得到了汤，我怎样才能找到这个标签并得到艺术家的名字，即 MIA 我无法将标签与 style 属性相匹配，因为它在页面中的十几个地方使用。我什至不知道 span 标记的确切位置，因为它会在页面之间更改位置。因此，我无法按位置进行匹配。艺术家姓名发生变化，但标题跨度结构始终相同。

我只想提取艺术家姓名（MIA 位）。

原文

I'm using BS to scrape a web page and i'm a little stuck with a small problem. Here's a snippet of HTML from the page.

<span style="font-family: arial;"><span style="font-weight: bold;">Artist:</span> M.I.A.<br>
</span>

Once I've got the soup, how can I find this tag and get the artist name i.e. M.I.A.
I cannot match the tag with the style attribute as it is used in a dozen places in the page. I don't even know the exact location of the span tag as it changes position from page to page. Therefore, I can't match by position. The artist name changes but the title span structure is always the same.

I would only like the extract the artist name (the M.I.A. bit).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

川水往事 2024-09-20 07:54:52

BeautifulSoup 已经死了，因为 SGMLParser 已被弃用。我建议您使用更好的 lxml 库 - 它甚至有 xpath支持！！

from lxml import html

text = '''
<span style="font-family: arial;">
    <span style="font-weight: bold;">Artist:</span>M.I.A.<br>
</span>
'''

doc = html.fromstring(text)
print ''.join(doc.xpath("//span/span[text()='Artist:']/../text()"))

此 xpath 表达式的意思是“查找位于另一个 span 标记内且包含文本 'Artist:' 的 span 标记，然后抓取包含标签的父级的所有文本”。它按照预期正确打印了 MIA。

BeautifulSoup is kind of dead, since SGMLParser is deprecated. I suggest you use the better lxml library -- It even has xpath support!!

from lxml import html

text = '''
<span style="font-family: arial;">
    <span style="font-weight: bold;">Artist:</span>M.I.A.<br>
</span>
'''

doc = html.fromstring(text)
print ''.join(doc.xpath("//span/span[text()='Artist:']/../text()"))

This xpath expression means "find the span tag which is inside another span tag and contains the text 'Artist:', and grab all the text of the parent containing tag". It correctly prints M.I.A. as one would expect.

回复收藏 0 原文

~没有更多了~