如果对象还有其他类,Beautiful Soup 也找不到 CSS 类

发布于 2024-07-30 07:49:04 字数 281 浏览 2 评论 0原文

如果页面具有

,则 soup.findAll(True, 'class1') 将找到它们。

但是,如果它具有

,则不会找到它。 如何找到具有特定类的所有对象,无论它们是否也有其他类?

if a page has <div class="class1"> and <p class="class1">, then soup.findAll(True, 'class1') will find them both.

If it has <p class="class1 class2">, though, it will not be found. How do I find all objects with a certain class, regardless of whether they have other classes, too?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

﹉夏雨初晴づ 2024-08-06 07:49:04

搜索具有特定 CSS 类的标签非常有用,但 CSS 属性的名称“class”是 Python 中的保留字。 使用 class 作为关键字参数会给你带来语法错误。 从 Beautiful Soup 4.1.2 开始,您可以使用关键字参数 class_ 按 CSS 类搜索:

Like:

soup.find_all("a", class_="class1")

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

Like:

soup.find_all("a", class_="class1")
海未深 2024-08-06 07:49:04

您应该使用 lxml。 它适用于由空格分隔的多个类值(“class1 class2”)。

尽管名称如此,lxml 也可用于解析和抓取 HTML。 它比 BeautifulSoup 快得多,甚至比 BeautifulSoup(他们声名鹊起)更好地处理“损坏的”HTML。 如果您不想学习 lxml API,它也有一个 BeautifulSoup 的兼容性 API。

Ian Bicking 同意并且更喜欢通过 BeautifulSoup 进行 lxml。

没有理由再使用 BeautifulSoup,除非你使用的是 Google App Engine 或其他不允许使用非纯 Python 的东西。

您甚至可以将 CSS 选择器与 lxml 一起使用,因此它比 BeautifulSoup 更容​​易使用。 尝试在交互式 Python 控制台中使用它。

You should use lxml. It works with multiple class values separated by spaces ('class1 class2').

Despite its name, lxml is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Bicking agrees and prefers lxml over BeautifulSoup.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

You can even use CSS selectors with lxml, so it's far easier to use than BeautifulSoup. Try playing around with it in an interactive Python console.

澜川若宁 2024-08-06 07:49:04

以防万一有人遇到这个问题。 BeautifulSoup 现在支持这一点:

Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

In [1]: import bs4

In [2]: soup = bs4.BeautifulSoup('<div class="foo bar"></div>')

In [3]: soup(attrs={'class': 'bar'})
Out[3]: [<div class="foo bar"></div>]

此外,您不必再键入 findAll。

Just in case anybody comes across this question. BeautifulSoup now supports this:

Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)]
Type "copyright", "credits" or "license" for more information.

In [1]: import bs4

In [2]: soup = bs4.BeautifulSoup('<div class="foo bar"></div>')

In [3]: soup(attrs={'class': 'bar'})
Out[3]: [<div class="foo bar"></div>]

Also, you don't have to type findAll anymore.

蓦然回首 2024-08-06 07:49:04

不幸的是,BeautifulSoup 将其视为一个带有空格的类 'class1 class2',而不是两个类 ['class1','class2']。 解决方法是使用正则表达式而不是字符串来搜索类。

这有效:

soup.findAll(True, {'class': re.compile(r'\bclass1\b')})

Unfortunately, BeautifulSoup treats this as a class with a space in it 'class1 class2' rather than two classes ['class1','class2']. A workaround is to use a regular expression to search for the class instead of a string.

This works:

soup.findAll(True, {'class': re.compile(r'\bclass1\b')})
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文