如果对象还有其他类,Beautiful Soup 也找不到 CSS 类
如果页面具有
和
,则
soup.findAll(True, 'class1')
将找到它们。
但是,如果它具有
,则不会找到它。 如何找到具有特定类的所有对象,无论它们是否也有其他类?
if a page has <div class="class1">
and <p class="class1">
, then soup.findAll(True, 'class1')
will find them both.
If it has <p class="class1 class2">
, though, it will not be found. How do I find all objects with a certain class, regardless of whether they have other classes, too?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
搜索具有特定 CSS 类的标签非常有用,但 CSS 属性的名称“class”是 Python 中的保留字。 使用 class 作为关键字参数会给你带来语法错误。 从 Beautiful Soup 4.1.2 开始,您可以使用关键字参数 class_ 按 CSS 类搜索:
Like:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
Like:
您应该使用 lxml。 它适用于由空格分隔的多个类值(“class1 class2”)。
尽管名称如此,lxml 也可用于解析和抓取 HTML。 它比 BeautifulSoup 快得多,甚至比 BeautifulSoup(他们声名鹊起)更好地处理“损坏的”HTML。 如果您不想学习 lxml API,它也有一个 BeautifulSoup 的兼容性 API。
Ian Bicking 同意并且更喜欢通过 BeautifulSoup 进行 lxml。
没有理由再使用 BeautifulSoup,除非你使用的是 Google App Engine 或其他不允许使用非纯 Python 的东西。
您甚至可以将 CSS 选择器与 lxml 一起使用,因此它比 BeautifulSoup 更容易使用。 尝试在交互式 Python 控制台中使用它。
You should use lxml. It works with multiple class values separated by spaces ('class1 class2').
Despite its name, lxml is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Bicking agrees and prefers lxml over BeautifulSoup.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.
You can even use CSS selectors with lxml, so it's far easier to use than BeautifulSoup. Try playing around with it in an interactive Python console.
以防万一有人遇到这个问题。 BeautifulSoup 现在支持这一点:
此外,您不必再键入 findAll。
Just in case anybody comes across this question. BeautifulSoup now supports this:
Also, you don't have to type findAll anymore.
不幸的是,BeautifulSoup 将其视为一个带有空格的类
'class1 class2'
,而不是两个类['class1','class2']
。 解决方法是使用正则表达式而不是字符串来搜索类。这有效:
Unfortunately, BeautifulSoup treats this as a class with a space in it
'class1 class2'
rather than two classes['class1','class2']
. A workaround is to use a regular expression to search for the class instead of a string.This works: