如何在 Beautifulsoup 的 find_all() 函数中过滤没有属性的标签?

发布于 2025-01-12 18:40:43 字数 1941 浏览 4 评论 0原文

下面是我正在使用的一个简单的 html 源代码

<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>

<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>

下面是我尝试获取 Melodie 行的代码

html='html text file aboved'

soup=BeautifulSoup(html,'html.parser')

    for tag in soup.find_all('td'):
        print(tag) 
        print('----') #Result:
#===============================================================================
# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Machaela</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Rhoan</td>
# ----
#.........
#===============================================================================

现在我想获取 仅 name 行,而不是带有“span”和“class”的行。我尝试了 2 个过滤器 soup.find_all('td' and not 'span')soup.find_all('td', attrs={'class':None})但这些都不起作用。我知道还有其他方法,但我想在 soup.find_all() 中使用过滤器。 我的预期输出(实际上我的最终目标是获取两个 之间的人名):

# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td>Machaela</td>
# ----
# <td>Rhoan</td>
# ----

Below are a simple html source code I'm working with

<html>
<head>
<title>Welcome to the comments assignment from www.py4e.com</title>
</head>
<body>
<h1>This file contains the actual data for your assignment - good luck!</h1>

<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>

Below is my code try to get the <td>Melodie</td> line

html='html text file aboved'

soup=BeautifulSoup(html,'html.parser')

    for tag in soup.find_all('td'):
        print(tag) 
        print('----') #Result:
#===============================================================================
# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Machaela</td>
# ----
# <td><span class="comments">100</span></td>
# ----
# <td>Rhoan</td>
# ----
#.........
#===============================================================================

Now I want to get the <td>name<td> lines only and not the line with 'span' and 'class'. I try 2 filters soup.find_all('td' and not 'span') and soup.find_all('td', attrs={'class':None}) but none of those work. I know there is other way around but I want to use the filter in soup.find_all().
My expected output (actually my final goal is to get the name of person between two <td>):

# <td>Name</td>
# ----
# <td>Comments</td>
# ----
# <td>Melodie</td>
# ----
# <td>Machaela</td>
# ----
# <td>Rhoan</td>
# ----

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

流星番茄 2025-01-19 18:40:43

您可以通过两个单独的选择器调用获得所需的输出:

from bs4 import BeautifulSoup

html = """
<body>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
"""
soup = BeautifulSoup(html, "lxml")

for elem in soup.select("td"):
    if not elem.select(".comments"):
        print(elem)

输出:

<td>Name</td>
<td>Comments</td>
<td>Melodie</td>
<td>Machaela</td>
<td>Rhoan</td>

顺便说​​一句,与 html.parser 相比,更喜欢 lxml。它对格式错误的 HTML 更快、更稳健。

You can get the desired output with two separate selector calls:

from bs4 import BeautifulSoup

html = """
<body>
<table border="2">
<tr>
<td>Name</td><td>Comments</td>
</tr>
<tr><td>Melodie</td><td><span class="comments">100</span></td></tr>
<tr><td>Machaela</td><td><span class="comments">100</span></td></tr>
<tr><td>Rhoan</td><td><span class="comments">99</span></td></tr>
"""
soup = BeautifulSoup(html, "lxml")

for elem in soup.select("td"):
    if not elem.select(".comments"):
        print(elem)

Output:

<td>Name</td>
<td>Comments</td>
<td>Melodie</td>
<td>Machaela</td>
<td>Rhoan</td>

As an aside, prefer lxml to html.parser. It's faster and more robust to malformed HTML.

因为看清所以看轻 2025-01-19 18:40:43

通过css选择器选择元素,例如nest 伪类 :has():not()

soup.select('td:not(:has(span))')

soup.select('td:not(:has(.comments))')

示例

from bs4 import BeautifulSoup
html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()

soup=BeautifulSoup(html,'html.parser')

for e in soup.select('td:not(:has(span))'):
    print(e)

输出

<td>Name</td>
<td>Comments</td>
<td>Melodie</td>
<td>Machaela</td>
<td>Rhoan</td>
<td>Murrough</td>
<td>Lilygrace</td>
...

Select your elements via css selectors e.g. nest pseudo classes :has() and :not():

soup.select('td:not(:has(span))')

or

soup.select('td:not(:has(.comments))')

Example

from bs4 import BeautifulSoup
html=urllib.request.urlopen('http://py4e-data.dr-chuck.net/comments_1430669.html').read()

soup=BeautifulSoup(html,'html.parser')

for e in soup.select('td:not(:has(span))'):
    print(e)

Output

<td>Name</td>
<td>Comments</td>
<td>Melodie</td>
<td>Machaela</td>
<td>Rhoan</td>
<td>Murrough</td>
<td>Lilygrace</td>
...
路还长,别太狂 2025-01-19 18:40:43

我知道问题发布已经过去 12 个月了,但我希望这可以帮助那些追随我们的人。我一直在尝试为像我这样的初学者找到最简洁的代码。这里是:

#Creating the veariables
soup = BeautifulSoup(html, "html.parser")
my_list = list()

#Asking BeautifulSoup to find all <td> tags that contains strings only with lettes (a-zA-z)
names = soup.find_all("td", string = re.compile("[a-zA-Z]"))

for name in names:
    my_list.append(name)
    print(name)

print(my_list)

I know it has been 12 months since the question was posted, but I hope this can help those who will come after us. I have tried and tried to find the most concise code for a beginner like me. Here it is:

#Creating the veariables
soup = BeautifulSoup(html, "html.parser")
my_list = list()

#Asking BeautifulSoup to find all <td> tags that contains strings only with lettes (a-zA-z)
names = soup.find_all("td", string = re.compile("[a-zA-Z]"))

for name in names:
    my_list.append(name)
    print(name)

print(my_list)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文