BeautifulSoup：如何从包含一些嵌套
的
列表中提取所有
？

发布于 2024-10-06 19:50:56 字数 3021 浏览 3 评论 0原文

我是一名新手程序员，试图通过构建一个脚本来跳入Python，该脚本可以抓取 http://en.wikipedia .org/wiki/2000s_in_film 并提取“电影标题（年份）”列表。我的 HTML 源代码如下所示：

<h3>Header3 (Start here)</h3>
<ul>
    <li>List items</li>
    <li>Etc...</li>
</ul>
<h3>Header 3</h3>
<ul>
    <li>List items</li>
    <ul>
        <li>Nested list items</li>
        <li>Nested list items</li></ul>
    <li>List items</li>
</ul>
<h2>Header 2 (end here)</h2>

我希望第一个 h3 标记后面的所有 li 标记并在下一个 h2 标记处停止，包括所有嵌套 li 标签。

firstH3 = soup.find('h3')

...正确找到我想开始的地方。

firstH3 = soup.find('h3') # Start here
uls = []
for nextSibling in firstH3.findNextSiblings():
    if nextSibling.name == 'h2':
        break
    if nextSibling.name == 'ul':
        uls.append(nextSibling)

...给我一个列表 uls，每个列表都有我需要的 li 内容。

uls 列表摘录：

<ul>
...
    <li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li>
    <li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li>
    <li>Air Bud series:
        <ul>
            <li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li>
            <li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li>
            <li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li>
            <li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li>
        </ul>
    </li>
    <li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li>
...
</ul>

但我不确定从这里该去哪里。

更新：

最终代码：

lis = []
    for ul in uls:
        for li in ul.findAll('li'):
            if li.find('ul'):
                break
            lis.append(li)

    for li in lis:
        print li.text.encode("utf-8")

if...break 会抛出包含 UL 的 LI，因为嵌套 LI 现在是重复的。

打印输出现在是：

102 只斑点狗(2000)
10 号和 10 号狼(2006)
11:14(2006)
12:08 布加勒斯特东部（2006 年）
13 继续30(2004)
1408(2007)
...

原文

I'm a newbie programmer trying to jump in to Python by building a script that scrapes http://en.wikipedia.org/wiki/2000s_in_film and extracts a list of "Movie Title (Year)".
My HTML source looks like:

<h3>Header3 (Start here)</h3>
<ul>
    <li>List items</li>
    <li>Etc...</li>
</ul>
<h3>Header 3</h3>
<ul>
    <li>List items</li>
    <ul>
        <li>Nested list items</li>
        <li>Nested list items</li></ul>
    <li>List items</li>
</ul>
<h2>Header 2 (end here)</h2>

I'd like all the li tags following the first h3 tag and stopping at the next h2 tag, including all nested li tags.

firstH3 = soup.find('h3')

...correctly finds the place I'd like to start.

firstH3 = soup.find('h3') # Start here
uls = []
for nextSibling in firstH3.findNextSiblings():
    if nextSibling.name == 'h2':
        break
    if nextSibling.name == 'ul':
        uls.append(nextSibling)

...gives me a list uls, each with li contents that I need.

Excerpt of the uls list:

<ul>
...
    <li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li>
    <li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li>
    <li>Air Bud series:
        <ul>
            <li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li>
            <li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li>
            <li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li>
            <li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li>
        </ul>
    </li>
    <li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li>
...
</ul>

But I'm unsure of where to go from here.

Update:

Final Code:

lis = []
    for ul in uls:
        for li in ul.findAll('li'):
            if li.find('ul'):
                break
            lis.append(li)

    for li in lis:
        print li.text.encode("utf-8")

The if...break throws out the LI's that contain UL's since the nested LI's are now duplicated.

Print output is now:

102 Dalmatians(2000)
10th & Wolf(2006)
11:14(2006)
12:08 East of Bucharest(2006)
13 Going on 30(2004)
1408(2007)
...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

请别遗忘我 2024-10-13 19:50:56

.findAll() 适用于嵌套 li 元素：

for ul in uls:
    for li in ul.findAll('li'):
        print(li)

输出：

<li>List items</li>
<li>Etc...</li>
<li>List items</li>
<li>Nested list items</li>
<li>Nested list items</li>
<li>List items</li>

.findAll() works for nested li elements:

for ul in uls:
    for li in ul.findAll('li'):
        print(li)

Output:

<li>List items</li>
<li>Etc...</li>
<li>List items</li>
<li>Nested list items</li>
<li>Nested list items</li>
<li>List items</li>

回复收藏 0 原文

蓝天 2024-10-13 19:50:56

列表理解也可以工作。

lis = [li for ul in uls for li in ul.findAll('li')]

A list comprehension could work, too.

lis = [li for ul in uls for li in ul.findAll('li')]

回复收藏 0 原文

春风十里 2024-10-13 19:50:56

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_list_test")
soup =   BeautifulSoup(r.content,"lxml")
w3schollsList = soup.find_all('body')
for w3scholl in w3schollsList:
    ulList = w3scholl.find_all('li')
    for li in ulList:
        print(li)

注意：这里是获取我们制作的div里面的“li”

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.w3schools.com/tags/tryit.asp?filename=tryhtml_list_test")
soup =   BeautifulSoup(r.content,"lxml")
w3schollsList = soup.find_all('body')
for w3scholl in w3schollsList:
    ulList = w3scholl.find_all('li')
    for li in ulList:
        print(li)

Note: here is to get the "li" inside the div we made

回复收藏 0 原文

~没有更多了~