如何在美丽的套件中与一个标签分开获取不同的文本?

发布于 2025-02-07 08:00:00 字数 993 浏览 1 评论 0 原文

我正在尝试从此Wikipedia页面上刮擦迪士尼电影的数据:

这是我的代码:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")

tbodies=soup.find_all("tbody")
for tbody in tbodies:
    trs=tbody.find_all("tr")
    for tr in trs:
        tds=tr.find_all("td")
        for td in tds:
            print(td.text)

这是Inspect Pane的屏幕截图:

如您所见,我想获得的不同文本(标题,日期和注释)在此突出显示的“ TD”标签中。

我在代码末尾尝试了PRINT(TD [0] .TEXT)或打印(TD [2] .Text),但它返回错误。

如何分别打印这三个不同的文本?

ps我不想使用pd.read_html(url)

I am trying to scrape Disney Pictures films data from this Wikipedia page: https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films

This is my code:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")

tbodies=soup.find_all("tbody")
for tbody in tbodies:
    trs=tbody.find_all("tr")
    for tr in trs:
        tds=tr.find_all("td")
        for td in tds:
            print(td.text)

This is the screenshot of inspect pane:
enter image description here

As you can see, the different texts I want to get (title, date and notes) are in this highlighted "td" tag.

I tried print(td[0].text) or print(td[2].text) in the end of my code but it returns error.

How can I print these three different texts separately?

P.S. I don't want to use pd.read_html(url)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

裂开嘴轻声笑有多痛 2025-02-14 08:00:00

要分别获取不同的文本,您可以使用CSS选择器而不是列表切片

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")
t=[]
d=[]
n=[]
title=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td i a')]
#print(title)
t.extend(title)

date=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(2)')]
d.extend(date)
notes=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(3)')]
n.extend(notes)

df = pd.DataFrame(data=list(zip(t,d,n)),columns=['Title','Date', 'Note'])
print(df)

输出:

                     Title  ...
       
0    Academy Award Review of Walt Disney Cartoons  ...      Anthology film. Distributed byUnited Artists.
1                 Snow White and the Seven Dwarfs  ...  First film to be distributed byRKO Radio Pictu...
2                                       Pinocchio  ...

3                                        Fantasia  ...                                     Anthology film
4                            The Reluctant Dragon  ...        Fictionalized tour around the Disney studio
..                                            ...  ...
        ...
531                   The Return of the Rocketeer  ...                   co-production withThese Pictures
532                               Tower of Terror  ...

533                                    Tron: Ares  ...                         co-production withRideback
534                                  FC Barcelona  ...          co-production withPixar Animation Studios
535                       Young Woman and the Sea  ...


[536 rows x 3 columns]

To get different texts separately, You can use css selectors instead of list slicing

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")
t=[]
d=[]
n=[]
title=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td i a')]
#print(title)
t.extend(title)

date=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(2)')]
d.extend(date)
notes=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(3)')]
n.extend(notes)

df = pd.DataFrame(data=list(zip(t,d,n)),columns=['Title','Date', 'Note'])
print(df)

Output:

                     Title  ...
       
0    Academy Award Review of Walt Disney Cartoons  ...      Anthology film. Distributed byUnited Artists.
1                 Snow White and the Seven Dwarfs  ...  First film to be distributed byRKO Radio Pictu...
2                                       Pinocchio  ...

3                                        Fantasia  ...                                     Anthology film
4                            The Reluctant Dragon  ...        Fictionalized tour around the Disney studio
..                                            ...  ...
        ...
531                   The Return of the Rocketeer  ...                   co-production withThese Pictures
532                               Tower of Terror  ...

533                                    Tron: Ares  ...                         co-production withRideback
534                                  FC Barcelona  ...          co-production withPixar Animation Studios
535                       Young Woman and the Sea  ...


[536 rows x 3 columns]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文