如何在美丽的套件中与一个标签分开获取不同的文本？

发布于 2025-02-07 08:00:00 字数 993 浏览 1 评论 0 原文

我正在尝试从此Wikipedia页面上刮擦迪士尼电影的数据：

这是我的代码：

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")

tbodies=soup.find_all("tbody")
for tbody in tbodies:
    trs=tbody.find_all("tr")
    for tr in trs:
        tds=tr.find_all("td")
        for td in tds:
            print(td.text)

这是Inspect Pane的屏幕截图：

如您所见，我想获得的不同文本（标题，日期和注释）在此突出显示的“ TD”标签中。

我在代码末尾尝试了PRINT（TD [0] .TEXT）或打印（TD [2] .Text），但它返回错误。

如何分别打印这三个不同的文本？

ps我不想使用pd.read_html（url）

原文

I am trying to scrape Disney Pictures films data from this Wikipedia page: https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films

This is my code:

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")

tbodies=soup.find_all("tbody")
for tbody in tbodies:
    trs=tbody.find_all("tr")
    for tr in trs:
        tds=tr.find_all("td")
        for td in tds:
            print(td.text)

This is the screenshot of inspect pane:

As you can see, the different texts I want to get (title, date and notes) are in this highlighted "td" tag.

I tried print(td[0].text) or print(td[2].text) in the end of my code but it returns error.

How can I print these three different texts separately?

P.S. I don't want to use pd.read_html(url)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

裂开嘴轻声笑有多痛 2025-02-14 08:00:00

要分别获取不同的文本，您可以使用CSS选择器而不是列表切片

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")
t=[]
d=[]
n=[]
title=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td i a')]
#print(title)
t.extend(title)

date=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(2)')]
d.extend(date)
notes=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(3)')]
n.extend(notes)

df = pd.DataFrame(data=list(zip(t,d,n)),columns=['Title','Date', 'Note'])
print(df)

输出：

                     Title  ...
       
0    Academy Award Review of Walt Disney Cartoons  ...      Anthology film. Distributed byUnited Artists.
1                 Snow White and the Seven Dwarfs  ...  First film to be distributed byRKO Radio Pictu...
2                                       Pinocchio  ...

3                                        Fantasia  ...                                     Anthology film
4                            The Reluctant Dragon  ...        Fictionalized tour around the Disney studio
..                                            ...  ...
        ...
531                   The Return of the Rocketeer  ...                   co-production withThese Pictures
532                               Tower of Terror  ...

533                                    Tron: Ares  ...                         co-production withRideback
534                                  FC Barcelona  ...          co-production withPixar Animation Studios
535                       Young Woman and the Sea  ...


[536 rows x 3 columns]

To get different texts separately, You can use css selectors instead of list slicing

import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

url="https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films"
page=requests.get(url).content
soup=bs(page,"html.parser")
t=[]
d=[]
n=[]
title=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td i a')]
#print(title)
t.extend(title)

date=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(2)')]
d.extend(date)
notes=[x.get_text(strip=True) for x in soup.select('.wikitable.sortable tbody tr td:nth-child(3)')]
n.extend(notes)

df = pd.DataFrame(data=list(zip(t,d,n)),columns=['Title','Date', 'Note'])
print(df)

Output:

                     Title  ...
       
0    Academy Award Review of Walt Disney Cartoons  ...      Anthology film. Distributed byUnited Artists.
1                 Snow White and the Seven Dwarfs  ...  First film to be distributed byRKO Radio Pictu...
2                                       Pinocchio  ...

3                                        Fantasia  ...                                     Anthology film
4                            The Reluctant Dragon  ...        Fictionalized tour around the Disney studio
..                                            ...  ...
        ...
531                   The Return of the Rocketeer  ...                   co-production withThese Pictures
532                               Tower of Terror  ...

533                                    Tron: Ares  ...                         co-production withRideback
534                                  FC Barcelona  ...          co-production withPixar Animation Studios
535                       Young Woman and the Sea  ...


[536 rows x 3 columns]

回复收藏 0 原文

~没有更多了~