使用 Beautiful Soup 问题抓取数据

发布于 2025-01-11 12:04:25 字数 1142 浏览 0 评论 0原文

我正在努力从该网站抓取宇航员的国家/地区: https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order。我正在使用 BeautifulSoup 来执行此任务,但遇到了一些问题。这是我的代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order'

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')

for item in tags:
    name = item.select_one('bau astronaut_cell__title bold mr05')
    country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
    data.append([name,country])
    
df = pd.DataFrame(data)

df

df 返回一个空列表。不知道发生了什么事。当我将代码从 for 循环中取出时,它似乎找不到 select_one 函数。功能应该来自 bs4 - 不知道为什么它不起作用。另外,是否有我缺少的可重复的网络抓取模式?每次我尝试解决这类问题时,似乎都是不同的野兽。

任何帮助将不胜感激!谢谢你!

I am working on scraping the countries of astronauts from this website: https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order. I am using BeautifulSoup to perform this task, but I'm having some issues. Here is my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=72&list=true&sort=launch%20order'

r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
tags = soup.find_all('div', class_ ='astronaut_index__content container--xl mxa f fr fw aifs pl15 pr15 pt0')

for item in tags:
    name = item.select_one('bau astronaut_cell__title bold mr05')
    country = item.select_one('mouseover__contents rel py05 px075 bau caps small ac').get_text(strip = True)
    data.append([name,country])
    
df = pd.DataFrame(data)

df

df is returning an empty list. Not sure what is going on. When I take the code out of the for loop, it can't seem to find the select_one function. Function should be coming from bs4 - not sure why that's not working. Also, is there a repeatable pattern for web scraping that I'm missing? Seems like it's a different beast every time I try to tackle these kinds of problems.

Any help would be appreciated! Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

ぇ气 2025-01-18 12:04:25

url 的数据是由 javascript 动态生成的,Beautifulsoup 无法抓取动态数据。因此,您可以使用自动化工具(例如 selenium 和 Beautifulsoup)。这里我将 selenium 与 Beautifulsoup 一起使用。请运行代码。

脚本:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time


data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')

for item in tags:
    name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
    #print(name.text)
    country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
    if country:
        country=country.get_text()
    #print(country)
    
    data.append([name, country])



cols=['name','country']
df = pd.DataFrame(data,columns=cols)

print(df)

输出:

name                   country
0       Bess, Cameron  United States of America
1          Bess, Lane  United States of America
2          Dick, Evan  United States of America
3       Taylor, Dylan  United States of America
4    Strahan, Michael  United States of America
..                ...                       ...
295     Jones, Thomas  United States of America
296      Sega, Ronald  United States of America
297     Usachov, Yury                    Russia
298   Fettman, Martin  United States of America
299       Wolf, David  United States of America

[300 rows x 2 columns]

The url's data is generated dynamically by javascript and Beautifulsoup can't grab dynamic data.So, You can use automation tool something like selenium with Beautifulsoup.Here I apply selenium with Beautifulsoup.Please just run the code.

Script:

from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time


data = []

url = 'https://www.supercluster.com/astronauts?ascending=false&limit=300&list=true&sort=launch%20order'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
time.sleep(5)
driver.get(url)
time.sleep(5)

soup = BeautifulSoup(driver.page_source, 'lxml')
driver.close()
tags = soup.select('.astronaut_cell.x')

for item in tags:
    name = item.select_one('.bau.astronaut_cell__title.bold.mr05').get_text()
    #print(name.text)
    country = item.select_one('.mouseover__contents.rel.py05.px075.bau.caps.small.ac')
    if country:
        country=country.get_text()
    #print(country)
    
    data.append([name, country])



cols=['name','country']
df = pd.DataFrame(data,columns=cols)

print(df)

Output:

name                   country
0       Bess, Cameron  United States of America
1          Bess, Lane  United States of America
2          Dick, Evan  United States of America
3       Taylor, Dylan  United States of America
4    Strahan, Michael  United States of America
..                ...                       ...
295     Jones, Thomas  United States of America
296      Sega, Ronald  United States of America
297     Usachov, Yury                    Russia
298   Fettman, Martin  United States of America
299       Wolf, David  United States of America

[300 rows x 2 columns]
月亮是我掰弯的 2025-01-18 12:04:25

该页面是使用 JavaScript 动态加载的,因此请求无法直接访问该页面。数据是从另一个地址加载的,并以json格式接收。您可以通过以下方式获取它:

url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)

加载后,您可以迭代它并检索相关信息。例如:

for astro in data['astronauts']:
  print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank']) 

      

输出:

1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General

等。

然后您可以将输出加载到 pandas 数据框或其他内容。

The page is dynamically loaded using javascript, so requests can't get to it directly. The data is loaded from another address and is received in json format. You can get to it this way:

url = "https://supercluster-iadb.s3.us-east-2.amazonaws.com/adb_mobile.json"
req = requests.get(url)
data = json.loads(req.text)

Once you have it loaded, you can iterate through it and retrieve relevant information. For example:

for astro in data['astronauts']:
  print(astro['astroNumber'],astro['firstName'],astro['lastName'],astro['rank']) 

      

Output:

1 Yuri Gagarin Colonel
10 Walter Schirra Captain
100 Georgi Ivanov Major General
101 Leonid Popov Major General
102 Bertalan Farkas Brigadier General

etc.

You can then load the output to a pandas dataframe or whatever.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文