Python Beautifulsoup Wikipedia Webscapping-学习

发布于 2025-02-02 02:43:51 字数 1676 浏览 4 评论 0原文

我正在学习Python和Beautifulsoup,

我正在尝试进行一些Webscraping:

让我首先描述我想做的是吗?

Wiki页面: https:// https://en.m.m.m.wikipedia.org/wikipedia.org/wiki/list_of_of_larges_banks_banks_banks_banks_banks

我正在尝试打印出

<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>

我想打印出文本的内容:通过市值

然后,银行表的文字: 例子: 按市值

排名银行上限率
1JP Morgan466.1
2中国银行300

一直到50,

我的代码像这样开始:

from bs4 import 
import requests 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text 
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup) 

我相信我的问题更多地在HTML方面: 但是我完全迷失了: 我检查了我认为要寻找的元素和标签

{section class_='mf-section-2 collapsible-block open-block'}

I learning Python and BeautifulSoup

I am trying to do some webscraping:

Let me first describe want I am trying to do?

the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks

I am trying to print out the

<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>

I want to print out the text: By market capitalization

Then the text of the table of the banks:
Example:
By market capitalization

RankBankCap Rate
1JP Morgan466.1
2Bank of China300

all the way to 50

My code starts out like this:

from bs4 import 
import requests 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text 
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup) 

I believe my problem is more on the html side of things:
But I am completely lost:
I inspected the element and the tags that I believe to look for are

{section class_='mf-section-2 collapsible-block open-block'}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

节枝 2025-02-09 02:43:52

接近您的目标 - 与下一个一起找到标题,然后通过pandas.read_html()转换为DataFrame。

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
示例
from bs4 import BeautifulSoup
import requests
import panda as pd 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')

header = soup.select_one('h2:has(>#By_market_capitalization)')

print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
输出

按市值按

等级排名银行名称市值(10亿美元)
1摩根大通466.21 [5]
2中国工业和商业银行295.65
3美国银行279.73
4韦尔斯法戈214.34
5中国建筑银行207.98
6中国农业银行6中国农业银行181.49
7HSBC Holdings PLC169.47
8花旗集团163.58
10中国银行151.15
11中国商人银行133.37
12加拿大皇家银行113.80
多伦多- 迪诺多银行106.61

...

Close to your goal - Find the heading and than its next table and transform it via pandas.read_html() to dataframe.

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]

or

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
Example
from bs4 import BeautifulSoup
import requests
import panda as pd 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')

header = soup.select_one('h2:has(>#By_market_capitalization)')

print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
Output

By market capitalization

RankBank nameMarket cap(US$ billion)
1JPMorgan Chase466.21[5]
2Industrial and Commercial Bank of China295.65
3Bank of America279.73
4Wells Fargo214.34
5China Construction Bank207.98
6Agricultural Bank of China181.49
7HSBC Holdings PLC169.47
8Citigroup Inc.163.58
9Bank of China151.15
10China Merchants Bank133.37
11Royal Bank of Canada113.80
12Toronto-Dominion Bank106.61

...

浊酒尽余欢 2025-02-09 02:43:52

如您所知,您只需直接打印。然后,使用PANDAS,您可以将目标表中的唯一搜索词用作更直接的选择方法:

import pandas as pd

df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0,  drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:

import pandas as pd

df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0,  drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文