Python Beautifulsoup Wikipedia Webscapping-学习

发布于 2025-02-02 02:43:51 字数 1676 浏览 4 评论 0原文

我正在学习Python和Beautifulsoup，

我正在尝试进行一些Webscraping：

让我首先描述我想做的是吗？

Wiki页面： https：// https://en.m.m.m.wikipedia.org/wikipedia.org/wiki/list_of_of_larges_banks_banks_banks_banks_banks

我正在尝试打印出

<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>

我想打印出文本的内容：通过市值

然后，银行表的文字：例子：按市值

排名	银行	上限率
1	JP Morgan	466.1
2	中国银行	300

一直到50，

我的代码像这样开始：

from bs4 import 
import requests 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text 
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)

我相信我的问题更多地在HTML方面：但是我完全迷失了：我检查了我认为要寻找的元素和标签

{section class_='mf-section-2 collapsible-block open-block'}

原文

I learning Python and BeautifulSoup

I am trying to do some webscraping:

Let me first describe want I am trying to do?

the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks

I am trying to print out the

<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>

I want to print out the text: By market capitalization

Then the text of the table of the banks:
Example:
By market capitalization

Rank	Bank	Cap Rate
1	JP Morgan	466.1
2	Bank of China	300

all the way to 50

My code starts out like this:

from bs4 import 
import requests 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text 
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)

I believe my problem is more on the html side of things:
But I am completely lost:
I inspected the element and the tags that I believe to look for are

{section class_='mf-section-2 collapsible-block open-block'}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

节枝 2025-02-09 02:43:52

接近您的目标 - 与下一个表一起找到标题，然后通过pandas.read_html（）转换为DataFrame。

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]

或

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]

示例

from bs4 import BeautifulSoup
import requests
import panda as pd 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')

header = soup.select_one('h2:has(>#By_market_capitalization)')

print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))

输出

按市值按

等级排名	银行名称	市值（10亿美元）
1	摩根大通	466.21 [5]
2	中国工业和商业银行	295.65
3	美国银行	279.73
4	韦尔斯法戈	214.34
5	中国建筑银行	207.98
6	中国农业银行6中国农业银行	181.49
7	HSBC Holdings PLC	169.47
8	花旗集团	163.58
10	中国银行	151.15
11	中国商人银行	133.37
12	加拿大皇家银行	113.80
多伦多	- 迪诺多银行	106.61

...

Close to your goal - Find the heading and than its next table and transform it via pandas.read_html() to dataframe.

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]

header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]

Example

from bs4 import BeautifulSoup
import requests
import panda as pd 
            
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')

header = soup.select_one('h2:has(>#By_market_capitalization)')

print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))

Output

By market capitalization

Rank	Bank name	Market cap(US$ billion)
1	JPMorgan Chase	466.21[5]
2	Industrial and Commercial Bank of China	295.65
3	Bank of America	279.73
4	Wells Fargo	214.34
5	China Construction Bank	207.98
6	Agricultural Bank of China	181.49
7	HSBC Holdings PLC	169.47
8	Citigroup Inc.	163.58
9	Bank of China	151.15
10	China Merchants Bank	133.37
11	Royal Bank of Canada	113.80
12	Toronto-Dominion Bank	106.61

...

回复收藏 0 原文

浊酒尽余欢 2025-02-09 02:43:52

如您所知，您只需直接打印。然后，使用PANDAS，您可以将目标表中的唯一搜索词用作更直接的选择方法：

import pandas as pd

df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0,  drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:

import pandas as pd

df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0,  drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

回复收藏 0 原文

~没有更多了~