Python Beautifulsoup Wikipedia Webscapping-学习
我正在学习Python和Beautifulsoup,
我正在尝试进行一些Webscraping:
让我首先描述我想做的是吗?
我正在尝试打印出
<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
我想打印出文本的内容:通过市值
然后,银行表的文字: 例子: 按市值
排名 | 银行 | 上限率 |
---|---|---|
1 | JP Morgan | 466.1 |
2 | 中国银行 | 300 |
一直到50,
我的代码像这样开始:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
我相信我的问题更多地在HTML方面: 但是我完全迷失了: 我检查了我认为要寻找的元素和标签
{section class_='mf-section-2 collapsible-block open-block'}
I learning Python and BeautifulSoup
I am trying to do some webscraping:
Let me first describe want I am trying to do?
the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks
I am trying to print out the
<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
I want to print out the text: By market capitalization
Then the text of the table of the banks:
Example:
By market capitalization
Rank | Bank | Cap Rate |
---|---|---|
1 | JP Morgan | 466.1 |
2 | Bank of China | 300 |
all the way to 50
My code starts out like this:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
I believe my problem is more on the html side of things:
But I am completely lost:
I inspected the element and the tags that I believe to look for are
{section class_='mf-section-2 collapsible-block open-block'}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
接近您的目标 - 与下一个
表
一起找到标题,然后通过pandas.read_html()
转换为DataFrame。或
示例
输出
按市值按
...
Close to your goal - Find the heading and than its next
table
and transform it viapandas.read_html()
to dataframe.or
Example
Output
By market capitalization
...
如您所知,您只需直接打印。然后,使用PANDAS,您可以将目标表中的唯一搜索词用作更直接的选择方法:
As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method: