如何删除维基百科表格中的副标题?

发布于 2025-01-12 13:35:16 字数 999 浏览 3 评论 0原文

我正在尝试将维基百科表网络废弃到数据框中。在维基百科表中,我想删除人口密度、土地面积,特别是人口(排名)。最后,我想保留州或领土,只保留人口(人民)。

https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_密度

这是我的代码:

    wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
    table_class="wikitable sortable jquery-tablesorter"
    response=requests.get(wiki)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    indiatable=soup.find('table',{'class':"wikitable"})

    df=pd.read_html(str(indiatable))
    df=pd.DataFrame(df[0])
    
    data = df.drop(["Population density","Population"["Rank"],"Land area"], axis=1)
   
    wikidata = data.rename(columns={"State or territory": "State","Population": "Population"})
    print (wikidata.head())

如何我是否特别引用该子表标题降低人口排名?

I am trying to web scrap a wikipedia table into a dataframe. In the wikipedia table, I want to drop Population density, Land Area, and specifically Population (Rank). In the end I want to keep State or territory and just Population (People).

https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density

Here is my code:

    wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
    table_class="wikitable sortable jquery-tablesorter"
    response=requests.get(wiki)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    indiatable=soup.find('table',{'class':"wikitable"})

    df=pd.read_html(str(indiatable))
    df=pd.DataFrame(df[0])
    
    data = df.drop(["Population density","Population"["Rank"],"Land area"], axis=1)
   
    wikidata = data.rename(columns={"State or territory": "State","Population": "Population"})
    print (wikidata.head())

How to do I reference specifically that subtable header to drop the rank in Population?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

Hello爱情风 2025-01-19 13:35:17

注意: 您的问题没有预期结果,因此您可能需要对标题进行一些调整。假设您想将 people 重命名为 population 而不是 Population 本身,我对此进行了更改。

要实现您的目标,只需设置 header > 读取 html 时的参数仅选择第二个,因此无需单独删除它:

df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)

示例

import requests
from bs4 import BeautifulSoup
import pandas as pd

wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wiki)

soup = BeautifulSoup(response.text, 'html.parser')
indiatable=soup.find('table',{'class':"wikitable"})

df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)

输出

StateRank(all)Rank(50 states)permi2perkm2PopulationRank.1mi2km2
District of Columbia11129543616895455661158
新泽西州211263488928899446735419046.8
罗德岛州32106141010973795110342678
波多黎各496037132858744935159103.8
马萨诸塞州53901348702991745780020201.9
康涅狄格州64745288360594448484212540.7
关岛773328315383652210543.9
美属萨摩亚8650251497105577199.4

Note: There is no expected result in your question, so you may have to make some adjustments to your headers. Assuming you like to rename people to population and not population by itself I changed that.

To get your goal, simply set the header parameter while reading the html to choose only the second, so you do not need to drop it separatly:

df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wiki)

soup = BeautifulSoup(response.text, 'html.parser')
indiatable=soup.find('table',{'class':"wikitable"})

df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)

Output

StateRank(all)Rank(50 states)permi2perkm2PopulationRank.1mi2km2
District of Columbia11129543616895455661158
New Jersey211263488928899446735419046.8
Rhode Island32106141010973795110342678
Puerto Rico496037132858744935159103.8
Massachusetts53901348702991745780020201.9
Connecticut64745288360594448484212540.7
Guam773328315383652210543.9
American Samoa8650251497105577199.4
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文