如何删除维基百科表格中的副标题？

发布于 2025-01-12 13:35:16 字数 999 浏览 3 评论 0原文

我正在尝试将维基百科表网络废弃到数据框中。在维基百科表中，我想删除人口密度、土地面积，特别是人口（排名）。最后，我想保留州或领土，只保留人口（人民）。

https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_密度

这是我的代码：

    wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
    table_class="wikitable sortable jquery-tablesorter"
    response=requests.get(wiki)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    indiatable=soup.find('table',{'class':"wikitable"})

    df=pd.read_html(str(indiatable))
    df=pd.DataFrame(df[0])
    
    data = df.drop(["Population density","Population"["Rank"],"Land area"], axis=1)
   
    wikidata = data.rename(columns={"State or territory": "State","Population": "Population"})
    print (wikidata.head())

如何我是否特别引用该子表标题降低人口排名？

原文

I am trying to web scrap a wikipedia table into a dataframe. In the wikipedia table, I want to drop Population density, Land Area, and specifically Population (Rank). In the end I want to keep State or territory and just Population (People).

https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density

Here is my code:

    wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
    table_class="wikitable sortable jquery-tablesorter"
    response=requests.get(wiki)
    
    soup = BeautifulSoup(response.text, 'html.parser')
    indiatable=soup.find('table',{'class':"wikitable"})

    df=pd.read_html(str(indiatable))
    df=pd.DataFrame(df[0])
    
    data = df.drop(["Population density","Population"["Rank"],"Land area"], axis=1)
   
    wikidata = data.rename(columns={"State or territory": "State","Population": "Population"})
    print (wikidata.head())

How to do I reference specifically that subtable header to drop the rank in Population?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

Hello爱情风 2025-01-19 13:35:17

注意： 您的问题没有预期结果，因此您可能需要对标题进行一些调整。假设您想将 people 重命名为 population 而不是 Population 本身，我对此进行了更改。

要实现您的目标，只需设置 header > 读取 html 时的参数仅选择第二个，因此无需单独删除它：

df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)

示例

import requests
from bs4 import BeautifulSoup
import pandas as pd

wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wiki)

soup = BeautifulSoup(response.text, 'html.parser')
indiatable=soup.find('table',{'class':"wikitable"})

df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)

输出

State	Rank(all)	Rank(50 states)	permi2	perkm2	Population	Rank.1	mi2	km2
District of Columbia	1	—	11295	4361	689545	56	61	158
新泽西州	2	1	1263	488	9288994	46	7354	19046.8
罗德岛州	3	2	1061	410	1097379	51	1034	2678
波多黎各	4	—	960	371	3285874	49	3515	9103.8
马萨诸塞州	5	3	901	348	7029917	45	7800	20201.9
康涅狄格州	6	4	745	288	3605944	48	4842	12540.7
关岛	7	—	733	283	153836	52	210	543.9
美属萨摩亚	8	—	650	251	49710	55	77	199.4

Note: There is no expected result in your question, so you may have to make some adjustments to your headers. Assuming you like to rename people to population and not population by itself I changed that.

To get your goal, simply set the header parameter while reading the html to choose only the second, so you do not need to drop it separatly:

df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

wiki = "https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population_density"
table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wiki)

soup = BeautifulSoup(response.text, 'html.parser')
indiatable=soup.find('table',{'class':"wikitable"})

df=pd.read_html(str(indiatable),header=1)[0]
df = df.rename(columns={"State or territory": "State","People": "Population"}).drop(['Rank'], axis=1)

Output

State	Rank(all)	Rank(50 states)	permi2	perkm2	Population	Rank.1	mi2	km2
District of Columbia	1	—	11295	4361	689545	56	61	158
New Jersey	2	1	1263	488	9288994	46	7354	19046.8
Rhode Island	3	2	1061	410	1097379	51	1034	2678
Puerto Rico	4	—	960	371	3285874	49	3515	9103.8
Massachusetts	5	3	901	348	7029917	45	7800	20201.9
Connecticut	6	4	745	288	3605944	48	4842	12540.7
Guam	7	—	733	283	153836	52	210	543.9
American Samoa	8	—	650	251	49710	55	77	199.4