在 R 中导入维基百科表

发布于 2024-12-04 07:27:41 字数 260 浏览 1 评论 0原文

我经常从维基百科中提取表格。 Excel 的 Web 导入对于维基百科无法正常工作，因为它将整个页面视为表格。在谷歌电子表格中，我可以输入：

=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)

这个函数将从该页面下载第三个表，其中列出了密歇根州的所有县。

R中有类似的东西吗？或者可以通过用户定义的函数创建？

原文

I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:

=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)

and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.

Is there something similar in R? or can be created via a user defined function?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏末 2024-12-11 07:27:41

以 Andrie 的答案为基础，并解决 SSL 问题。如果您可以添加一个额外的库依赖项：

library(httr)
library(XML)

url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"

r <- GET(url)

doc <- readHTMLTable(
  doc=content(r, "text"))

doc[6]

Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:

library(httr)
library(XML)

url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"

r <- GET(url)

doc <- readHTMLTable(
  doc=content(r, "text"))

doc[6]

回复收藏 0 原文

墨小沫ゞ 2024-12-11 07:27:41

XML 包中的函数 readHTMLTable 非常适合此目的。

请尝试以下操作：

library(XML)
doc <- readHTMLTable(
         doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

            V1         V2                 V3                              V4
1       County Population Land Area (sqÂ mi) Population Density (per sqÂ mi)
2        Alger      9,862                918                            10.7
3       Baraga      8,735                904                             9.7
4     Chippewa     38,413               1561                            24.7
5        Delta     38,520               1170                            32.9
6    Dickinson     27,427                766                            35.8
7      Gogebic     17,370               1102                            15.8
8     Houghton     36,016               1012                            35.6
9         Iron     13,138               1166                            11.3
10    Keweenaw      2,301                541                             4.3
11        Luce      7,024                903                             7.8
12    Mackinac     11,943               1022                            11.7
13   Marquette     64,634               1821                            35.5
14   Menominee     25,109               1043                            24.3
15   Ontonagon      7,818               1312                             6.0
16 Schoolcraft      8,903               1178                             7.6
17       TOTAL    317,258             16,420                            19.3

readHTMLTable 返回 HTML 页面每个元素的 data.frame 列表。您可以使用 names 来获取有关每个元素的信息：

> names(doc)
 [1] "NULL"                                                                               
 [2] "toc"                                                                                
 [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
 [4] "NULL"                                                                               
 [5] "Cities and Villages of the Upper Peninsula"                                         
 [6] "Upper Peninsula Land Area and Population Density by County"                         
 [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
 [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
 [9] "NULL"                                                                               
[10] "NULL"                                                                               
[11] "NULL"                                                                               
[12] "NULL"                                                                               
[13] "NULL"                                                                               
[14] "NULL"                                                                               
[15] "NULL"                                                                               
[16] "NULL"

The function readHTMLTable in package XML is ideal for this.

Try the following:

library(XML)
doc <- readHTMLTable(
         doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")

doc[[6]]

            V1         V2                 V3                              V4
1       County Population Land Area (sqÂ mi) Population Density (per sqÂ mi)
2        Alger      9,862                918                            10.7
3       Baraga      8,735                904                             9.7
4     Chippewa     38,413               1561                            24.7
5        Delta     38,520               1170                            32.9
6    Dickinson     27,427                766                            35.8
7      Gogebic     17,370               1102                            15.8
8     Houghton     36,016               1012                            35.6
9         Iron     13,138               1166                            11.3
10    Keweenaw      2,301                541                             4.3
11        Luce      7,024                903                             7.8
12    Mackinac     11,943               1022                            11.7
13   Marquette     64,634               1821                            35.5
14   Menominee     25,109               1043                            24.3
15   Ontonagon      7,818               1312                             6.0
16 Schoolcraft      8,903               1178                             7.6
17       TOTAL    317,258             16,420                            19.3

readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:

> names(doc)
 [1] "NULL"                                                                               
 [2] "toc"                                                                                
 [3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
 [4] "NULL"                                                                               
 [5] "Cities and Villages of the Upper Peninsula"                                         
 [6] "Upper Peninsula Land Area and Population Density by County"                         
 [7] "19th Century Population by Census Year of the Upper Peninsula by County"            
 [8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"   
 [9] "NULL"                                                                               
[10] "NULL"                                                                               
[11] "NULL"                                                                               
[12] "NULL"                                                                               
[13] "NULL"                                                                               
[14] "NULL"                                                                               
[15] "NULL"                                                                               
[16] "NULL"

回复收藏 0 原文

内心激荡 2024-12-11 07:27:41

以下是适用于安全 (https) 链接的解决方案：

install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)

Here is a solution that works with the secure (https) link:

install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)

回复收藏 0 原文

一梦浮鱼 2024-12-11 07:27:41

一种简单的方法是使用 RGoogleDocs 界面让 Google 文档为您进行转换：

http://www.omegahat.org/RGoogleDocs/run.html

然后，您可以使用 =ImportHtml Google 文档功能及其所有预先构建的魔力。

回复收藏 0 原文

新人笑 2024-12-11 07:27:41

使用 rvest 的 tidyverse 解决方案。如果您需要根据某些关键字（例如表标题中的关键字）查找表格，那么它非常有用。这是一个我们想要获取埃及人口统计数据表的示例。注意：html_nodes(x = page, css = "table") 是浏览页面上可用表格的有用方法。

library(magrittr)
library(rvest)

# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>% 
    # list all tables on the page
    html_nodes(css = "table") %>% 
    # select the one containing needed key words
    extract2(., str_which(string = . , pattern = "Live births")) %>% 
    # convert to a table
    html_table(fill = T) %>%  
    view

A tidyverse solution using rvest. It's very useful if you need to find the table based on some keywords, for example in the table headers. Here is an example where we want to get the table on Vital statistics of Egypt. Note: html_nodes(x = page, css = "table") is a useful way to browse available tables on the page.

library(magrittr)
library(rvest)

# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>% 
    # list all tables on the page
    html_nodes(css = "table") %>% 
    # select the one containing needed key words
    extract2(., str_which(string = . , pattern = "Live births")) %>% 
    # convert to a table
    html_table(fill = T) %>%  
    view

回复收藏 0 原文

翻身的咸鱼 2024-12-11 07:27:41

该表是唯一的第二个 td 子表的子表，因此您可以使用 css 指定该模式。您可以使用更快的类，而不是使用表的类型选择器来获取子表：

library(rvest)

t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>% 
  html_node('td:nth-child(2) .wikitable') %>% 
  html_table()

print(t)

That table is the only table which is a child of the second td child of so you can specify that pattern with css. Rather than use a type selector of table to grab the child table you can use the class which is faster:

library(rvest)

t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>% 
  html_node('td:nth-child(2) .wikitable') %>% 
  html_table()

print(t)

回复收藏 0 原文

~没有更多了~