在 R 中导入维基百科表
我经常从维基百科中提取表格。 Excel 的 Web 导入对于维基百科无法正常工作,因为它将整个页面视为表格。在谷歌电子表格中,我可以输入:
=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)
这个函数将从该页面下载第三个表,其中列出了密歇根州的所有县。
R中有类似的东西吗?或者可以通过用户定义的函数创建?
I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:
=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)
and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.
Is there something similar in R? or can be created via a user defined function?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
以 Andrie 的答案为基础,并解决 SSL 问题。如果您可以添加一个额外的库依赖项:
Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:
XML
包中的函数readHTMLTable
非常适合此目的。请尝试以下操作:
readHTMLTable
返回 HTML 页面每个元素的data.frame
列表。您可以使用names
来获取有关每个元素的信息:The function
readHTMLTable
in packageXML
is ideal for this.Try the following:
readHTMLTable
returns a list ofdata.frame
s for each element of the HTML page. You can usenames
to get information about each element:以下是适用于安全 (https) 链接的解决方案:
Here is a solution that works with the secure (https) link:
一种简单的方法是使用
RGoogleDocs
界面让 Google 文档为您进行转换:http://www.omegahat.org/RGoogleDocs/run.html
然后,您可以使用
=ImportHtml
Google 文档功能及其所有预先构建的魔力。One simple way to do it is to use the
RGoogleDocs
interface to have Google Docs to do the conversion for you:http://www.omegahat.org/RGoogleDocs/run.html
You can then use the
=ImportHtml
Google Docs function with all its pre-built magic.使用
rvest
的tidyverse
解决方案。如果您需要根据某些关键字(例如表标题中的关键字)查找表格,那么它非常有用。这是一个我们想要获取埃及人口统计数据表的示例。注意:html_nodes(x = page, css = "table")
是浏览页面上可用表格的有用方法。A
tidyverse
solution usingrvest
. It's very useful if you need to find the table based on some keywords, for example in the table headers. Here is an example where we want to get the table on Vital statistics of Egypt. Note:html_nodes(x = page, css = "table")
is a useful way to browse available tables on the page.该表是唯一的第二个 td 子表的子表,因此您可以使用 css 指定该模式。您可以使用更快的类,而不是使用表的类型选择器来获取子表:
That table is the only table which is a child of the second td child of so you can specify that pattern with css. Rather than use a type selector of table to grab the child table you can use the class which is faster: