使用 XML / RCurl R 包解析 HTML 表,而不使用 readHTMLTable 函数
我正在尝试从单个 html 表中抓取/提取数据:http://www.theplantlist.org/tpl/record/kew-419248" theplantlist.org/tpl/record/kew-419248 和许多非常相似的页面。 我最初尝试使用以下函数来读取表格,但它并不理想,因为我想将每个物种名称分成其组成部分(属/物种/种下/作者等)。
library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")
我使用 SelectorGadget 来识别我想要提取的每个表元素(不一定是最短的)的唯一 XPATH:
对于属名: //[contains(concat( " ", @class, " " ), concat( " ", "同义词", " " ))]// [contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]
对于物种名称: //[contains(concat( " ", @class, " " ), concat( " ", "同义词", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "物种", " " ))]
对于种下等级: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]
对于种下名称: //*[contains( concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]
对于置信水平(图像): //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img 对于来源: //[contains(concat( " ", @class, " " ) , concat( " ", "source", " " ))]//a
我现在想将信息提取到数据框/表中。
我尝试使用 XML 包的 xpathSApply 函数来提取一些数据:
例如对于种下等级
library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)
但是,由于数据中的间隙,这种方法是有问题的(例如,表中只有某些行具有种下等级,所以我所拥有的一切返回的是表中三个排名的列表,没有间隙)。数据输出也是我在附加到数据框时遇到问题的类。
有谁知道更好的方法将该表中的信息提取到数据框中?
任何帮助将不胜感激!
汤姆
I am trying to scrape/extract data from the single html table on: http://www.theplantlist.org/tpl/record/kew-419248 and a number of very similar pages.
I initially tried using the following function to read the table, but it wasn't ideal because I want to separate each species name into its component parts (genus/species/infraspecies/author etc).
library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")
I used SelectorGadget to identify a unique XPATH to each table element that I want to extract (not necessarily the shortest):
For genus names : //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//
[contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]
For species names: //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "species", " " ))]
For infraspecies ranks: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]
For infraspecies names: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]
For confidence levels (image): //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img For sources: //[contains(concat( " ", @class, " " ), concat( " ", "source", " " ))]//a
I now want to extract the information into a dataframe/table.
I tried using the xpathSApply function of the XML package to extract some of this data:
e.g. for infraspecies ranks
library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)
However, this method is problematic because of gaps in the data (e.g. only some rows of the table have an infraspecies rank, so all I have returned is a list of the three ranks in the table, with no gaps). The data output is also of a class that I have had trouble attaching to a dataframe.
Does anyone know a better way to extract information from this table into a dataframe?
Any help would be much appreciated!
Tom
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是另一个解决方案,它将每个物种名称拆分为其组成部分
它会产生以下输出
Here is another solution, which splits each species name into its component parts
It produces the following output
以下代码将您的表解析为矩阵。
注意事项:
代码:
结果:
The following code parses your table into a matrix.
Caveats:
The code:
The results: