使用 XML / RCurl R 包解析 HTML 表，而不使用 readHTMLTable 函数

发布于 2024-11-16 09:01:33 字数 1635 浏览 5 评论 0原文

我正在尝试从单个 html 表中抓取/提取数据：http://www.theplantlist.org/tpl/record/kew-419248" theplantlist.org/tpl/record/kew-419248 和许多非常相似的页面。我最初尝试使用以下函数来读取表格，但它并不理想，因为我想将每个物种名称分成其组成部分（属/物种/种下/作者等）。

library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")

我使用 SelectorGadget 来识别我想要提取的每个表元素（不一定是最短的）的唯一 XPATH：

对于属名： //[contains(concat( " ", @class, " " ), concat( " ", "同义词", " " ))]// [contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]

对于物种名称： //[contains(concat( " ", @class, " " ), concat( " ", "同义词", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "物种", " " ))]

对于种下等级： //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]

对于种下名称： //*[contains( concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]

对于置信水平（图像）： //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img 对于来源： //[contains(concat( " ", @class, " " ) , concat( " ", "source", " " ))]//a

我现在想将信息提取到数据框/表中。

我尝试使用 XML 包的 xpathSApply 函数来提取一些数据：

例如对于种下等级

library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)

但是，由于数据中的间隙，这种方法是有问题的（例如，表中只有某些行具有种下等级，所以我所拥有的一切返回的是表中三个排名的列表，没有间隙）。数据输出也是我在附加到数据框时遇到问题的类。

有谁知道更好的方法将该表中的信息提取到数据框中？

任何帮助将不胜感激！

汤姆

原文

I am trying to scrape/extract data from the single html table on: http://www.theplantlist.org/tpl/record/kew-419248 and a number of very similar pages.
I initially tried using the following function to read the table, but it wasn't ideal because I want to separate each species name into its component parts (genus/species/infraspecies/author etc).

library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")

I used SelectorGadget to identify a unique XPATH to each table element that I want to extract (not necessarily the shortest):

For genus names : //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//
[contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]

For species names: //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "species", " " ))]

For infraspecies ranks: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]

For infraspecies names: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]

For confidence levels (image): //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img For sources: //[contains(concat( " ", @class, " " ), concat( " ", "source", " " ))]//a

I now want to extract the information into a dataframe/table.

I tried using the xpathSApply function of the XML package to extract some of this data:

e.g. for infraspecies ranks

library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)

However, this method is problematic because of gaps in the data (e.g. only some rows of the table have an infraspecies rank, so all I have returned is a list of the three ranks in the table, with no gaps). The data output is also of a class that I have had trouble attaching to a dataframe.

Does anyone know a better way to extract information from this table into a dataframe?

Any help would be much appreciated!

Tom

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

卖梦商人 2024-11-23 09:01:33

这是另一个解决方案，它将每个物种名称拆分为其组成部分

library(XML)
library(plyr)

# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)

# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)

# function to extract desired fields from a given node    
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){

    dl = lapply(fields, function(x) xpathSApply(node, 
       paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
    tmp = rep(' ', length(dl))
    tmp[sapply(dl, length) == 1] = unlist(dl)
    confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
    return(c(tmp, confidence))
}

# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')

它会产生以下输出

 genus      species     infraspe                      authorship confidence
1 Critesion     chilense              (Roem. & Schult.) Ã\u0081.LÃ¶ve          H
2   Hordeum     chilense     chilense                                          L
3   Hordeum  cylindricum                                       Steud.          H
4   Hordeum depauperatum                                       Steud.          H
5   Hordeum     pratense brongniartii                       Macloskie          L
6   Hordeum    secalinum     chilense                   Ã\u0089.Desv.          L

Here is another solution, which splits each species name into its component parts

library(XML)
library(plyr)

# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)

# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)

# function to extract desired fields from a given node    
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){

    dl = lapply(fields, function(x) xpathSApply(node, 
       paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
    tmp = rep(' ', length(dl))
    tmp[sapply(dl, length) == 1] = unlist(dl)
    confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
    return(c(tmp, confidence))
}

# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')

It produces the following output

 genus      species     infraspe                      authorship confidence
1 Critesion     chilense              (Roem. & Schult.) Ã\u0081.LÃ¶ve          H
2   Hordeum     chilense     chilense                                          L
3   Hordeum  cylindricum                                       Steud.          H
4   Hordeum depauperatum                                       Steud.          H
5   Hordeum     pratense brongniartii                       Macloskie          L
6   Hordeum    secalinum     chilense                   Ã\u0089.Desv.          L

回复收藏 0 原文

浮世清欢 2024-11-23 09:01:33

以下代码将您的表解析为矩阵。

注意事项：

置信度列为空，因为这不是文本而是图像。如果这很重要，您应该能够检索图像位置并对其进行解析。
存在一些编码问题（UTF-8 字符在我的机器上转换为 ASCII）。我还不知道如何解决这个问题。

代码：

library(XML)
library(RCurl)

baseURL <- "http://www.theplantlist.org/tpl/record/kew-419248"
txt <- getURL(url=baseURL)

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)]))

结果：

     [,1]                                                [,2]      [,3] [,4]  
[1,] "Critesion chilense (Roem. & Schult.) Ã.LÃ¶ve" "Synonym" ""   "WCSP"
[2,] "Hordeum chilense var. chilense "                   "Synonym" ""   "TRO" 
[3,] "Hordeum cylindricum Steud. [Illegitimate]"         "Synonym" ""   "WCSP"
[4,] "Hordeum depauperatum Steud."                       "Synonym" ""   "WCSP"
[5,] "Hordeum pratense var. brongniartii Macloskie"      "Synonym" ""   "WCSP"
[6,] "Hordeum secalinum var. chilense Ã.Desv."        "Synonym" ""   "WCSP"

The following code parses your table into a matrix.

Caveats:

The confidence level column is blank, since this is not text but an image. If this is important, you should be able to retrieve the image location, and parse that.
There are some encoding issues (UTF-8 character get converted into ASCII on my machine). I don't yet know how to fix this.

The code:

library(XML)
library(RCurl)

baseURL <- "http://www.theplantlist.org/tpl/record/kew-419248"
txt <- getURL(url=baseURL)

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)]))

The results:

     [,1]                                                [,2]      [,3] [,4]  
[1,] "Critesion chilense (Roem. & Schult.) Ã.LÃ¶ve" "Synonym" ""   "WCSP"
[2,] "Hordeum chilense var. chilense "                   "Synonym" ""   "TRO" 
[3,] "Hordeum cylindricum Steud. [Illegitimate]"         "Synonym" ""   "WCSP"
[4,] "Hordeum depauperatum Steud."                       "Synonym" ""   "WCSP"
[5,] "Hordeum pratense var. brongniartii Macloskie"      "Synonym" ""   "WCSP"
[6,] "Hordeum secalinum var. chilense Ã.Desv."        "Synonym" ""   "WCSP"

回复收藏 0 原文

~没有更多了~