使用 XML / RCurl R 包解析 HTML 表,而不使用 readHTMLTable 函数

发布于 2024-11-16 09:01:33 字数 1635 浏览 1 评论 0原文

我正在尝试从单个 html 表中抓取/提取数据:http://www.theplantlist.org/tpl/record/kew-419248" theplantlist.org/tpl/record/kew-419248 和许多非常相似的页面。 我最初尝试使用以下函数来读取表格,但它并不理想,因为我想将每个物种名称分成其组成部分(属/物种/种下/作者等)。

library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")

我使用 SelectorGadget 来识别我想要提取的每个表元素(不一定是最短的)的唯一 XPATH:

对于属名: //[contains(concat( " ", @class, " " ), concat( " ", "同义词", " " ))]// [contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]

对于物种名称: //[contains(concat( " ", @class, " " ), concat( " ", "同义词", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "物种", " " ))]

对于种下等级: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]

对于种下名称: //*[contains( concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]

对于置信水平(图像): //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img 对于来源: //[contains(concat( " ", @class, " " ) , concat( " ", "source", " " ))]//a

我现在想将信息提取到数据框/表中。

我尝试使用 XML 包的 xpathSApply 函数来提取一些数据:

例如对于种下等级

library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)

但是,由于数据中的间隙,这种方法是有问题的(例如,表中只有某些行具有种下等级,所以我所拥有的一切返回的是表中三个排名的列表,没有间隙)。数据输出也是我在附加到数据框时遇到问题的类。

有谁知道更好的方法将该表中的信息提取到数据框中?

任何帮助将不胜感激!

汤姆

I am trying to scrape/extract data from the single html table on: http://www.theplantlist.org/tpl/record/kew-419248 and a number of very similar pages.
I initially tried using the following function to read the table, but it wasn't ideal because I want to separate each species name into its component parts (genus/species/infraspecies/author etc).

library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")

I used SelectorGadget to identify a unique XPATH to each table element that I want to extract (not necessarily the shortest):

For genus names : //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//
[contains(concat( " ", @class, " " ), concat( " ", "genus", " " ))]

For species names: //[contains(concat( " ", @class, " " ), concat( " ", "Synonym", " " ))]//[contains(concat( " ", @class, " " ), concat( " ", "species", " " ))]

For infraspecies ranks: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]

For infraspecies names: //*[contains(concat( " ", @class, " " ), concat( " ", "infraspe", " " ))]

For confidence levels (image): //[contains(concat( " ", @class, " " ), concat( " ", "synonyms", " " ))]//img For sources: //[contains(concat( " ", @class, " " ), concat( " ", "source", " " ))]//a

I now want to extract the information into a dataframe/table.

I tried using the xpathSApply function of the XML package to extract some of this data:

e.g. for infraspecies ranks

library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)

However, this method is problematic because of gaps in the data (e.g. only some rows of the table have an infraspecies rank, so all I have returned is a list of the three ranks in the table, with no gaps). The data output is also of a class that I have had trouble attaching to a dataframe.

Does anyone know a better way to extract information from this table into a dataframe?

Any help would be much appreciated!

Tom

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

卖梦商人 2024-11-23 09:01:33

这是另一个解决方案,它将每个物种名称拆分为其组成部分

library(XML)
library(plyr)

# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)

# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)

# function to extract desired fields from a given node    
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){

    dl = lapply(fields, function(x) xpathSApply(node, 
       paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
    tmp = rep(' ', length(dl))
    tmp[sapply(dl, length) == 1] = unlist(dl)
    confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
    return(c(tmp, confidence))
}

# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')

它会产生以下输出

 genus      species     infraspe                      authorship confidence
1 Critesion     chilense              (Roem. & Schult.) Ã\u0081.Löve          H
2   Hordeum     chilense     chilense                                          L
3   Hordeum  cylindricum                                       Steud.          H
4   Hordeum depauperatum                                       Steud.          H
5   Hordeum     pratense brongniartii                       Macloskie          L
6   Hordeum    secalinum     chilense                   Ã\u0089.Desv.          L

Here is another solution, which splits each species name into its component parts

library(XML)
library(plyr)

# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)

# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)

# function to extract desired fields from a given node    
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){

    dl = lapply(fields, function(x) xpathSApply(node, 
       paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
    tmp = rep(' ', length(dl))
    tmp[sapply(dl, length) == 1] = unlist(dl)
    confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
    return(c(tmp, confidence))
}

# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')

It produces the following output

 genus      species     infraspe                      authorship confidence
1 Critesion     chilense              (Roem. & Schult.) Ã\u0081.Löve          H
2   Hordeum     chilense     chilense                                          L
3   Hordeum  cylindricum                                       Steud.          H
4   Hordeum depauperatum                                       Steud.          H
5   Hordeum     pratense brongniartii                       Macloskie          L
6   Hordeum    secalinum     chilense                   Ã\u0089.Desv.          L
浮世清欢 2024-11-23 09:01:33

以下代码将您的表解析为矩阵。

注意事项:

  • 置信度列为空,因为这不是文本而是图像。如果这很重要,您应该能够检索图像位置并对其进行解析。
  • 存在一些编码问题(UTF-8 字符在我的机器上转换为 ASCII)。我还不知道如何解决这个问题。

代码:

library(XML)
library(RCurl)

baseURL <- "http://www.theplantlist.org/tpl/record/kew-419248"
txt <- getURL(url=baseURL)

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)]))

结果:

     [,1]                                                [,2]      [,3] [,4]  
[1,] "Critesion chilense (Roem. & Schult.) Ã.Löve" "Synonym" ""   "WCSP"
[2,] "Hordeum chilense var. chilense "                   "Synonym" ""   "TRO" 
[3,] "Hordeum cylindricum Steud. [Illegitimate]"         "Synonym" ""   "WCSP"
[4,] "Hordeum depauperatum Steud."                       "Synonym" ""   "WCSP"
[5,] "Hordeum pratense var. brongniartii Macloskie"      "Synonym" ""   "WCSP"
[6,] "Hordeum secalinum var. chilense Ã.Desv."        "Synonym" ""   "WCSP"

The following code parses your table into a matrix.

Caveats:

  • The confidence level column is blank, since this is not text but an image. If this is important, you should be able to retrieve the image location, and parse that.
  • There are some encoding issues (UTF-8 character get converted into ASCII on my machine). I don't yet know how to fix this.

The code:

library(XML)
library(RCurl)

baseURL <- "http://www.theplantlist.org/tpl/record/kew-419248"
txt <- getURL(url=baseURL)

xmltext <- htmlParse(txt, asText=TRUE)
xmltable <- xpathApply(xmltext, "//table//tbody//tr")
t(sapply(xmltable, function(x)unname(xmlSApply(x, xmlValue))[c(1, 3, 5, 7)]))

The results:

     [,1]                                                [,2]      [,3] [,4]  
[1,] "Critesion chilense (Roem. & Schult.) Ã.Löve" "Synonym" ""   "WCSP"
[2,] "Hordeum chilense var. chilense "                   "Synonym" ""   "TRO" 
[3,] "Hordeum cylindricum Steud. [Illegitimate]"         "Synonym" ""   "WCSP"
[4,] "Hordeum depauperatum Steud."                       "Synonym" ""   "WCSP"
[5,] "Hordeum pratense var. brongniartii Macloskie"      "Synonym" ""   "WCSP"
[6,] "Hordeum secalinum var. chilense Ã.Desv."        "Synonym" ""   "WCSP"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文