如何将XML数据转换为data.frame？

发布于 2024-08-18 02:16:15 字数 497 浏览 3 评论 0原文

我正在尝试学习 R 的 XML 包。我正在尝试从 books.xml 示例 xml 数据文件创建 data.frame。这就是我得到的：

library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
doc <- xmlTreeParse(books, useInternalNodes = TRUE)
doc
xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x))))
xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " "))
xpathSApply(doc, "//book/child::*", xmlValue)

这些 xpathSApply 中的每一个都没有让我接近我的意图。应该如何构建一个格式良好的 data.frame？

原文

I'm trying to learn R's XML package. I'm trying to create a data.frame from books.xml sample xml data file. Here's what I get:

library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
doc <- xmlTreeParse(books, useInternalNodes = TRUE)
doc
xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x))))
xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " "))
xpathSApply(doc, "//book/child::*", xmlValue)

Each of these xpathSApply's don't get me even close to my intention. How should one proceed toward a well formed data.frame?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

岁月染过的梦 2024-08-25 02:16:15

通常，我建议尝试 xmlToDataFrame() 函数，但我相信这实际上相当棘手，因为它一开始的结构就不好。

我建议使用此函数：

xmlToList(books)

一个问题是每本书有多个作者，因此您需要在构建数据框架时决定如何处理该问题。

一旦你决定如何处理多作者问题，那么使用 plyr 中的 ldply() 函数（或者只使用 lapply 并将使用 do.call("rbind"...) 将值返回到 data.frame 中。

这是一个完整的示例（不包括作者）：

library(XML)
books <-  "w3schools.com/xsl/books.xml"
library(plyr)
ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } )

   .id        title.text title..attrs year price   .attrs
 1 book  Everyday Italian           en 2005 30.00  COOKING
 2 book      Harry Potter           en 2005 29.99 CHILDREN
 3 book XQuery Kick Start           en 2003 49.99      WEB
 4 book      Learning XML           en 2003 39.95      WEB

这是包含作者的情况。您需要使用 ldply 。 > 在这种情况下，因为列表是“锯齿状的”...lapply 无法正确处理该问题[否则您可以将 lapply 与 rbind.fill 一起使用（也由您提供）。 Hadley），但是当 plyr 自动为你做这件事时为什么还要麻烦呢？]：

ldply(xmlToList(books), data.frame)

   .id        title.text title..attrs              author year price   .attrs
1 book  Everyday Italian           en Giada De Laurentiis 2005 30.00  COOKING
2 book      Harry Potter           en        J K. Rowling 2005 29.99 CHILDREN
3 book XQuery Kick Start           en      James McGovern 2003 49.99      WEB
4 book      Learning XML           en         Erik T. Ray 2003 39.95      WEB
     author.1   author.2   author.3               author.4
1        <NA>       <NA>       <NA>                   <NA>
2        <NA>       <NA>       <NA>                   <NA>
3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan
4        <NA>       <NA>       <NA>                   <NA>

Ordinarily, I would suggest trying the xmlToDataFrame() function, but I believe that this will actually be fairly tricky because it isn't well structured to begin with.

I would recommend working with this function:

xmlToList(books)

One problem is that there are multiple authors per book, so you will need to decide how to handle that when you're structuring your data frame.

Once you have decided what to do with the multiple authors issue, then it's fairly straight forward to turn your book list into a data frame with the ldply() function in plyr (or just use lapply and convert the return value into a data.frame by using do.call("rbind"...).

Here's a complete example (excluding author):

library(XML)
books <-  "w3schools.com/xsl/books.xml"
library(plyr)
ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } )

   .id        title.text title..attrs year price   .attrs
 1 book  Everyday Italian           en 2005 30.00  COOKING
 2 book      Harry Potter           en 2005 29.99 CHILDREN
 3 book XQuery Kick Start           en 2003 49.99      WEB
 4 book      Learning XML           en 2003 39.95      WEB

Here's what it looks like with author included. You need to use ldply in this instance since the list is "jagged"...lapply can't handle that properly. [Otherwise you can use lapply with rbind.fill (also courtesy of Hadley), but why bother when plyr automatically does it for you?]:

ldply(xmlToList(books), data.frame)

   .id        title.text title..attrs              author year price   .attrs
1 book  Everyday Italian           en Giada De Laurentiis 2005 30.00  COOKING
2 book      Harry Potter           en        J K. Rowling 2005 29.99 CHILDREN
3 book XQuery Kick Start           en      James McGovern 2003 49.99      WEB
4 book      Learning XML           en         Erik T. Ray 2003 39.95      WEB
     author.1   author.2   author.3               author.4
1        <NA>       <NA>       <NA>                   <NA>
2        <NA>       <NA>       <NA>                   <NA>
3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan
4        <NA>       <NA>       <NA>                   <NA>

回复收藏 0 原文

~没有更多了~