将不均匀的层次列表转换为数据框

发布于 2024-09-13 02:25:14 字数 1556 浏览 8 评论 0原文

我认为还没有有人问过这个问题，但是有没有一种方法可以将具有多个级别和不均匀结构的列表的信息组合成“长”格式的数据帧？

具体来说：

library(XML)
library(plyr)
xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5.xml"
xml.parse <- xmlInternalTreeParse(xml.inning)
xml.list <- xmlToList(xml.parse)
## $top$atbat
## $top$atbat$pitch
##             des              id            type               x               y 
##          "Ball"           "310"             "B"         "70.39"        "125.20"

以下是结构：

> llply(xml.list, function(x) llply(x, function(x) table(names(x))))
$top
$top$atbat
.attrs  pitch 
     1      4 
$top$atbat
.attrs  pitch 
     1      4 
$top$atbat
.attrs  pitch 
     1      5 
$bottom
$bottom$action
     b    des  event      o  pitch player      s 
     1      1      1      1      1      1      1 
$bottom$atbat
.attrs  pitch 
     1      5 
$bottom$atbat
.attrs  pitch 
     1      5 
$bottom$atbat
.attrs  pitch runner 
     1      5      1 
$bottom$atbat
.attrs  pitch runner 
     1      7      1 
$.attrs
$.attrs$num
character(0)
$.attrs$away_team
character(0)
$.attrs$

我想要的是来自 pitch 类别的命名向量的数据框，以及正确的 (top, atbat，底部）。因此，我需要忽略由于列数不同而不适合 data.frame 的级别。像这样的事情：

   first second third    des     x
1    top  atbat pitch   Ball 70.29
2    top  atbat pitch Strike 69.24
3 bottom  atbat pitch    Out 67.22

有没有一种优雅的方式来做到这一点？谢谢！

原文

I don't think this has been asked yet, but is there a way to combine information of a list with multiple levels and uneven structure into a data frame of "long" format?

Specifically:

library(XML)
library(plyr)
xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5.xml"
xml.parse <- xmlInternalTreeParse(xml.inning)
xml.list <- xmlToList(xml.parse)
## $top$atbat
## $top$atbat$pitch
##             des              id            type               x               y 
##          "Ball"           "310"             "B"         "70.39"        "125.20"

Where the following is the structure:

> llply(xml.list, function(x) llply(x, function(x) table(names(x))))
$top
$top$atbat
.attrs  pitch 
     1      4 
$top$atbat
.attrs  pitch 
     1      4 
$top$atbat
.attrs  pitch 
     1      5 
$bottom
$bottom$action
     b    des  event      o  pitch player      s 
     1      1      1      1      1      1      1 
$bottom$atbat
.attrs  pitch 
     1      5 
$bottom$atbat
.attrs  pitch 
     1      5 
$bottom$atbat
.attrs  pitch runner 
     1      5      1 
$bottom$atbat
.attrs  pitch runner 
     1      7      1 
$.attrs
$.attrs$num
character(0)
$.attrs$away_team
character(0)
$.attrs$

What I'd like to have is a data frame from the named vector from the pitch category, along with the proper (top, atbat, bottom). Therefore, I would need to ignore levels that won't fit into a data.frame due to different number of columns. Something like this:

   first second third    des     x
1    top  atbat pitch   Ball 70.29
2    top  atbat pitch Strike 69.24
3 bottom  atbat pitch    Out 67.22

Is there an elegant way of doing this? Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽梦忆笙歌 2024-09-20 02:25:14

我不知道优雅，但这很有效。那些更熟悉 plyr 的人可能可以提供更通用的解决方案。

cleanFun <- function(x) {
   a <- x[["atbat"]]
   b <- do.call(rbind,a[names(a)=="pitch"])
   c <- as.data.frame(b)
}
ldply(xml.list[c("top","bottom")], cleanFun)[,1:5]
     .id             des  id type      x
1    top            Ball 310    B  70.39
2    top   Called Strike 311    S 118.45
3    top   Called Strike 312    S  86.70
4    top In play, out(s) 313    X  79.83
5 bottom            Ball 335    B  15.45
6 bottom   Called Strike 336    S  77.25
7 bottom Swinging Strike 337    S  99.57
8 bottom            Ball 338    B 106.44
9 bottom In play, out(s) 339    X 134.76

I don't know about elegant, but this works. Those more familiar with plyr could probably provide a more general solution.

cleanFun <- function(x) {
   a <- x[["atbat"]]
   b <- do.call(rbind,a[names(a)=="pitch"])
   c <- as.data.frame(b)
}
ldply(xml.list[c("top","bottom")], cleanFun)[,1:5]
     .id             des  id type      x
1    top            Ball 310    B  70.39
2    top   Called Strike 311    S 118.45
3    top   Called Strike 312    S  86.70
4    top In play, out(s) 313    X  79.83
5 bottom            Ball 335    B  15.45
6 bottom   Called Strike 336    S  77.25
7 bottom Swinging Strike 337    S  99.57
8 bottom            Ball 338    B 106.44
9 bottom In play, out(s) 339    X 134.76

回复收藏 0 原文

硪扪都還晓 2024-09-20 02:25:14

ldply() 的 .id 功能很好，但一旦您执行另一个 ldply()，它们似乎就会重叠。

这是使用 rbind.fill() 的相当通用的函数：

aho <- ldply(llply(xml.list[[1]], function(x) ldply(x, function(x) rbind.fill(data.frame(t(x))))))
> aho[1:5,1:4]
     .id                                                       des   id type
1  pitch                                                      Ball  310    B
2  pitch                                             Called Strike  311    S
3  pitch                                             Called Strike  312    S
4  pitch                                           In play, out(s)  313    X
5 .attrs Alexei Ramirez lines out to second baseman Ian Kinsler.   <NA> <NA>

第二个 ldply() 的 .id 丢失，因为我们已经有了一个.id。我们可以通过将第一个 .id 命名为不同的名称来解决此问题，但它似乎不连贯。

aho2 <- ldply(llply(xml.list[[1]], function(x) {
  out <- ldply(x, function(x) rbind.fill(data.frame(t(x))))
  names(out)[1] <- ".id2"
  out
}))
> aho2[1:5,1:4]
    .id   .id2                                                       des   id
1 atbat  pitch                                                      Ball  310
2 atbat  pitch                                             Called Strike  311
3 atbat  pitch                                             Called Strike  312
4 atbat  pitch                                           In play, out(s)  313
5 atbat .attrs Alexei Ramirez lines out to second baseman Ian Kinsler.   <NA>

The .id feature for the ldply() is nice, but it seems like they overlap once you do another ldply().

Here is fairly general function that uses rbind.fill():

aho <- ldply(llply(xml.list[[1]], function(x) ldply(x, function(x) rbind.fill(data.frame(t(x))))))
> aho[1:5,1:4]
     .id                                                       des   id type
1  pitch                                                      Ball  310    B
2  pitch                                             Called Strike  311    S
3  pitch                                             Called Strike  312    S
4  pitch                                           In play, out(s)  313    X
5 .attrs Alexei Ramirez lines out to second baseman Ian Kinsler.   <NA> <NA>

The .id for the second ldply() is missing because we already had an .id. We could fix this by naming the first .id as a different name, but it doesn't seem coherent.

aho2 <- ldply(llply(xml.list[[1]], function(x) {
  out <- ldply(x, function(x) rbind.fill(data.frame(t(x))))
  names(out)[1] <- ".id2"
  out
}))
> aho2[1:5,1:4]
    .id   .id2                                                       des   id
1 atbat  pitch                                                      Ball  310
2 atbat  pitch                                             Called Strike  311
3 atbat  pitch                                             Called Strike  312
4 atbat  pitch                                           In play, out(s)  313
5 atbat .attrs Alexei Ramirez lines out to second baseman Ian Kinsler.   <NA>

回复收藏 0 原文

~没有更多了~