将不均匀的层次列表转换为数据框
我认为还没有有人问过这个问题,但是有没有一种方法可以将具有多个级别和不均匀结构的列表的信息组合成“长”格式的数据帧?
具体来说:
library(XML)
library(plyr)
xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5.xml"
xml.parse <- xmlInternalTreeParse(xml.inning)
xml.list <- xmlToList(xml.parse)
## $top$atbat
## $top$atbat$pitch
## des id type x y
## "Ball" "310" "B" "70.39" "125.20"
以下是结构:
> llply(xml.list, function(x) llply(x, function(x) table(names(x))))
$top
$top$atbat
.attrs pitch
1 4
$top$atbat
.attrs pitch
1 4
$top$atbat
.attrs pitch
1 5
$bottom
$bottom$action
b des event o pitch player s
1 1 1 1 1 1 1
$bottom$atbat
.attrs pitch
1 5
$bottom$atbat
.attrs pitch
1 5
$bottom$atbat
.attrs pitch runner
1 5 1
$bottom$atbat
.attrs pitch runner
1 7 1
$.attrs
$.attrs$num
character(0)
$.attrs$away_team
character(0)
$.attrs$
我想要的是来自 pitch 类别的命名向量的数据框,以及正确的 (top, atbat,底部)。因此,我需要忽略由于列数不同而不适合 data.frame 的级别。像这样的事情:
first second third des x
1 top atbat pitch Ball 70.29
2 top atbat pitch Strike 69.24
3 bottom atbat pitch Out 67.22
有没有一种优雅的方式来做到这一点?谢谢!
I don't think this has been asked yet, but is there a way to combine information of a list with multiple levels and uneven structure into a data frame of "long" format?
Specifically:
library(XML)
library(plyr)
xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5.xml"
xml.parse <- xmlInternalTreeParse(xml.inning)
xml.list <- xmlToList(xml.parse)
## $top$atbat
## $top$atbat$pitch
## des id type x y
## "Ball" "310" "B" "70.39" "125.20"
Where the following is the structure:
> llply(xml.list, function(x) llply(x, function(x) table(names(x))))
$top
$top$atbat
.attrs pitch
1 4
$top$atbat
.attrs pitch
1 4
$top$atbat
.attrs pitch
1 5
$bottom
$bottom$action
b des event o pitch player s
1 1 1 1 1 1 1
$bottom$atbat
.attrs pitch
1 5
$bottom$atbat
.attrs pitch
1 5
$bottom$atbat
.attrs pitch runner
1 5 1
$bottom$atbat
.attrs pitch runner
1 7 1
$.attrs
$.attrs$num
character(0)
$.attrs$away_team
character(0)
$.attrs$
What I'd like to have is a data frame from the named vector from the pitch category, along with the proper (top, atbat, bottom). Therefore, I would need to ignore levels that won't fit into a data.frame due to different number of columns. Something like this:
first second third des x
1 top atbat pitch Ball 70.29
2 top atbat pitch Strike 69.24
3 bottom atbat pitch Out 67.22
Is there an elegant way of doing this? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不知道优雅,但这很有效。那些更熟悉 plyr 的人可能可以提供更通用的解决方案。
I don't know about elegant, but this works. Those more familiar with plyr could probably provide a more general solution.
ldply()
的.id
功能很好,但一旦您执行另一个ldply()
,它们似乎就会重叠。这是使用 rbind.fill() 的相当通用的函数:
第二个 ldply() 的
.id
丢失,因为我们已经有了一个.id
。我们可以通过将第一个.id
命名为不同的名称来解决此问题,但它似乎不连贯。The
.id
feature for theldply()
is nice, but it seems like they overlap once you do anotherldply()
.Here is fairly general function that uses
rbind.fill()
:The
.id
for the secondldply()
is missing because we already had an.id
. We could fix this by naming the first.id
as a different name, but it doesn't seem coherent.