在 data.frame 中有效地定位分组常量列
如何有效地从数据框中提取分组常量列?我在下面包含了一个 plyr 实现,以精确说明我想要做的事情,但它很慢。我怎样才能尽可能高效地做到这一点? (理想情况下根本不分割数据框)。
base <- data.frame(group = 1:1000, a = sample(1000), b = sample(1000))
df <- data.frame(
base[rep(seq_len(nrow(base)), length = 1e6), ],
c = runif(1e6),
d = runif(1e6)
)
is.constant <- function(x) length(unique(x)) == 1
constant_cols <- function(x) head(Filter(is.constant, x), 1)
system.time(constant <- ddply(df, "group", constant_cols))
# user system elapsed
# 20.531 1.670 22.378
stopifnot(identical(names(constant), c("group", "a", "b")))
stopifnot(nrow(constant) == 1000)
在我的实际用例中(ggplot2深处)可能有任意数量的常量和非常量列。示例中数据的大小大约是正确的数量级。
How can I efficiently extract group-wise constant columns from a data frame? I've included an plyr implementation below to make precise what I'm trying to do, but it's slow. How can I do it as efficiently as possible? (Ideally without splitting the data frame at all).
base <- data.frame(group = 1:1000, a = sample(1000), b = sample(1000))
df <- data.frame(
base[rep(seq_len(nrow(base)), length = 1e6), ],
c = runif(1e6),
d = runif(1e6)
)
is.constant <- function(x) length(unique(x)) == 1
constant_cols <- function(x) head(Filter(is.constant, x), 1)
system.time(constant <- ddply(df, "group", constant_cols))
# user system elapsed
# 20.531 1.670 22.378
stopifnot(identical(names(constant), c("group", "a", "b")))
stopifnot(nrow(constant) == 1000)
In my real use case (deep inside ggplot2) there may be an arbitrary number of constant and non-constant columns. The size of the data in the example is about the right order of magnitude.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
(编辑可能是为了解决具有相同值的连续组的问题)
我暂时提交了这个答案,但我还没有完全说服自己在所有情况下它都会在组常量列中正确识别。但它肯定更快(并且可能可以改进):
显然,我的基本想法是使用 rle 。
(Edited to possibly address the issue of consecutive groups with the same value)
I'm tentatively submitting this answer, but I haven't completely convinced myself that it will correctly identify within group constant columns in all cases. But it's definitely faster (and can probably be improved):
My basic idea was to use
rle
, obviously.我不确定这是否正是您正在寻找的内容,但它标识了 a 列和 b 列。
I'm not sure if this is exactly what you are looking for, but it identifies columns a and b.
(编辑:更好的答案)
像
is.constant<-function(x) length(which(x==x[1])) == length(x)
这样的东西怎么样?很好的改进。比较以下内容。
(edit: better answer)
What about something like
is.constant<-function(x) length(which(x==x[1])) == length(x)
This seems to be a nice improvement. Compare the following.
比哈德利上面建议的要慢一点,但我认为它应该处理相等相邻组的情况
直觉是,如果列在分组上是恒定的,那么列值中的中断(按组值排序)将是以下的子集团体价值的中断。
现在,将其与 hadley 的进行比较(进行少量修改以确保定义 n)
A bit slower than what hadley suggested above, but I think it should handle the case of equal adjacent groups
The intuition is that if a column is constant groupwise then the breaks in the column values (sorted by the group value) will be a subset of the breaks in the group value.
Now, compare it with hadley's (with small modification to ensure n is defined)
受@Joran's回答的启发,这里有一个类似的策略,速度更快一些(我的机器上是1秒与1.5秒),
但它具有相同的缺陷,因为它不会检测到相邻组具有相同值的列(例如< code>df$f <- 1)
多一点思考再加上@David 的想法:
这会给出正确的结果。
Inspired by @Joran's answer, here's similar strategy that's a little faster (1 s vs 1.5 s on my machine)
It has the same flaws though, in that it won't detect columns that are have the same values for adjacent groups (e.g.
df$f <- 1
)With a bit more thinking plus @David's ideas:
And that gives the correct result.
对于非常数 x,
is.unsorted(x)
失败的速度有多快?遗憾的是我目前无法访问 R。不过,这似乎也不是你的瓶颈。How fast does
is.unsorted(x)
fail for non-constant x? Sadly I don't have access to R at the moment. Also seems that's not your bottleneck though.