在 data.frame 中有效地定位分组常量列

发布于 2024-12-22 23:16:15 字数 694 浏览 5 评论 0原文

如何有效地从数据框中提取分组常量列?我在下面包含了一个 plyr 实现,以精确说明我想要做的事情,但它很慢。我怎样才能尽可能高效地做到这一点? (理想情况下根本不分割数据框)。

base <- data.frame(group = 1:1000, a = sample(1000), b = sample(1000))
df <- data.frame(
  base[rep(seq_len(nrow(base)), length = 1e6), ], 
  c = runif(1e6), 
  d = runif(1e6)
)


is.constant <- function(x) length(unique(x)) == 1
constant_cols <- function(x) head(Filter(is.constant, x), 1)
system.time(constant <- ddply(df, "group", constant_cols))
#   user  system elapsed 
# 20.531   1.670  22.378 
stopifnot(identical(names(constant), c("group", "a", "b")))
stopifnot(nrow(constant) == 1000)

在我的实际用例中(ggplot2深处)可能有任意数量的常量和非常量列。示例中数据的大小大约是正确的数量级。

How can I efficiently extract group-wise constant columns from a data frame? I've included an plyr implementation below to make precise what I'm trying to do, but it's slow. How can I do it as efficiently as possible? (Ideally without splitting the data frame at all).

base <- data.frame(group = 1:1000, a = sample(1000), b = sample(1000))
df <- data.frame(
  base[rep(seq_len(nrow(base)), length = 1e6), ], 
  c = runif(1e6), 
  d = runif(1e6)
)


is.constant <- function(x) length(unique(x)) == 1
constant_cols <- function(x) head(Filter(is.constant, x), 1)
system.time(constant <- ddply(df, "group", constant_cols))
#   user  system elapsed 
# 20.531   1.670  22.378 
stopifnot(identical(names(constant), c("group", "a", "b")))
stopifnot(nrow(constant) == 1000)

In my real use case (deep inside ggplot2) there may be an arbitrary number of constant and non-constant columns. The size of the data in the example is about the right order of magnitude.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

思念绕指尖 2024-12-29 23:16:15

(编辑可能是为了解决具有相同值的连续组的问题)

我暂时提交了这个答案,但我还没有完全说服自己在所有情况下它都会在组常量列中正确识别。但它肯定更快(并且可能可以改进):

constant_cols1 <- function(df,grp){
    df <- df[order(df[,grp]),]

    #Adjust values based on max diff in data
    rle_group <- rle(df[,grp])
    vec <- rep(rep(c(0,ceiling(diff(range(df)))),
               length.out = length(rle_group$lengths)),
               times = rle_group$lengths)
    m <- matrix(vec,nrow = length(vec),ncol = ncol(df)-1)
    df_new <- df
    df_new[,-1] <- df[,-1] + m

    rles <- lapply(df_new,FUN = rle)
    nms <- names(rles)
    tmp <- sapply(rles[nms != grp],
                  FUN = function(x){identical(x$lengths,rles[[grp]]$lengths)})
    return(tmp)
}

显然,我的基本想法是使用 rle 。

(Edited to possibly address the issue of consecutive groups with the same value)

I'm tentatively submitting this answer, but I haven't completely convinced myself that it will correctly identify within group constant columns in all cases. But it's definitely faster (and can probably be improved):

constant_cols1 <- function(df,grp){
    df <- df[order(df[,grp]),]

    #Adjust values based on max diff in data
    rle_group <- rle(df[,grp])
    vec <- rep(rep(c(0,ceiling(diff(range(df)))),
               length.out = length(rle_group$lengths)),
               times = rle_group$lengths)
    m <- matrix(vec,nrow = length(vec),ncol = ncol(df)-1)
    df_new <- df
    df_new[,-1] <- df[,-1] + m

    rles <- lapply(df_new,FUN = rle)
    nms <- names(rles)
    tmp <- sapply(rles[nms != grp],
                  FUN = function(x){identical(x$lengths,rles[[grp]]$lengths)})
    return(tmp)
}

My basic idea was to use rle, obviously.

苍暮颜 2024-12-29 23:16:15

我不确定这是否正是您正在寻找的内容,但它标识了 a 列和 b 列。

require(data.table)
is.constant <- function(x) identical(var(x), 0)
dtOne <- data.table(df)
system.time({dtTwo <- dtOne[, lapply(.SD, is.constant), by=group]
result <- apply(X=dtTwo[, list(a, b, c, d)], 2, all)
result <- result[result == TRUE] })
stopifnot(identical(names(result), c("a", "b"))) 
result

I'm not sure if this is exactly what you are looking for, but it identifies columns a and b.

require(data.table)
is.constant <- function(x) identical(var(x), 0)
dtOne <- data.table(df)
system.time({dtTwo <- dtOne[, lapply(.SD, is.constant), by=group]
result <- apply(X=dtTwo[, list(a, b, c, d)], 2, all)
result <- result[result == TRUE] })
stopifnot(identical(names(result), c("a", "b"))) 
result
稚气少女 2024-12-29 23:16:15

(编辑:更好的答案)

is.constant<-function(x) length(which(x==x[1])) == length(x)

这样的东西怎么样?很好的改进。比较以下内容。

> a<-rnorm(5000000)

> system.time(is.constant(a))
   user  system elapsed 
  0.039   0.010   0.048 
> 
> system.time(is.constantOld(a))
   user  system elapsed 
  1.049   0.084   1.125 

(edit: better answer)

What about something like

is.constant<-function(x) length(which(x==x[1])) == length(x)

This seems to be a nice improvement. Compare the following.

> a<-rnorm(5000000)

> system.time(is.constant(a))
   user  system elapsed 
  0.039   0.010   0.048 
> 
> system.time(is.constantOld(a))
   user  system elapsed 
  1.049   0.084   1.125 
浅听莫相离 2024-12-29 23:16:15

比哈德利上面建议的要慢一点,但我认为它应该处理相等相邻组的情况

findBreaks <- function(x) cumsum(rle(x)$lengths)

constantGroups <- function(d, groupColIndex=1) {
  d <- d[order(d[, groupColIndex]), ]
  breaks <- lapply(d, findBreaks)
  groupBreaks <- breaks[[groupColIndex]]
  numBreaks <- length(groupBreaks)
  isSubset <- function(x) length(x) <= numBreaks && length(setdiff(x, groupBreaks)) == 0
  unlist(lapply(breaks[-groupColIndex], isSubset))
}

直觉是,如果列在分组上是恒定的,那么列值中的中断(按组值排序)将是以下的子集团体价值的中断。

现在,将其与 hadley 的进行比较(进行少量修改以确保定义 n)

# df defined as in the question

n <- nrow(df)
changed <- function(x) c(TRUE, x[-1] != x[-n])

constant_cols2 <- function(df,grp){
  df <- df[order(df[,grp]),]
  changes <- lapply(df, changed)
  vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1))
}

> system.time(constant_cols2(df, 1))
   user  system elapsed 
  1.779   0.075   1.869 
> system.time(constantGroups(df))
   user  system elapsed 
  2.503   0.126   2.614 
> df$f <- 1
> constant_cols2(df, 1)
    a     b     c     d     f 
 TRUE  TRUE FALSE FALSE FALSE 
> constantGroups(df)
    a     b     c     d     f 
 TRUE  TRUE FALSE FALSE  TRUE

A bit slower than what hadley suggested above, but I think it should handle the case of equal adjacent groups

findBreaks <- function(x) cumsum(rle(x)$lengths)

constantGroups <- function(d, groupColIndex=1) {
  d <- d[order(d[, groupColIndex]), ]
  breaks <- lapply(d, findBreaks)
  groupBreaks <- breaks[[groupColIndex]]
  numBreaks <- length(groupBreaks)
  isSubset <- function(x) length(x) <= numBreaks && length(setdiff(x, groupBreaks)) == 0
  unlist(lapply(breaks[-groupColIndex], isSubset))
}

The intuition is that if a column is constant groupwise then the breaks in the column values (sorted by the group value) will be a subset of the breaks in the group value.

Now, compare it with hadley's (with small modification to ensure n is defined)

# df defined as in the question

n <- nrow(df)
changed <- function(x) c(TRUE, x[-1] != x[-n])

constant_cols2 <- function(df,grp){
  df <- df[order(df[,grp]),]
  changes <- lapply(df, changed)
  vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1))
}

> system.time(constant_cols2(df, 1))
   user  system elapsed 
  1.779   0.075   1.869 
> system.time(constantGroups(df))
   user  system elapsed 
  2.503   0.126   2.614 
> df$f <- 1
> constant_cols2(df, 1)
    a     b     c     d     f 
 TRUE  TRUE FALSE FALSE FALSE 
> constantGroups(df)
    a     b     c     d     f 
 TRUE  TRUE FALSE FALSE  TRUE
紙鸢 2024-12-29 23:16:15

受@Joran's回答的启发,这里有一个类似的策略,速度更快一些(我的机器上是1秒与1.5秒),

changed <- function(x) c(TRUE, x[-1] != x[-n])

constant_cols2 <- function(df,grp){
  df <- df[order(df[,grp]),]
  n <- nrow(df)
  changes <- lapply(df, changed)

  vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1))
}
system.time(cols <- constant_cols2(df, "group")) # about 1 s

system.time(constant <- df[changed(df$group), cols])
#   user  system elapsed 
#  1.057   0.230   1.314 

stopifnot(identical(names(constant), c("group", "a", "b")))
stopifnot(nrow(constant) == 1000)

但它具有相同的缺陷,因为它不会检测到相邻组具有相同值的列(例如< code>df$f <- 1)

多一点思考再加上@David 的想法:

constant_cols3 <- function(df, grp) {
  # If col == TRUE and group == FALSE, not constant
  matching_breaks <- function(group, col) {
    !any(col & !group)
  }

  n <- nrow(df)
  changed <- function(x) c(TRUE, x[-1] != x[-n])

  df <- df[order(df[,grp]),]
  changes <- lapply(df, changed)
  vapply(changes[-1], matching_breaks, group = changes[[1]], 
    FUN.VALUE = logical(1))
}

system.time(x <- constant_cols3(df, "group"))
#   user  system elapsed 
#  1.086   0.221   1.413 

这会给出正确的结果。

Inspired by @Joran's answer, here's similar strategy that's a little faster (1 s vs 1.5 s on my machine)

changed <- function(x) c(TRUE, x[-1] != x[-n])

constant_cols2 <- function(df,grp){
  df <- df[order(df[,grp]),]
  n <- nrow(df)
  changes <- lapply(df, changed)

  vapply(changes[-1], identical, changes[[1]], FUN.VALUE = logical(1))
}
system.time(cols <- constant_cols2(df, "group")) # about 1 s

system.time(constant <- df[changed(df$group), cols])
#   user  system elapsed 
#  1.057   0.230   1.314 

stopifnot(identical(names(constant), c("group", "a", "b")))
stopifnot(nrow(constant) == 1000)

It has the same flaws though, in that it won't detect columns that are have the same values for adjacent groups (e.g. df$f <- 1)

With a bit more thinking plus @David's ideas:

constant_cols3 <- function(df, grp) {
  # If col == TRUE and group == FALSE, not constant
  matching_breaks <- function(group, col) {
    !any(col & !group)
  }

  n <- nrow(df)
  changed <- function(x) c(TRUE, x[-1] != x[-n])

  df <- df[order(df[,grp]),]
  changes <- lapply(df, changed)
  vapply(changes[-1], matching_breaks, group = changes[[1]], 
    FUN.VALUE = logical(1))
}

system.time(x <- constant_cols3(df, "group"))
#   user  system elapsed 
#  1.086   0.221   1.413 

And that gives the correct result.

豆芽 2024-12-29 23:16:15

对于非常数 x,is.unsorted(x) 失败的速度有多快?遗憾的是我目前无法访问 R。不过,这似乎也不是你的瓶颈。

How fast does is.unsorted(x) fail for non-constant x? Sadly I don't have access to R at the moment. Also seems that's not your bottleneck though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文