在 R 中对两个大量逻辑向量进行交叉制表的最快方法

发布于 2025-01-03 06:55:19 字数 1689 浏览 5 评论 0原文

对于两个逻辑向量，x和y，长度> 1E8，计算 2x2 交叉表的最快方法是什么？

我怀疑答案是用 C/C++ 编写它，但我想知道 R 中是否有一些东西已经非常聪明地解决这个问题，因为它并不罕见。

示例代码，针对 300M 条目（如果 3E8 太大，请随意让 N = 1E8；我选择的总大小略低于 2.5GB (2.4GB)。我的目标密度为 0.02，只是为了使其更有趣（可以使用稀疏向量，如果有帮助，但类型转换可能需要时间）

set.seed(0)
N = 3E8
p = 0.02
x = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

一些明显的方法：

table
bigtabulate
简单的逻辑运算（例如）。 sum(x & y))
向量乘法 (boo)
data.table
以上部分内容，以及来自 多核的 parallel 包（或新的 parallel 包）

我尝试了前三个选项（请参阅我的答案），但我觉得一定有更好更快的东西

。该table运行速度非常慢。 bigtabulate 对于一对逻辑向量来说似乎有点大材小用。最后，进行普通的逻辑运算似乎是一种拼凑，而且它对每个向量查看太多次（3X？7X？），更不用说这一点了。它在处理过程中会占用大量额外的内存，这会浪费大量时间。

向量乘法通常是一个坏主意，但是当向量稀疏时，将其存储起来然后使用向量乘法可能会获得优势。

当子样本足以用于创建交叉表时，为什么要使用如此多的数据？数据来自于两个变量的 TRUE 观察结果都非常罕见的情况。一个是数据异常的结果，另一个是由于代码中可能存在的错误（可能的错误是因为我们只看到计算结果 - 将变量 x 视为“Garbage In”，并且 因此，问题是代码导致的输出问题是否仅仅是数据异常的情况，还是还有其他一些好的数据变坏的情况？（这就是为什么我问关于遇到 NaN、NA 或 Inf 时停止。）

这也解释了为什么我的示例出现的概率很低正确值；这些的发生率确实远低于 0.1%，

这是否表明有不同的解决方案路径？是的：这表明我们可以使用两个索引（即每个集合中 TRUE 的位置）并且我避免了集合交集，因为 Matlab 会先对集合中的元素进行排序，然后再进行交集。（我隐约记得复杂性更令人尴尬：比如 O(n^2) 而不是 O(n log n)。）

那么我该如何在 R 中做到这一点？

原文

For two logical vectors, x and y, of length > 1E8, what is the fastest way to calculate the 2x2 cross tabulations?

I suspect the answer is to write it in C/C++, but I wonder if there is something in R that is already quite smart about this problem, as it's not uncommon.

Example code, for 300M entries (feel free to let N = 1E8 if 3E8 is too big; I chose a total size just under 2.5GB (2.4GB). I targeted a density of 0.02, just to make it more interesting (one could use a sparse vector, if that helps, but type conversion can take time).

set.seed(0)
N = 3E8
p = 0.02
x = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y = sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

Some obvious methods:

table
bigtabulate
Simple logical operations (e.g. sum(x & y))
Vector multiplication (boo)
data.table
Some of the above, with parallel from the multicore package (or the new parallel package)

I've taken a stab at the first three options (see my answer), but I feel that there must be something better and faster.

I find that table works very slowly. bigtabulate seems like overkill for a pair of logical vectors. Finally, doing the vanilla logical operations seems like a kludge, and it looks at each vector too many times (3X? 7X?), not to mention that it fills a lot of additional memory during processing, which is a massive time waster.

Vector multiplication is usually a bad idea, but when the vector is sparse, one may get an advantage out of storing it as such, and then using vector multiplication.

Why use so much data when a sub-sample might be adequate for the purposes of creating a cross-tabulation? The data arises from cases where the TRUE observations are very rare, for both variables. One is a result of a data anomaly, the other due to a possible bug in code (possible bug because we only see the computational result - think of variable x as "Garbage In", and y as "Garbage Out". As a result, the question is whether the issues in the output caused by the code are solely those cases where the data is anomalous, or are there some other instances where good data goes bad? (This is why I asked a question about stopping when a NaN, NA, or Inf is encountered.)

That also explains why my example has a low probability for TRUE values; these really occur much less than 0.1% of the time.

Does this suggest a different solution path? Yes: it suggests that we may use two indices (i.e. the locations of TRUE in each set) and count set intersections. I avoided set intersections because I was burned awhile back by Matlab, which would first sort elements of a set before it does an intersection. (I vaguely recall the complexity was even more embarrassing: like O(n^2) instead of O(n log n).)

So how do I do it in R?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

转瞬即逝 2025-01-10 06:55:19

如果您要对巨大的逻辑向量进行大量操作，请查看位包。它通过将布尔值存储为真正的 1 位布尔值来节省大量内存。

这对table没有帮助；它实际上使情况变得更糟，因为由于位向量的构造方式，位向量中有更多唯一值。但它确实有助于逻辑比较。

# N <- 3e7
require(bit)
xb <- as.bit(x)
yb <- as.bit(y)
benchmark(replications = 1, order = "elapsed", 
    bit = {res <- func_logical(xb,yb)},
    logical = {res <- func_logical(x,y)}
)
#      test replications elapsed relative user.self sys.self user.child sys.child
# 1     bit            1   0.129  1.00000     0.132    0.000          0         0
# 2 logical            1   3.677 28.50388     2.684    0.928          0         0

If you're doing a lot of operations on huge logical vectors, take a look at the bit package. It saves a ton of memory by storing the booleans as true 1-bit booleans.

This doesn't help with table; it actually makes it worse because there are more unique values in the bit vector due to how it's constructed. But it really helps with logical comparisons.

# N <- 3e7
require(bit)
xb <- as.bit(x)
yb <- as.bit(y)
benchmark(replications = 1, order = "elapsed", 
    bit = {res <- func_logical(xb,yb)},
    logical = {res <- func_logical(x,y)}
)
#      test replications elapsed relative user.self sys.self user.child sys.child
# 1     bit            1   0.129  1.00000     0.132    0.000          0         0
# 2 logical            1   3.677 28.50388     2.684    0.928          0         0

回复收藏 0 原文

悸初 2025-01-10 06:55:19

这个答案给出了三种简单方法的计时，这是相信 table 很慢的基础。然而，要认识到的关键是“逻辑”方法的效率非常低。看看它在做什么：

4 个逻辑向量运算
4 个类型转换（逻辑到整数或 FP - 用于 sum）
4 个向量求和
8 个赋值（1 个用于逻辑运算，1 个用于求和）

不仅如此，但它甚至没有被编译或并行化。然而，它仍然击败了table。请注意，带有额外类型转换（1 * cbind...）的 bigtabulate 仍然优于 table。

以下是 N = 3E8 时逻辑方法、table 和 bigtabulate 的结果：

         test replications elapsed relative user.self sys.self
2     logical            1  23.861 1.000000     15.36     8.50
3 bigtabulate            1  36.477 1.528729     28.04     8.43
1       table            1 184.652 7.738653    150.61    33.99

在这种情况下，table 是一场灾难。

为了进行比较，这里是 N = 3E6：

         test replications elapsed relative user.self sys.self
2     logical            1   0.220 1.000000      0.14     0.08
3 bigtabulate            1   0.534 2.427273      0.45     0.08
1       table            1   1.956 8.890909      1.87     0.09

在这一点上，似乎编写自己的逻辑函数是最好的，尽管这滥用了 sum，并多次检查每个逻辑向量。我还没有尝试编译这些函数，但这应该会产生更好的结果。

更新 1 如果我们给出的 bigtabulate 值已经是整数，即如果我们在外部进行类型转换 1 * cbind(v1,v2) bigtabulate，则 N=3E6 的倍数是 1.80，而不是 2.4。相对于“逻辑”方法的N=3E8倍数仅为1.21，而不是1.53。

更新 2

正如 Joshua Ulrich 所指出的，转换为位向量是一项重大改进 - 我们分配和移动的数据要少得多：R 的逻辑向量每个条目消耗 4 个字节（“为什么？”），你可能会问...嗯，我不知道知道，但答案可能会出现在这里。），而位向量每个条目消耗一位，即 1/32 的数据量。因此，x 消耗 1.2e9 字节，而 xb（下面代码中的位版本）仅消耗 3.75e7 字节。

我从更新的基准测试中删除了 table 和 bigtabulate 变体 (N=3e8)。请注意，ticalB1 假定数据已经是位向量，而 ticalB2 是相同的操作，但会因类型转换而受到惩罚。由于我的逻辑向量是对其他数据进行运算的结果，因此我没有从位向量开始的好处。尽管如此，所支付的罚款相对较小。 [“逻辑3”系列仅执行3次逻辑运算，然后进行减法。由于是交叉制表，我们知道总数，正如 DWin 所说。]

        test replications elapsed  relative user.self sys.self
4 logical3B1            1   1.276  1.000000      1.11     0.17
2  logicalB1            1   1.768  1.385580      1.56     0.21
5 logical3B2            1   2.297  1.800157      2.15     0.14
3  logicalB2            1   2.782  2.180251      2.53     0.26
1    logical            1  22.953 17.988245     15.14     7.82

我们现在已将其速度加快到只需要 1.8-2.8 秒，尽管存在许多严重的低效率问题。毫无疑问，在不到 1 秒的时间内完成此操作应该是可行的，更改包括以下一项或多项：C 代码、编译和多核处理。毕竟，所有 3（或 4）个不同的逻辑运算都可以独立完成，尽管这仍然浪费计算周期。

最相似的最佳挑战者 logic3B2 的速度比 table 快约 80 倍。它比简单的逻辑运算快大约 10 倍。而且它还有很大的改进空间。

这是产生上述内容的代码。注意我建议注释掉一些操作或向量，除非您有大量 RAM - 创建 x、x1 和 < code>xb 以及相应的 y 对象将占用相当多的内存。

另请注意：我应该使用 1L 作为 bigtabulate 的整数乘数，而不仅仅是 1。在某个时候，我将重新运行此更改，并向任何使用 bigtabulate 方法的人推荐此更改。

library(rbenchmark)
library(bigtabulate)
library(bit)

set.seed(0)
N <- 3E8
p <- 0.02

x <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

x1 <- 1*x
y1 <- 1*y

xb <- as.bit(x)
yb <- as.bit(y)

func_table  <- function(v1,v2){
    return(table(v1,v2))
}

func_logical  <- function(v1,v2){
    return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}

func_logicalB  <- function(v1,v2){
    v1B <- as.bit(v1)
    v2B <- as.bit(v2)
    return(c(sum(v1B & v2B), sum(v1B & !v2B), sum(!v1B & v2B), sum(!v1B & !v2B)))
}

func_bigtabulate    <- function(v1,v2){
    return(bigtabulate(1*cbind(v1,v2), ccols = c(1,2)))
}

func_bigtabulate2    <- function(v1,v2){
    return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}

func_logical3   <- function(v1,v2){
    r1  <- sum(v1 & v2)
    r2  <- sum(v1 & !v2)
    r3  <- sum(!v1 & v2)
    r4  <- length(v1) - sum(c(r1, r2, r3))
    return(c(r1, r2, r3, r4))
}

func_logical3B   <- function(v1,v2){
    v1B <- as.bit(v1)
    v2B <- as.bit(v2)
    r1  <- sum(v1B & v2B)
    r2  <- sum(v1B & !v2B)
    r3  <- sum(!v1B & v2B)
    r4  <- length(v1) - sum(c(r1, r2, r3))
    return(c(r1, r2, r3, r4))
}

benchmark(replications = 1, order = "elapsed", 
    #table = {res <- func_table(x,y)},
    logical = {res <- func_logical(x,y)},
    logicalB1 = {res <- func_logical(xb,yb)},
    logicalB2 = {res <- func_logicalB(x,y)},

    logical3B1 = {res <- func_logical3(xb,yb)},
    logical3B2 = {res <- func_logical3B(x,y)}

    #bigtabulate = {res <- func_bigtabulate(x,y)},
    #bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)

This answer gives timings on three naive methods, which is the basis for believing table is slow. Yet, the key thing to realize is that the "logical" method is grossly inefficient. Look at what it's doing:

4 logical vector operations
4 type conversions (logical to integer or FP - for sum)
4 vector summations
8 assignments (1 for the logical operation, 1 for the summation)

Not only that, but it's not even compiled or parallelized. Yet, it still beats the pants off of table. Notice that bigtabulate, with an extra type conversion (1 * cbind...) still beats table.

Here are results for the logical method, table, and bigtabulate, for N = 3E8:

         test replications elapsed relative user.self sys.self
2     logical            1  23.861 1.000000     15.36     8.50
3 bigtabulate            1  36.477 1.528729     28.04     8.43
1       table            1 184.652 7.738653    150.61    33.99

In this case, table is a disaster.

For comparison, here is N = 3E6:

         test replications elapsed relative user.self sys.self
2     logical            1   0.220 1.000000      0.14     0.08
3 bigtabulate            1   0.534 2.427273      0.45     0.08
1       table            1   1.956 8.890909      1.87     0.09

At this point, it seems that writing one's own logical functions is best, even though that abuses sum, and examines each logical vector multiple times. I've not yet tried compiling the functions, but that should yield better results.

Update 1 If we give bigtabulate values that are already integers, i.e. if we do the type conversion 1 * cbind(v1,v2) outside of bigtabulate, then the N=3E6 multiple is 1.80, instead of 2.4. The N=3E8 multiple relative to the "logical" method is only 1.21, instead of 1.53.

Update 2

As Joshua Ulrich has pointed out, converting to bit vectors is a significant improvement - we're allocating and moving around a LOT less data: R's logical vectors consume 4 bytes per entry ("Why?", you may ask... Well, I don't know, but an answer may turn up here.), whereas a bit vector consumes, well, one bit, per entry - i.e. 1/32 as much data. So, x consumes 1.2e9 bytes, while xb (the bit version in the code below) consumes only 3.75e7 bytes.

I've dropped table and the bigtabulate variations from the updated benchmarks (N=3e8). Note that logicalB1 assumes that the data is already a bit vector, while logicalB2 is the same operation with the penalty for type conversion. As my logical vectors are the results of operations on other data, I don't have the benefit of starting off with a bit vector. Nonetheless, the penalty to be paid is relatively small. [The "logical3" series only performs 3 logical operations, and then does a subtraction. Since it's cross-tabulation, we know the total, as DWin has remarked.]

        test replications elapsed  relative user.self sys.self
4 logical3B1            1   1.276  1.000000      1.11     0.17
2  logicalB1            1   1.768  1.385580      1.56     0.21
5 logical3B2            1   2.297  1.800157      2.15     0.14
3  logicalB2            1   2.782  2.180251      2.53     0.26
1    logical            1  22.953 17.988245     15.14     7.82

We've now sped this up to taking only 1.8-2.8 seconds, even with many gross inefficiencies. There is no doubt it should be feasible to do this in well under 1 second, with changes including one or more of: C code, compilation, and multicore processing. After all the 3 (or 4) different logical operations could be done independently, even though that's still a waste of compute cycles.

The most similar of the best challengers, logical3B2, is about 80X faster than table. It's about 10X faster than the naive logical operation. And it still has a lot of room for improvement.

Here is code to produce the above. NOTE I recommend commenting out some of the operations or vectors, unless you have a lot of RAM - the creation of x, x1, and xb, along with the corresponding y objects, will take up a fair bit of memory.

Also, note: I should have used 1L as the integer multiplier for bigtabulate, instead of just 1. At some point I will re-run with this change, and would recommend that change to anyone who uses the bigtabulate approach.

library(rbenchmark)
library(bigtabulate)
library(bit)

set.seed(0)
N <- 3E8
p <- 0.02

x <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

x1 <- 1*x
y1 <- 1*y

xb <- as.bit(x)
yb <- as.bit(y)

func_table  <- function(v1,v2){
    return(table(v1,v2))
}

func_logical  <- function(v1,v2){
    return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}

func_logicalB  <- function(v1,v2){
    v1B <- as.bit(v1)
    v2B <- as.bit(v2)
    return(c(sum(v1B & v2B), sum(v1B & !v2B), sum(!v1B & v2B), sum(!v1B & !v2B)))
}

func_bigtabulate    <- function(v1,v2){
    return(bigtabulate(1*cbind(v1,v2), ccols = c(1,2)))
}

func_bigtabulate2    <- function(v1,v2){
    return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}

func_logical3   <- function(v1,v2){
    r1  <- sum(v1 & v2)
    r2  <- sum(v1 & !v2)
    r3  <- sum(!v1 & v2)
    r4  <- length(v1) - sum(c(r1, r2, r3))
    return(c(r1, r2, r3, r4))
}

func_logical3B   <- function(v1,v2){
    v1B <- as.bit(v1)
    v2B <- as.bit(v2)
    r1  <- sum(v1B & v2B)
    r2  <- sum(v1B & !v2B)
    r3  <- sum(!v1B & v2B)
    r4  <- length(v1) - sum(c(r1, r2, r3))
    return(c(r1, r2, r3, r4))
}

benchmark(replications = 1, order = "elapsed", 
    #table = {res <- func_table(x,y)},
    logical = {res <- func_logical(x,y)},
    logicalB1 = {res <- func_logical(xb,yb)},
    logicalB2 = {res <- func_logicalB(x,y)},

    logical3B1 = {res <- func_logical3(xb,yb)},
    logical3B2 = {res <- func_logical3B(x,y)}

    #bigtabulate = {res <- func_bigtabulate(x,y)},
    #bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)

回复收藏 0 原文

离去的眼神 2025-01-10 06:55:19

这是使用 Rcpp 糖的答案。

N <- 1e8
x <- sample(c(T,F),N,replace=T)
y <- sample(c(T,F),N,replace=T)

func_logical  <- function(v1,v2){
    return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}


library(Rcpp)
library(inline)

doCrossTab1 <- cxxfunction(signature(x="integer", y = "integer"), body='
  Rcpp::LogicalVector Vx(x);
  Rcpp::LogicalVector Vy(y);
  Rcpp::IntegerVector V(4);

  V[0] = sum(Vx*Vy);
  V[1] = sum(Vx*!Vy);
  V[2] = sum(!Vx*Vy);
  V[3] = sum(!Vx*!Vy);
  return( wrap(V));
  '
, plugin="Rcpp")

system.time(doCrossTab1(x,y))

require(bit)
system.time(
{
xb <- as.bit(x)
yb <- as.bit(y)
func_logical(xb,yb)
})

结果是：

> system.time(doCrossTab1(x,y))
   user  system elapsed 
  1.067   0.002   1.069 
> system.time(
+ {
+ xb <- as.bit(x)
+ yb <- as.bit(y)
+ func_logical(xb,yb)
+ })
   user  system elapsed 
  1.451   0.001   1.453

因此，我们可以在位包上获得一点速度，尽管我对时代的竞争如此激烈感到惊讶。

更新：为了纪念 Iterator，这里有一个 Rcpp 迭代器解决方案：

doCrossTab2 <- cxxfunction(signature(x="integer", y = "integer"), body='
  Rcpp::LogicalVector Vx(x);
  Rcpp::LogicalVector Vy(y);
  Rcpp::IntegerVector V(4);
    V[0]=V[1]=V[2]=V[3]=0;
  LogicalVector::iterator itx = Vx.begin();
  LogicalVector::iterator ity = Vy.begin();
  while(itx!=Vx.end()){
    V[0] += (*itx)*(*ity);
    V[1] += (*itx)*(!*ity);
    V[2] += (!*itx)*(*ity);
    V[3] += (!*itx)*(!*ity);    
    itx++;
    ity++;
  }
  return( wrap(V));
  '
, plugin="Rcpp")

system.time(doCrossTab2(x,y))
#   user  system elapsed 
#  0.780   0.001   0.782

Here is an answer using Rcpp sugar.

N <- 1e8
x <- sample(c(T,F),N,replace=T)
y <- sample(c(T,F),N,replace=T)

func_logical  <- function(v1,v2){
    return(c(sum(v1 & v2), sum(v1 & !v2), sum(!v1 & v2), sum(!v1 & !v2)))
}


library(Rcpp)
library(inline)

doCrossTab1 <- cxxfunction(signature(x="integer", y = "integer"), body='
  Rcpp::LogicalVector Vx(x);
  Rcpp::LogicalVector Vy(y);
  Rcpp::IntegerVector V(4);

  V[0] = sum(Vx*Vy);
  V[1] = sum(Vx*!Vy);
  V[2] = sum(!Vx*Vy);
  V[3] = sum(!Vx*!Vy);
  return( wrap(V));
  '
, plugin="Rcpp")

system.time(doCrossTab1(x,y))

require(bit)
system.time(
{
xb <- as.bit(x)
yb <- as.bit(y)
func_logical(xb,yb)
})

which results in:

> system.time(doCrossTab1(x,y))
   user  system elapsed 
  1.067   0.002   1.069 
> system.time(
+ {
+ xb <- as.bit(x)
+ yb <- as.bit(y)
+ func_logical(xb,yb)
+ })
   user  system elapsed 
  1.451   0.001   1.453

So, we can get a little speed up over the bit package, though I'm surprised at how competitive the times are.

Update: In honor of Iterator, here is a Rcpp iterator solution:

doCrossTab2 <- cxxfunction(signature(x="integer", y = "integer"), body='
  Rcpp::LogicalVector Vx(x);
  Rcpp::LogicalVector Vy(y);
  Rcpp::IntegerVector V(4);
    V[0]=V[1]=V[2]=V[3]=0;
  LogicalVector::iterator itx = Vx.begin();
  LogicalVector::iterator ity = Vy.begin();
  while(itx!=Vx.end()){
    V[0] += (*itx)*(*ity);
    V[1] += (*itx)*(!*ity);
    V[2] += (!*itx)*(*ity);
    V[3] += (!*itx)*(!*ity);    
    itx++;
    ity++;
  }
  return( wrap(V));
  '
, plugin="Rcpp")

system.time(doCrossTab2(x,y))
#   user  system elapsed 
#  0.780   0.001   0.782

回复收藏 0 原文

空袭的梦i 2025-01-10 06:55:19

另一种策略是考虑仅设置交集，使用 TRUE 值的索引，利用样本的偏差很大（即大多数为 FALSE）。

为此，我介绍了 func_find01 和使用 bit 包的翻译 (func_find01B)；上面答案中未出现的所有代码都粘贴在下面。

我重新运行了完整的 N=3e8 评估，除了忘记使用 func_find01B ；我在第二遍中重新运行了更快的方法来对抗它。

          test replications elapsed   relative user.self sys.self
6   logical3B1            1   1.298   1.000000      1.13     0.17
4    logicalB1            1   1.805   1.390601      1.57     0.23
7   logical3B2            1   2.317   1.785054      2.12     0.20
5    logicalB2            1   2.820   2.172573      2.53     0.29
2       find01            1   6.125   4.718798      4.24     1.88
9 bigtabulate2            1  22.823  17.583205     21.00     1.81
3      logical            1  23.800  18.335901     15.51     8.28
8  bigtabulate            1  27.674  21.320493     24.27     3.40
1        table            1 183.467 141.345917    149.01    34.41

只是“快速”方法：

        test replications elapsed relative user.self sys.self
3     find02            1   1.078 1.000000      1.03     0.04
6 logical3B1            1   1.312 1.217069      1.18     0.13
4  logicalB1            1   1.797 1.666976      1.58     0.22
2    find01B            1   2.104 1.951763      2.03     0.08
7 logical3B2            1   2.319 2.151206      2.13     0.19
5  logicalB2            1   2.817 2.613173      2.50     0.31
1     find01            1   6.143 5.698516      4.21     1.93

因此，find01B 是不使用预先转换的位向量的方法中最快的，但差距很小（2.099 秒与 2.327 秒）。 find02 来自哪里？我随后编写了一个使用预先计算的位向量的版本。这是现在最快的了。

一般来说，“指数法”方法的运行时间可能会受到边际和边际的影响。联合概率。我怀疑当概率更低时，它会特别有竞争力，但人们必须先验地或通过子样本知道这一点。

更新 1. 我还对 Josh O'Brien 的建议进行了计时，使用 tabulate() 而不是 table()。 12 秒后的结果约为 find01 的 2 倍，约为 bigtabulate2 的一半。现在最好的方法都接近1秒，这也相对较慢：

 user  system elapsed 
7.670   5.140  12.815

代码：

func_find01 <- function(v1, v2){
    ix1 <- which(v1 == TRUE)
    ix2 <- which(v2 == TRUE)

    len_ixJ <- sum(ix1 %in% ix2)
    len1    <- length(ix1)
    len2    <- length(ix2)
    return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
             length(v1) - len1 - len2 + len_ixJ))
}

func_find01B <- function(v1, v2){
    v1b = as.bit(v1)
    v2b = as.bit(v2)

    len_ixJ <- sum(v1b & v2b)
    len1 <- sum(v1b)
    len2 <- sum(v2b)

    return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
             length(v1) - len1 - len2 + len_ixJ))
}

func_find02 <- function(v1b, v2b){
    len_ixJ <- sum(v1b & v2b)
    len1 <- sum(v1b)
    len2 <- sum(v2b)

    return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
             length(v1b) - len1 - len2 + len_ixJ))
}

func_bigtabulate2    <- function(v1,v2){
    return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}

func_tabulate01 <- function(v1,v2){
    return(tabulate(1L + 1L*x + 2L*y))
}

benchmark(replications = 1, order = "elapsed", 
    table = {res <- func_table(x,y)},
    find01  = {res <- func_find01(x,y)},
    find01B  = {res <- func_find01B(x,y)},
    find02  = {res <- func_find01B(xb,yb)},
    logical = {res <- func_logical(x,y)},
    logicalB1 = {res <- func_logical(xb,yb)},
    logicalB2 = {res <- func_logicalB(x,y)},

    logical3B1 = {res <- func_logical3(xb,yb)},
    logical3B2 = {res <- func_logical3B(x,y)},

    tabulate    = {res <- func_tabulate(x,y)},
    bigtabulate = {res <- func_bigtabulate(x,y)},
    bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)

A different tactic is to consider just set intersections, using the indices of the TRUE values, taking advantage that the samples are very biased (i.e. mostly FALSE).

To that end, I introduce func_find01 and a translation that uses the bit package (func_find01B); all of the code that doesn't appear in the answer above is pasted below.

I re-ran the full N=3e8 evaluation, except forgot to use func_find01B; I reran the faster methods against it, in a second pass.

          test replications elapsed   relative user.self sys.self
6   logical3B1            1   1.298   1.000000      1.13     0.17
4    logicalB1            1   1.805   1.390601      1.57     0.23
7   logical3B2            1   2.317   1.785054      2.12     0.20
5    logicalB2            1   2.820   2.172573      2.53     0.29
2       find01            1   6.125   4.718798      4.24     1.88
9 bigtabulate2            1  22.823  17.583205     21.00     1.81
3      logical            1  23.800  18.335901     15.51     8.28
8  bigtabulate            1  27.674  21.320493     24.27     3.40
1        table            1 183.467 141.345917    149.01    34.41

Just the "fast" methods:

        test replications elapsed relative user.self sys.self
3     find02            1   1.078 1.000000      1.03     0.04
6 logical3B1            1   1.312 1.217069      1.18     0.13
4  logicalB1            1   1.797 1.666976      1.58     0.22
2    find01B            1   2.104 1.951763      2.03     0.08
7 logical3B2            1   2.319 2.151206      2.13     0.19
5  logicalB2            1   2.817 2.613173      2.50     0.31
1     find01            1   6.143 5.698516      4.21     1.93

So, find01B is fastest among methods that do not use pre-converted bit vectors, by a slim margin (2.099 seconds versus 2.327 seconds). Where did find02 come from? I subsequently wrote a version that uses pre-computed bit vectors. This is now the fastest.

In general, the running time of the "indices method" approach may be affected by the marginal & joint probabilities. I suspect that it would be especially competitive when the probabilities are even lower, but one has to know that a priori, or via a sub-sample.

Update 1. I've also timed Josh O'Brien's suggestion, using tabulate() instead of table(). The results, at 12 seconds elapsed, are about 2X find01 and about half of bigtabulate2. Now that the best methods are approaching 1 second, this is also relatively slow:

 user  system elapsed 
7.670   5.140  12.815

Code:

func_find01 <- function(v1, v2){
    ix1 <- which(v1 == TRUE)
    ix2 <- which(v2 == TRUE)

    len_ixJ <- sum(ix1 %in% ix2)
    len1    <- length(ix1)
    len2    <- length(ix2)
    return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
             length(v1) - len1 - len2 + len_ixJ))
}

func_find01B <- function(v1, v2){
    v1b = as.bit(v1)
    v2b = as.bit(v2)

    len_ixJ <- sum(v1b & v2b)
    len1 <- sum(v1b)
    len2 <- sum(v2b)

    return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
             length(v1) - len1 - len2 + len_ixJ))
}

func_find02 <- function(v1b, v2b){
    len_ixJ <- sum(v1b & v2b)
    len1 <- sum(v1b)
    len2 <- sum(v2b)

    return(c(len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
             length(v1b) - len1 - len2 + len_ixJ))
}

func_bigtabulate2    <- function(v1,v2){
    return(bigtabulate(cbind(v1,v2), ccols = c(1,2)))
}

func_tabulate01 <- function(v1,v2){
    return(tabulate(1L + 1L*x + 2L*y))
}

benchmark(replications = 1, order = "elapsed", 
    table = {res <- func_table(x,y)},
    find01  = {res <- func_find01(x,y)},
    find01B  = {res <- func_find01B(x,y)},
    find02  = {res <- func_find01B(xb,yb)},
    logical = {res <- func_logical(x,y)},
    logicalB1 = {res <- func_logical(xb,yb)},
    logicalB2 = {res <- func_logicalB(x,y)},

    logical3B1 = {res <- func_logical3(xb,yb)},
    logical3B2 = {res <- func_logical3B(x,y)},

    tabulate    = {res <- func_tabulate(x,y)},
    bigtabulate = {res <- func_bigtabulate(x,y)},
    bigtabulate2 = {res <- func_bigtabulate2(x1,y1)}
)

回复收藏 0 原文

静待花开 2025-01-10 06:55:19

这是 Rcpp 的答案，仅列出那些不都是 0 的条目。我怀疑一定有几种方法可以改进这一点，因为这异常缓慢；这也是我第一次尝试 Rcpp，因此移动数据可能会导致一些明显的低效率问题。我故意写了一个简单的例子，它应该让其他人演示如何改进它。

library(Rcpp)
library(inline)
doCrossTab <- cxxfunction(signature(x="integer", y = "integer"), body='
  Rcpp::IntegerVector Vx(x);
  Rcpp::IntegerVector Vy(y);
  Rcpp::IntegerVector V(3);
  for(int i = 0; i < Vx.length(); i++) {
    if( (Vx(i) == 1) & ( Vy(i) == 1) ){ V[0]++; } 
    else if( (Vx(i) == 1) & ( Vy(i) == 0) ){ V[1]++; } 
    else if( (Vx(i) == 0) & ( Vy(i) == 1) ){ V[2]++; } 
 }
  return( wrap(V));
  ', plugin="Rcpp")

N = 3E8 的计时结果：

   user  system elapsed 
 10.930   1.620  12.586

这需要比我的第二个答案中的 func_find01B 多 6 倍的时间。

Here's an answer with Rcpp, tabulating only those entries that are not both 0. I suspect there must be several ways to improve this, as this is unusually slow; it's also my first attempt with Rcpp, so there may be some obvious inefficiencies associated with moving the data around. I wrote an example that is purposefully plain vanilla, which should let others demonstrate how this can be improved.

library(Rcpp)
library(inline)
doCrossTab <- cxxfunction(signature(x="integer", y = "integer"), body='
  Rcpp::IntegerVector Vx(x);
  Rcpp::IntegerVector Vy(y);
  Rcpp::IntegerVector V(3);
  for(int i = 0; i < Vx.length(); i++) {
    if( (Vx(i) == 1) & ( Vy(i) == 1) ){ V[0]++; } 
    else if( (Vx(i) == 1) & ( Vy(i) == 0) ){ V[1]++; } 
    else if( (Vx(i) == 0) & ( Vy(i) == 1) ){ V[2]++; } 
 }
  return( wrap(V));
  ', plugin="Rcpp")

Timing results for N = 3E8:

   user  system elapsed 
 10.930   1.620  12.586

This takes more than 6X as long as func_find01B in my 2nd answer.

回复收藏 0 原文

橘和柠 2025-01-10 06:55:19

这是一个使用子集和三个求和来求解列联表的函数。结果向量的顺序与table中的顺序相同。

xtab.logical <- function(x, y) {
  n <- length(x)
  sx <- sum(x)
  sy <- sum(y)
  out <- integer(4)
  if (sx < n/2) {
    if (sy < n/2) {
      out[4] <- sum(if (sx < sy) y[x] else x[y])
      out[2] <- sx - out[4]
    } else {
      out[2] <- sum(if (sx < n - sy) !y[x] else x[!y])
      out[4] <- sx - out[2]
    }
    out[1] <- n - sy - out[2]
    out[3] <- sy - out[4]
  } else {
    if (sy < n/2) {
      out[3] <- sum(if (n - sx < sy) y[!x] else !x[y])
      out[1] <- n - sx - out[3]
    } else {
      out[1] <- sum(if (sx > sy) !y[!x] else !x[!y])
      out[3] <- n - sx - out[1]
    }
    out[4] <- sy - out[3]
    out[2] <- sx - out[4]
  }
  out
}

针对 @Iterator 的 func_find02 进行基准测试（修改为以与 table 相同的顺序返回值）：

func_find02 <- function(v1b, v2b){
  len_ixJ <- sum(v1b & v2b)
  len1 <- sum(v1b)
  len2 <- sum(v2b)
  
  c(length(v1b) - len1 - len2 + len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
    len_ixJ)
}

set.seed(0)
N <- 3E8
p <- 0.02

x <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

bench::mark(
  xtab.logical = xtab.logical(x, y),
  func_find02 = func_find02(x, y),
  iterations = 10,
  relative = TRUE
)
#> # A tibble: 2 × 6
#>   expression     min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 xtab.logical  1      1         1.99      1.04     1.99
#> 2 func_find02   1.92   1.92      1         1        1

xtab.logic 的速度大约是前者的两倍。

如果知道x和y是稀疏的，我们可以使用which通过整数索引来获得更快的速度（尽管会消耗内存）。

xtab.logical2 <- function(x, y) {
  n <- length(x)
  ix <- which(x)
  iy <- which(y)
  sx <- length(ix)
  sy <- length(iy)
  out <- integer(4)
  out[4] <- sum(if (sx < sy) y[ix] else x[iy])
  out[2] <- sx - out[4]
  out[1] <- n - sy - out[2]
  out[3] <- sy - out[4]
  out
}

bench::mark(
  xtab.logical = xtab.logical(x, y),
  xtab.logical2 = xtab.logical2(x, y),
  func_find02 = func_find02(x, y),
  iterations = 10,
  relative = TRUE
)
#> # A tibble: 3 × 6
#>   expression      min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 xtab.logical   1.09   1.14      1.99      1.04     1.99
#> 2 xtab.logical2  1      1         2.30      2.06     4.60
#> 3 func_find02    2.09   2.26      1         1        1

Here's a function that uses subsetting and three sums to solve the contingency table. The order of the resulting vector is the same as from table.

xtab.logical <- function(x, y) {
  n <- length(x)
  sx <- sum(x)
  sy <- sum(y)
  out <- integer(4)
  if (sx < n/2) {
    if (sy < n/2) {
      out[4] <- sum(if (sx < sy) y[x] else x[y])
      out[2] <- sx - out[4]
    } else {
      out[2] <- sum(if (sx < n - sy) !y[x] else x[!y])
      out[4] <- sx - out[2]
    }
    out[1] <- n - sy - out[2]
    out[3] <- sy - out[4]
  } else {
    if (sy < n/2) {
      out[3] <- sum(if (n - sx < sy) y[!x] else !x[y])
      out[1] <- n - sx - out[3]
    } else {
      out[1] <- sum(if (sx > sy) !y[!x] else !x[!y])
      out[3] <- n - sx - out[1]
    }
    out[4] <- sy - out[3]
    out[2] <- sx - out[4]
  }
  out
}

Benchmarking against @Iterator's func_find02 (modified to return the values in the same order as table):

func_find02 <- function(v1b, v2b){
  len_ixJ <- sum(v1b & v2b)
  len1 <- sum(v1b)
  len2 <- sum(v2b)
  
  c(length(v1b) - len1 - len2 + len_ixJ, len1 - len_ixJ, len2 - len_ixJ,
    len_ixJ)
}

set.seed(0)
N <- 3E8
p <- 0.02

x <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)
y <- sample(c(TRUE, FALSE), N, prob = c(p, 1-p), replace = TRUE)

bench::mark(
  xtab.logical = xtab.logical(x, y),
  func_find02 = func_find02(x, y),
  iterations = 10,
  relative = TRUE
)
#> # A tibble: 2 × 6
#>   expression     min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 xtab.logical  1      1         1.99      1.04     1.99
#> 2 func_find02   1.92   1.92      1         1        1

xtab.logical is about twice as fast.

If it is known that x and y are sparse, we can get more speed with integer indexing using which (although at a memory cost).

xtab.logical2 <- function(x, y) {
  n <- length(x)
  ix <- which(x)
  iy <- which(y)
  sx <- length(ix)
  sy <- length(iy)
  out <- integer(4)
  out[4] <- sum(if (sx < sy) y[ix] else x[iy])
  out[2] <- sx - out[4]
  out[1] <- n - sy - out[2]
  out[3] <- sy - out[4]
  out
}

bench::mark(
  xtab.logical = xtab.logical(x, y),
  xtab.logical2 = xtab.logical2(x, y),
  func_find02 = func_find02(x, y),
  iterations = 10,
  relative = TRUE
)
#> # A tibble: 3 × 6
#>   expression      min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 xtab.logical   1.09   1.14      1.99      1.04     1.99
#> 2 xtab.logical2  1      1         2.30      2.06     4.60
#> 3 func_find02    2.09   2.26      1         1        1

回复收藏 0 原文

~没有更多了~