在 R 中使用 data.table 创建多行组标识符的最快方法是什么?

发布于 2025-01-16 14:11:41 字数 1461 浏览 5 评论 0原文

我有一个使用 id 标识一组值的数据框:

library(data.table)

dt <- data.table(
  id = rep(c("a", "b", "c"), each = 2),
  value1 = c(1, 1, 1, 2, 1, 1),
  value2 = c(0, 3, 0, 3, 0, 3)
)
dt
#>    id value1 value2
#> 1:  a      1      0
#> 2:  a      1      3
#> 3:  b      1      0
#> 4:  b      2      3
#> 5:  c      1      0
#> 6:  c      1      3

如您所见,id ac 标识同一组值。所以我想创建一个“模式 id”,它标识与 ids ac 关联的值集(观察:一个 id 可能标识多于两行,为了简单起见,我在这里将它们限制为两行)。

我确实设法使用嵌套 data.tables 和 match() 提出了一个解决方案:

dt <- dt[, .(data = list(.SD)), by = id]

unique_groups <- unique(dt$data)
dt[, pattern_id := match(data, unique_groups)]
dt[, data := NULL]

dt
#>    id pattern_id
#> 1:  a          1
#> 2:  b          2
#> 3:  c          1

做到了,但它没有我想要的那么快成为。 match() 文档对于列表的效率非常清楚:

列表匹配可能非常慢,最好避免,除非是简单的情况。

正如您所看到的,我不需要最终结果中的实际模式数据,只需要一个将 id 与模式 id 相关联的表。我觉得嵌套数据,用它来匹配然后删除它有点浪费,但不确定是否有更好的方法。我正在考虑将每个数据帧转换为字符串的方法,或者更好的是,完全避免嵌套的方法,但我想不出比现在更好的方法。

我创建了一个更大的数据集来尝试和测试不同的解决方案:

set.seed(0)
size <- 1000000
dt <- data.table(
  id = rep(1:(size / 2), each = 2),
  value1 = sample(1:10, size, replace = TRUE),
  value2 = sample(1:10, size, replace = TRUE)
)

I have a dataframe that identifies a set of values with an id:

library(data.table)

dt <- data.table(
  id = rep(c("a", "b", "c"), each = 2),
  value1 = c(1, 1, 1, 2, 1, 1),
  value2 = c(0, 3, 0, 3, 0, 3)
)
dt
#>    id value1 value2
#> 1:  a      1      0
#> 2:  a      1      3
#> 3:  b      1      0
#> 4:  b      2      3
#> 5:  c      1      0
#> 6:  c      1      3

As you can see, the ids a and c identify both the same set of values. So I want to create a "pattern id", that identifies the set of values associated with the ids a and c (obs: an id might identify more than two rows, I just limited them to two rows here for the sake of simplicity).

I did manage to come up with a solution using nested data.tables and match():

dt <- dt[, .(data = list(.SD)), by = id]

unique_groups <- unique(dt$data)
dt[, pattern_id := match(data, unique_groups)]
dt[, data := NULL]

dt
#>    id pattern_id
#> 1:  a          1
#> 2:  b          2
#> 3:  c          1

It does the trick, but it is not as fast as I'd like it to be. match() documentation is pretty clear regarding its efficiency with lists:

Matching for lists is potentially very slow and best avoided except in simple cases.

As you can see, I don't need the actual pattern data in my final result, only a table that associates the ids to the pattern ids. I feel like nesting the data, using it to match and then dropping it afterwards is a bit wasteful, but not sure if there's a better way. I was thinking in something that transform each dataframe into a string, or, even better, something that avoided the nesting altogether, but I couldn't come up with anything better than what I have now.

I have created a bigger dataset to play around with and test different solutions:

set.seed(0)
size <- 1000000
dt <- data.table(
  id = rep(1:(size / 2), each = 2),
  value1 = sample(1:10, size, replace = TRUE),
  value2 = sample(1:10, size, replace = TRUE)
)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

酒与心事 2025-01-23 14:11:41

我们可以尝试下面的代码

dt[
    ,
    q := toString(unlist(.SD)), id
][
    ,
    pattern_id := .GRP, q
][
    ,
    q := NULL
][]

dt[
    ,
    q := toString(unlist(.SD)),
    id
][
    ,
    pattern_id := as.integer(factor(match(q, q)))
][
    ,
    q := NULL
][]

给出

   id value1 value2 pattern_id
1:  a      1      0          1
2:  a      1      3          1
3:  b      1      0          2
4:  b      2      3          2
5:  c      1      0          1
6:  c      1      3          1

We can try the code below

dt[
    ,
    q := toString(unlist(.SD)), id
][
    ,
    pattern_id := .GRP, q
][
    ,
    q := NULL
][]

or

dt[
    ,
    q := toString(unlist(.SD)),
    id
][
    ,
    pattern_id := as.integer(factor(match(q, q)))
][
    ,
    q := NULL
][]

which gives

   id value1 value2 pattern_id
1:  a      1      0          1
2:  a      1      3          1
3:  b      1      0          2
4:  b      2      3          2
5:  c      1      0          1
6:  c      1      3          1
夜访吸血鬼 2025-01-23 14:11:41

更新(删除连接):

这复制了您的方法(即,它要求顺序以及值相同)

unique(
  dt[, pattern:=.(paste0(c(value1,value2), collapse=",")), by=id][,.(id,pattern)]
)[,grp:=.GRP, by=pattern][,pattern:=NULL]

       id   grp
   <char> <int>
1:      a     1
2:      b     2
3:      c     1

先前的解决方案:

dt[dt[, .(paste0(sort(c(value1,value2)), collapse=",")), by=id] %>% 
     .[,pattern:=.GRP, by=V1] %>% 
     .[,V1:=NULL], on=.(id)]

输出:

       id value1 value2 pattern
   <char>  <num>  <num>   <int>
1:      a      1      0       1
2:      a      1      3       1
3:      b      1      0       2
4:      b      2      3       2
5:      c      1      0       1
6:      c      1      3       1

Updated (to remove join):

This one replicates your approach (i.e. it requires that the order is the same as well as the values)

unique(
  dt[, pattern:=.(paste0(c(value1,value2), collapse=",")), by=id][,.(id,pattern)]
)[,grp:=.GRP, by=pattern][,pattern:=NULL]

       id   grp
   <char> <int>
1:      a     1
2:      b     2
3:      c     1

Prior solution:

dt[dt[, .(paste0(sort(c(value1,value2)), collapse=",")), by=id] %>% 
     .[,pattern:=.GRP, by=V1] %>% 
     .[,V1:=NULL], on=.(id)]

Output:

       id value1 value2 pattern
   <char>  <num>  <num>   <int>
1:      a      1      0       1
2:      a      1      3       1
3:      b      1      0       2
4:      b      2      3       2
5:      c      1      0       1
6:      c      1      3       1
于我来说 2025-01-23 14:11:41

使用 toString,如使用列表作为 bydata.table 错误消息所建议的那样:

“by”的列或表达式 1 的类型为“list”,目前不支持。
作为解决方法,请考虑将列转换为受支持的类型,例如 by=sapply(list_col, toString)

dt <- dt[, .(data = list(.SD)), by = id]
dt[, pattern_id :=.GRP, by = sapply(data, toString)]
dt[,unlist(data,recursive=F),by=.(id,pattern_id)]

       id pattern_id value1 value2
   <char>      <int>  <num>  <num>
1:      a          1      1      0
2:      a          1      1      3
3:      b          2      1      0
4:      b          2      2      3
5:      c          1      1      0
6:      c          1      1      3

但是,这比 match 慢。

With toString, as suggested by data.table error message when using a list as by :

Column or expression 1 of 'by' is type 'list' which is not currently supported.
As a workaround, consider converting the column to a supported type, e.g. by=sapply(list_col, toString)

dt <- dt[, .(data = list(.SD)), by = id]
dt[, pattern_id :=.GRP, by = sapply(data, toString)]
dt[,unlist(data,recursive=F),by=.(id,pattern_id)]

       id pattern_id value1 value2
   <char>      <int>  <num>  <num>
1:      a          1      1      0
2:      a          1      1      3
3:      b          2      1      0
4:      b          2      2      3
5:      c          1      1      0
6:      c          1      1      3

However, this is slower than match.

此刻的回忆 2025-01-23 14:11:41

假设每个 id 重复两次,“重塑” - 将 2x2 转换为 1x4 列。然后使用 .GRP 通过按除 id 之外的所有列进行分组来获取组 ID:

res <- dt[, c(.SD[ 1 ], .SD[ 2 ]), by = id]
setnames(res, make.unique(colnames(res)))
res[, pattern_id := .GRP, by = res[, -1] ][, .(id, pattern_id)]
#             id pattern_id
#      1:      1          1
#      2:      2          2
#      3:      3          3
#      4:      4          4
#      5:      5          5
#    ---                  
# 499996: 499996       1010
# 499997: 499997       3175
# 499998: 499998       3996
# 499999: 499999       3653
# 500000: 500000       4217

使用更大的数据集大约需要半秒。


编辑:另一个使用dcast的版本,但速度慢了 8 倍:

res <- dcast(dt, id ~ value1 + value2, length)
res[, pattern_id :=.GRP, by = res[, -1] ][, .(id, pattern_id)]

Assuming each id is repeated twice, "reshape" - convert 2x2 into 1x4 columns. Then get group ID using .GRP by grouping by all columns excluding id:

res <- dt[, c(.SD[ 1 ], .SD[ 2 ]), by = id]
setnames(res, make.unique(colnames(res)))
res[, pattern_id := .GRP, by = res[, -1] ][, .(id, pattern_id)]
#             id pattern_id
#      1:      1          1
#      2:      2          2
#      3:      3          3
#      4:      4          4
#      5:      5          5
#    ---                  
# 499996: 499996       1010
# 499997: 499997       3175
# 499998: 499998       3996
# 499999: 499999       3653
# 500000: 500000       4217

Using the bigger dataset takes about half a second.


Edit: another version using dcast, but it is 8x slower:

res <- dcast(dt, id ~ value1 + value2, length)
res[, pattern_id :=.GRP, by = res[, -1] ][, .(id, pattern_id)]
月牙弯弯 2025-01-23 14:11:41

以下是一些不依赖于每个 id 必然识别两行的基准,我将在下面发布结果。

library(data.table)

set.seed(0)
size <- 500000
dt <- data.table(
  id = rep(1:(size / 2), each = 2),
  value1 = sample(1:10, size, replace = TRUE),
  value2 = sample(1:10, size, replace = TRUE)
)

my_solution <- function(x) {
  x <- x[, .(data = list(.SD)), by = id]

  unique_groups <- unique(x$data)
  x[, pattern_id := match(data, unique_groups)]
  x[, data := NULL]
  x[]
}

langtang_solution <- function(x) {
  x <- x[, .(data = paste0(value1, "|", value2, collapse = ";")), by = id]
  x[, pattern_id := .GRP, by = data]
  x[, data := NULL]
  x[]
}

thomasiscoding_solution <- function(x) {
  x <- x[, .(data = toString(unlist(.SD))), by = id]
  x[, pattern_id := .GRP, by = data]
  x[, data := NULL]
  x[]
}

identical(my_solution(dt), langtang_solution(dt))
#> [1] TRUE
identical(my_solution(dt), thomasiscoding_solution(dt))
#> [1] TRUE

microbenchmark::microbenchmark(
  my_solution(dt),
  langtang_solution(dt),
  thomasiscoding_solution(dt),
  times = 50L
)
#> Unit: seconds
#>                         expr      min       lq     mean   median       uq
#>              my_solution(dt) 3.174106 3.566495 3.818829 3.793850 4.015176
#>        langtang_solution(dt) 1.369860 1.467013 1.596558 1.529327 1.649607
#>  thomasiscoding_solution(dt) 3.014511 3.154224 3.280713 3.256732 3.370015
#>       max neval
#>  4.525275    50
#>  2.279064    50
#>  3.681657    50

这非常丰富。我不知道 .GRP,在我的测试中它的表现与 match() 非常相似,尽管(非常小)好一些。最好的答案似乎是使用 paste() 将组转换为字符串,然后根据该字符串查找组。

Here is some benchmarks with those that don't rely on each id identifying necessarily two rows and I'm posting the results below.

library(data.table)

set.seed(0)
size <- 500000
dt <- data.table(
  id = rep(1:(size / 2), each = 2),
  value1 = sample(1:10, size, replace = TRUE),
  value2 = sample(1:10, size, replace = TRUE)
)

my_solution <- function(x) {
  x <- x[, .(data = list(.SD)), by = id]

  unique_groups <- unique(x$data)
  x[, pattern_id := match(data, unique_groups)]
  x[, data := NULL]
  x[]
}

langtang_solution <- function(x) {
  x <- x[, .(data = paste0(value1, "|", value2, collapse = ";")), by = id]
  x[, pattern_id := .GRP, by = data]
  x[, data := NULL]
  x[]
}

thomasiscoding_solution <- function(x) {
  x <- x[, .(data = toString(unlist(.SD))), by = id]
  x[, pattern_id := .GRP, by = data]
  x[, data := NULL]
  x[]
}

identical(my_solution(dt), langtang_solution(dt))
#> [1] TRUE
identical(my_solution(dt), thomasiscoding_solution(dt))
#> [1] TRUE

microbenchmark::microbenchmark(
  my_solution(dt),
  langtang_solution(dt),
  thomasiscoding_solution(dt),
  times = 50L
)
#> Unit: seconds
#>                         expr      min       lq     mean   median       uq
#>              my_solution(dt) 3.174106 3.566495 3.818829 3.793850 4.015176
#>        langtang_solution(dt) 1.369860 1.467013 1.596558 1.529327 1.649607
#>  thomasiscoding_solution(dt) 3.014511 3.154224 3.280713 3.256732 3.370015
#>       max neval
#>  4.525275    50
#>  2.279064    50
#>  3.681657    50

This was very enriching. I didn't know about .GRP, which in my tests perform very similarly to match(), although a (very small) bit better. The best answer seems to be using paste() to convert the group into a string and then finding the group based on that string.

忆梦 2025-01-23 14:11:41

重塑更宽的形状并使用 paste0() 怎么样?

library(dplyr)
library(tidyr)

dt <- dt %>% group_by(id) %>%
  mutate(inst = row_number(id)) %>% 
  pivot_wider(values_from = c(value1, value2),
              names_from = inst) %>% 
  mutate(pattern_id = paste0(value1_1, value1_2, value2_1, value2_2))

How about reshaping wider and using paste0()?

library(dplyr)
library(tidyr)

dt <- dt %>% group_by(id) %>%
  mutate(inst = row_number(id)) %>% 
  pivot_wider(values_from = c(value1, value2),
              names_from = inst) %>% 
  mutate(pattern_id = paste0(value1_1, value1_2, value2_1, value2_2))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文