在 R 中使用 data.table 创建多行组标识符的最快方法是什么?
我有一个使用 id 标识一组值的数据框:
library(data.table)
dt <- data.table(
id = rep(c("a", "b", "c"), each = 2),
value1 = c(1, 1, 1, 2, 1, 1),
value2 = c(0, 3, 0, 3, 0, 3)
)
dt
#> id value1 value2
#> 1: a 1 0
#> 2: a 1 3
#> 3: b 1 0
#> 4: b 2 3
#> 5: c 1 0
#> 6: c 1 3
如您所见,id a
和 c
标识同一组值。所以我想创建一个“模式 id”,它标识与 ids a
和 c
关联的值集(观察:一个 id 可能标识多于两行,为了简单起见,我在这里将它们限制为两行)。
我确实设法使用嵌套 data.tables 和 match()
提出了一个解决方案:
dt <- dt[, .(data = list(.SD)), by = id]
unique_groups <- unique(dt$data)
dt[, pattern_id := match(data, unique_groups)]
dt[, data := NULL]
dt
#> id pattern_id
#> 1: a 1
#> 2: b 2
#> 3: c 1
它做到了,但它没有我想要的那么快成为。 match()
文档对于列表的效率非常清楚:
列表匹配可能非常慢,最好避免,除非是简单的情况。
正如您所看到的,我不需要最终结果中的实际模式数据,只需要一个将 id 与模式 id 相关联的表。我觉得嵌套数据,用它来匹配然后删除它有点浪费,但不确定是否有更好的方法。我正在考虑将每个数据帧转换为字符串的方法,或者更好的是,完全避免嵌套的方法,但我想不出比现在更好的方法。
我创建了一个更大的数据集来尝试和测试不同的解决方案:
set.seed(0)
size <- 1000000
dt <- data.table(
id = rep(1:(size / 2), each = 2),
value1 = sample(1:10, size, replace = TRUE),
value2 = sample(1:10, size, replace = TRUE)
)
I have a dataframe that identifies a set of values with an id:
library(data.table)
dt <- data.table(
id = rep(c("a", "b", "c"), each = 2),
value1 = c(1, 1, 1, 2, 1, 1),
value2 = c(0, 3, 0, 3, 0, 3)
)
dt
#> id value1 value2
#> 1: a 1 0
#> 2: a 1 3
#> 3: b 1 0
#> 4: b 2 3
#> 5: c 1 0
#> 6: c 1 3
As you can see, the ids a
and c
identify both the same set of values. So I want to create a "pattern id", that identifies the set of values associated with the ids a
and c
(obs: an id might identify more than two rows, I just limited them to two rows here for the sake of simplicity).
I did manage to come up with a solution using nested data.tables and match()
:
dt <- dt[, .(data = list(.SD)), by = id]
unique_groups <- unique(dt$data)
dt[, pattern_id := match(data, unique_groups)]
dt[, data := NULL]
dt
#> id pattern_id
#> 1: a 1
#> 2: b 2
#> 3: c 1
It does the trick, but it is not as fast as I'd like it to be. match()
documentation is pretty clear regarding its efficiency with lists:
Matching for lists is potentially very slow and best avoided except in simple cases.
As you can see, I don't need the actual pattern data in my final result, only a table that associates the ids to the pattern ids. I feel like nesting the data, using it to match and then dropping it afterwards is a bit wasteful, but not sure if there's a better way. I was thinking in something that transform each dataframe into a string, or, even better, something that avoided the nesting altogether, but I couldn't come up with anything better than what I have now.
I have created a bigger dataset to play around with and test different solutions:
set.seed(0)
size <- 1000000
dt <- data.table(
id = rep(1:(size / 2), each = 2),
value1 = sample(1:10, size, replace = TRUE),
value2 = sample(1:10, size, replace = TRUE)
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我们可以尝试下面的代码
或
给出
We can try the code below
or
which gives
更新(删除连接):
这复制了您的方法(即,它要求顺序以及值相同)
先前的解决方案:
输出:
Updated (to remove join):
This one replicates your approach (i.e. it requires that the order is the same as well as the values)
Prior solution:
Output:
使用
toString
,如使用列表作为by
时data.table
错误消息所建议的那样:但是,这比
match
慢。With
toString
, as suggested bydata.table
error message when using a list asby
:However, this is slower than
match
.假设每个 id 重复两次,“重塑” - 将 2x2 转换为 1x4 列。然后使用 .GRP 通过按除 id 之外的所有列进行分组来获取组 ID:
使用更大的数据集大约需要半秒。
编辑:另一个使用dcast的版本,但速度慢了 8 倍:
Assuming each id is repeated twice, "reshape" - convert 2x2 into 1x4 columns. Then get group ID using .GRP by grouping by all columns excluding id:
Using the bigger dataset takes about half a second.
Edit: another version using dcast, but it is 8x slower:
以下是一些不依赖于每个 id 必然识别两行的基准,我将在下面发布结果。
这非常丰富。我不知道
.GRP
,在我的测试中它的表现与match()
非常相似,尽管(非常小)好一些。最好的答案似乎是使用paste()
将组转换为字符串,然后根据该字符串查找组。Here is some benchmarks with those that don't rely on each id identifying necessarily two rows and I'm posting the results below.
This was very enriching. I didn't know about
.GRP
, which in my tests perform very similarly tomatch()
, although a (very small) bit better. The best answer seems to be usingpaste()
to convert the group into a string and then finding the group based on that string.重塑更宽的形状并使用
paste0()
怎么样?How about reshaping wider and using
paste0()
?