将 R 中的数据帧合并到预先排序的列上?

发布于 2024-12-12 21:31:08 字数 506 浏览 0 评论 0原文

我通常使用排序良好(或可以轻松排序)的大数据框。

给定两个数据帧,均按“用户”排序

some.data <user> <data_1> <data_2> 
user <user> <user_attr_1> <user_attr_2>

,我运行 m = merge(some.data,user),我收到的结果为:

m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>

这很好。

但是 merge 并没有利用这些在公共列上排序的数据帧,使得合并相当占用 CPU/内存。然而,这种合并可以在 O(n) 内完成

我想知道 R 中是否有一种方法可以对排序数据集进行有效的合并?

I usually work with big dataframes that are pretty well sorted (or can be easily sorted).

Given two dataframes, both sorted by 'user'

some.data <user> <data_1> <data_2> 
user <user> <user_attr_1> <user_attr_2>

And I run m = merge(some.data,user), I receive the result as:

m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>

And this is fine so.

But merge doesn't take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)

I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

无边思念无边月 2024-12-19 21:31:09

我对此没有任何经验,但据我所知,这是 data.table 包旨在改进的问题之一。

对于大多数实际用途,data.table=data.frame + index。因此,如果使用得当,可以提高很多大型操作的性能。

存在将 data.frame 转换为 data.table (即添加索引)可能需要一些时间的危险(尽管我希望这能得到很好的优化),但是一旦你完成了它,像合并这样的函数就可以轻松地使用索引来获得更好的性能。

I don't have any experience with it, but as far as I know, this is one of the issues that package data.tablewas designed to improve.

For most practical purposes, data.table=data.frame + index. As a consequence, when used right, this improves performance of quite a few large operations.

There is a danger that turning your data.frame into a data.table (i.e. adding the index) could take some time (although I expect this to be well optimized), but once you've got it up, functions like merge can easily use the index for better performance.

难如初 2024-12-19 21:31:09

如果你的一组公共键/索引完全重叠,那就是......


减少(`&`, user$user.id %in% some.data$user.id)

...返回 TRUE,正如您所说,它们已排序,并且没有关键重复,那么您的合并问题就减少为向 data.frame 添加列。一些东西......

library(log4r)

t1 <- system.time(z <- merge(user, some.data, by='user.id'))

info(my.logger, paste('Elapsed time with merge():', t1['elapsed']))

t2 <- Sys.time()

r <- data.frame(user.id=user$user.id, V1.x=user$V1, V2.x=user$V2)

r[,names(some.data)] <- some.data[,names(some.data)

t3 <- Sys.time()

info(my.logger, paste('Elapsed time without:', t3-t2))

如果上面的假设不成立,那么两个键集、翻译函数、NA 填充的集合并集会变得稍微混乱一些,但仅合并和重叠假设就可以让你取得很大的进步。

还要注意的是,秒方法的计时是有偏差的,因为它调用了两次 Sys.time() ,而 merge() 计时则调用了 system.time() 并且只调用了一次。
(请原谅我对 SO 标记的蹩脚使用)

If your set of common keys/indexes is totally overlapping, that is...


Reduce(`&`, user$user.id %in% some.data$user.id)

...returns TRUE and they are, as you said, sorted,and there are no key duplicates then your merging problem is reduced to adding columns to a data.frame. Something in the lines along...

library(log4r)

t1 <- system.time(z <- merge(user, some.data, by='user.id'))

info(my.logger, paste('Elapsed time with merge():', t1['elapsed']))

t2 <- Sys.time()

r <- data.frame(user.id=user$user.id, V1.x=user$V1, V2.x=user$V2)

r[,names(some.data)] <- some.data[,names(some.data)

t3 <- Sys.time()

info(my.logger, paste('Elapsed time without:', t3-t2))

If the assumptions above do not hold, then it gets slightly messier set union of both key sets, translation function, NA padding) but the merging and overlapping assumption alone gets you a long way ahead.

Notice also that the timing of the seconds approach is biased since it's calling twice Sys.time() unlike the merge() timing which calls system.time() and only once.
(Excuse my lame usage of S.O. mark-up)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文