将 R 中的数据帧合并到预先排序的列上?
我通常使用排序良好(或可以轻松排序)的大数据框。
给定两个数据帧,均按“用户”排序
some.data <user> <data_1> <data_2>
user <user> <user_attr_1> <user_attr_2>
,我运行 m = merge(some.data,user)
,我收到的结果为:
m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>
这很好。
但是 merge
并没有利用这些在公共列上排序的数据帧,使得合并相当占用 CPU/内存。然而,这种合并可以在 O(n) 内完成
我想知道 R 中是否有一种方法可以对排序数据集进行有效的合并?
I usually work with big dataframes that are pretty well sorted (or can be easily sorted).
Given two dataframes, both sorted by 'user'
some.data <user> <data_1> <data_2>
user <user> <user_attr_1> <user_attr_2>
And I run m = merge(some.data,user)
, I receive the result as:
m = <user> <data_1> <data_2> <user_attr_1> <user_attr_2>
And this is fine so.
But merge
doesn't take advantage of these dataframes being sorted on the common column making the merge pretty CPU/memory heavy. However, this merge could be done in O(n)
I am wondering if there is a way in R to conduct an efficient merge on sorted datasets?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我对此没有任何经验,但据我所知,这是
data.table
包旨在改进的问题之一。对于大多数实际用途,
data.table
=data.frame
+index
。因此,如果使用得当,可以提高很多大型操作的性能。存在将
data.frame
转换为data.table
(即添加索引)可能需要一些时间的危险(尽管我希望这能得到很好的优化),但是一旦你完成了它,像合并这样的函数就可以轻松地使用索引来获得更好的性能。I don't have any experience with it, but as far as I know, this is one of the issues that package
data.table
was designed to improve.For most practical purposes,
data.table
=data.frame
+index
. As a consequence, when used right, this improves performance of quite a few large operations.There is a danger that turning your
data.frame
into adata.table
(i.e. adding the index) could take some time (although I expect this to be well optimized), but once you've got it up, functions like merge can easily use the index for better performance.如果你的一组公共键/索引完全重叠,那就是......
减少(`&`, user$user.id %in% some.data$user.id)
...返回 TRUE,正如您所说,它们已排序,并且没有关键重复,那么您的合并问题就减少为向 data.frame 添加列。一些东西......
如果上面的假设不成立,那么两个键集、翻译函数、NA 填充的集合并集会变得稍微混乱一些,但仅合并和重叠假设就可以让你取得很大的进步。
还要注意的是,秒方法的计时是有偏差的,因为它调用了两次 Sys.time() ,而 merge() 计时则调用了 system.time() 并且只调用了一次。
(请原谅我对 SO 标记的蹩脚使用)
If your set of common keys/indexes is totally overlapping, that is...
Reduce(`&`, user$user.id %in% some.data$user.id)
...returns TRUE and they are, as you said, sorted,and there are no key duplicates then your merging problem is reduced to adding columns to a data.frame. Something in the lines along...
If the assumptions above do not hold, then it gets slightly messier set union of both key sets, translation function, NA padding) but the merging and overlapping assumption alone gets you a long way ahead.
Notice also that the timing of the seconds approach is biased since it's calling twice Sys.time() unlike the merge() timing which calls system.time() and only once.
(Excuse my lame usage of S.O. mark-up)