从一个 data.frame 中选择第二个 data.frame 中不存在的行
我有两个 data.frames:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
我想找到 a1 有而 a2 没有的行。
是否有针对此类操作的内置函数?
(ps:我确实为此编写了一个解决方案,我只是好奇是否有人已经制作了更精巧的代码)
这是我的解决方案:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2 <- function(a1,a2)
{
a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)
I have two data.frames:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
I want to find the rows a1 have that a2 doesn't.
Is there a built in function for this type of operation?
(p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)
Here is my solution:
a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2 <- function(a1,a2)
{
a1.vec <- apply(a1, 1, paste, collapse = "")
a2.vec <- apply(a2, 1, paste, collapse = "")
a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
sqldf
提供了一个很好的解决方案两个数据框中的行:
dplyr
的新版本有一个函数anti_join
,正是针对这些 行和
semi_join
来过滤a1
中同时也在a2
中的sqldf
provides a nice solutionAnd the rows which are in both data frames:
The new version of
dplyr
has a function,anti_join
, for exactly these kinds of comparisonsAnd
semi_join
to filter rows ina1
that are also ina2
在dplyr中:
基本上,
setdiff(bigFrame,smallFrame)
可以获取第一个表中的额外记录。在 SQLverse 中,这称为
对于所有连接选项和设置主题的详细描述,这是迄今为止我见过的最好的摘要之一:http://www.vertabelo.com/blog/technical-articles/sql-joins
但回到这个问题 - 以下是使用 OP 数据时
setdiff()
代码的结果:或者甚至
anti_join(a1,a2)
也会得到相同的结果。欲了解更多信息: https://www.rstudio .com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
In dplyr:
Basically,
setdiff(bigFrame, smallFrame)
gets you the extra records in the first table.In the SQLverse this is called a
For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins
But back to this question - here are the results for the
setdiff()
code when using the OP's data:Or even
anti_join(a1,a2)
will get you the same results.For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
这不会直接回答您的问题,但会为您提供共同点。这可以通过 Paul Murrell 的包
compare
< 来完成/a>:函数
compare
在允许进行哪种比较方面为您提供了很大的灵活性(例如,更改每个向量元素的顺序、更改变量的顺序和名称、缩短变量、更改字符串的情况)。由此,您应该能够找出其中一个或另一个缺少什么。例如(这不是很优雅):This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package
compare
:The function
compare
gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):对于这个特定目的来说,它当然效率不高,但在这些情况下我经常做的是在每个 data.frame 中插入指示符变量,然后合并:
included_a1 中的缺失值将记录 a1 中缺失了哪些行。 a2 也类似。
您的解决方案的一个问题是列顺序必须匹配。另一个问题是,很容易想象行被编码为相同而实际上不同的情况。使用合并的优点是您可以免费获得良好解决方案所需的所有错误检查。
It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:
missing values in included_a1 will note which rows are missing in a1. similarly for a2.
One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.
因为我遇到了同样的问题,所以我写了一个包(https://github.com/alexsanjoseph/compareDF)。
一个更复杂的例子:
该包还有一个 html_output 命令用于快速检查
I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.
A more complicated example:
The package also has an html_output command for quick checking
您可以使用
daff
包(它包装了daff.js
库 使用V8
包):产生以下差异对象:
表格差异格式描述此处 应该是非常不言自明的。第一列
@@
中带有+++
的行是a1
中新增的行,并且在a2< 中不存在/代码>。
差异对象可用于
patch_data()
,使用write_diff()
存储差异以用于文档目的,或使用render_diff()可视化差异)
:生成简洁的 HTML 输出:
You could use the
daff
package (which wraps thedaff.js
library using theV8
package):produces the following difference object:
The tabular diff format is described here and should be pretty self-explanatory. The lines with
+++
in the first column@@
are the ones which are new ina1
and not present ina2
.The difference object can be used to
patch_data()
, to store the difference for documentation purposes usingwrite_diff()
or to visualize the difference usingrender_diff()
:generates a neat HTML output:
使用
diffobj
包:Using
diffobj
package:我改编了
merge
函数来获得此功能。在较大的数据帧上,它比完整合并解决方案使用更少的内存。我可以使用关键列的名称。另一种解决方案是使用库
prob
。I adapted the
merge
function to get this functionality. On larger dataframes it uses less memory than the full merge solution. And I can play with the names of the key columns.Another solution is to use the library
prob
.您的示例数据没有任何重复项,但您的解决方案会自动处理它们。这意味着在出现重复的情况下,某些答案可能与函数的结果不匹配。
这是我的解决方案,其地址重复的方式与您的相同。它的规模也很大!
需要data.table 1.9.8+
Your example data does not have any duplicates, but your solution handle them automatically. This means that potentially some of the answers won't match to results of your function in case of duplicates.
Here is my solution which address duplicates the same way as yours. It also scales great!
It needs data.table 1.9.8+
也许它太简单了,但我使用了这个解决方案,并且当我有一个可以用来比较数据集的主键时,我发现它非常有用。希望它能有所帮助。
Maybe it is too simplistic, but I used this solution and I find it very useful when I have a primary key that I can use to compare data sets. Hope it can help.
另一个基于 plyr 中的 match_df 的解决方案。
这是plyr的match_df:
我们可以将其修改为negate:
然后:
Yet another solution based on match_df in plyr.
Here's plyr's match_df:
We can modify it to negate:
Then:
使用
子集
:Using
subset
:以下代码同时使用
data.table
和fastmatch
来提高速度。The following code uses both
data.table
andfastmatch
for increased speed.非常快速的比较,以计算差异。
使用特定的列名称。
对于完整的数据框,不要提供列或索引名称
Really fast comparison, to get count of differences.
Using specific column name.
For complete dataframe, do not provide column or index name