从一个 data.frame 中选择第二个 data.frame 中不存在的行

发布于 2024-09-07 22:05:11 字数 654 浏览 8 评论 0原文

我有两个 data.frames:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

我想找到 a1 有而 a2 没有的行。

是否有针对此类操作的内置函数?

(ps:我确实为此编写了一个解决方案,我只是好奇是否有人已经制作了更精巧的代码)

这是我的解决方案:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)

I have two data.frames:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

I want to find the rows a1 have that a2 doesn't.

Is there a built in function for this type of operation?

(p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)

Here is my solution:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(14

原来分手还会想你 2024-09-14 22:05:11

sqldf 提供了一个很好的解决方案

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

require(sqldf)

a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

两个数据框中的行:

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

dplyr 的新版本有一个函数 anti_join,正是针对这些 行

require(dplyr) 
anti_join(a1,a2)

semi_join 来过滤 a1 中同时也在 a2 中的

semi_join(a1,a2)

sqldf provides a nice solution

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

require(sqldf)

a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

And the rows which are in both data frames:

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

The new version of dplyr has a function, anti_join, for exactly these kinds of comparisons

require(dplyr) 
anti_join(a1,a2)

And semi_join to filter rows in a1 that are also in a2

semi_join(a1,a2)
昔梦 2024-09-14 22:05:11

dplyr中:

setdiff(a1,a2)

基本上,setdiff(bigFrame,smallFrame)可以获取第一个表中的额外记录。

在 SQLverse 中,这称为

Left Exclusion Join Venn 图

对于所有连接选项和设置主题的详细描述,这是迄今为止我见过的最好的摘要之一:http://www.vertabelo.com/blog/technical-articles/sql-joins

但回到这个问题 - 以下是使用 OP 数据时 setdiff() 代码的结果:

> a1
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e

> a2
  a b
1 1 a
2 2 b
3 3 c

> setdiff(a1,a2)
  a b
1 4 d
2 5 e

或者甚至 anti_join(a1,a2) 也会得到相同的结果。
欲了解更多信息: https://www.rstudio .com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

In dplyr:

setdiff(a1,a2)

Basically, setdiff(bigFrame, smallFrame) gets you the extra records in the first table.

In the SQLverse this is called a

Left Excluding Join Venn Diagram

For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins

But back to this question - here are the results for the setdiff() code when using the OP's data:

> a1
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e

> a2
  a b
1 1 a
2 2 b
3 3 c

> setdiff(a1,a2)
  a b
1 4 d
2 5 e

Or even anti_join(a1,a2) will get you the same results.
For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

回忆凄美了谁 2024-09-14 22:05:11

这不会直接回答您的问题,但会为您提供共同点。这可以通过 Paul Murrell 的包 compare< 来完成/a>:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
#  a b
#1 1 a
#2 2 b
#3 3 c

函数 compare 在允许进行哪种比较方面为您提供了很大的灵活性(例如,更改每个向量元素的顺序、更改变量的顺序和名称、缩短变量、更改字符串的情况)。由此,您应该能够找出其中一个或另一个缺少什么。例如(这不是很优雅):

difference <-
   data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
#  a b
#1 4 d
#2 5 e

This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
#  a b
#1 1 a
#2 2 b
#3 3 c

The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

difference <-
   data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
#  a b
#1 4 d
#2 5 e
人疚 2024-09-14 22:05:11

对于这个特定目的来说,它当然效率不高,但在这些情况下我经常做的是在每个 data.frame 中插入指示符变量,然后合并:

a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)

included_a1 中的缺失值将记录 a1 中缺失了哪些行。 a2 也类似。

您的解决方案的一个问题是列顺序必须匹配。另一个问题是,很容易想象行被编码为相同而实际上不同的情况。使用合并的优点是您可以免费获得良好解决方案所需的所有错误检查。

It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:

a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)

missing values in included_a1 will note which rows are missing in a1. similarly for a2.

One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.

述情 2024-09-14 22:05:11

因为我遇到了同样的问题,所以我写了一个包(https://github.com/alexsanjoseph/compareDF)。

  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
  > df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
  > df_compare = compare_df(df1, df2, "row")

  > df_compare$comparison_df
    row chng_type a b
  1   4         + 4 d
  2   5         + 5 e

一个更复杂的例子:

library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", "Duster 360", "Merc 240D"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
                 hp = c(110, 110, 181, 110, 245, 62),
                 cyl = c(6, 6, 4, 6, 8, 4),
                 qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))

df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
                 hp = c(110, 110, 93, 110, 175, 105),
                 cyl = c(6, 6, 4, 6, 8, 6),
                 qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))

> df_compare$comparison_df
    grp chng_type                id1 id2  hp cyl  qsec
  1   1         -  Hornet Sportabout Dus 175   8 17.02
  2   2         +         Datsun 710 Dat 181   4 33.00
  3   2         -         Datsun 710 Dat  93   4 18.61
  4   3         +         Duster 360 Dus 245   8 15.84
  5   7         +          Merc 240D Mer  62   4 20.00
  6   8         -            Valiant Val 105   6 20.22

该包还有一个 html_output 命令用于快速检查

df_compare$html_output
在此处输入图片描述


I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.

  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
  > df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
  > df_compare = compare_df(df1, df2, "row")

  > df_compare$comparison_df
    row chng_type a b
  1   4         + 4 d
  2   5         + 5 e

A more complicated example:

library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", "Duster 360", "Merc 240D"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
                 hp = c(110, 110, 181, 110, 245, 62),
                 cyl = c(6, 6, 4, 6, 8, 4),
                 qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))

df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
                 hp = c(110, 110, 93, 110, 175, 105),
                 cyl = c(6, 6, 4, 6, 8, 6),
                 qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))

> df_compare$comparison_df
    grp chng_type                id1 id2  hp cyl  qsec
  1   1         -  Hornet Sportabout Dus 175   8 17.02
  2   2         +         Datsun 710 Dat 181   4 33.00
  3   2         -         Datsun 710 Dat  93   4 18.61
  4   3         +         Duster 360 Dus 245   8 15.84
  5   7         +          Merc 240D Mer  62   4 20.00
  6   8         -            Valiant Val 105   6 20.22

The package also has an html_output command for quick checking

df_compare$html_output
enter image description here

秋千易 2024-09-14 22:05:11

您可以使用 daff(它包装了 daff.js 使用 V8):

library(daff)

diff_data(data_ref = a2,
          data = a1)

产生以下差异对象:

Daff Comparison: ‘a2’ vs. ‘a1’ 
  First 6 and last 6 patch lines:
   @@   a   b
1 ... ... ...
2       3   c
3 +++   4   d
4 +++   5   e
5 ... ... ...
6 ... ... ...
7       3   c
8 +++   4   d
9 +++   5   e

表格差异格式描述此处 应该是非常不言自明的。第一列 @@ 中带有 +++ 的行是 a1 中新增的行,并且在 a2< 中不存在/代码>。

差异对象可用于patch_data(),使用write_diff()存储差异以用于文档目的,或使用render_diff()可视化差异)

render_diff(
    diff_data(data_ref = a2,
              data = a1)
)

生成简洁的 HTML 输出:

在此处输入图像描述

You could use the daff package (which wraps the daff.js library using the V8 package):

library(daff)

diff_data(data_ref = a2,
          data = a1)

produces the following difference object:

Daff Comparison: ‘a2’ vs. ‘a1’ 
  First 6 and last 6 patch lines:
   @@   a   b
1 ... ... ...
2       3   c
3 +++   4   d
4 +++   5   e
5 ... ... ...
6 ... ... ...
7       3   c
8 +++   4   d
9 +++   5   e

The tabular diff format is described here and should be pretty self-explanatory. The lines with +++ in the first column @@ are the ones which are new in a1 and not present in a2.

The difference object can be used to patch_data(), to store the difference for documentation purposes using write_diff() or to visualize the difference using render_diff():

render_diff(
    diff_data(data_ref = a2,
              data = a1)
)

generates a neat HTML output:

enter image description here

千纸鹤带着心事 2024-09-14 22:05:11

使用 diffobj 包:

library(diffobj)

diffPrint(a1, a2)
diffObj(a1, a2)

在此处输入图像描述

在此处输入图像描述

Using diffobj package:

library(diffobj)

diffPrint(a1, a2)
diffObj(a1, a2)

enter image description here

enter image description here

善良天后 2024-09-14 22:05:11

我改编了 merge 函数来获得此功能。在较大的数据帧上,它比完整合并解决方案使用更少的内存。我可以使用关键列的名称。

另一种解决方案是使用库prob

#  Derived from src/library/base/R/merge.R
#  Part of the R package, http://www.R-project.org
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  http://www.r-project.org/Licenses/

XinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = FALSE, incomparables = NULL,
             ...)
{
    fix.by <- function(by, df)
    {
        ## fix up 'by' to be a valid set of cols by number: 0 is row.names
        if(is.null(by)) by <- numeric(0L)
        by <- as.vector(by)
        nc <- ncol(df)
        if(is.character(by))
            by <- match(by, c("row.names", names(df))) - 1L
        else if(is.numeric(by)) {
            if(any(by < 0L) || any(by > nc))
                stop("'by' must match numbers of columns")
        } else if(is.logical(by)) {
            if(length(by) != nc) stop("'by' must match number of columns")
            by <- seq_along(by)[by]
        } else stop("'by' must specify column(s) as numbers, names or logical")
        if(any(is.na(by))) stop("'by' must specify valid column(s)")
        unique(by)
    }

    nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
    by.x <- fix.by(by.x, x)
    by.y <- fix.by(by.y, y)
    if((l.b <- length(by.x)) != length(by.y))
        stop("'by.x' and 'by.y' specify different numbers of columns")
    if(l.b == 0L) {
        ## was: stop("no columns to match on")
        ## returns x
        x
    }
    else {
        if(any(by.x == 0L)) {
            x <- cbind(Row.names = I(row.names(x)), x)
            by.x <- by.x + 1L
        }
        if(any(by.y == 0L)) {
            y <- cbind(Row.names = I(row.names(y)), y)
            by.y <- by.y + 1L
        }
        ## create keys from 'by' columns:
        if(l.b == 1L) {                  # (be faster)
            bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
            by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
        } else {
            ## Do these together for consistency in as.character.
            ## Use same set of names.
            bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
            names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
            bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
            bx <- bz[seq_len(nx)]
            by <- bz[nx + seq_len(ny)]
        }
        comm <- match(bx, by, 0L)
        if (notin) {
            res <- x[comm == 0,]
        } else {
            res <- x[comm > 0,]
        }
    }
    ## avoid a copy
    ## row.names(res) <- NULL
    attr(res, "row.names") <- .set_row_names(nrow(res))
    res
}


XnotinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = TRUE, incomparables = NULL,
             ...)
{
    XinY(x,y,by,by.x,by.y,notin,incomparables)
}

I adapted the merge function to get this functionality. On larger dataframes it uses less memory than the full merge solution. And I can play with the names of the key columns.

Another solution is to use the library prob.

#  Derived from src/library/base/R/merge.R
#  Part of the R package, http://www.R-project.org
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  http://www.r-project.org/Licenses/

XinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = FALSE, incomparables = NULL,
             ...)
{
    fix.by <- function(by, df)
    {
        ## fix up 'by' to be a valid set of cols by number: 0 is row.names
        if(is.null(by)) by <- numeric(0L)
        by <- as.vector(by)
        nc <- ncol(df)
        if(is.character(by))
            by <- match(by, c("row.names", names(df))) - 1L
        else if(is.numeric(by)) {
            if(any(by < 0L) || any(by > nc))
                stop("'by' must match numbers of columns")
        } else if(is.logical(by)) {
            if(length(by) != nc) stop("'by' must match number of columns")
            by <- seq_along(by)[by]
        } else stop("'by' must specify column(s) as numbers, names or logical")
        if(any(is.na(by))) stop("'by' must specify valid column(s)")
        unique(by)
    }

    nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
    by.x <- fix.by(by.x, x)
    by.y <- fix.by(by.y, y)
    if((l.b <- length(by.x)) != length(by.y))
        stop("'by.x' and 'by.y' specify different numbers of columns")
    if(l.b == 0L) {
        ## was: stop("no columns to match on")
        ## returns x
        x
    }
    else {
        if(any(by.x == 0L)) {
            x <- cbind(Row.names = I(row.names(x)), x)
            by.x <- by.x + 1L
        }
        if(any(by.y == 0L)) {
            y <- cbind(Row.names = I(row.names(y)), y)
            by.y <- by.y + 1L
        }
        ## create keys from 'by' columns:
        if(l.b == 1L) {                  # (be faster)
            bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
            by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
        } else {
            ## Do these together for consistency in as.character.
            ## Use same set of names.
            bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
            names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
            bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
            bx <- bz[seq_len(nx)]
            by <- bz[nx + seq_len(ny)]
        }
        comm <- match(bx, by, 0L)
        if (notin) {
            res <- x[comm == 0,]
        } else {
            res <- x[comm > 0,]
        }
    }
    ## avoid a copy
    ## row.names(res) <- NULL
    attr(res, "row.names") <- .set_row_names(nrow(res))
    res
}


XnotinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = TRUE, incomparables = NULL,
             ...)
{
    XinY(x,y,by,by.x,by.y,notin,incomparables)
}
负佳期 2024-09-14 22:05:11

您的示例数据没有任何重复项,但您的解决方案会自动处理它们。这意味着在出现重复的情况下,某些答案可能与函数的结果不匹配。
这是我的解决方案,其地址重复的方式与您的相同。它的规模也很大!

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}

library(data.table)
setDT(a1)
setDT(a2)

# no duplicates - as in example code
r <- fsetdiff(a1, a2)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

# handling duplicates - make some duplicates
a1 <- rbind(a1, a1, a1)
a2 <- rbind(a2, a2, a2)
r <- fsetdiff(a1, a2, all = TRUE)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

需要data.table 1.9.8+

Your example data does not have any duplicates, but your solution handle them automatically. This means that potentially some of the answers won't match to results of your function in case of duplicates.
Here is my solution which address duplicates the same way as yours. It also scales great!

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}

library(data.table)
setDT(a1)
setDT(a2)

# no duplicates - as in example code
r <- fsetdiff(a1, a2)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

# handling duplicates - make some duplicates
a1 <- rbind(a1, a1, a1)
a2 <- rbind(a2, a2, a2)
r <- fsetdiff(a1, a2, all = TRUE)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

It needs data.table 1.9.8+

烟织青萝梦 2024-09-14 22:05:11

也许它太简单了,但我使用了这个解决方案,并且当我有一个可以用来比较数据集的主键时,我发现它非常有用。希望它能有所帮助。

a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
different.names <- (!a1$a %in% a2$a)
not.in.a2 <- a1[different.names,]

Maybe it is too simplistic, but I used this solution and I find it very useful when I have a primary key that I can use to compare data sets. Hope it can help.

a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
different.names <- (!a1$a %in% a2$a)
not.in.a2 <- a1[different.names,]
好听的两个字的网名 2024-09-14 22:05:11

另一个基于 plyr 中的 match_df 的解决方案。
这是plyr的match_df:

match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[keys$x %in% keys$y, , drop = FALSE]
}

我们可以将其修改为negate:

library(plyr)
negate_match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[!(keys$x %in% keys$y), , drop = FALSE]
}

然后:

diff <- negate_match_df(a1,a2)

Yet another solution based on match_df in plyr.
Here's plyr's match_df:

match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[keys$x %in% keys$y, , drop = FALSE]
}

We can modify it to negate:

library(plyr)
negate_match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[!(keys$x %in% keys$y), , drop = FALSE]
}

Then:

diff <- negate_match_df(a1,a2)
醉酒的小男人 2024-09-14 22:05:11

使用子集

missing<-subset(a1, !(a %in% a2$a))

Using subset:

missing<-subset(a1, !(a %in% a2$a))
请恋爱 2024-09-14 22:05:11

以下代码同时使用 data.tablefastmatch 来提高速度。

library("data.table")
library("fastmatch")

a1 <- setDT(data.frame(a = 1:5, b=letters[1:5]))
a2 <- setDT(data.frame(a = 1:3, b=letters[1:3]))

compare_rows <- a1$a %fin% a2$a
# the %fin% function comes from the `fastmatch` package

added_rows <- a1[which(compare_rows == FALSE)]

added_rows

#    a b
# 1: 4 d
# 2: 5 e

The following code uses both data.table and fastmatch for increased speed.

library("data.table")
library("fastmatch")

a1 <- setDT(data.frame(a = 1:5, b=letters[1:5]))
a2 <- setDT(data.frame(a = 1:3, b=letters[1:3]))

compare_rows <- a1$a %fin% a2$a
# the %fin% function comes from the `fastmatch` package

added_rows <- a1[which(compare_rows == FALSE)]

added_rows

#    a b
# 1: 4 d
# 2: 5 e
踏雪无痕 2024-09-14 22:05:11

非常快速的比较,以计算差异。
使用特定的列名称。

colname = "CreatedDate" # specify column name
index <- match(colname, names(source_df)) # get index name for column name
sel <- source_df[, index] == target_df[, index] # get differences, gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches

对于完整的数据框,不要提供列或索引名称

sel <- source_df[, ] == target_df[, ] # gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches

Really fast comparison, to get count of differences.
Using specific column name.

colname = "CreatedDate" # specify column name
index <- match(colname, names(source_df)) # get index name for column name
sel <- source_df[, index] == target_df[, index] # get differences, gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches

For complete dataframe, do not provide column or index name

sel <- source_df[, ] == target_df[, ] # gives you dataframe with TRUE and FALSE values
table(sel)["FALSE"] # count of differences
table(sel)["TRUE"] # count of matches
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文