查找已更新的两个不同数据帧的行中的差异

发布于 2025-01-11 16:07:31 字数 1272 浏览 3 评论 0原文

我是 R 新手,试图弄清楚如何在合并两个数据集后找到两个数据集中的差异。我已将数据集与 SETDIFF 合并,并在新的 df 中发现了 19 个不同的行。但是无法知道哪些列已更改。由于 df 有 100 列,因此搜索每一行和每一列来查找更改是不切实际的。有没有办法确定新 df.

IP Name Address  ZIP 
1  Bob  3456 st  2012
2  Jane 2456 st  4302
3  Mike 9698 st  2398

旧 df 的示例

IP Name Address  ZIP 
1  Bob  3000 st  2012
2  Jane 2456 st  4302
3  Mike 9698 st  2000 

如果新 df 更改了 Bob 地址和 Mikes ZIP,我将如何在 R 中执行此操作。我尝试过 SETDIFF 和 COMPARE,但这些不起作用。我只想获取数据帧的特定更改以及发生在哪一行的输出。

编辑:另一个例子,来自评论:

new <- data.frame(
          stringsAsFactors = FALSE, 
                        IP = c(1L, 2L, 3L, 4L, 5L, 6L), 
                      Name = c("Bob", "Jack", "Jane", "Mike", "Alex", "Amy"), 
                   Address = c("3000 st", "5678 st", "2456 st", "9698 st", 
                               "9776 st", "1002 st"), 
                       ZIP = c(2012L, 1121L, 4302L, 2398L, 3476L, 4655L)
       )  

old <- data.frame(
          stringsAsFactors = FALSE, 
                        IP = c(1L, 2L, 3L, 4L), 
                      Name = c("Bob", "Jane", "Mike", "Jack"), 
                   Address = c("3456 st", "2456 st", "9698 st", "5678 st"), 
                       ZIP = c(2012L, 4302L, 2012L, 1121L)
       )

I am new to R and trying to figure out how to find differences in two data sets after merging the two. I have merged the data sets with SETDIFF and found 19 different rows in the new df. However there is no way to know which of the columns have been changed. Since the df have 100s of columns it is not practical to search every row and column to find the change. Is there a way to determine the exact change in the row in the new df.

IP Name Address  ZIP 
1  Bob  3456 st  2012
2  Jane 2456 st  4302
3  Mike 9698 st  2398

Example of the old df

IP Name Address  ZIP 
1  Bob  3000 st  2012
2  Jane 2456 st  4302
3  Mike 9698 st  2000 

If the new df had changes to Bobs address and Mikes ZIP, how would I do that in R. I have tried SETDIFF and COMPARE, but those did not work. I would like to only get output for the specific changes to the dataframe and in what row it happened.

EDIT: Another example, from the comments:

new <- data.frame(
          stringsAsFactors = FALSE, 
                        IP = c(1L, 2L, 3L, 4L, 5L, 6L), 
                      Name = c("Bob", "Jack", "Jane", "Mike", "Alex", "Amy"), 
                   Address = c("3000 st", "5678 st", "2456 st", "9698 st", 
                               "9776 st", "1002 st"), 
                       ZIP = c(2012L, 1121L, 4302L, 2398L, 3476L, 4655L)
       )  

old <- data.frame(
          stringsAsFactors = FALSE, 
                        IP = c(1L, 2L, 3L, 4L), 
                      Name = c("Bob", "Jane", "Mike", "Jack"), 
                   Address = c("3456 st", "2456 st", "9698 st", "5678 st"), 
                       ZIP = c(2012L, 4302L, 2012L, 1121L)
       )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

清浅ˋ旧时光 2025-01-18 16:07:31

编辑#2:

如果您想查找从旧数据添加到新数据的新名称,并且“名称”列是唯一标识符,您可以在更新的示例数据上使用它:

new %>%
  filter(!Name %in% old$Name)

#  IP Name Address  ZIP
#1  5 Alex 9776 st 3476
#2  6  Amy 1002 st 4655

必须有一种更优雅的方法来执行此操作,但另一种方法可能是使用 Name 作为键将数据连接到自身,然后重新整形以识别两者之间的差异:

library(tidyverse)
new %>%
  left_join(old, by = "Name") %>%
  mutate(across(everything(), as.character)) %>%
  pivot_longer(-Name, names_to = c("name", "src"), names_sep = "\\.") %>%
  pivot_wider(names_from = src, values_from = value) %>%
  group_by(Name) %>%
  filter(x != y) %>%
  ungroup()

## A tibble: 5 x 4
#  Name  name    x       y      
#  <chr> <chr>   <chr>   <chr>  
#1 Bob   Address 3000 st 3456 st
#2 Jack  IP      2       4      
#3 Jane  IP      3       2      
#4 Mike  IP      4       3      
#5 Mike  ZIP     2398    2012   

此输出告诉我们 Bob 的地址字段已更改,Jack、Jane 和 Mike 出现在不同的行中,而 Mike 的地址字段发生了变化。邮政编码已更改。


原始答案

waldo 包提供了一种简单的方法来执行此操作:

waldo::compare(new, old)

#`old$Address`: "3456 st" "2456 st" "9698 st"
#`new$Address`: "3000 st" "2456 st" "9698 st"
#
#`old$ZIP`: 2012 4302 2398
#`new$ZIP`: 2012 4302 2000

虽然此处不可见,但控制台中的输出以绿色突出显示更改的值。

编辑 - 更好的选择可能是 diffdf::diffdf(new, old),它输出特定差异的摘要:

Differences found between the objects!

A summary is given below.

Not all Values Compared Equal
All rows are shown in table below

  =============================
   Variable  No of Differences 
  -----------------------------
   Address           1         
     ZIP             1         
  -----------------------------


All rows are shown in table below

  ===========================================
   VARIABLE  ..ROWNUMBER..   BASE    COMPARE 
  -------------------------------------------
   Address         1        3456 st  3000 st 
  -------------------------------------------


All rows are shown in table below

  ========================================
   VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
  ----------------------------------------
     ZIP           3        2398   2000   
  ----------------------------------------

可加载形式的示例数据:

new <- data.frame(
  stringsAsFactors = FALSE,
                IP = c(1L, 2L, 3L),
              Name = c("Bob", "Jane", "Mike"),
           Address = c("3456 st", "2456 st", "9698 st"),
               ZIP = c(2012L, 4302L, 2398L)
       )

old <- data.frame(
  stringsAsFactors = FALSE,
                IP = c(1L, 2L, 3L),
              Name = c("Bob", "Jane", "Mike"),
           Address = c("3000 st", "2456 st", "9698 st"),
               ZIP = c(2012L, 4302L, 2000L)
       )

EDIT #2:

If you want to find new names added from the old to the new data, and the Names column is a unique identifier, you could use this on the updated example data:

new %>%
  filter(!Name %in% old$Name)

#  IP Name Address  ZIP
#1  5 Alex 9776 st 3476
#2  6  Amy 1002 st 4655

There must be a more elegant way to do this, but another approach could be to join the data to itself using Name as a key, and then reshape to identify differences between the two:

library(tidyverse)
new %>%
  left_join(old, by = "Name") %>%
  mutate(across(everything(), as.character)) %>%
  pivot_longer(-Name, names_to = c("name", "src"), names_sep = "\\.") %>%
  pivot_wider(names_from = src, values_from = value) %>%
  group_by(Name) %>%
  filter(x != y) %>%
  ungroup()

## A tibble: 5 x 4
#  Name  name    x       y      
#  <chr> <chr>   <chr>   <chr>  
#1 Bob   Address 3000 st 3456 st
#2 Jack  IP      2       4      
#3 Jane  IP      3       2      
#4 Mike  IP      4       3      
#5 Mike  ZIP     2398    2012   

This output tells us that Bob's Address field changed, Jack, Jane, and Mike appear in different rows, and Mike's ZIP changed.


Original answer

The waldo package offers an easy way to do this:

waldo::compare(new, old)

#`old$Address`: "3456 st" "2456 st" "9698 st"
#`new$Address`: "3000 st" "2456 st" "9698 st"
#
#`old$ZIP`: 2012 4302 2398
#`new$ZIP`: 2012 4302 2000

While not visible here, the output in the console highlights the values that changed in green.

EDIT - A better option might be diffdf::diffdf(new, old), which outputs a summary of the specific differences:

Differences found between the objects!

A summary is given below.

Not all Values Compared Equal
All rows are shown in table below

  =============================
   Variable  No of Differences 
  -----------------------------
   Address           1         
     ZIP             1         
  -----------------------------


All rows are shown in table below

  ===========================================
   VARIABLE  ..ROWNUMBER..   BASE    COMPARE 
  -------------------------------------------
   Address         1        3456 st  3000 st 
  -------------------------------------------


All rows are shown in table below

  ========================================
   VARIABLE  ..ROWNUMBER..  BASE  COMPARE 
  ----------------------------------------
     ZIP           3        2398   2000   
  ----------------------------------------

Example data in loadable form:

new <- data.frame(
  stringsAsFactors = FALSE,
                IP = c(1L, 2L, 3L),
              Name = c("Bob", "Jane", "Mike"),
           Address = c("3456 st", "2456 st", "9698 st"),
               ZIP = c(2012L, 4302L, 2398L)
       )

old <- data.frame(
  stringsAsFactors = FALSE,
                IP = c(1L, 2L, 3L),
              Name = c("Bob", "Jane", "Mike"),
           Address = c("3000 st", "2456 st", "9698 st"),
               ZIP = c(2012L, 4302L, 2000L)
       )
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文