查找已更新的两个不同数据帧的行中的差异
我是 R 新手,试图弄清楚如何在合并两个数据集后找到两个数据集中的差异。我已将数据集与 SETDIFF 合并,并在新的 df 中发现了 19 个不同的行。但是无法知道哪些列已更改。由于 df 有 100 列,因此搜索每一行和每一列来查找更改是不切实际的。有没有办法确定新 df.
IP Name Address ZIP
1 Bob 3456 st 2012
2 Jane 2456 st 4302
3 Mike 9698 st 2398
旧 df 的示例
IP Name Address ZIP
1 Bob 3000 st 2012
2 Jane 2456 st 4302
3 Mike 9698 st 2000
如果新 df 更改了 Bob 地址和 Mikes ZIP,我将如何在 R 中执行此操作。我尝试过 SETDIFF 和 COMPARE,但这些不起作用。我只想获取数据帧的特定更改以及发生在哪一行的输出。
编辑:另一个例子,来自评论:
new <- data.frame(
stringsAsFactors = FALSE,
IP = c(1L, 2L, 3L, 4L, 5L, 6L),
Name = c("Bob", "Jack", "Jane", "Mike", "Alex", "Amy"),
Address = c("3000 st", "5678 st", "2456 st", "9698 st",
"9776 st", "1002 st"),
ZIP = c(2012L, 1121L, 4302L, 2398L, 3476L, 4655L)
)
old <- data.frame(
stringsAsFactors = FALSE,
IP = c(1L, 2L, 3L, 4L),
Name = c("Bob", "Jane", "Mike", "Jack"),
Address = c("3456 st", "2456 st", "9698 st", "5678 st"),
ZIP = c(2012L, 4302L, 2012L, 1121L)
)
I am new to R and trying to figure out how to find differences in two data sets after merging the two. I have merged the data sets with SETDIFF and found 19 different rows in the new df. However there is no way to know which of the columns have been changed. Since the df have 100s of columns it is not practical to search every row and column to find the change. Is there a way to determine the exact change in the row in the new df.
IP Name Address ZIP
1 Bob 3456 st 2012
2 Jane 2456 st 4302
3 Mike 9698 st 2398
Example of the old df
IP Name Address ZIP
1 Bob 3000 st 2012
2 Jane 2456 st 4302
3 Mike 9698 st 2000
If the new df had changes to Bobs address and Mikes ZIP, how would I do that in R. I have tried SETDIFF and COMPARE, but those did not work. I would like to only get output for the specific changes to the dataframe and in what row it happened.
EDIT: Another example, from the comments:
new <- data.frame(
stringsAsFactors = FALSE,
IP = c(1L, 2L, 3L, 4L, 5L, 6L),
Name = c("Bob", "Jack", "Jane", "Mike", "Alex", "Amy"),
Address = c("3000 st", "5678 st", "2456 st", "9698 st",
"9776 st", "1002 st"),
ZIP = c(2012L, 1121L, 4302L, 2398L, 3476L, 4655L)
)
old <- data.frame(
stringsAsFactors = FALSE,
IP = c(1L, 2L, 3L, 4L),
Name = c("Bob", "Jane", "Mike", "Jack"),
Address = c("3456 st", "2456 st", "9698 st", "5678 st"),
ZIP = c(2012L, 4302L, 2012L, 1121L)
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
编辑#2:
如果您想查找从旧数据添加到新数据的新名称,并且“名称”列是唯一标识符,您可以在更新的示例数据上使用它:
必须有一种更优雅的方法来执行此操作,但另一种方法可能是使用 Name 作为键将数据连接到自身,然后重新整形以识别两者之间的差异:
此输出告诉我们 Bob 的地址字段已更改,Jack、Jane 和 Mike 出现在不同的行中,而 Mike 的地址字段发生了变化。邮政编码已更改。
原始答案
waldo 包提供了一种简单的方法来执行此操作:
虽然此处不可见,但控制台中的输出以绿色突出显示更改的值。
编辑 - 更好的选择可能是
diffdf::diffdf(new, old)
,它输出特定差异的摘要:可加载形式的示例数据:
EDIT #2:
If you want to find new names added from the old to the new data, and the Names column is a unique identifier, you could use this on the updated example data:
There must be a more elegant way to do this, but another approach could be to join the data to itself using Name as a key, and then reshape to identify differences between the two:
This output tells us that Bob's Address field changed, Jack, Jane, and Mike appear in different rows, and Mike's ZIP changed.
Original answer
The
waldo
package offers an easy way to do this:While not visible here, the output in the console highlights the values that changed in green.
EDIT - A better option might be
diffdf::diffdf(new, old)
, which outputs a summary of the specific differences:Example data in loadable form: