将行与列上差异最小的其他行进行匹配
我想在数据框中的两个组之间执行匹配,其中属于一组(二进制)的所有行都与另一组的观察值(带替换)匹配,如果它们在另一列上的差异小于预设阈值。让我们使用下面的玩具数据集:
set.seed(123)
df <- data.frame(id = c(1:10),
group = rbinom(10,1, 0.3),
value = round(runif(10),2))
threshold <- round(sd(df$value),2)
看起来像这样
> df
id group value
1 1 0 0.96
2 2 1 0.45
3 3 0 0.68
4 4 1 0.57
5 5 1 0.10
6 6 0 0.90
7 7 0 0.25
8 8 1 0.04
9 9 0 0.33
10 10 0 0.95
> threshold
[1] 0.35
在这种情况下,我想将带有 group==1
的行与带有 group==2
的行进行匹配,其中差异值
之间小于阈值
(0.35)。这应该会导致一个看起来像这样的数据框(对潜在错误表示歉意,手动完成)。
id matched_id
1 2 3
2 2 7
3 2 9
4 4 3
5 4 6
6 4 7
7 4 9
8 5 7
9 5 9
10 8 7
11 8 9
谢谢你!
I want to perform matching between two groups in a data frame, where all rows belonging to one group (binary) are matched with observations from the other group (with replacement) if their difference on another column is smaller than a pre-set threshold. Let's use the toy-dataset below:
set.seed(123)
df <- data.frame(id = c(1:10),
group = rbinom(10,1, 0.3),
value = round(runif(10),2))
threshold <- round(sd(df$value),2)
Which looks like this
> df
id group value
1 1 0 0.96
2 2 1 0.45
3 3 0 0.68
4 4 1 0.57
5 5 1 0.10
6 6 0 0.90
7 7 0 0.25
8 8 1 0.04
9 9 0 0.33
10 10 0 0.95
> threshold
[1] 0.35
In this case, I want to match rows with group==1
with rows with group==2
where the difference between value
is smaller than threshold
(0.35). This should lead to a data frame looking like this (apologizes for potential error, did it manually).
id matched_id
1 2 3
2 2 7
3 2 9
4 4 3
5 4 6
6 4 7
7 4 9
8 5 7
9 5 9
10 8 7
11 8 9
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用 df |> left_join(df, by = character()) 这是执行笛卡尔积的 tidyverse 方式。然后根据
阈值
进行过滤。You can use
df |> left_join(df, by = character())
which is the tidyverse way of performing a cartesian product. Then filter according tothreshold
.更新的答案:在较大的数据集上运行缓慢,因此我尝试使代码更高效。
提出了一个似乎可以满足我要求的解决方案。不确定此代码在较大数据上的效率如何,但似乎有效。
这导致以下数据框:
UPDATED ANSWER: Was going slow on a larger dataset, so I tried to make the code a bit more efficient.
Came up with a solution that seems to do what I want. Not sure how efficient this code is on larger data but seems to work.
This leads to the following dataframe: